Statistical Learning Theory in Reinforcement Learning ... - Inria

9 downloads 206 Views 2MB Size Report
Jun 26, 2012 - This Tutorial. Least-Squares, SVM, . ...... Sample xi ∼ p(·|xi,ai) and ri = r(xi,ai). A. Lazaric &
Statistical Learning Theory in Reinforcement Learning & Approximate Dynamic Programming A. Lazaric & M. Ghavamzadeh (INRIA Lille – Team SequeL) ICML 2012 INRIA Lille – Team SequeL

June 26, 2012

Sequential Decision-Making under Uncertainty

? How Can I ... ?

I

Move around in the physical world (e.g. driving, navigation)

I

Play and win a game

I

Retrieve information over the web

I

Medical diagnosis and treatment

I

Maximize the throughput of a factory

I

Optimize the performance of a rescue team A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 2/185

Reinforcement Learning (RL)

I

RL: A class of learning problems in which an agent interacts with a dynamic, stochastic, and incompletely known environment

I

Goal: Learn an action-selection strategy, or policy, to optimize some measure of its long-term performance

I

Interaction: Modeled as a MDP or a POMDP A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 3/185

Reinforcement Learning (RL)

Goal: Learn an action-selection strategy, or policy, to optimize some measure of its long-term performance

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 4/185

Reinforcement Learning (RL)

Goal: Learn an action-selection strategy, or policy, to optimize some measure of its long-term performance Algorithms: are based on the two celebrated dynamic programming algorithms: policy iteration and value iteration

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 4/185

A Bit of History I

formulation of the problem: optimal control, state, value function, Bellman equations, etc.

I

dynamic programming algorithms: policy iteration and value iteration + proof of convergence to an optimal policy

I

approximate dynamic programming I

performance evaluation: how close is the obtained solution to an optimal one? I

asymptotic analysis: the performance with infinite number of samples

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 5/185

A Bit of History I

formulation of the problem: optimal control, state, value function, Bellman equations, etc.

I

dynamic programming algorithms: policy iteration and value iteration + proof of convergence to an optimal policy

I

approximate dynamic programming I

performance evaluation: how close is the obtained solution to an optimal one? I

asymptotic analysis: the performance with infinite number of samples

in real problems we always have a finite number of samples

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 5/185

Motivation what about the performance with finite number of samples?

I

approximate dynamic programming (ADP) I

asymptotic analysis

I

finite sample analysis

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 6/185

Motivation what about the performance with finite number of samples?

I

I

approximate dynamic programming (ADP) I

asymptotic analysis

I

finite sample analysis

finite sample analysis of ADP algorithms I

error at each iteration of the alg.

I

how the error propagates through the iterations of the alg.

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 6/185

Motivation

I

finite sample analysis of ADP algorithms I

error at each iteration of the alg. I

the problem is formulated as regression, classification, or fixed point

tools from statistical learning theory are used to bound the error of these problems

I

I

how the error propagates through the iterations of the alg.

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 7/185

Supervised Learning

Problem (regression , classification)

Sequential Decision-Making

ADP

Algorithm

Least-Squares, SVM, ... (API , AVI)

Tools for Analysis

Statistical Learning Theory (SLT)

This Tutorial

Asymptotic Analysis

Finite Sample Analysis of ADP Algs.

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 8/185

Outline Preliminaries Tools from Statistical Learning Theory A Step-by-step Derivation for Linear FQI Least-Squares Policy Iteration (LSPI) Classification-based Policy Iteration Discussion

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 9/185

Preliminaries

Outline Preliminaries Dynamic Programming Approximate Dynamic Programming Tools from Statistical Learning Theory A Step-by-step Derivation for Linear FQI Least-Squares Policy Iteration (LSPI) Classification-based Policy Iteration Discussion A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 10/185

Preliminaries

Reinforcement Learning (RL)

I

RL: A class of learning problems in which an agent interacts with a dynamic, stochastic, and incompletely known environment

I

Goal: Learn an action-selection strategy, or policy, to optimize some measure of its long-term performance

I

Interaction: Modeled as a MDP or a POMDP A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 11/185

Preliminaries

Markov Decision Process MDP I

An MDP M is a tuple hX , A, r, p, γi.

I

The state space X is a bounded closed subset of Rd .

I

The set of actions A is finite (|A| < ∞).

I

The reward function r : X × A → R is bounded by Rmax .

I

The transition model p(·|x, a) is a distribution over X .

I

γ ∈ (0, 1) is a discount factor.

I

Policy: a mapping from states to actions

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

π(x) ∈ A June 26, 2012 - 12/185

Preliminaries

Value Function For a policy π I

Vπ :X →R "∞ # X  V π (x) = E γ t r Xt , π(Xt ) |X0 = x

Value function

t=0

I

Action-value function Qπ : X × A → R "∞ # X Qπ (x, a) = E γ t r(Xt , At )|X0 = x, A0 = a t=0

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 13/185

Preliminaries

Notation Bellman Operator I

Bellman operator for policy π T π : B(X ; Vmax ) → B(X ; Vmax )

I

V π is the unique fixed-point of the Bellman operator Z   π (T V )(x) = r x, π(x) + γ p dy|x, π(x) V (y) X

I

The action-value function Qπ is defined as Z Qπ (x, a) = r(x, a) + γ p(dy|x, a)V π (y) X

B(X ; Vmax ) is the space of functions on X bounded by Vmax A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 14/185

Preliminaries

Optimal Value Function and Optimal Policy I

Optimal value function V ∗ (x) = sup V π (x) π

I

Optimal action-value function Q∗ (x, a) = sup Qπ (x, a) π

I

∀x ∈ X

∀x ∈ X , ∀a ∈ A

A policy π is optimal if V π (x) = V ∗ (x)

∀x ∈ X

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 15/185

Preliminaries

Notation Bellman Optimality Operator I

Bellman optimality operator T : B(X ; Vmax ) → B(X ; Vmax )

I

V ∗ is the unique fixed-point of the Bellman optimality operator   Z (T V )(x) = max r(x, a) + γ p(dy|x, a)V (y) a∈A

I

X

Optimal action-value function Q∗ is defined as Z ∗ Q (x, a) = r(x, a) + γ p(dy|x, a)V ∗ (y) X

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 16/185

Preliminaries

Properties of Bellman Operators I

Monotonicity: if V1 ≤ V2 component-wise, then T π V1 ≤ T π V2

I

Max-Norm Contraction:

and

T V1 ≤ T V2

∀V1 , V2 ∈ B(X ; Vmax )

||T π V1 − T π V2 ||∞ ≤ γ||V1 − V2 ||∞ ||T V1 − T V2 ||∞ ≤ γ||V1 − V2 ||∞

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 17/185

Preliminaries

Dynamic Programming

Outline Preliminaries Dynamic Programming Approximate Dynamic Programming Tools from Statistical Learning Theory A Step-by-step Derivation for Linear FQI Least-Squares Policy Iteration (LSPI) Classification-based Policy Iteration Discussion A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 18/185

Preliminaries

Dynamic Programming

Dynamic Programming Algorithms Value Iteration I

start with an arbitrary action-value function Q0

I

at each iteration k

Qk+1 = T Qk

Convergence I

limk→∞ Vk = V ∗ . k→∞

||V ∗ −Vk+1 ||∞ = ||T V ∗ −T Vk ||∞ ≤ γ||V ∗ −Vk ||∞ ≤ γ k+1 ||V ∗ −V0 ||∞ −→ 0

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 19/185

Preliminaries

Dynamic Programming

Dynamic Programming Algorithms Policy Iteration I

start with an arbitrary policy π0

I

at each iteration k I

Policy Evaluation: Compute Qπk

I

Policy Improvement: Compute the greedy policy w.r.t. Qπk πk+1 (x) = (Gπk )(x) = arg max Qπk (x, a) a∈A

Convergence PI generates a sequence of policies with increasing performance (V πk+1 ≥ V πk ) and stops after a finite number of iterations with the optimal policy π ∗ . V πk = T πk V πk ≤ T V πk = T πk+1 V πk ≤ lim (T πk+1 )n V πk = V πk+1 n→∞

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 20/185

Preliminaries

Approximate Dynamic Programming

Outline Preliminaries Dynamic Programming Approximate Dynamic Programming Tools from Statistical Learning Theory A Step-by-step Derivation for Linear FQI Least-Squares Policy Iteration (LSPI) Classification-based Policy Iteration Discussion A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 21/185

Preliminaries

Approximate Dynamic Programming

Approximate Dynamic Programming & Batch Reinforcement Learning

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 22/185

Preliminaries

Approximate Dynamic Programming

Approximate Dynamic Programming Algorithms Value Iteration I

start with an arbitrary action-value function Q0

I

at each iteration k

Qk+1 = T Qk

What if Qk+1 ≈ T Qk ? ?

||Q∗ − Qk+1 || ≤ γ||Q∗ − Qk ||

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 23/185

Preliminaries

Approximate Dynamic Programming

Approximate Dynamic Programming Algorithms Policy Iteration I

start with an arbitrary policy π0

I

at each iteration k I

Policy Evaluation: Compute Qπk

I

Policy Improvement: Compute the greedy policy w.r.t. Qπk πk+1 (x) = (Gπk )(x) = arg max Qπk (x, a) a∈A

b πk ≈ Qπk instead) What if we cannot compute Qπk exactly? (Compute Q ?

b πk (x, a) 6= (Gπk )(x) −→ V πk+1 ≥ V πk πk+1 (x) = arg max Q a∈A

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 24/185

Preliminaries

Approximate Dynamic Programming

Error at each Iteration (AVI)

Budget = B App Space = F

AVI

Qk+1 ≈ T Qk

Error at iteration k ||T Qk − Qk+1 ||p,ρ ≤ f (B, F)

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

w.h.p.

June 26, 2012 - 25/185

Preliminaries

Approximate Dynamic Programming

Error at each Iteration (API)

Budget = B App Space = F

! π k ≈ Q πk Q

API

! πk ≈ Gπk πk+1 = G Q Error at iteration k b πk ||p,ρ ≤ f (B, F) ||Qπk − Q A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

w.h.p.

June 26, 2012 - 26/185

Preliminaries

Approximate Dynamic Programming

Final Performance Bound

Final Objective: Bound the error after K iteration of the alg. ||V ∗ − V πK ||p,µ ≤ f (B, F, K)

w.h.p.

πK is the policy computed by the algorithm after K iterations

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 27/185

Preliminaries

Approximate Dynamic Programming

Final Performance Bound

Final Objective: Bound the error after K iteration of the alg. ||V ∗ − V πK ||p,µ ≤ f (B, F, K)

w.h.p.

πK is the policy computed by the algorithm after K iterations

Error Propagation: How the error at each iteration propagates through the iterations of the algorithm

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 27/185

Preliminaries

Approximate Dynamic Programming

SLT in RL & ADP I supervised learning methods (regression, classification) appear in the inner-loop of ADP algorithms (performance at each iteration) I tools from SLT that are used to analyze supervised learning methods can be used in RL and ADP (e.g., how many samples are required to achieve a certain performance)

What makes RL more challenging? I the objective is not always to recover a target function from its noisy observations (fixed-point vs. regression) I the target sometimes has to be approximated given sample trajectories (non i.i.d. samples) I propagation of error (control problem) I the choice of the sampling distribution ρ (exploration problem) A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 28/185

Tools from Statistical Learning Theory

Outline Preliminaries Tools from Statistical Learning Theory Concentration Inequalities Functional Concentration Inequalities A Step-by-step Derivation for Linear FQI Least-Squares Policy Iteration (LSPI) Classification-based Policy Iteration Discussion A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 29/185

Tools from Statistical Learning Theory

Objective of the section

I

Introduce the theoretical tools used to derive the error bounds at each iteration

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 30/185

Tools from Statistical Learning Theory

Objective of the section

I

Introduce the theoretical tools used to derive the error bounds at each iteration

I

Understand the relationship between accuracy , number of samples, and confidence

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 30/185

Tools from Statistical Learning Theory

Concentration Inequalities

Outline Preliminaries Tools from Statistical Learning Theory Concentration Inequalities Functional Concentration Inequalities A Step-by-step Derivation for Linear FQI Least-Squares Policy Iteration (LSPI) Classification-based Policy Iteration Discussion A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 31/185

Tools from Statistical Learning Theory

Concentration Inequalities

The Chernoff–Hoeffding Bound

Remark: all the learning algorithms use random samples instead of actual distributions.

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 32/185

Tools from Statistical Learning Theory

Concentration Inequalities

The Chernoff–Hoeffding Bound

Remark: all the learning algorithms use random samples instead of actual distributions. Question: how reliable is the solution learned from finite random samples?

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 32/185

Tools from Statistical Learning Theory

Concentration Inequalities

The Chernoff–Hoeffding Bound

Theorem Let X1 , . . . , Xn be i.i.d. samples from a distribution P bounded in [a, b], then for any ε > 0 # " n 1 X  2nε2  Xt − EP [X1 ] > ε ≤ 2 exp − P n (b − a)2 t=1

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 33/185

Tools from Statistical Learning Theory

Concentration Inequalities

The Chernoff–Hoeffding Bound

Theorem Let X1 , . . . , Xn be i.i.d. samples from a distribution bounded in [a, b], then for any ε > 0 # " n 1 X  2nε2  ε ≤ 2 exp − P Xt − E[X1 ] > |{z} n (b − a)2 t=1 accuracy | {z } | {z } deviation

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

conf idence

June 26, 2012 - 34/185

Tools from Statistical Learning Theory

Concentration Inequalities

The Chernoff–Hoeffding Bound (Cont.d)

Theorem Let X1 , . . . , Xn be i.i.d. samples from a distribution bounded in [a, b], then for any δ ∈ (0, 1) # " r n 1 X log 2/δ Xt − E[X1 ] > (b − a) ≤δ P n 2n t=1

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 35/185

Tools from Statistical Learning Theory

Concentration Inequalities

The Chernoff–Hoeffding Bound (Cont.d)

Theorem Let X1 , X2 , . . . be i.i.d. samples from a distribution bounded in [a, b], then for any δ ∈ (0, 1) and ε > 0 # " n 1 X Xt − E[X1 ] > ε ≤ δ P n t=1

if

n≥

(b − a)2 log 2/δ . 2ε2

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 36/185

Tools from Statistical Learning Theory

Concentration Inequalities

The Azuma Bound

Remark: in ADP and RL, the samples are not necessarily i.i.d. but may be generated from trajectories

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 37/185

Tools from Statistical Learning Theory

Concentration Inequalities

The Azuma Bound

Remark: in ADP and RL, the samples are not necessarily i.i.d. but may be generated from trajectories Question: how is it possible to extend the previous results to non–i.i.d. samples?

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 37/185

Tools from Statistical Learning Theory

Concentration Inequalities

The Azuma Bound

A sequence of random variables X1 , X2 , . . . is a martingale difference sequence if for any t E[Xt+1 |X1 , . . . , Xt ] = 0

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 38/185

Tools from Statistical Learning Theory

Concentration Inequalities

The Azuma Bound

Theorem Let X1 , . . . , Xn be a martingale difference sequence bounded in [a, b], then for any ε > 0 # " n 1 X  2nε2  Xt > ε ≤ 2 exp − P n (b − a)2 t=1

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 39/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

Outline Preliminaries Tools from Statistical Learning Theory Concentration Inequalities Functional Concentration Inequalities A Step-by-step Derivation for Linear FQI Least-Squares Policy Iteration (LSPI) Classification-based Policy Iteration Discussion A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 40/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

The Functional Chernoff–Hoeffding Bound

Remark: the learning algorithm returns the empirical best hypothesis from a hypothesis set (e.g., a value function, a policy). Question: how do the previous results extend to the case of random hypotheses in a hypothesis set?

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 41/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

The Functional Chernoff–Hoeffding Bound

Theorem Let X1 , . . . , Xn be i.i.d. samples from an arbitrary distribution P in X and f : X → [a, b] a bounded function, then for any ε > 0 # " n 1 X  2nε2  f (Xt ) − E[f (X1 )] > ε ≤ 2 exp − P n (b − a)2 t=1

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 42/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

The Functional Chernoff–Hoeffding Bound (Cont.d)

Theorem Let X1 , . . . , Xn be i.i.d. samples from an arbitrary distribution P in X and F a set of functions bounded in [a, b], then for any fixed f ∈ F and any ε > 0 # " n 1 X  2nε2  f (Xt ) − E[f (X1 )] > ε ≤ 2 exp − P n (b − a)2 t=1

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 43/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

The Functional Chernoff–Hoeffding Bound (Cont.d)

Remark: usually we do not know which function f the learning algorithm will return (it is random!)

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 44/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

The Functional Chernoff–Hoeffding Bound (Cont.d)

Remark: usually we do not know which function f the learning algorithm will return (it is random!)

Theorem Let X1 , . . . , Xn be i.i.d. samples from an arbitrary distribution P in X and F a set of functions bounded in [a, b], then for any ε > 0 " # n 1 X P ∃f ∈ F : f (Xt ) − E[f (X1 )] > ε ≤ ??? n t=1

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 44/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

The Union Bound

Also known as: Boole’s inequality, Bonferroni inequality, etc.

Theorem Let A1 , A2 , . . . be a countable set of events, then h[ i X   P Ai . P Ai ≤ i

i

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 45/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

The Union Bound

Also known as: Boole’s inequality, Bonferroni inequality, etc.

Theorem Let A1 , A2 , . . . be a countable set of events, then h[ i X   P Ai . P Ai ≤ i

i

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 45/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

The Functional Chernoff–Hoeffding Bound (Cont.d) n 1 X h i P ∃f ∈ F : f (Xt ) − E[f (X1 )] > ε n t=1

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 46/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

The Functional Chernoff–Hoeffding Bound (Cont.d) n 1 X h i P ∃f ∈ F : f (Xt ) − E[f (X1 )] > ε n t=1

"

=P

n n 1 X o[ f (X ) − E[f (X )] > ε 1 t 1 1 n t=1 n o[ n 1 X f2 (Xt ) − E[f2 (X1 )] > ε n t=1

··· n n 1 X o[ fN (Xt ) − E[fN (X1 )] > ε n t=1 # ···

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 46/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

The Functional Chernoff–Hoeffding Bound (Cont.d) Theorem Let X1 , . . . , Xn be i.i.d. samples from an arbitrary distribution P in X and F a finite set of functions bounded in [a, b] with |F| = N , then for any f1 ∈ F and any δ ∈ (0, 1) # " r n 1 X log 2/δ f (Xt ) − E[f (X1 )] > (b − a) ≤ P ∃f ∈ F : n 2n t=1 " # r n 1 X log 2/δ N max P f (Xt ) − E[f (X1 )] > (b − a) ≤ Nδ f ∈F n 2n t=1

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 47/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

The Functional Chernoff–Hoeffding Bound (Cont.d)

Theorem Let X1 , . . . , Xn be i.i.d. samples from an arbitrary distribution P in X and F a finite set of functions bounded in [a, b] with |F| = N , then for any f1 ∈ F and any δ ∈ (0, 1) " # r n 1 X log 2N /δ P ∃f ∈ F : f (Xt ) − E[f (X1 )] > (b − a) ≤δ n 2n t=1

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 48/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

The Functional Chernoff–Hoeffding Bound (Cont.d)

Problem: In general F contains an infinite number of functions (e.g., a linear classifier)

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 49/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

The Symmetrization Trick

"

# n 1 X   P ∃f ∈ F : f (Xt ) − E f (X1 ) > ε n t=1 " # n n 1 X ε 1X ≤ 2P ∃f ∈ F : f (Xt ) − f (Xt0 ) > n n 2 t=1

t=1

with the ghost samples {Xt0 }nt=1 independently drawn from P.

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 50/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

The VC dimension Not all the infinities are the same...

1

1

0.5

0.5

0

0

−0.5

−0.5

−1

−1

−1.5

−1.5

−2

−2

−2.5 −1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

−2.5 −1

−0.8

−0.6

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

June 26, 2012 - 51/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

The VC dimension Not all the infinities are the same...

1

1

0.5

0.5

0

0

−0.5

−0.5

−1

−1

−1.5

−1.5

−2

−2

−2.5 −1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

−2.5 −1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Let’s consider a binary space F = {f : X → {0, 1}} A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 51/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

The VC dimension (cont’d)

How many different predictions can a space F produce over n distinct inputs?

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 52/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

The VC dimension (cont’d)

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 53/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

The VC dimension (cont’d)

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 53/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

The VC dimension (cont’d)

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 53/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

The VC dimension (cont’d)

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 53/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

The VC dimension (cont’d)

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 53/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

The VC dimension (cont’d)

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 53/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

The VC dimension (cont’d)

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 53/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

The VC dimension (cont’d)

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 53/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

The VC dimension (cont’d)

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 53/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

The VC dimension (cont’d)

The VC dimension of a linear classifier in dim. 2 is V C(F) = 3.

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 53/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

The VC dimension (cont’d)

Let S = (x1 , . . . , xd ) be an arbitrary sequence of points, then ΠS (F) = {(f (x1 ), . . . , f (xd )), h ∈ F} is the set of all the possible ways the d points can be classified by hypothesis in F.

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 54/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

The VC dimension (cont’d)

Let S = (x1 , . . . , xd ) be an arbitrary sequence of points, then ΠS (F) = {(f (x1 ), . . . , f (xd )), h ∈ F} is the set of all the possible ways the d points can be classified by hypothesis in F.

Definition A set S is shattered by a hypothesis space F if |ΠS (F)| = 2d .

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 54/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

The VC dimension (cont’d) Definition (VC Dimension) The VC dimension of a hypothesis space F is VC(F) = max{d| ∃|S| = d, |ΠS (F)| = 2d }

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 55/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

The VC dimension (cont’d) Definition (VC Dimension) The VC dimension of a hypothesis space F is VC(F) = max{d| ∃|S| = d, |ΠS (F)| = 2d }

Lemma (Sauer’s Lemma) Let F be a hypothesis space with VC dimension d, then for any sequence of n points S = (x1 , . . . , xn ) with n > d |ΠS (F)| ≤

d   X n i=0

i

≤ nd

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 55/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

The Functional CH Bound for Binary Spaces

Question: how many values can f ∈ F (with F a binary space) take on 2n samples? "

n n 1 X ε 1X 2P ∃f ∈ F : f (Xt ) − f (Xt0 ) > n n 2 t=1

t=1

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

#

June 26, 2012 - 56/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

The Functional CH Bound for Binary Spaces

Question: how many values can f ∈ F (with F a binary space) take on 2n samples? "

n n 1 X ε 1X 2P ∃f ∈ F : f (Xt ) − f (Xt0 ) > n n 2 t=1

t=1

#

If VC(F) = d and 2n > d, then the answer is at most (2n)d !

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 56/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

The Functional CH Bound for Binary Spaces

Theorem Let X1 , . . . , Xn be i.i.d. samples from an arbitrary distribution P in X and F a finite set of binary functions with VC = d, then for any δ ∈ (0, 1) # " n r log 2N /δ 1 X f (Xt ) − E[f (X1 )] > ≤ 2δ P ∃f ∈ F : n 2n t=1

with N = (2n)d .

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 57/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

The Functional CH Bound for Binary Spaces

A simplified reading of the previous bound For any set of n i.i.d. samples and any binary function f ∈ F with VC(F) = d r n 1 X  d log n/δ  f (Xt ) − E[f (X1 )] ≤ O n n t=1

with probability 1 − δ (w.r.t. to the randomness of the samples)

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 58/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

The Pollard’s Inequality

Extension: how does the previous result extend to the case of a real–valued space F?

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 59/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

The Pollard’s Inequality

Question: how many values can f ∈ F (with F a real–valued space) take on 2n samples? "

n n 1 X ε 1X 0 2P ∃f ∈ F : f (Xt ) − f (Xt ) > n n 2 t=1

t=1

#

Answer: an infinite number of values...

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 60/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

Covering

Observation: we only need an accuracy of order ε.

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 61/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

Covering

Observation: we only need an accuracy of order ε. Question: how many functions from F do we need to achieve an accuracy of order ε on 2n samples?

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 61/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

Covering

A space Fε ⊂ F is an ε-cover of F on the states {xt }nt=1 if n n 1 X 1X 0 ∀f ∈ F, ∃f 0 ∈ Fε : f (xt ) − f (xt ) ≤ ε n n t=1

t=1

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 62/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

Covering

A space Fε ⊂ F is an ε-cover of F on the states {xt }nt=1 if n n 1 X 1X 0 ∀f ∈ F, ∃f 0 ∈ Fε : f (xt ) − f (xt ) ≤ ε n n t=1

t=1

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 62/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

Covering

A space Fε ⊂ F is an ε-cover of F on the states {xt }nt=1 if n n 1 X 1X 0 ∀f ∈ F, ∃f 0 ∈ Fε : f (xt ) − f (xt ) ≤ ε n n t=1

t=1

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 62/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

Covering

A space Fε ⊂ F is an ε-cover of F on the states {xt }nt=1 if n n 1 X 1X 0 ∀f ∈ F, ∃f 0 ∈ Fε : f (xt ) − f (xt ) ≤ ε n n t=1

t=1

The cover number of F is

 N F, ε, {xt }nt=1 = |Fε |

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 62/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

The Pollard’s Inequality We build an (ε/8)-cover of F on states {Xt }nt=1 ∪ {Xt0 }nt=1 , thus we have # n n 1 X ε 1X P ∃f ∈ F : f (Xt ) − f (Xt0 ) > n n 2 t=1 t=1 # " n n 1 X ε 1X 0 ≤ P ∃f ∈ Fε/8 : f (Xt ) − f (Xt ) > n n 4 t=1 t=1 # " n n ε h i 1 X 1X 0 0 n f (Xt ) − f (Xt ) > ≤ E N F, ε/8, {Xt ∪ Xt }t=1 P n n 4 "

t=1

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

t=1

June 26, 2012 - 63/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

The Pollard’s Inequality

Theorem Let X1 , . . . , Xn be i.i.d. samples from an arbitrary distribution P in X and F a set of bounded functions in [0, B], then for any ε > 0 " # n 1 X P ∃f ∈ F : f (Xt ) − E[f (X1 )] > ε n t=1 h  i nε2  ≤ 8E N F, ε/8, {Xt ∪ Xt0 }nt=1 exp − . 64B 2

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 64/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

The Pseudo–Dimension

Question: how is it possible to compute the cover number ? A real–valued space F has a psuedo–dimension d if V(F) = VC

  (x, y) → sign(f (x) − y), f ∈ F =d

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 65/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

The Pseudo–Dimension

Question: how is it possible to compute the cover number ? A real–valued space F has a psuedo–dimension d if V(F) = VC For any {xt }nt=1

  (x, y) → sign(f (x) − y), f ∈ F =d

N

F, ε, {xt }nt=1



≤O



B d ε

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP



June 26, 2012 - 65/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

Functional Concentration Inequality for L2 -norm

Remark: In some cases we want to consider the deviations between different norms.

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 66/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

Functional Concentration Inequality for L2 -norm

Remark: In some cases we want to consider the deviations between different norms. Example: in least–squares regression, the error is measured with L2 -norms, so we want to bound the deviation between  X 1/2 n 1 f (Xt )2 n t=1

    1/2 E f (X)2

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 66/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

Functional Concentration Inequality for L2 -norm

Theorem Let X1 , . . . , Xn be i.i.d. samples from an arbitrary distribution P in X and F a set of bounded functions in [0, B], then for any ε  " # 1/2 n 1X   1/2  P ∃f ∈ F : f (Xt )2 − 2 E f (X)2 >ε n t=1 √ h  i 2 nε2  ≤ 3E N2 F, ε, {Xt ∪ Xt0 }nt=1 exp − . 24 288B 2

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 67/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

Summary

I

Learning algorithms use finite random samples ⇒ concentration of averages to expectations

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 68/185

Tools from Statistical Learning Theory

Functional Concentration Inequalities

Summary

I

I

Learning algorithms use finite random samples ⇒ concentration of averages to expectations

Learning algorithms use spaces of functions ⇒ concentration of averages to expectations for any function

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 68/185

A Step-by-step Derivation for Linear FQI

Outline Preliminaries Tools from Statistical Learning Theory A Step-by-step Derivation for Linear FQI Error at Each Iteration Error Propagation The Final Bound Least-Squares Policy Iteration (LSPI) Classification-based Policy Iteration Discussion A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 69/185

A Step-by-step Derivation for Linear FQI

Objective of the section

I

Step-by-step derivation a performance bound for a popular algorithm

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 70/185

A Step-by-step Derivation for Linear FQI

Objective of the section

I

Step-by-step derivation a performance bound for a popular algorithm

I

Show the interplay between prediction error and propagation

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 70/185

A Step-by-step Derivation for Linear FQI

Linear Fitted Q-iteration

Linear space (used to approximate action–value functions) d n o X F = f (x, a) = αj ϕj (x, a), α ∈ Rd j=1

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 71/185

A Step-by-step Derivation for Linear FQI

Linear Fitted Q-iteration

Linear space (used to approximate action–value functions) d n o X F = f (x, a) = αj ϕj (x, a), α ∈ Rd j=1

with features ϕj : X × A → [0, L]

φ(x, a) = [ϕ1 (x, a) . . . ϕd (x, a)]>

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 71/185

A Step-by-step Derivation for Linear FQI

Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 72/185

A Step-by-step Derivation for Linear FQI

Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n e0 ∈ F Initial function Q

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 72/185

A Step-by-step Derivation for Linear FQI

Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n e0 ∈ F Initial function Q For k = 1, . . . , K

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 72/185

A Step-by-step Derivation for Linear FQI

Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n e0 ∈ F Initial function Q For k = 1, . . . , K i.i.d I Draw n samples (xi , ai ) ∼ ρ

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 72/185

A Step-by-step Derivation for Linear FQI

Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n e0 ∈ F Initial function Q For k = 1, . . . , K i.i.d I Draw n samples (xi , ai ) ∼ ρ I

Sample x0i ∼ p(·|xi , ai ) and ri = r(xi , ai )

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 72/185

A Step-by-step Derivation for Linear FQI

Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n e0 ∈ F Initial function Q For k = 1, . . . , K i.i.d I Draw n samples (xi , ai ) ∼ ρ I I

Sample x0i ∼ p(·|xi , ai ) and ri = r(xi , ai )

e k−1 (x0 , a) Compute yi = ri + γ maxa Q i

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 72/185

A Step-by-step Derivation for Linear FQI

Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n e0 ∈ F Initial function Q For k = 1, . . . , K i.i.d I Draw n samples (xi , ai ) ∼ ρ I

Sample x0i ∼ p(·|xi , ai ) and ri = r(xi , ai )

e k−1 (x0 , a) Compute yi = ri + γ maxa Q i   n I Build training set (xi , ai ), yi i=1

I

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 72/185

A Step-by-step Derivation for Linear FQI

Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n e0 ∈ F Initial function Q For k = 1, . . . , K i.i.d I Draw n samples (xi , ai ) ∼ ρ I

Sample x0i ∼ p(·|xi , ai ) and ri = r(xi , ai )

e k−1 (x0 , a) Compute yi = ri + γ maxa Q i   n I Build training set (xi , ai ), yi i=1

I

I

Solve the least squares problem

n

fαbk = arg min

fα ∈F

2 1X fα (xi , ai ) − yi n i=1

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 72/185

A Step-by-step Derivation for Linear FQI

Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n e0 ∈ F Initial function Q For k = 1, . . . , K i.i.d I Draw n samples (xi , ai ) ∼ ρ I

Sample x0i ∼ p(·|xi , ai ) and ri = r(xi , ai )

e k−1 (x0 , a) Compute yi = ri + γ maxa Q i   n I Build training set (xi , ai ), yi i=1

I

I

Solve the least squares problem

n

fαbk = arg min

fα ∈F

I

ek

Return Q = Trunc(fαbk )

2 1X fα (xi , ai ) − yi n i=1

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 72/185

A Step-by-step Derivation for Linear FQI

Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n e0 ∈ F Initial function Q For k = 1, . . . , K i.i.d I Draw n samples (xi , ai ) ∼ ρ I

Sample x0i ∼ p(·|xi , ai ) and ri = r(xi , ai )

e k−1 (x0 , a) Compute yi = ri + γ maxa Q i   n I Build training set (xi , ai ), yi i=1

I

I

Solve the least squares problem

n

fαbk = arg min

fα ∈F

I

ek

Return Q = Trunc(fαbk )

2 1X fα (xi , ai ) − yi n i=1

e K (·, a) (greedy policy ) Return πK (·) = arg maxa Q A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 72/185

A Step-by-step Derivation for Linear FQI

Theoretical Objectives

Objective 1: derive a bound on the performance (quadratic) loss w.r.t. a testing distribution µ ||Q∗ − QπK ||µ ≤ ???

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 73/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Outline Preliminaries Tools from Statistical Learning Theory A Step-by-step Derivation for Linear FQI Error at Each Iteration Error Propagation The Final Bound Least-Squares Policy Iteration (LSPI) Classification-based Policy Iteration Discussion A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 74/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ e0 ∈ F Initial function Q For k = 1, . . . , K i.i.d I Draw n samples (xi , ai ) ∼ ρ I

Sample x0i ∼ p(·|xi , ai ) and ri = r(xi , ai )

e k−1 (x0 , a) Compute yi = ri + γ maxa Q i   n I Build training set (xi , ai ), yi i=1 I

I

Solve the least squares problem

n

fαbk = arg min

fα ∈F

I

ek

Return Q = Trunc(fαbk )

2 1X fα (xi , ai ) − yi n i=1

e K (·, a) (greedy policy ) Return πK (·) = arg maxa Q A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 75/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ e0 ∈ F Initial function Q For k = 1, . . . , K i.i.d I Draw n samples (xi , ai ) ∼ ρ I

Sample x0i ∼ p(·|xi , ai ) and ri = r(xi , ai )

e k−1 (x0 , a) Compute yi = ri + γ maxa Q i   n I Build training set (xi , ai ), yi i=1 I

I

Solve the least squares problem

n

fαbk = arg min

fα ∈F

I

ek

Return Q = Trunc(fαbk )

2 1X fα (xi , ai ) − yi n i=1

e K (·, a) (greedy policy ) Return πK (·) = arg maxa Q A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 75/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Linear Fitted Q-iteration

I I

i.i.d

Draw n samples (xi , ai ) ∼ ρ

Sample x0i ∼ p(·|xi , ai ) and ri = r(xi , ai )

e k−1 (x0 , a) Compute yi = ri + γ maxa Q i   n I Build training set (xi , ai ), yi i=1 I

I

Solve the least squares problem

n

fαbk = arg min

fα ∈F

e k = Trunc(fαbk ) I Return Q

2 1X fα (xi , ai ) − yi n i=1

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 76/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Objectives

Target: at each iteration we want to approximate Qk = T Qk−1

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 77/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Objectives

Target: at each iteration we want to approximate Qk = T Qk−1 Objective 2: derive an intermediate bound on the prediction error [random design] e k ||ρ ≤ ??? ||Qk − Q

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 77/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Objectives

Target: at each iteration we have samples {(xi , ai )}ni=1 (from ρ)

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 78/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Objectives

Target: at each iteration we have samples {(xi , ai )}ni=1 (from ρ) Objective 3: derive an intermediate bound on the prediction error on the samples [deterministic design] n 2 1 X k e k (xi , ai ) = ||Qk − Q e k ||2 ≤ ??? Q (xi , ai ) − Q ρb n i=1

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 78/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Objectives Obj 3 e k ||2 ≤ ??? ||Qk − Q ρb

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 79/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Objectives Obj 3

⇒ Obj 2

e k ||2 ≤ ??? ||Qk − Q ρb e k ||ρ ≤ ??? ||Qk − Q

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 79/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Objectives Obj 3

⇒ Obj 2

e k ||2 ≤ ??? ||Qk − Q ρb

⇒ Obj 1

e k ||ρ ≤ ??? ||Qk − Q ||Q∗ − QπK ||µ ≤ ???

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 79/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Objectives

Returned solution n

fαbk

2 1X = arg min fα (xi , ai ) − yi fα ∈F n i=1

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 80/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Objectives

Returned solution n

fαbk

2 1X = arg min fα (xi , ai ) − yi fα ∈F n i=1

Best solution fα∗k = arg inf ||fα − Qk ||ρ fα ∈F

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 80/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Additional Notation Given the set of inputs {(xi , ai )}ni=1 drawn from ρ.

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 81/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Additional Notation Given the set of inputs {(xi , ai )}ni=1 drawn from ρ. Vector space Fn = {z ∈ Rn , zi = fα (xi , ai ); fα ∈ F} ⊂ Rn

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 81/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Additional Notation Given the set of inputs {(xi , ai )}ni=1 drawn from ρ. Vector space Fn = {z ∈ Rn , zi = fα (xi , ai ); fα ∈ F} ⊂ Rn Empirical L2 -norm ||fα ||2ρb

n

n

i=1

i=1

1X 1X 2 = fα (xi , ai )2 = zi = ||z||2n n n

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 81/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Additional Notation Given the set of inputs {(xi , ai )}ni=1 drawn from ρ. Vector space Fn = {z ∈ Rn , zi = fα (xi , ai ); fα ∈ F} ⊂ Rn Empirical L2 -norm ||fα ||2ρb

n

n

i=1

i=1

1X 1X 2 = fα (xi , ai )2 = zi = ||z||2n n n

Empirical orthogonal projection b = arg min ||y − z||n Πy z∈Fn

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 81/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Additional Notation I

Target vector: e k−1 (xi , ai ) qi = Qk (xi , ai ) = T Q Z e k−1 (dx0 , a)p(dx0 |xi , ai ) = r(xi , ai ) + γ max Q a

X

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 82/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Additional Notation I

Target vector: e k−1 (xi , ai ) qi = Qk (xi , ai ) = T Q Z e k−1 (dx0 , a)p(dx0 |xi , ai ) = r(xi , ai ) + γ max Q a

I

X

Observed target vector:

e k−1 (x0i , a) yi = ri + γ max Q a

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 82/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Additional Notation I

Target vector: e k−1 (xi , ai ) qi = Qk (xi , ai ) = T Q Z e k−1 (dx0 , a)p(dx0 |xi , ai ) = r(xi , ai ) + γ max Q a

I

X

Observed target vector:

e k−1 (x0i , a) yi = ri + γ max Q a

I

Noise vector (zero–mean and bounded): ξi = qi − yi |ξi | ≤ Vmax

E[ξi |xi ] = 0

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 82/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Additional Notation

y ξ q Fn

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 83/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Additional Notation

I

Optimal solution in Fn b = arg min ||q − z||n Πq z∈Fn

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 84/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Additional Notation

I

Optimal solution in Fn b = arg min ||q − z||n Πq z∈Fn

I

Returned vector

qbi = fαbk (xi , ai )

b = arg min ||y − z||n qb = Πy z∈Fn

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 84/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Additional Notation

y ξ q Fn

b Πq

qb

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 85/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Analysis ||Qk − fαbk ||2ρb = ||q − qb||2n

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 86/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Analysis ||Qk − fαbk ||2ρb = ||q − qb||2n y

ξ q Fn

b Πq

ξb

qb

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 86/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Analysis ||Qk − fαbk ||2ρb = ||q − qb||2n y

ξ q Fn

b Πq

ξb

qb

b n b n + ||Πq b − qb||n = ||q − Πq|| b n + ||ξ|| ||q − qb||n ≤ ||q − Πq|| A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 86/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Analysis

b b ||q − qb||n ≤ ||q − Πq|| + ||ξ|| | {z } | {z n} | {z n}

prediction err

approx. err

estim. err

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 87/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Analysis

b b ||q − qb||n ≤ ||q − Πq|| + ||ξ|| | {z } | {z n} | {z n}

prediction err I

approx. err

estim. err

Prediction error : distance between learned function and target function

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 87/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Analysis

b b ||q − qb||n ≤ ||q − Πq|| + ||ξ|| | {z } | {z n} | {z n}

prediction err

approx. err

estim. err

I

Prediction error : distance between learned function and target function

I

Approximation error : distance between the best function in F and the target function ⇒ depends on F

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 87/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Analysis

b b ||q − qb||n ≤ ||q − Πq|| + ||ξ|| | {z } | {z n} | {z n}

prediction err

approx. err

estim. err

I

Prediction error : distance between learned function and target function

I

Approximation error : distance between the best function in F and the target function ⇒ depends on F

I

Estimation error : distance between the best function in F and the learned function ⇒ depends on the samples

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 87/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Analysis b The noise ξb = Πξ

b n = hξ, b ξi b = hξ, b ξi ⇒ ||ξ||

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 88/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Analysis b The noise ξb = Πξ

b n = hξ, b ξi b = hξ, b ξi ⇒ ||ξ||

The projected noise belongs to Fn

⇒ ∃fβ ∈ F : fβ (xi , ai ) = ξbi ,

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

∀(xi , ai )

June 26, 2012 - 88/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Analysis b The noise ξb = Πξ

b n = hξ, b ξi b = hξ, b ξi ⇒ ||ξ||

The projected noise belongs to Fn

⇒ ∃fβ ∈ F : fβ (xi , ai ) = ξbi ,

∀(xi , ai )

By definition of inner product

n

b n= ⇒ ||ξ||

1X fβ (xi , ai )ξi n i=1

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 88/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Analysis The noise ξ has zero mean and it is bounded in [−Vmax , Vmax ]

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 89/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Analysis The noise ξ has zero mean and it is bounded in [−Vmax , Vmax ] Thus for any fixed fβ ∈ F (the expectation is conditioned on (xi , ai ))

⇒ Eξ

n h1 X

n

i=1

fβ (xi , ai )ξi

i

n

 1X  = Eξ fβ (xi , ai )ξi = 0 n i=1

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 89/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Analysis The noise ξ has zero mean and it is bounded in [−Vmax , Vmax ] Thus for any fixed fβ ∈ F (the expectation is conditioned on (xi , ai ))

⇒ Eξ

n h1 X

n

i=1

fβ (xi , ai )ξi

i

n

 1X  = Eξ fβ (xi , ai )ξi = 0 n i=1

n

n

i=1

i=1

X 2 1X 2 1 fβ (xi , ai )ξi ≤ 4Vmax fβ (xi , ai )2 = 4Vmax ||fβ ||2ρb ⇒ n n ⇒ we can use concentration inequalities

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 89/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Analysis

Problem: fβ is a random variable

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 90/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Analysis

Problem: fβ is a random variable Solution: we need functional concentration inequalities

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 90/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Analysis

Define the space of normalized functions n o fα (·) G = g(·) = , fα ∈ F ||fα ||ρb

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 91/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Analysis

Define the space of normalized functions n o fα (·) G = g(·) = , fα ∈ F ||fα ||ρb [by definition] ⇒ ∀g ∈ G, ||g||ρb ≤ 1

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 91/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Analysis

Define the space of normalized functions n o fα (·) G = g(·) = , fα ∈ F ||fα ||ρb [by definition] ⇒ ∀g ∈ G, ||g||ρb ≤ 1 [F is a linear space] ⇒ V(G) = d + 1

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 91/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Analysis

Application of Pollard’s inequality for space G

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 92/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Analysis

Application of Pollard’s inequality for space G For any g ∈ G n 1 X g(xi , ai )ξi ≤ 4Vmax n i=1

s

2 log n



3(9ne2 )d+1 δ



with probability 1 − δ (w.r.t., the realization of the noise ξ).

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 92/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Analysis By definition of g s   n 1 X 3(9ne2 )d+1 2 ⇒ fα (xi , ai )ξi ≤ 4Vmax ||fα ||ρb log n i=1 n δ

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 93/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Analysis By definition of g s   n 1 X 3(9ne2 )d+1 2 ⇒ fα (xi , ai )ξi ≤ 4Vmax ||fα ||ρb log n i=1 n δ

For the specific fβ equivalent to ξb

b ξi ≤ 4Vmax ||ξ|| b n ⇒ hξ,

s

2 log n



3(9ne2 )d+1 δ

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP



June 26, 2012 - 93/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Analysis By definition of g s   n 1 X 3(9ne2 )d+1 2 ⇒ fα (xi , ai )ξi ≤ 4Vmax ||fα ||ρb log n i=1 n δ

For the specific fβ equivalent to ξb

b ξi ≤ 4Vmax ||ξ|| b n ⇒ hξ,

s

2 log n



3(9ne2 )d+1 δ



s

2 log n



3(9ne2 )d+1 δ



Recalling the objective

b 2 ≤ 4Vmax ||ξ|| b n ⇒ ||ξ|| n

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 93/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Analysis By definition of g s   n 1 X 3(9ne2 )d+1 2 ⇒ fα (xi , ai )ξi ≤ 4Vmax ||fα ||ρb log n i=1 n δ

For the specific fβ equivalent to ξb

b ξi ≤ 4Vmax ||ξ|| b n ⇒ hξ,

s

2 log n



3(9ne2 )d+1 δ



s

2 log n



3(9ne2 )d+1 δ



Recalling the objective

b 2 ≤ 4Vmax ||ξ|| b n ⇒ ||ξ|| n

b − qb||n ≤ 4Vmax ⇒ ||Πq

s

2 log n



3(9ne2 )d+1 δ

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

 June 26, 2012 - 93/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Analysis y ξ q Fn

b Πq

ξb

qb

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 94/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Analysis y ξ q Fn

b Πq

ξb

qb

Theorem At each iteration k and given a set of state–action pairs {(xi , ai )}, LinearFQI returns an approximation qb such that b n + ||Πq b − qb||n ||q − qb||n ≤ ||q − Πq|| r   d log n/δ b ≤ ||q − Πq||n + O Vmax n A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 94/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Analysis

Moving back from vectors to functions ||q − qb||n = ||Qk − fαbk ||ρb b n ≤ ||Qk − fα∗ ||ρb ||q − Πq|| k

r   d log n/δ k k ⇒ ||Q − fαbk ||ρb ≤ ||Q − fα∗k ||ρb + O Vmax n

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 95/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Analysis

e k = Trunc(fαb )) By definition of truncation (Q k

Theorem

At each iteration k and given a set of state–action pairs {(xi , ai )}, b k such that (Objective 3) LinearFQI returns an approximation Q e k ||ρb ≤ ||Qk − fαb ||ρb ||Qk − Q k k



≤ ||Q − fα∗k ||ρb + O Vmax

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

r

d log n/δ n



June 26, 2012 - 96/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Analysis Remark: in order to move from Obj3 to Obj2 we need to move from empirical to expected L2 -norms

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 97/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Analysis Remark: in order to move from Obj3 to Obj2 we need to move from empirical to expected L2 -norms e k is truncated, it is bounded in [−Vmax , Vmax ] Since Q r   d log n/δ k k k k e e 2||Q − Q ||ρb ≥ ||Q − Q ||ρ − O Vmax n

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 97/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Analysis Remark: in order to move from Obj3 to Obj2 we need to move from empirical to expected L2 -norms e k is truncated, it is bounded in [−Vmax , Vmax ] Since Q r   d log n/δ k k k k e e 2||Q − Q ||ρb ≥ ||Q − Q ||ρ − O Vmax n The best solution fα∗k is a fixed function in F k

k

||Q − fα∗k ||ρb ≤ 2||Q − fα∗k ||ρ + O



Vmax +

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP



L||αk∗ ||

r

log 1/δ n



June 26, 2012 - 97/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Analysis Theorem e k such At each iteration k, LinearFQI returns an approximation Q that (Objective 2) e k ||ρ ≤ 4||Qk − fα∗ ||ρ ||Qk − Q k r    log 1/δ ∗ + O Vmax + L||αk || n r   d log n/δ + O Vmax , n with probability 1 − δ. A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 98/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Analysis

e k ||ρ ≤ 4||Qk − fα∗ ||ρ ||Qk − Q k r    log 1/δ ∗ + O Vmax + L||αk || n r   d log n/δ + O Vmax n

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 99/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Analysis

e k ||ρ ≤ 4||Qk − fα∗ ||ρ ||Qk − Q k r    log 1/δ + O Vmax + L||α∗k || n r   d log n/δ + O Vmax n

Remarks I

No algorithm can do better

I

Constant 4

I

Depends on the space F

I

Changes with the iteration k

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 100/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Analysis

e k ||ρ ≤ 4||Qk − fα∗ ||ρ ||Qk − Q k r    log 1/δ ∗ + O Vmax + L||αk || n r   d log n/δ + O Vmax n

Remarks I

Vanishing to zero as O(n−1/2 )

I

Depends on the features (L) and on the best solution (||αk∗ ||)

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 101/185

A Step-by-step Derivation for Linear FQI

Error at Each Iteration

Theoretical Analysis

e k ||ρ ≤ 4||Qk − fα∗ ||ρ ||Qk − Q k r    log 1/δ + O Vmax + L||α∗k || n r   d log n/δ + O Vmax n

Remarks I

Vanishing to zero as O(n−1/2 )

I

Depends on the dimensionality of the space (d) and the number of samples (n)

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 102/185

A Step-by-step Derivation for Linear FQI

Error Propagation

Outline Preliminaries Tools from Statistical Learning Theory A Step-by-step Derivation for Linear FQI Error at Each Iteration Error Propagation The Final Bound Least-Squares Policy Iteration (LSPI) Classification-based Policy Iteration Discussion A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 103/185

A Step-by-step Derivation for Linear FQI

Error Propagation

Theoretical Analysis

Objective 1 ||Q∗ − QπK ||µ

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 104/185

A Step-by-step Derivation for Linear FQI

Error Propagation

Theoretical Analysis

Objective 1 ||Q∗ − QπK ||µ I

Problem 1: the test norm µ is different from the sampling norm ρ

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 104/185

A Step-by-step Derivation for Linear FQI

Error Propagation

Theoretical Analysis

Objective 1 ||Q∗ − QπK ||µ I

I

Problem 1: the test norm µ is different from the sampling norm ρ e k not for the performance Problem 2: we have bounds for Q of the corresponding πk

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 104/185

A Step-by-step Derivation for Linear FQI

Error Propagation

Theoretical Analysis

Objective 1 ||Q∗ − QπK ||µ I

I

I

Problem 1: the test norm µ is different from the sampling norm ρ e k not for the performance Problem 2: we have bounds for Q of the corresponding πk

Problem 3: we have bounds for one single iteration

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 104/185

A Step-by-step Derivation for Linear FQI

Error Propagation

Propagation of Errors I

Bellman operators T Q(x, a) = r(x, a) + γ T π Q(x, a) = r(x, a) + γ

I

I

Z

X

max Q(dx0 , a0 )p(dx0 |x, a) 0

X

Q(dx0 , π(dx0 ))p(dx0 |x, a)

Z

a

Optimal action–value function Greedy policy

Q∗ = T Q∗ π(x) = arg max Q(x, a) a

π ∗ (x) = arg max Q∗ (x, a) I

Prediction error

a

ek εk = Qk − Q

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 105/185

A Step-by-step Derivation for Linear FQI

Error Propagation

Propagation of Errors

Step 1: upper-bound on the propagation (problem 3) ∗ By definition T Qk ≥ T π Qk e k+1 = T π∗ Q∗ −T π∗ Q ek + T π∗ Q e k −T Q e k + εk Q∗ − Q | {z } | {z }| {z } fixed point

0

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

e k+1 Q

June 26, 2012 - 106/185

A Step-by-step Derivation for Linear FQI

Error Propagation

Propagation of Errors

Step 1: upper-bound on the propagation (problem 3) ∗ By definition T Qk ≥ T π Qk e k+1 = T π∗ Q∗ − T π∗ Q ek + T π∗ Q e k + −T Q e k + εk Q∗ − Q | {z } | {z } |{z} recursion

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

≤0

error

June 26, 2012 - 106/185

A Step-by-step Derivation for Linear FQI

Error Propagation

Propagation of Errors

Step 1: upper-bound on the propagation (problem 3) ∗ By definition T Qk ≥ T π Qk e k+1 = T π∗ Q∗ − T π∗ Q e k + T π∗ Q e k + −T Q e k + εk Q∗ − Q ∗ e k ) + εk ≤ γP π (Q∗ − Q

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 106/185

A Step-by-step Derivation for Linear FQI

Error Propagation

Propagation of Errors

Step 1: upper-bound on the propagation (problem 3) ∗ By definition T Qk ≥ T π Qk eK ≤ Q −Q ∗

K−1 X k=0

∗ ∗ e0 ) γ K−k−1 (P π )K−k−1 εk + γ K (P π )K (Q∗ − Q

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 106/185

A Step-by-step Derivation for Linear FQI

Error Propagation

Propagation of Errors

Step 2: lower-bound on the propagation (problem 3) By definition T Q∗ ≥ T πk Q∗ e k+1 = T Q∗ −T πk Q∗ + T πk Q∗ −T Q e k + εk Q∗ − Q | {z } | {z }| {z } fixed point

0

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

e k+1 Q

June 26, 2012 - 107/185

A Step-by-step Derivation for Linear FQI

Error Propagation

Propagation of Errors

Step 2: lower-bound on the propagation (problem 3) By definition T Q∗ ≥ T πk Q∗ e k+1 = T Q∗ − T πk Q∗ + T πk Q∗ − T Q e k + εk Q∗ − Q | {z } | {z } |{z} ≥0

greedy pol.

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

error

June 26, 2012 - 107/185

A Step-by-step Derivation for Linear FQI

Error Propagation

Propagation of Errors

Step 2: lower-bound on the propagation (problem 3) By definition T Q∗ ≥ T πk Q∗ e k+1 ≥ T πk Q∗ − T πk Q e k + εk Q∗ − Q | {z } |{z} recursion

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

error

June 26, 2012 - 107/185

A Step-by-step Derivation for Linear FQI

Error Propagation

Propagation of Errors

Step 2: lower-bound on the propagation (problem 3) By definition T Q∗ ≥ T πk Q∗ e k+1 ≥ γP πk (Q∗ − Q e k ) + εk Q∗ − Q

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 107/185

A Step-by-step Derivation for Linear FQI

Error Propagation

Propagation of Errors

Step 2: lower-bound on the propagation (problem 3) By definition T Q∗ ≥ T πk Q∗ e k+1 ≥ Q −Q ∗

K−1 X

γ K−k−1 (P πK−1 P πK−2 . . . P πk+1 )εk

k=0

e0 ) + γ K (P πK−1 P πK−2 . . . P π0 )(Q∗ − Q

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 107/185

A Step-by-step Derivation for Linear FQI

Error Propagation

Propagation of Errors

e K to πK (problem 2) Step 3: from Q eK = T Q e K ≥ T π∗ QK By definition T πK Q

∗ ∗ e K + T π∗ Q e K −T πK Q e K + T πK Q e K −T πK Q eK Q∗ − QπK = T π Q∗ −T π Q | {z } | {z }| {z } | {z }

fixed point

0

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

0

fixed point

June 26, 2012 - 108/185

A Step-by-step Derivation for Linear FQI

Error Propagation

Propagation of Errors

e K to πK (problem 2) Step 3: from Q eK = T Q e K ≥ T π∗ QK By definition T πK Q

∗ ∗ e K + T π∗ Q e K − T πK Q eK e K − T πK Q e K + T πK Q Q∗ − QπK = T π Q∗ − T π Q | {z } | {z } | {z }

error

≤0

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

function vs policy

June 26, 2012 - 108/185

A Step-by-step Derivation for Linear FQI

Error Propagation

Propagation of Errors

e K to πK (problem 2) Step 3: from Q eK = T Q e K ≥ T π∗ QK By definition T πK Q

∗ e K ) + γP πK (Q e K −Q∗ + Q∗ −QπK ) Q∗ − QπK ≤ γP π (Q∗ − Q | {z }

0

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 108/185

A Step-by-step Derivation for Linear FQI

Error Propagation

Propagation of Errors

e K to πK (problem 2) Step 3: from Q eK = T Q e K ≥ T π∗ QK By definition T πK Q

∗ e K ) + γP πK (Q e K − Q∗ + Q∗ − QπK ) Q∗ − QπK ≤ γP π (Q∗ − Q | {z } | {z } | {z }

error

error

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

policy performance

June 26, 2012 - 108/185

A Step-by-step Derivation for Linear FQI

Error Propagation

Propagation of Errors

e K to πK (problem 2) Step 3: from Q e K ≥ T π∗ QK eK = T Q By definition T πK Q

∗ ek ) (I − γP πK )(Q∗ − QπK ) ≤ γ(P π − P πK )(Q∗ − Q

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 108/185

A Step-by-step Derivation for Linear FQI

Error Propagation

Propagation of Errors

e K to πK (problem 2) Step 3: from Q eK = T Q e K ≥ T π∗ QK By definition T πK Q

∗ ek ) Q∗ − QπK ≤ γ(I − γP πK )−1 (P π − P πK )(Q∗ − Q

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 108/185

A Step-by-step Derivation for Linear FQI

Error Propagation

Propagation of Errors

e K to πK (problem 2) Step 3: from Q eK = T Q e K ≥ T π∗ QK By definition T πK Q

∗ ek ) Q∗ − QπK ≤ γ(I − γP πK )−1 (P π − P πK )(Q∗ − Q

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 108/185

A Step-by-step Derivation for Linear FQI

Error Propagation

Propagation of Errors

Step 3: plugging the error propagation (problem 2) Q∗ − QπK ≤(I − γP πK )−1

 K−1 X

i h ∗ γ K−k (P π )K−k − P πK P πK−1 . . . P πk+1 εk

k=0

h

+ (P

π ∗ K+1

)

 i e0 ) − (P πK P πK−1 . . . P π0 ) (Q∗ − Q

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 109/185

A Step-by-step Derivation for Linear FQI

Error Propagation

Propagation of Errors

Step 4: rewrite in compact form



πK

Q −Q

 K−1  2γ(1 − γ K+1 ) X ∗ 0 e ≤ αk Ak |εk | + αK AK |Q − Q | (1 − γ)2 k=0

I

αk : weights

I

Ak : summarize the P πi terms

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 110/185

A Step-by-step Derivation for Linear FQI

Error Propagation

Propagation of Errors

Step 5: take the norm w.r.t. to the test distribution µ ||Q∗ − QπK ||2µ =  ≤  ≤

Z

ρ(dx, da)(Q∗ (x, a) − QπK (x, a))2

2γ(1 − γ K+1 (1 − γ)2

2 Z

 γ K+1 2

2γ(1 − (1 − γ)2

 K−1 2 X e 0 | (x, a) µ(dx, da) αk Ak |εk | + αK AK |Q∗ − Q k=0

Z

 K−1  X e 0 )2 (x, a) µ(dx, da) αk Ak ε2k + αK AK (Q∗ − Q k=0

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 111/185

A Step-by-step Derivation for Linear FQI

Error Propagation

Propagation of Errors

Focusing on one single term   ∗ 1−γ µ(I − γP πK )−1 (P π )K−k + P πK P πK−1 . . . P πk+1 2   ∗ 1−γ X m = γ µ(P πK )m (P π )K−k + P πK P πK−1 . . . P πk+1 2 m≥0

µAk =

=

i X ∗ 1−γh X m γ µ(P πK )m (P π )K−k + γ m µ(P πK )m P πK P πK−1 . . . P πk+1 2 m≥0 m≥0

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 112/185

A Step-by-step Derivation for Linear FQI

Error Propagation

Propagation of Errors

Assumption: concentrability terms d(µP π1 . . . P πm ) c(m) = sup dρ π1 ...πm Cµ,ρ = (1 − γ)2

X



mγ m−1 c(m) < +∞

m≥1

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 113/185

A Step-by-step Derivation for Linear FQI

Error Propagation

Propagation of Errors

Step 5: take the norm w.r.t. to the test distribution µ ||Q∗ − QπK ||2µ    K−1  X 2γ(1 − γ K+1 2 X ≤ γ m c(m + K − k)||εk ||2ρ + αK (2Vmax )2 αk (1 − γ) 2 (1 − γ) k=0 m≥0

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 114/185

A Step-by-step Derivation for Linear FQI

Error Propagation

Propagation of Errors

Step 5: take the norm w.r.t. to the test distribution µ (problem 1)



||Q −

QπK ||2µ



2γ ≤ (1 − γ)2

2

Cµ,ρ max ||εk ||2ρ k



γK +O V2 (1 − γ)3 max

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP



June 26, 2012 - 115/185

A Step-by-step Derivation for Linear FQI

The Final Bound

Outline Preliminaries Tools from Statistical Learning Theory A Step-by-step Derivation for Linear FQI Error at Each Iteration Error Propagation The Final Bound Least-Squares Policy Iteration (LSPI) Classification-based Policy Iteration Discussion A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 116/185

A Step-by-step Derivation for Linear FQI

The Final Bound

Plugging Per–Iteration Regret



||Q −

QπK ||2µ



2γ ≤ (1 − γ)2

2

Cµ,ρ max ||εk ||2ρ k



γK +O V2 (1 − γ)3 max

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP



June 26, 2012 - 117/185

A Step-by-step Derivation for Linear FQI

The Final Bound

Plugging Per–Iteration Regret



||Q −

QπK ||2µ



2γ ≤ (1 − γ)2

2

Cµ,ρ max ||εk ||2ρ k



γK +O V2 (1 − γ)3 max



e k ||ρ ≤ 4||Qk − fα∗ ||ρ ||εk ||ρ = ||Qk − Q k r    log 1/δ ∗ + O Vmax + L||αk || n r   d log n/δ + O Vmax n

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 117/185

A Step-by-step Derivation for Linear FQI

The Final Bound

Plugging Per–Iteration Regret

The inherent Bellman error ||Qk − fα∗k ||ρ = inf ||Qk − f ||ρ f ∈F

e k−1 − f ||ρ = inf ||T Q f ∈F

≤ inf ||T fαk−1 − f ||ρ f ∈F

≤ sup inf ||T g − f ||ρ = d(F, T F) g∈F f ∈F

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 118/185

A Step-by-step Derivation for Linear FQI

The Final Bound

Plugging Per–Iteration Regret

fα∗k is the orthogonal projection of Qk onto F w.r.t. ρ e k−1 ||ρ ≤ ||Q e k−1 ||∞ ≤ Vmax ⇒ ||fα∗k ||ρ ≤ ||Qk ||ρ = ||T Q

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 119/185

A Step-by-step Derivation for Linear FQI

The Final Bound

Plugging Per–Iteration Regret Gram matrix Gi,j = E(x,a)∼ρ [ϕi (x, a)ϕj (x, a)] Smallest eigenvalue of G is ω ||fα ||2ρ = ||φ> α||2ρ = α> Gα ≥ ωα> α = ω||α||2

max ||αk∗ || ≤ max k

k

||fα∗k ||ρ Vmax √ ≤ √ ω ω

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 120/185

A Step-by-step Derivation for Linear FQI

The Final Bound

The Final Bound

Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy πK after K iterations such that ||Q∗ − QπK ||µ ≤

r  ! p 2γ L  d log n/δ C 4d(F , T F ) + O V 1 + √ µ,ρ max (1 − γ)2 ω n   γK +O V2 (1 − γ)3 max

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 121/185

A Step-by-step Derivation for Linear FQI

The Final Bound

The Final Bound

Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy πK after K iterations such that ||Q∗ − QπK ||µ ≤

r  ! p L  d log n/δ 2γ C 4d(F , T F ) + O V 1 + √ µ,ρ max (1 − γ)2 ω n   γK +O V2 (1 − γ)3 max

The propagation (and different norms) makes the problem more complex ⇒ how do we choose the sampling distribution?

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 122/185

A Step-by-step Derivation for Linear FQI

The Final Bound

The Final Bound

Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy πK after K iterations such that ||Q∗ − QπK ||µ ≤

r  ! p L  d log n/δ 2γ C 4d(F , T F ) + O V 1 + √ µ,ρ max (1 − γ)2 ω n   γK +O V2 (1 − γ)3 max

The approximation error is worse than in regression ⇒ how do adapt to the Bellman operator?

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 123/185

A Step-by-step Derivation for Linear FQI

The Final Bound

The Final Bound

Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy πK after K iterations such that ||Q∗ − QπK ||µ ≤

r  ! p 2γ L  d log n/δ C 4d(F , T F ) + O V 1 + √ µ,ρ max (1 − γ)2 ω n   γK +O V2 (1 − γ)3 max

The dependency on γ is worse than at each iteration ⇒ is it possible to avoid it?

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 124/185

A Step-by-step Derivation for Linear FQI

The Final Bound

The Final Bound

Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy πK after K iterations such that ||Q∗ − QπK ||µ ≤

r  ! p 2γ L  d log n/δ C 4d(F , T F ) + O V 1 + √ µ,ρ max (1 − γ)2 ω n   γK +O V2 (1 − γ)3 max

The error decreases exponentially in K ⇒ K ≈ ε/(1 − γ)

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 125/185

A Step-by-step Derivation for Linear FQI

The Final Bound

The Final Bound

Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy πK after K iterations such that ||Q∗ − QπK ||µ ≤

r  ! p 2γ L  d log n/δ C 4d(F , T F ) + O V 1 + √ µ,ρ max (1 − γ)2 ω n   γK +O V2 (1 − γ)3 max

The smallest eigenvalue of the Gram matrix ⇒ design the features so as to be orthogonal w.r.t. ρ

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 126/185

A Step-by-step Derivation for Linear FQI

The Final Bound

The Final Bound

Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy πK after K iterations such that ||Q∗ − QπK ||µ ≤

r  ! p 2γ L  d log n/δ C 4d(F , T F ) + O V 1 + √ µ,ρ max (1 − γ)2 ω n   K γ +O V2 (1 − γ)3 max

The asymptotic rate O(d/n) is the same as for regression

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 127/185

A Step-by-step Derivation for Linear FQI

The Final Bound

Summary

I

At each iteration FQI solves a regression problem ⇒ least–squares prediction error bound

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 128/185

A Step-by-step Derivation for Linear FQI

The Final Bound

Summary

I

I

At each iteration FQI solves a regression problem ⇒ least–squares prediction error bound The error is propagated through iterations ⇒ propagation of any error

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 128/185

Least-Squares Policy Iteration (LSPI)

Outline Preliminaries Tools from Statistical Learning Theory A Step-by-step Derivation for Linear FQI Least-Squares Policy Iteration (LSPI) Least-Squares Temporal-Difference Learning (LSTD) LSTD and LSPI Error Bounds Classification-based Policy Iteration Discussion A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 129/185

Least-Squares Policy Iteration (LSPI)

Finite-Sample Performance Bound of Least-Squares Policy Iteration (LSPI)

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 130/185

Least-Squares Policy Iteration (LSPI)

Least-Squares Policy Iteration (LSPI)

LSPI: is an approximate policy iteration algorithm that uses Least-Squares Temporal-Difference Learning (LSTD) for policy evaluation.

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 131/185

Least-Squares Policy Iteration (LSPI)

Objective of the Section

I

a brief description of LSTD iteration) algorithms

I

report final sample performance bounds for LSTD and LSPI

I

describe the main components of these bounds

(policy evaluation)

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

and LSPI

(policy

June 26, 2012 - 132/185

Least-Squares Policy Iteration (LSPI)

Least-Squares Temporal-Difference Learning (LSTD)

Outline Preliminaries Tools from Statistical Learning Theory A Step-by-step Derivation for Linear FQI Least-Squares Policy Iteration (LSPI) Least-Squares Temporal-Difference Learning (LSTD) LSTD and LSPI Error Bounds Classification-based Policy Iteration Discussion A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 133/185

Least-Squares Policy Iteration (LSPI)

Least-Squares Temporal-Difference Learning (LSTD)

Least-Squares Temporal-Difference Learning (LSTD) I

 P Linear function space F = f : f (·) = dj=1 αj ϕj (·)  {ϕj }dj=1 ∈ B(X ; L

,

> φ : X → Rd , φ(·) = ϕ1 (·), . . . , ϕd (·)

I

V π is the fixed-point of T π

V π = T πV π

I

V π may not belong to F

I

Best approximation of V π in F is

Vπ ∈ /F

ΠV π = arg min ||V π − f ||

(Π is the projection onto F)

f ∈F



F



ΠV π

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 134/185

Least-Squares Policy Iteration (LSPI)

Least-Squares Temporal-Difference Learning (LSTD)

Least-Squares Temporal-Difference Learning (LSTD) I

LSTD searches for the fixed-point of Π? T π instead (Π? is a

projection into F w.r.t. L? -norm) I

Π∞ T π is a contraction in L∞ -norm I

I

L∞ -projection is numerically expensive when the number of states is large or infinite

LSTD searches for the fixed-point of Π2,ρ T π Π2,ρ g = arg min ||g − f ||2,ρ f ∈F

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 135/185

Least-Squares Policy Iteration (LSPI)

Least-Squares Temporal-Difference Learning (LSTD)

Least-Squares Temporal-Difference Learning (LSTD) When the fixed-point of Πρ T π exists, we call it the LSTD solution VTD = Πρ T π VTD T π VTD Tπ Vπ

F

Πρ V π



VTD = Πρ T π VTD

hT π VTD − VTD , ϕi iρ = 0, π

i = 1, . . . , d

π

hr + γP VTD − VTD , ϕi iρ = 0 d X

(j)

hrπ , ϕi iρ − hϕj − γP π ϕj , ϕi iρ · αTD = 0 | {z } i=1 | {z } bi

−→

A αTD = b

Aij

I In general, Πρ T π is not a contraction and does not have a fixed-point. I If ρ = ρπ , the stationary dist. of π, then Πρπ T π has a unique fixed-point. A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 136/185

Least-Squares Policy Iteration (LSPI)

Least-Squares Temporal-Difference Learning (LSTD)

LSTD Algorithm Proposition

(LSTD Performance)

||V π − VTD ||ρπ ≤ p

1 1−

inf ||V π − V ||ρπ

γ 2 V ∈F

LSTD Algorithm

I We observe a trajectory generated by following the policy π  (X0 , R0 , X1 , R1 , . . . , XN ) where Xt+1 ∼ P · |Xt , π(Xt ) and  Rt = r Xt , π(Xt ) I We build estimators of the matrix A and vector b N −1 X   bij = 1 A ϕi (Xt ) ϕj (Xt ) − γϕj (Xt+1 ) N t=0

bαTD = b I Ab b

,

,

N −1 1 X b bi = ϕi (Xt )Rt N t=0

VbTD (·) = φ(·)> α bTD

b → A and b when n → ∞ then A b → b, and thus, α bTD → αTD and VbTD → VTD . A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 137/185

Least-Squares Policy Iteration (LSPI)

LSTD and LSPI Error Bounds

Outline Preliminaries Tools from Statistical Learning Theory A Step-by-step Derivation for Linear FQI Least-Squares Policy Iteration (LSPI) Least-Squares Temporal-Difference Learning (LSTD) LSTD and LSPI Error Bounds Classification-based Policy Iteration Discussion A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 138/185

Least-Squares Policy Iteration (LSPI)

LSTD and LSPI Error Bounds

LSTD Error Bound When the Markov chain induced by the policy under evaluation π has a stationary distribution ρπ (Markov chain is ergodic - e.g. β-mixing), then

Theorem

(LSTD Error Bound)

Let Ve be the truncated LSTD solution computed using n samples along a trajectory generated by following the policy π. Then with probability 1 − δ, we have ! r d log(d/δ) c π π e inf ||V − f ||ρπ + O ||V − V ||ρπ ≤ p nν 1 − γ 2 f ∈F I n = # of samples

d = dimension of the linear function space F R I ν = the smallest eigenvalue of the Gram matrix ( ϕi ϕj dρπ )i,j ,

(Assume: eigenvalues of the Gram matrix are strictly positive - existence of the model-based LSTD solution) I β-mixing coefficients are hidden in the O(·) notation A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 139/185

Least-Squares Policy Iteration (LSPI)

LSTD and LSPI Error Bounds

LSTD Error Bound LSTD Error Bound ||V π − Ve ||ρπ ≤ p

I

c

inf ||V π − f ||ρπ + O 1 − γ 2 f|∈F {z } | approximation error

r

! d log(d/δ) nν {z }

estimation error

Approximation error: it depends on how well the function space F can approximate the value function V π

I

Estimation error: it depends on the number of samples n, the dim of the function space d, the smallest eigenvalue of the Gram matrix ν, the mixing properties of the Markov chain (hidden in O)

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 140/185

Least-Squares Policy Iteration (LSPI)

LSTD and LSPI Error Bounds

LSPI Error Bound

Theorem

(LSPI Error Bound)

Let V−1 ∈ Fe be an arbitrary initial value function, Ve0 , . . . , VeK−1 be the sequence of truncated value functions generated by LSPI after K iterations, and πK be the greedy policy w.r.t. VeK−1 . Then with probability 1 − δ, we have

||V ∗ −V πK ||µ ≤

4γ (1 − γ)2

(

" p CCµ,ρ cE0 (F ) + O

s

d log(dK/δ) n νρ

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

!# +γ

K−1 2

) Rmax

June 26, 2012 - 141/185

Least-Squares Policy Iteration (LSPI)

LSTD and LSPI Error Bounds

LSPI Error Bound Theorem

(LSPI Error Bound)

Let V−1 ∈ Fe be an arbitrary initial value function, Ve0 , . . . , VeK−1 be the sequence of truncated value functions generated by LSPI after K iterations, and πK be the greedy policy w.r.t. VeK−1 . Then with probability 1 − δ, we have



||V −V

πK

4γ ||µ ≤ (1 − γ)2

(

" p CCµ,ρ cE0 (F ) + O

s

d log(dK/δ) n νρ

!# +γ

K−1 2

) Rmax

I Approximation error: E0 (F) = supπ∈G(Fe) inf f ∈F ||V π − f ||ρπ

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 142/185

Least-Squares Policy Iteration (LSPI)

LSTD and LSPI Error Bounds

LSPI Error Bound Theorem

(LSPI Error Bound)

Let V−1 ∈ Fe be an arbitrary initial value function, Ve0 , . . . , VeK−1 be the sequence of truncated value functions generated by LSPI after K iterations, and πK be the greedy policy w.r.t. VeK−1 . Then with probability 1 − δ, we have

||V ∗ −V πK ||µ ≤

4γ (1 − γ)2

(

" p CCµ,ρ cE0 (F ) + O

s

d log(dK/δ) n νρ

!# +γ

K−1 2

) Rmax

I Approximation error: E0 (F) = supπ∈G(Fe) inf f ∈F ||V π − f ||ρπ I Estimation error: depends on n, d, νρ , K

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 143/185

Least-Squares Policy Iteration (LSPI)

LSTD and LSPI Error Bounds

LSPI Error Bound Theorem

(LSPI Error Bound)

Let V−1 ∈ Fe be an arbitrary initial value function, Ve0 , . . . , VeK−1 be the sequence of truncated value functions generated by LSPI after K iterations, and πK be the greedy policy w.r.t. VeK−1 . Then with probability 1 − δ, we have ||V ∗ −V πK ||µ ≤

4γ (1 − γ)2

(

" p CCµ,ρ cE0 (F ) + O

s

d log(dK/δ) n νρ

!# +γ

K−1 2

) Rmax

I Approximation error: E0 (F) = supπ∈G(Fe) inf f ∈F ||V π − f ||ρπ I Estimation error: depends on n, d, νρ , K I Initialization error: error due to the choice of the initial value function or

initial policy |V ∗ − V π0 | A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 144/185

Least-Squares Policy Iteration (LSPI)

LSTD and LSPI Error Bounds

LSPI Error Bound LSPI Error Bound ∗

||V −V

πK

4γ ||µ ≤ (1 − γ)2

(

" p CCµ,ρ cE0 (F ) + O

s

d log(dK/δ) n νρ

!# +γ

K−1 2

) Rmax

Lower-Bounding Distribution e we have There exists a distribution ρ such that for any policy π ∈ G(F), ρ ≤ Cρπ , where C < ∞ is a constant and ρπ is the stationary distribution of π. Furthermore, we can define the concentrability coefficient Cµ,ρ as before.

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 145/185

Least-Squares Policy Iteration (LSPI)

LSTD and LSPI Error Bounds

LSPI Error Bound LSPI Error Bound ||V ∗ −V πK ||µ ≤

4γ (1 − γ)2

(

" p CCµ,ρ cE0 (F ) + O

s

d log(dK/δ) n νρ

!# +γ

K−1 2

) Rmax

Lower-Bounding Distribution e we have There exists a distribution ρ such that for any policy π ∈ G(F), ρ ≤ Cρπ , where C < ∞ is a constant and ρπ is the stationary distribution of π. Furthermore, we can define the concentrability coefficient Cµ,ρ as before. R

I νρ = the smallest eigenvalue of the Gram matrix ( ϕi ϕj dρ)i,j

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 146/185

Classification-based Policy Iteration

Outline Preliminaries Tools from Statistical Learning Theory A Step-by-step Derivation for Linear FQI Least-Squares Policy Iteration (LSPI) Classification-based Policy Iteration Algorithm Bounds on Error at each Iteration and Final Error Discussion A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 147/185

Classification-based Policy Iteration

Finite-Sample Performance Bound of a Classification-based Policy Iteration Algorithm

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 148/185

Classification-based Policy Iteration

Objective of the Section I

classification-based vs. regression-based (value function-based) policy iteration

I

describe a classification-based policy iteration algorithm

I

report bounds on the error at each iteration and on the error after K iterations of the algorithm

I

describe the main components of these bounds

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 149/185

Classification-based Policy Iteration

Value-based (Approximate) Policy Iteration

Regression eπ = learn(D, F) Q

Approximate vf e π (·, ·) Q Improvement e a) π e = arg maxa Q(·,

Training Set

bπ (xi , a)} D = {(xi, a), Q

Samples xi ∼ ρ

Rollout Estimate bπ (xi , a) Q

Policy π

* We use Monte-Carlo estimation for illustration purposes.

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 150/185

Classification-based Policy Iteration

Classification-based Policy Iteration No explicit approximation

Training Set

bπ (xi, a)} D = {xi, arg maxa Q

Classification

Samples xi ∼ ρ

Rollout Estimate bπ (xi , a) Q

π e = learn(D, Π)

Policy π

* First introduced by Lagoudakis & Parr (2003) and Fern et al. (2004,2006).

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 151/185

Classification-based Policy Iteration

Value-based vs Classification-based Policy Iteration

Regression eπ = learn(D, F) Q

Approximate vf e π (·, ·) Q

Samples xi ∼ ρ

Training Set bπ (xi, a)} D = {xi, arg maxa Q

Improvement e a) π e = arg maxa Q(·,

Training Set bπ (xi , a)} D = {(xi, a), Q Rollout Estimate bπ (xi , a) Q

No explicit approximation

Policy π

Classification

Samples xi ∼ ρ

Rollout Estimate bπ (xi , a) Q

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

π e = learn(D, Π)

Policy π

June 26, 2012 - 152/185

Classification-based Policy Iteration

Appealing Properties

I

Property 1. More important to have a policy with a performance similar to the greedy policy w.r.t. Qπk than an accurate approximation of Qπk .

I

Property 2. In some problems good policies are easier to represent and learn than their corresponding value functions.

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 153/185

Classification-based Policy Iteration

Algorithm

Outline Preliminaries Tools from Statistical Learning Theory A Step-by-step Derivation for Linear FQI Least-Squares Policy Iteration (LSPI) Classification-based Policy Iteration Algorithm Bounds on Error at each Iteration and Final Error Discussion A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 154/185

Classification-based Policy Iteration

Algorithm

Template of the Algorithm Input: policy space Π ⊆ Bπ (X ), state distribution ρ, number of rollout states N , number of rollouts per state-action pair M , rollout horizon H Initialize: Let π0 ∈ Π be an arbitrary policy for k = 0, 1, 2, . . . do iid

Construct the rollout set Dk = {xi }N i=1 , xi ∼ ρ for all states xi ∈ Dk and actions a ∈ A do for j = 1 to M do Perform a rollout according to policy πk and return π

Rj k (xi , a) = r(xi , a) +

H−1 X

 γ t r xt , πk (xt ) ,

t=1

xt

|xt−1 , π

(xt−1 )



with ∼p · and k end for b πk (xi , a) = 1 PM Rπk (xi , a) Q j=1 j M end for b πk+1 = arg minπ∈Π Lπk (b ρ ; π) end for

x1

∼ p(·|xi , a)

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

(classifier)

June 26, 2012 - 155/185

Classification-based Policy Iteration

Algorithm

Template of the Algorithm Input: policy space Π ⊆ Bπ (X ), state distribution ρ, number of rollout states N , number of rollouts per state-action pair M , rollout horizon H Initialize: Let π0 ∈ Π be an arbitrary policy for k = 0, 1, 2, . . . do iid

Construct the rollout set Dk = {xi }N i=1 , xi ∼ ρ for all states xi ∈ Dk and actions a ∈ A do for j = 1 to M do Perform a rollout according to policy πk and return π

Rj k (xi , a) = r(xi , a) +

H−1 X

 γ t r xt , πk (xt ) ,

t=1

xt

|xt−1 , π

(xt−1 )



with ∼p · and k end for b πk (xi , a) = 1 PM Rπk (xi , a) Q j=1 j M end for b πk+1 = arg minπ∈Π Lπk (b ρ ; π) end for

x1

∼ p(·|xi , a)

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

(classifier)

June 26, 2012 - 156/185

Classification-based Policy Iteration

Algorithm

Template of the Algorithm Input: policy space Π ⊆ Bπ (X ), state distribution ρ, number of rollout states N , number of rollouts per state-action pair M , rollout horizon H Initialize: Let π0 ∈ Π be an arbitrary policy for k = 0, 1, 2, . . . do iid

Construct the rollout set Dk = {xi }N i=1 , xi ∼ ρ for all states xi ∈ Dk and actions a ∈ A do for j = 1 to M do Perform a rollout according to policy πk and return π

Rj k (xi , a) = r(xi , a) +

H−1 X

 γ t r xt , πk (xt ) ,

t=1

xt

|xt−1 , π

(xt−1 )



with ∼p · and k end for b πk (xi , a) = 1 PM Rπk (xi , a) Q j=1 j M end for b πk+1 = arg minπ∈Π Lπk (b ρ ; π) end for

x1

∼ p(·|xi , a)

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

(classifier)

June 26, 2012 - 157/185

Classification-based Policy Iteration

Algorithm

Template of the Algorithm Input: policy space Π ⊆ Bπ (X ), state distribution ρ, number of rollout states N , number of rollouts per state-action pair M , rollout horizon H Initialize: Let π0 ∈ Π be an arbitrary policy for k = 0, 1, 2, . . . do iid

Construct the rollout set Dk = {xi }N i=1 , xi ∼ ρ for all states xi ∈ Dk and actions a ∈ A do for j = 1 to M do Perform a rollout according to policy πk and return π

Rj k (xi , a) = r(xi , a) +

H−1 X

 γ t r xt , πk (xt ) ,

t=1

xt

|xt−1 , π

(xt−1 )



with ∼p · and k end for b πk (xi , a) = 1 PM Rπk (xi , a) Q j=1 j M end for b πk+1 = arg minπ∈Π Lπk (b ρ ; π) end for

x1

∼ p(·|xi , a)

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

(classifier)

June 26, 2012 - 158/185

Classification-based Policy Iteration

Algorithm

Template of the Algorithm Input: policy space Π ⊆ Bπ (X ), state distribution ρ, number of rollout states N , number of rollouts per state-action pair M , rollout horizon H Initialize: Let π0 ∈ Π be an arbitrary policy for k = 0, 1, 2, . . . do iid

Construct the rollout set Dk = {xi }N i=1 , xi ∼ ρ for all states xi ∈ Dk and actions a ∈ A do for j = 1 to M do Perform a rollout according to policy πk and return π

Rj k (xi , a) = r(xi , a) +

H−1 X

 γ t r xt , πk (xt ) ,

t=1

xt

|xt−1 , π

(xt−1 )



with ∼p · and k end for b πk (xi , a) = 1 PM Rπk (xi , a) Q j=1 j M end for b πk+1 = arg minπ∈Π Lπk (b ρ ; π) end for

x1

∼ p(·|xi , a)

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

(classifier)

June 26, 2012 - 159/185

Classification-based Policy Iteration

Algorithm

Template of the Algorithm Empirical Error: N i 1 Xh b b πk (xi , a) − Q b πk xi , π(xi ) , Lπk (b ρ ; π) = max Q a∈A N i=1

(b ρ is the empirical distribution induced by the samples in Dk )

with the objective to minimize the Expected Error Z h i Lπk (ρ; π) = max Qπk (x, a) − Qπk x, π(x) ρ(dx) X

a∈A

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 160/185

Classification-based Policy Iteration

Algorithm

Mistake-based vs. Gap-based Errors Mistake-based error h i Lπk (ρ ; π) = Ex∼ρ I {π(x) 6= (Gπk )(x)}  Z  πk = I π(x) 6= arg max Q (x, a) ρ(dx) X

a∈A

Gap-based error Z h i Lπk (ρ ; π) = max Qπk (x, a) − Qπk x, π(x) ρ(dx) a∈A X h Z  i = I π(x) 6= arg max Qπk (x, a) max Qπk (x, a) − Qπk x, π(x) ρ(dx) X

a∈A

a∈A

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 161/185

Classification-based Policy Iteration

Algorithm

Mistake-based vs. Gap-based Errors Mistake-based error h i Lπk (ρ ; π) = Ex∼ρ I {π(x) 6= (Gπk )(x)}  Z  = I π(x) 6= arg max Qπk (x, a) ρ(dx) a∈A X | {z } mistake

Gap-based error

Z h

i Lπk (ρ ; π) = max Qπk (x, a) − Qπk x, π(x) ρ(dx) A a∈A h Z  i πk = I π(x) 6= arg max Q (x, a) max Qπk (x, a) − Qπk x, π(x) ρ(dx) a∈A a∈A X | {z } mistake

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 162/185

Classification-based Policy Iteration

Algorithm

Mistake-based vs. Gap-based Errors Mistake-based error h i Lπk (ρ ; π) = Ex∼ρ I {π(x) 6= (Gπk )(x)}  Z  = I π(x) 6= arg max Qπk (x, a) ρ(dx) a∈A X | {z } mistake

Gap-based error

Z h

i Lπk (ρ ; π) = max Qπk (x, a) − Qπk x, π(x) ρ(dx) a∈A X h Z  i πk = I π(x) 6= arg max Q (x, a) max Qπk (x, a) − Qπk x, π(x) ρ(dx) a∈A a∈A X {z } | {z }| mistake

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

cost/regret

June 26, 2012 - 163/185

Classification-based Policy Iteration

Bounds on Error at each Iteration and Final Error

Outline Preliminaries Tools from Statistical Learning Theory A Step-by-step Derivation for Linear FQI Least-Squares Policy Iteration (LSPI) Classification-based Policy Iteration Algorithm Bounds on Error at each Iteration and Final Error Discussion A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 164/185

Classification-based Policy Iteration

Bounds on Error at each Iteration and Final Error

Error at each Iteration Theorem Let Π be a policy space with h = V C(Π) < ∞ and ρ be a distribution over X . Let N be the number of states in Dk drawn i.i.d. from ρ, H be the rollout horizon, and M be the number of rollouts per state-action pair. Let πk+1 = arg min Lbπk (b ρ ; π) π∈Π

be the policy computed at the k’th iteration of DPI . Then, for any δ > 0 Lπk (ρ ; πk+1 ) ≤ inf Lπk (ρ ; π) + 2(1 + 2 + γ H Qmax ), π∈Π

with probability 1 − δ, where s

1 = 16Qmax

2 N

 h log

s 2 = 8(1 − γ H )Qmax

eN 32 + log h δ

2 MN

 and

  eM N 32 h log + log h δ

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 165/185

Classification-based Policy Iteration

Bounds on Error at each Iteration and Final Error

Remarks The bound  Lπk (ρ ; πk+1 ) ≤ inf Lπk (ρ ; π) + 2 1 (N ) + 2 (N, M, H) + γ H Qmax π∈Π | {z } | {z } approximation

estimation

I

Approximation error: it depends on how well the policy space Π can approximate greedy policies.

I

Estimation error: it depends on the number of rollout states, number of rollouts, and the rollout horizon.

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 166/185

Classification-based Policy Iteration

Bounds on Error at each Iteration and Final Error

Remarks The bound Lπk (ρ ; πk+1 ) ≤ inf Lπk (ρ ; π)+2 1 (N )+2 (N, M, H)+γ H Qmax π∈Π



The approximation error

inf Lπk (ρ ; π) = Z h i inf I {π(x) 6= (Gπk )(x)} max Qπk (x, a) − Qπk x, π(x) ρ(dx)

π∈Π

π∈Π X

a∈A

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 167/185

Classification-based Policy Iteration

Bounds on Error at each Iteration and Final Error

Remarks The bound Lπk (ρ ; πk+1 ) ≤ inf Lπk (ρ ; π) + 2 1 (N ) + 2 (N, M, H) + γ H Qmax π∈Π



The estimation error s

 eN 32 + log h δ s   eM N 32 2 2 = 8(1 − γ H )Qmax h log + log MN h δ 1 = 16Qmax

2 N



h log

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 168/185

Classification-based Policy Iteration

Bounds on Error at each Iteration and Final Error

Remarks The bound Lπk (ρ ; πk+1 ) ≤ inf Lπk (ρ ; π) + 2 1 (N ) + 2 (N, M, H) + γ H Qmax π∈Π



The estimation error s

 eN 32 + log h δ s   eM N 32 2 2 = 8(1 − γ H )Qmax h log + log MN h δ 1 = 16Qmax

2 N



h log

I

Avoid overfitting (1 ) : take N  h

I

Fixed budget of rollouts B = M N : take M = 1 and N = B

I

log B Fixed budget B = N M H and M = 1 : take H = O( log 1/γ ) and N = O(B/H) A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 168/185

Classification-based Policy Iteration

Bounds on Error at each Iteration and Final Error

Final Bound Theorem Let Π be a policy space with VC-dimension h and πK be the policy generated by DPI after K iterations. Then, for any δ > 0 ||V ∗ − V πK ||1,µ ≤

i 2γ K R Cµ,ρ h max d(Π, GΠ) + 2(1 + 2 + γ H Qmax ) + 2 (1 − γ) 1−γ

with probability 1 − δ, where s 1 = 16Qmax

2 N

  eN 32K h log + log h δ s

H

2 = 8(1 − γ )Qmax

2 MN

 h log

and

eM N 32K + log h δ

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP



June 26, 2012 - 169/185

Classification-based Policy Iteration

Bounds on Error at each Iteration and Final Error

Final Bound Theorem Let Π be a policy space with VC-dimension h and πK be the policy generated by DPI after K iterations. Then, for any δ > 0 ||V ∗ − V πK ||1,µ ≤

i 2γ K R Cµ,ρ h max d(Π, GΠ) + 2(1 + 2 + γ H Qmax ) + (1 − γ)2 1−γ

with probability 1 − δ, where s 1 = 16Qmax

2 N

  eN 32K h log + log h δ s

2 = 8(1 − γ H )Qmax

2 MN

 h log

and

eM N 32K + log h δ



Concentrability coefficient: Cµ,ρ A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 170/185

Classification-based Policy Iteration

Bounds on Error at each Iteration and Final Error

Final Bound Theorem Let Π be a policy space with VC-dimension h and πK be the policy generated by DPI after K iterations. Then, for any δ > 0 ||V ∗ − V πK ||1,µ ≤

i 2γ K R Cµ,ρ h max d(Π, GΠ) + 2(1 + 2 + γ H Qmax ) + (1 − γ)2 1−γ

with probability 1 − δ, where s 1 = 16Qmax

2 N

  eN 32K h log + log h δ s

2 = 8(1 − γ H )Qmax

2 MN

 h log

and

eM N 32K + log h δ



Estimation error: depends on M , N , H, h, and K A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 171/185

Classification-based Policy Iteration

Bounds on Error at each Iteration and Final Error

Final Bound Theorem Let Π be a policy space with VC-dimension h and πK be the policy generated by DPI after K iterations. Then, for any δ > 0 ||V ∗ − V πK ||1,µ ≤

i 2γ K R Cµ,ρ h max d(Π, GΠ) + 2(1 + 2 + γ H Qmax ) + (1 − γ)2 1−γ

with probability 1 − δ, where s 1 = 16Qmax

2 N

  eN 32K h log + log h δ s

H

2 = 8(1 − γ )Qmax

2 MN

 h log

and

eM N 32K + log h δ



Initialization error: error due to the choice of the initial policy |V ∗ − V π0 | A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 172/185

Classification-based Policy Iteration

Bounds on Error at each Iteration and Final Error

Remarks Inherent Greedy Error d(Π, GΠ)

(approximation error)

d(Π, GΠ) = sup inf Lπ (ρ ; π 0 ) 0 π∈Π π ∈Π

= sup inf 0

π∈Π π ∈Π

Z

X

h i I {π 0 (x) 6= (Gπ)(x)} max Qπ (x, a) − Qπ x, π 0 (x) ρ(dx) a∈A



Π π

π



A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 173/185

Classification-based Policy Iteration

Bounds on Error at each Iteration and Final Error

Other Finite-Sample Analysis Results in Batch RL I

Approximate Value Iteration (Munos & Szepesvari 2008)

I

Approximate Policy Iteration I

LSTD and LSPI (Lazaric et al. 2010, 2012)

I

Bellman Residual Minimization (Maillard et al. 2010)

I

Modified Bellman Residual Minimization (Antos et al. 2008)

I

Classification-based Policy Iteration (Fern et al. 2006; Lazaric et al. 2010; Gabillon et al. 2011; Farahmand et al. 2012)

I

Conservative Policy Iteration (Kakade & Langford 2002; Kakade 2003)

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 174/185

Classification-based Policy Iteration

Bounds on Error at each Iteration and Final Error

Other Finite-Sample Analysis Results in Batch RL I

Approximate Modified Policy Iteration (Scherrer et al. 2012)

I

Regularized Approximate Dynamic Programming I

I

L2 -Regularization I

L2 -Regularized Policy Iteration (Farahmand et al. 2008)

I

L2 -Regularized Fitted Q-Iteration (Farahmand et al. 2009)

L1 -Regularization and High-Dimensional RL I

Lasso-TD (Ghavamzadeh et al. 2011)

I

LSTD (LSPI) with Random Projections (Ghavamzadeh et al. 2010)

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 175/185

Discussion

Outline Preliminaries Tools from Statistical Learning Theory A Step-by-step Derivation for Linear FQI Least-Squares Policy Iteration (LSPI) Classification-based Policy Iteration Discussion Learned Lessons Open Problems A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 176/185

Discussion

Discussion

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 177/185

Discussion

Learned Lessons

Outline Preliminaries Tools from Statistical Learning Theory A Step-by-step Derivation for Linear FQI Least-Squares Policy Iteration (LSPI) Classification-based Policy Iteration Discussion Learned Lessons Open Problems A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 178/185

Discussion

Learned Lessons

Learned Lessons

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 179/185

Discussion

Learned Lessons

Comparison to Supervised Learning we obtain the optimal rate of regression and classification for RL (ADP) algorithms

What makes RL more challenging then? I

dependency on 1/(1 − γ) (sequential nature of the problem)

I

the approximation error is more complex

I

the propagation of error (control problem)

I

the sampling problem (how to choose ρ – exploration problem)

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 180/185

Discussion

Learned Lessons

Practical Lessons I

I

Tuning the parameters (given a fixed accuracy ) I

number of samples (inverting the bound)

I

number of iterations (inverting the bound)

K ≈ /(1 − γ)

choice of function F and/or policy space Π I

I

e d) n ≥ Ω( 

features {ϕi }di=1 to be linearly independent given the sampling distribution ρ (on-policy – off-policy sampling)

tradeoff between approximation and estimation errors

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 181/185

Discussion

Open Problems

Outline Preliminaries Tools from Statistical Learning Theory A Step-by-step Derivation for Linear FQI Least-Squares Policy Iteration (LSPI) Classification-based Policy Iteration Discussion Learned Lessons Open Problems A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 182/185

Discussion

Open Problems

Open Problems

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 183/185

Discussion

Open Problems

Open Problems

I

High–dimensional spaces: how to deal with MDPs with many state–action variables? I I

First example in deterministic design for LSTD Extension to other algorithms

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 184/185

Discussion

Open Problems

Open Problems

I

High–dimensional spaces: how to deal with MDPs with many state–action variables? I I

I

First example in deterministic design for LSTD Extension to other algorithms

Optimality : how optimal are the current algorithms? I I I

Improve the sampling distribution Control the concentrability terms Limit the propagation of error through iterations

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 184/185

Discussion

Open Problems

Open Problems

I

High–dimensional spaces: how to deal with MDPs with many state–action variables? I I

I

Optimality : how optimal are the current algorithms? I I I

I

First example in deterministic design for LSTD Extension to other algorithms Improve the sampling distribution Control the concentrability terms Limit the propagation of error through iterations

Off–policy learning for LSTD

A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 184/185

Statistical Learning Theory Meets Dynamic Programming

M. Ghavamzadeh, A. Lazaric {mohammad.ghavamzadeh, alessandro.lazaric}@inria.fr sequel.lille.inria.fr