Least-squares Temporal Difference Learning - CiteSeerX

Least-squares Temporal Difference Learning Reinforcement Learning Christian Igel Institut f¨ ur Neuroinformatik Ruhr-Universit¨ at Bochum, Germany http://www.neuroinformatik.rub.de

Christian Igel: Least-squares Temporal Difference Learning

1 / 16

Notation a , Ra ], we assume episodic tasks (discount parameter • [S, A, Pss 0 ss0

γ = 1) • S: finite state space, A: action space a : expected immediate reward when action a leads to s0 from s • Rss 0 a • Pss0 : probability that a leads to s0 from s

• π : S → A: deterministic policy • V : S → R: estimate state value function • r ∈ R: immediate reward • δ ∈ R: temporal difference error • e : S → R+ 0 : eligibility trace in tabular TD + • αn ∈ R : learning rate, may decrease with episode number n • st : state at tth step • si : state number i (according to some fixed enumeration of S) Christian Igel: Least-squares Temporal Difference Learning

2 / 16

Recall: On-line tabular temporal difference learning Algorithm 1: On-line TD(λ) 1 2 3 4 5 6 7 8 9

10

11

initialize V foreach episode n = 1, . . . do initialize s1 , ∀s ∈ S : e(s) ← 0, e(s1 ) ← 1, αn ∈ R+ , t ← 1 while st not terminal do take action at = πt (st ), observe reward rt+1 and st+1 δ ← rt+1 + V (st+1 ) − V (st ) forall s ∈ S do V (s) ← V (s) + αn δe(s) e(s) ← λe(s) e(st+1 ) ← e(st+1 ) + 1 t←t+1


3 / 16

Linear function approximation and vector notation • mapping to feature space:

φ : S → Rk , s 7→ (φ1 (s), . . . , φk (s))T φi : S → R: ith basis function • linear function approximator with weights β ∈ Rk :

V (s) = φ(s)T β • e∈

Rk :

∇β V (s) = φ(s)

eligibility trace, e(s) = φ(s)T e

• relation to tabular representation:

k = S = {s1 , . . . , sk } ( 1 if i = l φi (sl ) = 0 otherwise v = (V (1), . . . , V (k))T V (sl ) = vl = φ(sl )T β = βl Christian Igel: Least-squares Temporal Difference Learning

4 / 16

Recall: On-line temporal difference learning II

Algorithm 2: On-line TD(λ) 1 2 3 4 5 6 7 8 9

10

initialize β (i.e., V ) foreach episode n = 1, . . . do initialize s1 , αn ∈ R+ , t ← 1 e ← φ(s1 ) while st not terminal do take action at = πt (st ), observe reward rt+1 and st+1 δ ← e(rt+1 + (φ(st+1 ) − φ(st ))T β) e ← λe + φ(st+1 ) β ← β + αn δ t←t+1


5 / 16

Off-line temporal difference learning Algorithm 3: Off-line TD(λ) 1 2 3 4 5 6 7 8 9

10

11

initialize β (i.e., V ) foreach episode n = 1, . . . do initialize s1 , αn ∈ R+ , t ← 1 δ←0 et ← φ(s1 ) while st not terminal do take action at = πt (st ), observe reward rt+1 and st+1 δ ← δ + et (rt+1 + (φ(st+1 ) − φ(st ))T β) et+1 ← λet + φ(st+1 ) t←t+1 β ← β + αn δ


6 / 16

Final weights of off-line TD learning after observed trajectory (s1 , . . . , st ) roughly: β ← β + αn (d + Cβ + ω) " t # " t # X X d=E ei ri+1 , C = E ei (φ(si+1 ) − φ(si ))T i=1

i=1

ω: zero-mean noise E: expectation over trajectories assuming suitable decreasing schedule for αn over time (episodes), β converges to β λ satisfying d + Cβ λ = 0 in effect, TD(λ) performs stochastic gradient-descent on kβ − β λ k2 Christian Igel: Least-squares Temporal Difference Learning

7 / 16

Least-squares temporal difference learning

• idea: solve d + Cβ λ = 0 directly • gather experience:

b=

t X

ei ri+1 , A =

i=1

t X

ei (φ(si ) − φ(si+1 ))T

i=1

we do not reset t after an episode to keep the notation uncluttered (i.e., sum over t runs over all episodes) •

b ∈ Rk and A ∈ Rk×k are unbiased estimates of nd ∈ Rk and −nC ∈ Rk×k , respectively !

• d + Cβ λ = 0

b − Aβ = 0 ⇒ β = A−1 b


8 / 16

Least-squares temporal difference learning Algorithm 4: Least-squares TD(λ) 1 2 3 4 5 6 7 8 9

10

11

12

initialize β, A ← 0, b = 0 t ← 1 // no reset after episode for notational convenience foreach episode n = 1, . . . do initialize st et ← φ(st ) while st not terminal do take action at = πt (st ), observe reward rt+1 and st+1 A ← A + et (φ(st ) − φ(st+1 ))T b ← b + et rt+1 et+1 ← λet + φ(st+1 ) t←t+1 whenever updated coefficients are desired β = A−1 b


9 / 16

Model-based learning

Tabular model-based RL, k = |S = {s1 , . . . , sk }|: 1

build empirical model, sufficient statistics: • n ∈ Nk0 , ni : number of times state si has been visited, N = diag(n) • T ∈ Rk×k : state-transition counts, [T ]ij : how many times sj directly

followed si • g ∈ Rk , gi : sum of one-step rewards observed on transition leaving

state si 2

for new estimate of V π solve Bellman equation v = (N − T )−1 g


10 / 16

Bellman equation

V (s) =

X

π(s, a)

X

a

s0

X

π(s, a)

X

=

a a 0 Pss 0 Rss0 + V (s ) a a Pss 0 Rss0 +

s0

a

XX s0

a 0 π(s, a)Pss 0 V (s )

a

# X 1 1 gs + [T ]ss0 V (s0 ) ⇒ = lim N →∞ ns n s s0 " # X 0 ns V (s) ≈ gs + [T ]ss0 V (s ) "

s0

thus, in the limit we can write in vector form (v = (V (1), . . . , V (k))T ) N v = T v + g ⇒ v = (N − T )−1 g Christian Igel: Least-squares Temporal Difference Learning

11 / 16

LSTD(0)

Tabular LSTD(0), k = |S|, assume states are numbered s1 , . . . , sk :  (  if st = si 6= st+1 1 1 if st = si [et ]i = , [(φ(st ) − φ(st+1 ))]i = −1 if st+1 = si 6= st  0 otherwise  0 otherwise • b ← b + et rt+1 stores g • A ← A + et (φ(st ) − φ(st+1 ))T stores N − T • β = A−1 b = (N − T )−1 g

→ LSTD(0) is model based-learning function approximator: a compressed world model is stored!


12 / 16

LSTD(1) • observed trajectory (s1 , . . . , st ), corresponding features

(φ(s1 ), . . . , φ(st )) and rewards (r2 , . . . , rt+1 ) • generate input/target training patterns (T : end of episode)

φ(s1 ),

T X

T X ri+1 , . . . , (φ(st ), rt+1 ) ri+1 , φ(s2 ), i=2

i=1

T X = (xi , yi ) 1 ≤ i ≤ t ∧ xi = φ(si ) ∧ yi = rj+1 • minimizing (just for simplicity for a single episode)

E(β) ∝

T X

T

β φ(si ) −

i=1

T X j=i

2 rj+1

=

T X

j=i

β T xi − y i

2

i=1

yields same solution as LSTD(1) → LSTD(1) performs least-squares linear regression to model future reward given the features Christian Igel: Least-squares Temporal Difference Learning

13 / 16

Equivalence LSTD(1) and Regression I

T

2 ∂ 1 X T β φ(si ) − yi = 0 ⇒ ∂β 2 i=1

T X

β T φ(si ) − yi φ(si )T = 0 ⇒

i=1

βT

T X

φ(si )φ(si )T =

i=1

T X

yi φ(si )T ⇒

i=1

" β=

T X i=1


#−1 T

φ(si )φ(si )

T X

yi φ(si )

i=1

14 / 16

Equivalence LSTD(1) and Regression II for λ = 1 and setting φ(sT +1 ) = 0: A=

t X

ei (φ(si ) − φ(si+1 ))T =

t X

= φ(s1 )φ(s1 )T +

i T X X

i X

 i=1

i=1



= φ(s1 )φ(s1 )T +

t X

φ(sj )φ(si )T −

=

T X

t+1 X k−1 X

φ(sj )φ(sk )T

k=2 j=1

i X

 i=2

φ(sj ) (φ(si ) − φ(si+1 ))T

j=1

i=2 j=1





φ(sj )φ(si )T −

j=1

i−1 X

 φ(sk )φ(si )T 

k=1

φ(si )φ(si )T

i=1

b=

t X i=1

ei ri+1 =

t X



i X

 i=1

 φ(sj ) ri+1 =

j=1


t X j=0

φ(sj )

t X i=j

ri+1 =

T X

yi φ(si )

i=1 15 / 16

Summary

• least-squares temporal difference learning is very fast and does not

require carefully chosen learning rate parameter – use it! • for λ = 1: incremental formulation of supervised linear regression • for λ → 0: model-based reinforcement learning

Reference J. Boyan. Technical Update: Least-Squares Temporal Difference Learning. Machine Learning 49(2–3), pp. 233–246, 2002


16 / 16

Least-squares Temporal Difference Learning - CiteSeerX

Least-squares Temporal Difference Learning - CiteSeerX

Suggest Documents

Emphatic Temporal-Difference Learning

Differential Temporal Difference Learning

RKHS Temporal Difference Learning

Temporal Difference Based Actor Critic Learning ... - CiteSeerX

Sparse temporal difference learning using LASSO - CiteSeerX

Temporal Abstraction in Temporal-difference Networks - CiteSeerX

Temporal Difference Learning Versus Co-Evolution for ... - CiteSeerX

Why Co-Evolution beats Temporal Difference learning at ... - CiteSeerX

The Significance of Temporal-Difference Learning in Self ... - CiteSeerX

Investigating practical linear temporal difference learning

Efficient Asymptotic Approximation in Temporal Difference Learning

Adaptive Î» Least-Squares Temporal Difference Learning

Incremental Least-Squares Temporal Difference Learning

Efficient Asymptotic Approximation in Temporal Difference Learning

Preconditioned Temporal Difference Learning - ICML 2008

Learning to Play Draughts using Temporal Difference Learning with ...

Convergent Temporal-Difference Learning with ... - University of Alberta

Off-Policy Temporal-Difference Learning with ... - Semantic Scholar

On the Convergence of Temporal-Difference Learning ... - Springer Link

Linear Least-Squares algorithms for temporal difference learning

Effective Multi-step Temporal-Difference Learning for Non-Linear ...

Temporal Difference Learning of Position Evaluation in ... - Google Sites

Temporal Difference Learning with Interpolated N-Tuples: Initial ...

Temporal-Difference Learning to Assist Human Decision Making ...