initialize s1, âs â S : e(s) â 0, e(s1) â 1, αn â R+, t â 1. 3 while st not terminal do. 4 take action at = Ït(st), observe reward rt+1 and st+1. 5 δ â rt+1 + V (st+1) ...
Least-squares Temporal Difference Learning Reinforcement Learning Christian Igel Institut f¨ ur Neuroinformatik Ruhr-Universit¨ at Bochum, Germany http://www.neuroinformatik.rub.de
Christian Igel: Least-squares Temporal Difference Learning
1 / 16
Notation a , Ra ], we assume episodic tasks (discount parameter • [S, A, Pss 0 ss0
γ = 1) • S: finite state space, A: action space a : expected immediate reward when action a leads to s0 from s • Rss 0 a • Pss0 : probability that a leads to s0 from s
• π : S → A: deterministic policy • V : S → R: estimate state value function • r ∈ R: immediate reward • δ ∈ R: temporal difference error • e : S → R+ 0 : eligibility trace in tabular TD + • αn ∈ R : learning rate, may decrease with episode number n • st : state at tth step • si : state number i (according to some fixed enumeration of S) Christian Igel: Least-squares Temporal Difference Learning
2 / 16
Recall: On-line tabular temporal difference learning Algorithm 1: On-line TD(λ) 1 2 3 4 5 6 7 8 9
10
11
initialize V foreach episode n = 1, . . . do initialize s1 , ∀s ∈ S : e(s) ← 0, e(s1 ) ← 1, αn ∈ R+ , t ← 1 while st not terminal do take action at = πt (st ), observe reward rt+1 and st+1 δ ← rt+1 + V (st+1 ) − V (st ) forall s ∈ S do V (s) ← V (s) + αn δe(s) e(s) ← λe(s) e(st+1 ) ← e(st+1 ) + 1 t←t+1
Christian Igel: Least-squares Temporal Difference Learning
3 / 16
Linear function approximation and vector notation • mapping to feature space:
φ : S → Rk , s 7→ (φ1 (s), . . . , φk (s))T φi : S → R: ith basis function • linear function approximator with weights β ∈ Rk :
V (s) = φ(s)T β • e∈
Rk :
∇β V (s) = φ(s)
eligibility trace, e(s) = φ(s)T e
• relation to tabular representation:
k = S = {s1 , . . . , sk } ( 1 if i = l φi (sl ) = 0 otherwise v = (V (1), . . . , V (k))T V (sl ) = vl = φ(sl )T β = βl Christian Igel: Least-squares Temporal Difference Learning
4 / 16
Recall: On-line temporal difference learning II
Algorithm 2: On-line TD(λ) 1 2 3 4 5 6 7 8 9
10
initialize β (i.e., V ) foreach episode n = 1, . . . do initialize s1 , αn ∈ R+ , t ← 1 e ← φ(s1 ) while st not terminal do take action at = πt (st ), observe reward rt+1 and st+1 δ ← e(rt+1 + (φ(st+1 ) − φ(st ))T β) e ← λe + φ(st+1 ) β ← β + αn δ t←t+1
Christian Igel: Least-squares Temporal Difference Learning
5 / 16
Off-line temporal difference learning Algorithm 3: Off-line TD(λ) 1 2 3 4 5 6 7 8 9
10
11
initialize β (i.e., V ) foreach episode n = 1, . . . do initialize s1 , αn ∈ R+ , t ← 1 δ←0 et ← φ(s1 ) while st not terminal do take action at = πt (st ), observe reward rt+1 and st+1 δ ← δ + et (rt+1 + (φ(st+1 ) − φ(st ))T β) et+1 ← λet + φ(st+1 ) t←t+1 β ← β + αn δ
Christian Igel: Least-squares Temporal Difference Learning
6 / 16
Final weights of off-line TD learning after observed trajectory (s1 , . . . , st ) roughly: β ← β + αn (d + Cβ + ω) " t # " t # X X d=E ei ri+1 , C = E ei (φ(si+1 ) − φ(si ))T i=1
i=1
ω: zero-mean noise E: expectation over trajectories assuming suitable decreasing schedule for αn over time (episodes), β converges to β λ satisfying d + Cβ λ = 0 in effect, TD(λ) performs stochastic gradient-descent on kβ − β λ k2 Christian Igel: Least-squares Temporal Difference Learning
7 / 16
Least-squares temporal difference learning
• idea: solve d + Cβ λ = 0 directly • gather experience:
b=
t X
ei ri+1 , A =
i=1
t X
ei (φ(si ) − φ(si+1 ))T
i=1
we do not reset t after an episode to keep the notation uncluttered (i.e., sum over t runs over all episodes) •
b ∈ Rk and A ∈ Rk×k are unbiased estimates of nd ∈ Rk and −nC ∈ Rk×k , respectively !
• d + Cβ λ = 0
b − Aβ = 0 ⇒ β = A−1 b
Christian Igel: Least-squares Temporal Difference Learning
8 / 16
Least-squares temporal difference learning Algorithm 4: Least-squares TD(λ) 1 2 3 4 5 6 7 8 9
10
11
12
initialize β, A ← 0, b = 0 t ← 1 // no reset after episode for notational convenience foreach episode n = 1, . . . do initialize st et ← φ(st ) while st not terminal do take action at = πt (st ), observe reward rt+1 and st+1 A ← A + et (φ(st ) − φ(st+1 ))T b ← b + et rt+1 et+1 ← λet + φ(st+1 ) t←t+1 whenever updated coefficients are desired β = A−1 b
Christian Igel: Least-squares Temporal Difference Learning
9 / 16
Model-based learning
Tabular model-based RL, k = |S = {s1 , . . . , sk }|: 1
build empirical model, sufficient statistics: • n ∈ Nk0 , ni : number of times state si has been visited, N = diag(n) • T ∈ Rk×k : state-transition counts, [T ]ij : how many times sj directly
followed si • g ∈ Rk , gi : sum of one-step rewards observed on transition leaving
state si 2
for new estimate of V π solve Bellman equation v = (N − T )−1 g
Christian Igel: Least-squares Temporal Difference Learning
10 / 16
Bellman equation
V (s) =
X
π(s, a)
X
a
s0
X
π(s, a)
X
=
a a 0 Pss 0 Rss0 + V (s ) a a Pss 0 Rss0 +
s0
a
XX s0
a 0 π(s, a)Pss 0 V (s )
a
# X 1 1 gs + [T ]ss0 V (s0 ) ⇒ = lim N →∞ ns n s s0 " # X 0 ns V (s) ≈ gs + [T ]ss0 V (s ) "
s0
thus, in the limit we can write in vector form (v = (V (1), . . . , V (k))T ) N v = T v + g ⇒ v = (N − T )−1 g Christian Igel: Least-squares Temporal Difference Learning
11 / 16
LSTD(0)
Tabular LSTD(0), k = |S|, assume states are numbered s1 , . . . , sk : ( if st = si 6= st+1 1 1 if st = si [et ]i = , [(φ(st ) − φ(st+1 ))]i = −1 if st+1 = si 6= st 0 otherwise 0 otherwise • b ← b + et rt+1 stores g • A ← A + et (φ(st ) − φ(st+1 ))T stores N − T • β = A−1 b = (N − T )−1 g
→ LSTD(0) is model based-learning function approximator: a compressed world model is stored!
Christian Igel: Least-squares Temporal Difference Learning
12 / 16
LSTD(1) • observed trajectory (s1 , . . . , st ), corresponding features
(φ(s1 ), . . . , φ(st )) and rewards (r2 , . . . , rt+1 ) • generate input/target training patterns (T : end of episode)
φ(s1 ),
T X
T X ri+1 , . . . , (φ(st ), rt+1 ) ri+1 , φ(s2 ), i=2
i=1
T X = (xi , yi ) 1 ≤ i ≤ t ∧ xi = φ(si ) ∧ yi = rj+1 • minimizing (just for simplicity for a single episode)
E(β) ∝
T X
T
β φ(si ) −
i=1
T X j=i
2 rj+1
=
T X
j=i
β T xi − y i
2
i=1
yields same solution as LSTD(1) → LSTD(1) performs least-squares linear regression to model future reward given the features Christian Igel: Least-squares Temporal Difference Learning
13 / 16
Equivalence LSTD(1) and Regression I
T
2 ∂ 1 X T β φ(si ) − yi = 0 ⇒ ∂β 2 i=1
T X
β T φ(si ) − yi φ(si )T = 0 ⇒
i=1
βT
T X
φ(si )φ(si )T =
i=1
T X
yi φ(si )T ⇒
i=1
" β=
T X i=1
Christian Igel: Least-squares Temporal Difference Learning
#−1 T
φ(si )φ(si )
T X
yi φ(si )
i=1
14 / 16
Equivalence LSTD(1) and Regression II for λ = 1 and setting φ(sT +1 ) = 0: A=
t X
ei (φ(si ) − φ(si+1 ))T =
t X
= φ(s1 )φ(s1 )T +
i T X X
i X
i=1
i=1
= φ(s1 )φ(s1 )T +
t X
φ(sj )φ(si )T −
=
T X
t+1 X k−1 X
φ(sj )φ(sk )T
k=2 j=1
i X
i=2
φ(sj ) (φ(si ) − φ(si+1 ))T
j=1
i=2 j=1
φ(sj )φ(si )T −
j=1
i−1 X
φ(sk )φ(si )T
k=1
φ(si )φ(si )T
i=1
b=
t X i=1
ei ri+1 =
t X
i X
i=1
φ(sj ) ri+1 =
j=1
Christian Igel: Least-squares Temporal Difference Learning
t X j=0
φ(sj )
t X i=j
ri+1 =
T X
yi φ(si )
i=1 15 / 16
Summary
• least-squares temporal difference learning is very fast and does not
require carefully chosen learning rate parameter – use it! • for λ = 1: incremental formulation of supervised linear regression • for λ → 0: model-based reinforcement learning
Reference J. Boyan. Technical Update: Least-Squares Temporal Difference Learning. Machine Learning 49(2–3), pp. 233–246, 2002
Christian Igel: Least-squares Temporal Difference Learning
16 / 16