Machine Learning, 22, 33-57 (1996) O 1996 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.
Linear Least-Squares Algorithms for Temporal Difference Learning bradtke@ gte.com
STEVEN J. BRADTKE GTE Data Services, One E Telecorn Pkwy, DC B2H, Temple Terrace, FL 33637 ANDREW
G. BARTO
Dept. o] Computer Science, University of Massachusetts, Amherst, MA
[email protected] 01003-4610
Editor: Leslie Pack Kaelbling
We introduce two new temporal difference (TD) algorithms based on the theory of linear leastsquares function approximation. We define an algorithm we call Least-Squares TD (LS TD) for which we prove probability-one convergence when it is used with a function approximator linear in the adjustable parameters. We then define a recursive version of this algorithm, Recursive Least-Squares TD (RLS TD). Although these new TD algorithms require more computation per time-step than do Sutton's TD(A) algorithms, they are more efficient in a statistical sense because they extract more information from training experiences. We describe a simulation experiment showing the substantial improvement in learning rate achieved by RLS TD in an example Markov prediction problem. To quantify this improvement, we introduce the TD error variance of a Markov chain, arc,, and experimentally conclude that the convergence rate of a TD algorithm depends linearly on ~ro. In addition to converging more rapidly, LS TD and RLS TD do not have control parameters, such as a learning rate parameter, thus eliminating the possibility of achieving poor performance by an unlucky choice of parameters. Abstract.
Keywords: Reinforcement learning, Markov Decision Problems, Temporal Difference Methods, Least-Squares
1.
Introduction
The class o f t e m p o r a l d i f f e r e n c e ( T D ) algorithms (Sutton, 1988) was d e v e l o p e d to prov i d e r e i n f o r c e m e n t learning systems with an efficient m e a n s for learning w h e n the consequences o f actions unfold o v e r e x t e n d e d time periods. T h e y allow a system to learn to predict the total a m o u n t o f reward e x p e c t e d o v e r time, and they can be used for other prediction p r o b l e m s as well (Anderson, 1988, Barto, et al., 1983, Sutton, 1984, Tesauro, 1992). W e introduce two new T D algorithms based on the theory o f linear least-squares f u n c t i o n approximation. The recursive l e a s t - s q u a r e s function a p p r o x i m a tion algorithm is c o m m o n l y used in adaptive control ( G o o d w i n & Sin, 1984) b e c a u s e it can c o n v e r g e m a n y t i m e s m o r e rapidly than simpler algorithms. Unfortunately, e x t e n d i n g this algorithm to the case o f T D learning is not straightforward. We define an a l g o r i t h m we call L e a s t - S q u a r e s T D (LS T D ) for which we p r o v e p r o b a b i l i t y - o n e c o n v e r g e n c e w h e n it is used with a function a p p r o x i m a t o r linear in the adjustable parameters. To obtain this result, w e use the instrumental variable approach (Ljung & S o d e r s t r r m , 1983, S r d e r s t r S m & Stoica, 1983, Young, 1984) which p r o v i d e s a w a y to handle l e a s t - s q u a r e s estimation with training data that is noisy on both the input and output observations. We then define a recursive version o f this algorithm, Re-
34
s.J. B R A D T K E AND A.G. BARTO
cursive Least-Squares TD (RLS TD). Although these new TD algorithms require more computation per time step than do Sutton's TD(A) algorithms, they are more efficient in a statistical sense because they extract more information from training experiences. We describe a simulation experiment showing the substantial improvement in learning rate achieved by RLS TD in an example Markov prediction problem. To quantify this improvement, we introduce the TD error variance of a Markov chain, ~rro, and experimentally conclude that the convergence rate of a TD algorithm depends linearly on art. In addition to converging more rapidly, LS TD and RLS TD do not have control parameters, such as a learning rate parameter, thus eliminating the possibility of achieving poor performance by an unlucky choice of parameters. We begin in Section 2 with a brief overview of the policy evaluation problem for Markov decision processes, the class of problems to which TD algorithms apply. After describing the TD(A) class of algorithms and the existing convergence results in Sections 3 and 4, we present the least-squares approach in Section 5. Section 6 presents issues relevant to selecting an algorithm, and Sections 7 and 8 introduce the TD error variance and use it to quantify the results of a simulation experiment.
2.
Markov Decision Processes
TD(A) algorithms address the policy evaluation problem associated with discrete-time stochastic optimal control problems referred to as Markov decision processes (MDPs). An MDP consists of a discrete-time stochastic dynamic system (a controlled Markov chain), an immediate reward function, R, and a measure of long-term system performance. Restricting attention to finite-state, finite-action MDP's, we let X and A respectively denote finite sets of states and actions, and 39 denote the state transition probability function. At time step t, the controller observes the current state, xt, and executes an action, at, resulting in a transition to state xt+l with probability P ( z t , z t + l , at) and the receipt of an immediate reward rt = R(xt, xt+l, at). A (stationary) policy is a function # : X --* A giving the controller's action choice for each state. For each policy # there is a value function, V ~, that assigns to each state a measure of long-term performance given that the system starts in the given state and the controller always uses # to select actions. Confining attention to the infinite-horizon discounted definition of long-term performance, the value function for # is defined as follows: O % then one can set "y = 1_) The objective of the MDP is to find a policy, #*, that is optimal in the sense that V ~* (x) >_ VU(x) for all x ~ X and for all policies #. Computing the evaluation function for a given policy is called policy evaluation. This computation is a component of the policy iteration method for finding an optimal policy,
LINEAR LEAST-SQUARES ALGORITHMS FOR TEMPORAL DIFFERENCE LEARNING 35
and it is sometimes of interest in its own right to solve prediction problems, the perspective taken by Sutton (Sutton, 1988). The evaluation function of a policy must satisfy the following consistency condition: for all x E X :
VU(x) = ~ P ( x , y, # ( x ) ) [ R ( x , y, # ( x ) ) + 7 V " ( y ) ] . yEx This is a set of IX[ linear equations which can be solved for V u using any of a number of standard direct or iterative methods when the functions R and P are known. The TD(A) family of algorithms apply to this problem when these functions are not known. Since our concern in this paper is solely in the problem of evaluating a fixed policy #, we can omit reference to the policy throughout. We therefore denote the value function V u simply as V, and we omit the action argument in the functions R and P . Furthermore, throughout this paper, by a Markov chain we always mean a finite-state Markov chain.
3.
T h e TD(A) A l g o r i t h m
Although any TD(A) algorithm can be used with a lookup-table function representation, it is most often described in terms of a parameterized function approximator. In this case, Vt, the approximation of V at time step t, is defined by Vt(x) = f(Ot, ex), for all x E X , where Ot is a parameter vector at time step t, ex is a feature vector representing state x, and f is a given real-valued function differentiable with respect to Ot for all ex. We use the notation V0~ Vt (x) to denote the gradient vector at state x of Vt as a function of Or.
Table 1. Notation used in the discussion of the TD(A) learning rule. x,y~z Xt ~t
r S X' V ¢:~ et q~ 7rx H O* Ot V~(x) c~?(zt)
~?(xt)
states of the Markov chain the state at time step t the immediate reward associated with the transition trom state xt to Xt+l; rt =
R(zt, zt+l).
t~he vector of expected immediaterewards; the vector of starting probabilities the transpose of the vector or matrix X the true value function the feature vector representingstate x the feature vector representingstate xt. 49t = fbzt. the matrix whose x-th row is Cz. the proportion of time that the Markov chain is expected to spend in state x the diagonal matrix diag(Tr) the true value function parameter vector the estimate of O* at time t the estimated value of state x using parameter vector 0t the step-size parameter used to update the value of Ot the number of transitions from state xt up to time step t.
Using additional notation, summarized in Table 1, the TD(A) learning rule for a differentiable parameterized function approximator (Sutton, 1988) updates the parameter
36
S.J. BRADTKE
AND A.G. BARTO
vector Ot as follows: t
o~+~ = o~ + ~.(~) [ R~ + ~V~(x~+,)
v~(.~) ] Z ~-kvo~v~(~)
-
k=l
= Ot + c~(:ct)A0t, where t
[ R, + ~v~(~,+~) - v~(~,) ] Z a~-~Vo, V~(~)
AOt =
k=l
[ R~ + ~y~(x~+~) - y~(x~) ] r~ and
(1) k=l
Notice that A0t depends only on estimates, Vt(xk), made using the latest parameter values, Or. This is an attempt to separate the effects of changing the parameters from the effects of moving through the state space. However, when Vt is not linear in the parameter vector Ot and ~ ¢ 0, the sum Et cannot be formed in an efficient, recursive manner. Instead, it is necessary to remember the xk and to explicitly compute V0, Vt (xk) for all k O, with 'Txy = R(x, y) - ~ and ~o~ = ( ~ - 7 E ~ X P ( x , Y)¢y), then E {'7} = 0, and
Cor(w, r~) = O. Therefore, if we know the state transition probability function, P , the feature vector Ca, for all x E X, if we can observe the state of the Markov chain at each time step, and if Cor(w,w) is nonsingular and finite, then by Lemma (1) the algorithm given by (6) converges in probability to 0". In general, however, we do not know P, { ¢ z } z e x , or the state of the Markov chain at each time step. We assume that all that is available to define Ot are et, ¢t+1 and rt. Instrumental variable methods allow us to solve the problem under these conditions. Let COt -----Ct -- ")'¢t+l, and (t = "YY:~yeX P(xt, y)¢y - 70t+1. Then we can observe &t
=
6t -
76t+1
= (¢t--'~ E P(Xt'Y)¢Y) + (7 E P(xt,?,J)¢y-'~¢t+l) yeX
ycX
with cot = et 7 E y E X P(xt, Y)¢y and 0 for all states x. One of the goals of using a parameterized function approximator (of which the linearin-the-parameters approximators considered in this article are the simplest examples) is to store the value function more compactly than it could be stored in a lookup table. Function approximators that are linear in the parameters do not achieve this goal if the feature vectors representing the states are linearly independent, since in this case they use the same amount of memory as a lookup table. However, we believe that the results presented here and elsewhere on the performance of "IT) algorithms with function approximators that are linear in the parameters are first steps toward understanding the performance of TD algorithms using more compact representations.
LINEAR LEAST-SQUARES ALGORITHMS FOR TEMPORAL DIFFERENCE LEARNING
53
Appendix A Proofs of Lemmas In preparation for the proofs of Lemmas 3 and 4, we first examine the sum ~y~X P(x, y)(R(x,y) - rz) for an arbitrary state x:
E P(x,y)(R(x,y)- ~) = E P(x,y)R(x,y)- E P(x,y)f~ yeX ~cX yeX yeX =
O.
Proof of Lemma 3: The result in the preceding paragraph leads directly to a proof that E {,7} : o: E{r]} :
:
E{R(x,y)-{z}
Z ~ Z P(~,y)(R(z,y)-~) ~cX
=
~
v~X ~w "0
x~X and to a proof that Cor(w, ~7) = 0: Cor(w,~7) :
E{wu}
~EX :
Z
yEX ~ Z
zEX xeX :
~
P(:~,y)~(R(~,y)-
ycX ycX 7rxW x "
0
xcX =
O.
Proof of L e m m a 4: First we consider Cor(p, r/):
Cor(~,,7) : E {p~'}
~)
54
S.J. BRADTKE AND A.G. BARTO
~cX
uEX
= E 7r:~ Z P ( x , y ) ¢ ~ ( R ( x , y ) - ~ ) ' ~eX u~X = E 7r~¢x E P ( x , y ) ( R ( x , y ) - ~ ) ' ~X ueX = ~ 7r~¢x • 0 zEX
z O, Now for Cor(p, ~):
Cor(p, ~) = E {pC"} seX
uEX
= ~ ~ ~ P(~,y),~(-~ ~ P(x,z)~z-~)' :~eX
-- Z ~
y~X
~ P(x,y)(,~ ~ P(~,z)¢)-
xeX
~eX
zEX
y~X
:cEX
: O.
zeX
zEX
zEX
xEX
yEX
Proof of L e m m a 5: Equation 11 (repeated here) gives us the t~estimate found by algo-
rithm LS TD for 0":
Ot z [l~_l~k(~k --"~Ok+l)tl-1 II~=l~kRk ] " As t grows we have by condition (2) that the sampled transition probabilities between each pair of states approaches the true transition probabilities, P, with probability 1. We also have by condition (3) that each state x E X is visited in the proportion 7r~ with probability 1. Therefore, given condition (4) we can express the limiting estimate found by algorithm LS TD, OLsvo,as O~s~ =
lira Ot
t~oo
LINEAR LEAST-SQUARES ALGORITHMS FOR TEMPORAL DIFFERENCE LEARNING
-1
=
lim
Ck(¢k -- 7¢k+1)'
t--~O0
55
[l kRk]
k=l --1
.E ~x¢~E P(x,y)R(x,y)]
--1
Y
x -1
Y
y
-
x
Appendix B The Five-State Markov Chain Example The transition probability function of the five-state Markov chain used in the experiments appears in matrix form as follows, where the entry in row i, column j is the probability of a transition from state i to state j (rounded to two decimal places): 0.42 0.25 0.08 0.36 0.17
0.13 0.08 0.20 0.05 0.24
0.14 0.16 0.33 0.00 0.19
0.03 0.35 0.17 0.51 0.18
0.28 0.15 0.22 0.07 0.22
The feature vectors representing the states (rounded to two decimal places) are listed as the rows of the following matrix ~: 74.29 61.60 97.00 41.10 7.76
34.61 48.07 4.88 40.13 79.82
73.48 34.68 8.51 64.63 43.78
53.29 7.79 36.19 82.02 87.89 5.17 92.67 31.09 8.56 61.11
The immediate rewards were specified by the following matrix (which has been rounded to two decimal places), where the entry in row i, column j determines the immediate reward for a transition from state i to state j. The matrix was scaled to produce the different TD error variance values used in the experiments. The relative sizes of the immediate rewards remained the same.
56
S.J. B R A D T K E AND A.G. BARTO
R =
104.66 75.86 57.68 96.23 70.35
29.69 82.36 37.49 68.82 ] 29.24 100.37 0.31 35.99 | 65.66 56.95 100.44 47.63 | 14.01 0.88 89.77 66.77 | 23.69 73.41 70.70 85.41 J
Appendix C Selecting Step-Size Parameters The convergence theorem for NTD(0) (Bradtke, 1994) requires a separate step-size parameter, a(x), for each state x, that satisfies the Robbins and Monro (Robbins & Monro, 1951) criteria OQ
E ak(x) = ~
and
E ak(x)2 < c~
k=l
k=l
with probability 1, where ak(x) is the step-size parameter for the k-th visitation of state x. Instead of a separate step-size parameter for each state, we used a single parameter at, which we decreased at every time step. For each state x there is a corresponding subsequence {at}~ that is used to update the value function when x is visited. We conjecture that if the original sequence {at } satisfies the Robbins and Monro criteria, then these subsequences also satisfy the criteria, with probability 1. The overall convergence rate may be decreased by use of a single step-size parameter since each subsequence will contain fewer large step sizes. The step-size parameter sequence {at } was generated using the "search then converge" algorithm described by Darken, Chang, and Moody (Darken, et al., 1992): 1+ Ozt = ~ 0
c t O~0 T
c
t
t2
l+~oT+T 7
"
The choice of parameters a0, c, and ~- determines the transition of learning from "search mode" to "converge mode". Search mode describes the time during which t > 7. at is nearly constant in search mode, while at .w. ~ in converge mode. The ideal choice of step-size parameters moves Ot as quickly as possible into the vicinity of 0* during search mode, and then settles into converge mode.
Notes 1. If the set of feature vectors is linearly independent, then there exist parameter values such that any realvalued function of X can be approximated with zero error by a function approximator linear in the parameters. Using terminology from adaptive control (Goodwin & Sin, 1984), this situation is said to satisfy the exact matching condition for arbitrary real-valued functions of X.
LINEAR L E A S T - S Q U A R E S A L G O R I T H M S F O R T E M P O R A L D I F F E R E N C E LEARNING
57
2.
rn can not be hess than rank(b). If m > rank((b), then [,btII(I - 7 P ) ~ ] is an (rn x m) matrix with rank less than rn. It is therefore not invertible. 3. The search for the best settings for A, ao, c, and 7 was the limiting factor on the size of the state space for this experiment.
References Anderson, C. W., (1988). Strategy learning with multilayer connectionist representations. Technical Report 87-509.3, GTE Laboratories Incorporated, Computer and Intelligent Systems Laboratory, 40 Sylvan Road, Waltham, MA 02254. Barto, A. G. , Sutton, R. S. & Anderson, C. W. (1983). Neuronlike elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13:835-846. Bradtke, S. J., (1994). Incremental Dynamic Programming for On-Line Adaptive Optimal Control. PhD thesis, University of Massachusetts, Computer Science Dept. Technical Report 94-62. Darken, C. Chang, J. & Moody, J., (1992) Learning rate schedules for faster stochastic gradient search. In Neural Networks for Signal Processing 2 - - Proceedings of the 1992 IEEE Workshop. IEEE Press. Dayan, P., (1992). The convergence of TD(A) for general A. Machine Learning, 8:341-362. Dayan, P. & Sejnowski, T.J., (1994). TD(A): Convergence with probability 1. Machine Learning. Goodwin, G.C. & Sin, K.S., (1984). Adaptive Filtering Prediction and Control. Prentice-Hall, Englewood Cliffs, N.J. Jaakkola, T, Jordan, M.I. & Singh, S.P., (1994). On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6(6). Kemeny, J.G. & Snell, J.L., (1976). Finite Markov Chains. Springer-Verlag, New York. Ljung, L. & S6derstr6m, T., (1983). Theory and Practice of Recursive Identification. MIT Press, Cambridge, MA. Lukes, G., Thompson, B. & Werbos, P., (1990). Expectation driven learning with an associative memory. In Proceedings of the International Joint Conference on Neural Networks, pages 1:521-524. Robbins, H & Monro, S., (1951). A stochastic approximation method. Annals of Mathematical Statistics, 22:400-407. S~derstr6m, T. & Stoica, P.G., (1983). Instrumental Variable Methods for System Identification. SpringerVerlag, Berlin. Sutton. A.S., (1984). Temporal Credit Assignment in Reinforcement Learning. PbD thesis, Department of Computer and Information Science, University of Massachusetts at Amherst, Amherst, MA 01003. Sutton, R.S., (t988). Learning to predict by the method of temporal differences. Machine Learning, 3:9-44. Tesauro, G.J., (1992). Practical issues in temporal difference learning. Machine Learning, 8(3/4):257-277. Tsitsiklis, J.N., (1993). Asynchronous stochastic approximation and Q-learning. Technical Report LIDS-P2172, Laboratory for Information and Decision Systems, MIT, Cambridge, MA. Watkins, C. J. C. H., (1989). Learning from Delayed Rewards. PhD thesis, Cambridge University, Cambridge, England. Watkins, C. J. C. H. & Dayan, P., (1992). Q-learning. Machine Learning, 8(3/4):257-277, May 1992. Werbos, P.J., (1987). Building and understanding adaptive systems: A statistical/numerical approach to factory automation and brain research. IEEE Transactions on Systems, Man, and Cybernetics, 17(1):7-20. Werbos, P.J., (1988). Generalization of backpropagation with application to a recurrent gas market model. Neural Networks, 1(4):339-356, 1988. Werbos, P.J., (1990). Consistency of HDP applied to a simple reinforcement learning problem. Neural Networks, 3(2): 179-190. Werbos, P.J., (1992). Approximate dynamic programming for real-time control and neural modeling. In D. A. White and D. A. Sofge, editors, Handbook of Intelligent Control: Neutral, Fuzzy, and Adaptive Approaches, pages 493-525. Van Nostrand Reinhold, New York. Young, P., (1984). Recursive Estimation and Time-series Analysis. Springer-Verlag. Received November 10, 1994 Accepted March 10, 1995 Final Manuscript October 4, 1995