The Asymptotic Convergence-Rate of Q-Iearning - Semantic Scholar

The Asymptotic Convergence-Rate of Q-Iearning Cs. Szepesvari*

Hesearch Group on Artificial Intelligence, a,Jozsef Attila" University, Szeged, Aradi vrt. tere 1, Hungary, H-6720 szepes((trnath.u-szeged.hu

Abstract

In this paper we show that for discounted MDPs with discount factor ( > 1/2 the asymptotic rate of convergence of Q-learning is O(I/tR(l-O)) if R(1 - ':I) < 1/2 and O( Jlog log t/t) otherwise provided that the state-action pairs are sampled from a fixed prob ability distribution. Here R = Pmin/PmRx is the ratio of the min imum and rnaximurn state-action occupation frequencies. The re sults extend to convergent oIl-line learning provided that Pmin > 0, where Pmin and Pmax now become the minimum and maximum state-action occupation frequencies corresponding to the station ary distribution. '

1

INTRODUCTION

Q-learning is a popular reinforcement learning (RL) algorithm whose convergence is well demonstrated in the literature (Jaakkola et aI., 1994; Tsitsiklis, 1994; Littman and Szepesvilri, 1996; Szepesviiri and Littman, 1996). Our aim in this paper is to provide an upper bound for the convergence rate of (lookup-table based) Q-Iearning algorithms. Although, this upper bound is not strict, computer experiments (to be present.ed elsewhere) and the form of t.he lemma underlying t.he proof indicat.e t.hat. the obtained upper bound can be made strict by a slightly more complicated defi nition for R. Our results extend to learning on aggregated states (see (Singh et aI., 1995)) and other related algorithms \vhich admit a certain form of asynchronous stochast.ic approximat.ion (see (S7.epesvilri and Littman, 1996)). Present address: Associative Computing, Inc., Budapest, Konkoly Thege !vl. u. 29-33,

HUKGARY-1121

Q-LEARNING

2

\iVatkin� introduced the following algorithm to estimate the value of state-action

1990):

pairs in discounted 'vlarkovian Decision Processes ('vIDPs) (Watkins,

Qt+l(X,a) = (1 - "t(x, a))Qt(x, a) + "t(x, a)(rt(x, a) Here

xEX

+ '(max Qt(Yt, b)). bEA

(1)

and a E A are states and actions, respectively, X and A are finite. It is ..

assumed that some random sampling mechanism (e.g. simulation or interaction with a real 1'larkovian enviroIlment) generates random samples of form where the probability of

Eh I Xl,ad

R(x, a)

=

Yt

(Xt, at)

given

(Xl, at, Yt, Tt), P(Xt,at, Yt),

is fixed and is denoted by

is the immediate average reward which is received when

executing action a from state

X, Yt

and

rt

are assumed to be independent given the

Varlrt I Xl, ad < C for 0 - O. Here R = Prnin!Pmax, max(x.a) p(x,a), where p(x,a) is the

where

Pmin =

sampling proba

"ate that if '( 2: 1- Pm=/2Pmin then (4) is the slower, while if, < 1- Pmax/2Pmin then (5) is the slower. The proof will be presented in several steps. Step 1.

Just like in (Littman and Szepesvari, 1996) (see also the extended version (Szepesvari and Littman, 1996)) the main idea is to compare Qt with the simpler process Q'+1(x, a) = (1- t(x,a))Qt(x,a) + ,(x,a)(rt(x,a) + , max Q ' (Yt, b)). b

(6)

"ate that the only (but. rat.her essential) difference bet.ween the definition of Qt and that of Qt is the appearance of Q' in the defining equation of Qt. Firstly, notice that as a consequence of this change the process Qt clearly converges to Q' and this convergence may be investigated along each component (x: a) separately using standard stochastic-approximation techniques (see e.g. (Wasan, 1969; Pol.iak and Tsypkin, 1973)). Using simple devices one can show that the difference process �,.(x, a) Qt(x,a)1 satisfies the following inequality:

=

IQ,(x,a)

�t+l(x,a) :S (1 - (tt(x, a))�t(x, a) + ,O!t(x, a)(lI �tll + IIQt - Q'II)·

(7)

Here 11·11 stands for the maximum norm. That is the task of shovving the convergence rate of Qt to Q' is reduced to that of showing the convergence rate of �t to zero. Step 2. We simplify the notation by introducing the abstract process whose update equation is

xt+l(i)

=

(1- St�iJ xt(i) + S�i) (1lxtll + 1/2. Fortunately, we knovy by an extension of the Lmv of the Iterated Logarithm to stochastic approximation processes that the convergence

)

(

rate of IIQ[-Q'II is 0 y'log log t/t (the uniform boundedness of the random rein forcement signals must be exploited in this step) (Major, 1973). Thus it is sufficient to provide a convergence rate estimate for the perturbed process, Xl, defined by (8), when it C y'log log t/t for some C > O. We state that the convergence rate of it is faster than that of Xt. Define the process =

. Zt+l (l)

_ _

{(

)

1- � .-:>t\IJ Zt(i), Zt (i),

if Til 'i; if1)tr'i.

(12)

=

This process clearly lower bounds the perturbed process, Xt_ Obviously, the COll vergence rate of Zt is 0(1/t1-'1') which is Rlower than the convergence rate of ft provided that ( > 1/2, proving that Et must be faster than Xt. Thus, asymptoti cally Et 'I > 0, \vhieh follovvs from the mean-value theorem for integrals and the law of integration by parts, we get that Xt "" O(l/tR(l-O)). The case when 'Y