Document not found! Please try again

A Policy Gradient Method for SMDPs with Application to Call ...

5 downloads 24354 Views 509KB Size Report
approximation. The gradient estimator is based on the discounted score method as introduced in 1]. We demonstrate the utility of our algorithm in a Call Ad-.
A Policy Gradient Method for SMDPs with Application to Call Admission Control Sumeetpal Singh1, Vladislav Tadic Department of Electrical and Electronic Engineering University of Melbourne, Vic. 3010, Australia Arnaud Doucet Signal Processing Group, Engineering Department University of Cambridge, Trumpington Street, Cambridge CB2 1PZ, UK Abstract Classical methods for solving a semi-Markov decision process such as value iteration and policy iteration require precise knowledge of the underlying probabilistic model and are known to su er from the curse of dimensionality. To overcome both these limitations, this paper presents a reinforcement learning approach where one optimizes directly the performance criterion with respect to a family of parameterised policies. We propose an online algorithm that simultaneously estimates the gradient of the performance criterion and optimises it through stochastic approximation. The gradient estimator is based on the discounted score method as introduced in [1]. We demonstrate the utility of our algorithm in a Call Admission Control problem.

1 Introduction A semi-Markov decision process (SMDP) is a continuous time generalisation of a discrete time Markov decision process (MDP). Like a MDP, a SMDP can be solved using classical methods such as value iteration and policy iteration or a hybrid combination of the two. However, these classical methods su er from 2 signi cant drawbacks: (i) the curse of dimensionality which implies (loosely speaking) the memory required to represent the optimal value function (solution to Bellman's equation) is proportional to the dimension of the MDP (or SMDP) and is prohibitive for problems with large state and action spaces, (ii) the classical methods require a model or equivalently knowledge of the transition probabilities of the MDP. To alleviate the 2 shortcomings above, a methodology called reinforcement learning (RL) with function approximation is used. RL combines dynamic programming and stochastic approximation to learn the 1 Research supported in part by the Centre of Expertise in Networked Decision Systems (CENDS)

optimal value function online or via simulation and thus circumventing drawback (ii) above. Function approximation attempts to approximate the optimal value function from a family of parameterised functions; to o set drawback (i) a family that admits compact representation is chosen. While RL for MDPs is well studied [1, 4, 8], it is not the case for SMDPs. The interest in SMDPs stems from the fact that the SMDP framework is general enough to address a large class of sequential decision making problems under uncertainty. To mention a few, optimal preventive maintenance schedule of a production inventory system [3], optimal call admission control [6], the optimal control of a queuing system [2]. The extension of the RL methodology to SMDPs have been addressed in [2, 3]. In these cited references, the authors attempt to learn the optimal value functions for discounted and average cost SMDPs. It is well known that the greedy policy derived from the current approximation of the optimal value function is not guaranteed to improve with each iteration and can be worse than the old policy by an amount proportional to the maximum approximation error over all states [1]. Inspired by the work in [1, 4, 8], the contribution of this pa-

per is to present an alternative approach in which the policy of a SMDP is represented by its own function approximator to yield a family of parameterised policies. We then optimise the average cost SMDP problem with respect to the family of parameterised policies. The method proposed is model free as it requires no knowledge of the probabilistic laws governing the evolution of the SMDP and therefore is suited for online implementation. Moreover, this framework allows one to easily handle the constrained case by using recent results for constrained stochastic approximation [9]. By constraints, we mean an average cost SMDP subject to average cost constraints . To demonstrate the utility of the method we propose, we present as a case study the optimal call admission control problem. In

simulations, we demonstrate quick convergence of the online algorithm to the optimal policy. Our approach is based on a recent paper [1] where a policy gradient algorithm is proposed for average cost MDPs. The policy gradient method works with a family of parameterised policies and the gradient of the performance function with respect to the policy parameters is estimated online. Using this estimated gradient, a stochastic approximation algorithm is used to update the policy parameters in the direction of ascent of the performance function. This paper demonstrates that the policy gradient method with all the associated convergence results for the MDPs scenario, can be extended to the more general SMDP setting. The problem of estimating the gradient of a performance function online (or via simulation) has been well studied by the Operations Research community [5]. Baxter and Barlett [1] show that their policy gradient method is a discounted score function (or likelihood ratio) method. Discounting is used because the variance of the score function grows with time [5]; a discount factor is used to bound the variance. The use of a discount introduces a bias in the gradient estimate, as is shown in [1]. However, the biased gradient itself can be accurate for Markov chains mixing quickly [1][Theorem 3.7]. In this paper, we apply the discounted score method to derive a gradient estimator for an average cost SMDP (algorithm SMDPBG). We demonstrate the eciency of our algorithm in a call admission control case study.

2 Problem Formulation Let f(xk ; k ; uk )gk0 be a controlled (semi-)Markov process where xk 2 X := f1; : : : ; ng for all k, X being the state space of the embedded chain fxk gk0 . uk 2 U is the control applied at time k, the control (or action) space U is nite.  = [1 ; : : : ; K ]T 2 RK is the parameter that determines the control policy in e ect where typically K < n.  parameterises a family of stationary randomised polices as follows: let  :   X ! (U ) where (U ) denotes the set of all probability measures on U . At epoch k, the control uk to applied is sampled from the measure (; xk ; :). k 2 R+ := [0; 1), k > 0, is the time the process fxk gk0 dwells in state xk?1 or equivalently, the time interval between the (k ? 1)-th and the k-th transition. Let Q be the controlled semi-Markov kernel that speci es the joint distribution of (xk+1 ; k+1 ) given (xk ; uk ): P(k+1  ; xk+1 = j jxk = i; uk = u) = Qij (; u): (1)

(We use the symbol P for probability and E for the expectation operator.) Note that Q itself is independent of  and the dependence of the processes f(xk ; k ; uk )gk0 on  originates only through the policy . We assume that Z1

Qij (d; u) < 1

(2)

0

for all i, j and u. The continuous time semi-Markov Process f(x (t); u (t))gt2R+ is now de ned as follows: let x (t) and u (t), be the right-continuous interpolations of fxk gk0 and fuk gk0 respectively de ned by (x (t); u (t)) = (xk ; uk ) for t 2 [tk ; tk+1 ) (3) where tk := tk?1 + k , k > 0, t0 := 0. Let c : X U ! R be a suitably de ned cost function. For a given  2 RK , we de ne the continuous time average cost function as T

1 Z J () = lim Ef c(x (t); u (t)) dtg: T !1 T

(4)

0

The aim is to minimise J () with respect to the parameter . Below, we state assumptions to

ensure that J () is a well de ned quantity with a limit independent of the initial distribution for x0 .

Remark 2.1 (On state-action constraints)

State-action constraints are speci ed by a family of subsets of U : fUi ; i 2 X g (5) where Ui  U . When in state i 2 X , the action must be selected from Ui . We can easily account for state-action constraints by designing parameterised randomised policies that satisfy (; i; u) = 0 for all u 2= Ui .

Recall the de nition of unichain for a nite state discrete time homogeneous Markov chain with state space X and transition probability matrix [Pij ] [11]: a Markov chain is said to be unichain if any 2 closed set of states have a non-empty intersection. (A set of P states X  X is closed if j2X Pij = 1 for all i 2 X .) Note that if there exits a state i that is accessible from every state j 2 X , then the Markov chain is unichain. This is typical for CAC problems since the state that corresponds to an empty system is accessible from all other states of the system. Finally, note that the semi-Markov kernel in (1) speci es transition probabilities Pij (u) := P(xk+1 = j jxk = i; uk = u) (6) = lim !1 Qij (; u):

Assumption 2.1 For any  2 RK , the embedded chain fxk gk0 with transition probability Pij () := P(xk+1 = j jxk = i) = Pu2U Pij (u)(; i; u) is unichain.

Let f(; i) : i 2 X g be the unique stationary distribution of fxk gk0 ; uniquenessP follows from Assump?1 P k () = (; j ) tion 2.1. In fact, limT !1 1=T Tk=0 ij independently of i [11, Theorem 2.3.3]; Pijk () is the ij -th element of the matrix (P ())k , P 0 () := I . By Theorem 3.5.1 of [11], under Assumption 2.1, J () =

where

P

iP 2X;u2U c(i; u) (i; u)(; i; u)(; i) (7) i2X;u2U  (i; u)(; i; u) (; i)

 (i; u) := Efk+1 jxk = i; uk = ug 1 XZ Qij (d; u): = j 2X 0

(8)

For convenience, write (9)

The the gradient of J () is

rJ () =  ()rc () ?()2c ()r () : 

(10)

Throughout this paper, the following convention for the use of r is adopted: (i) for a scalar b() 2 R, the vector rb := [@b=@1; : : : ; @b=@K ]T , (ii) for a vector b() 2 Rn , the matrix rb := [@b=@1; : : : ; @b=@K ], (iii) for a matrix B () 2 Rnm and vectors a 2 Rn , b 2 Rm , aT (rB )b := [aT @B=@1b; : : : ; aT @B=@K b]T . Notice that ?1 1 TX c () = lim E f c(xk ; uk ) (xk ; uk )g T !1 T k=0

(11)

and similarly ?1 1 TX  () = lim E f  (xk ; uk )g: T !1 T k=0

u2U

Then we have c () =

X

i2X

(; i)r(; i):

(12)

We have the dropped the superscript  on the processes in favour of the subscript  on the expectation operator E to indicate the randomised policy  is use. We observe that c and  are performance criterions for average cost MDPs with instantaneous state and action dependent costs given by c(xk ; uk ) (xk ; uk ) and  (xk ; uk ) respectively.

(14)

Dropping the dependence on  for convenience, the gradient is rc = (r)T r + (rr)T : (15) The following result is Theorem 2 of [1] and it is proved therein.

Result 3.1 Given 2 [0; 1), de ne the i-th component of the vector c; () 2 RjX j as N X

c; (; i) := lim eTi [ k P k ()]r() N !1 k=0

3 Online estimation of a gradient for J () J () = c ()= ():

De ne the i-th component of r() 2 RjX j to be r(; i) := E fc(xk ; uk ) (xk ; uk )jxk = ig X (13) = c(i; u) (i; u)(; i; u):

(16)

where eTi 2 RjX j is the vector with the i-th component equal to 1 and all other components equal to 0. Then for all  2 RK , lim !1 T (rP ())c; = (r)T r.

The choice of notation c; is explained as follows: note that c; () is the value function of an in nite horizon -discounted MDP with transition probability matrix P () and one-step cost r(); hence the subscript . The over-bar \" represents the fact that the instantaneous cost c has been \smoothed" (13) so that it is only a function of the current state xk . The above result states that r c := T (rP ())c; + (rr)T  (17) is a good estimate of rc when is close to 1. As we show below, T (rP ())c; + (rr)T  can be estimated online using only a single trajectory of the SMDP. The e ect in the gradient estimator presented below is a bias-variance tradeo ; small gives a small variance but large bias and vice versa. As was done for rc , we can similarly de ne a biased gradient for r : r  := T (rP ()); + (r)T  (18) where the i-th component of the vectors ; (); () 2 RjX j are de ned as (; i) := E f (xk ; uk )jxk = ig X =  (i; u)(; i; u); (19) u2U

; (; i) :=

lim eT [ N !1 i

N X k=0

k P k ()] (): (20)

Owing to the discussion in the previous paragraphs, we may use the following biased version rJ ()  ()r c () ? c ()r  ():

0.

We have dropped the squared term in (10) since it is bounded below away from 0, and from above, uniformly in ; bounded below away from 0 follows if one assumes  (i; u) > 0 for all i and u (cf. (8)), which is a usual stability assumption. Anyhow, ? rc + c r is a direction of descent, as it makes a negative inner product with rJ (), and therefore may be used instead of rJ () in a gradient descent algorithm to minimise J (). Using the same approach as in [1], one can compute a theoretical bound for the error between  rc ? c r and  r c ? c r  . Although we do not pursue this matter here, we demonstrate in the case study in Section 5 that the error is negligible even for a value = 0:5. By observing only a single trajectory of the SMDP, algorithm SMDPBG (SMDP Biased Gradient) below produces a sequence of iterates f(c(k) ; (k) ; 4ck ; 4k )gk0 converging to (c ;  ; r c ; r  ) almost surely as k tends to in nity; see Theorem 3.2 below. Thus, at time k, we use the following biased estimate of rJ () (k) 4ck ? c(k) 4k

(21)

which also converges almost surely to  r c ? c r  .

Algorithm 3.1 (SMDPBG) Given 2 (0; 1), xed  2 RK , randomised policy (; :; :). Given arbitrary x0 2 X , 0 2 R+ , and a SMDP state sequence u0; x1 ; 1 ; u1 ; x2 ; 2 ; u2 ; : : : generated using policy (; :; :) and law (1). Set z0 = 4c0 = 40 = 0 2 RK , c(0) = (0) = 0. for k = 0 to T ? 1 do z = z + r(; xk ; uk ) ; (22) k+1

k

(; xk ; uk )

4ck+1 = 4ck + k +1 1 c(xk+1 ; uk+1 )k+2 zk+1 (; xk+1 ; uk+1 )  ? 4c ; (23) + r(; k x ;u ) ?

4k+1 =

c(k+1) (k+1) end for

k+1 k+1  ? 1  4k + k + 1 k+2 zk+1

c(k) and (k) are the sample average estimators for c () and  () respectively. In the update of zk+1 , if (; xk ; uk ) = 0, we set r(; xk ; uk )=(; xk ; uk ) =



(; xk+1 ; uk+1 )  ? 4 ; (24) + r(; k xk+1 ; uk+1 )   = c(k) + k +1 1 c(xk ; uk )k+1 ? c(k) (25)   = (k) + k +1 1 k+1 ? (k) : (26)

We now state the main result of this paper concerning the convergence of algorithm SMDPG. As in [1], we make the following assumption on the derivatives of the policy.

Assumption 3.1

j@(;i;u)j

@k < B < 1  (; i; u) uniformly in k, , i and u.

(27)

If the denominator is 0, then boundedness implies the numerator is also 0.

Theorem 3.2 Consider algorithm SMDPBG. Let condition (2) and Assumption 2.1 hold. Then the iterates c(k) and (k) converge almost surely to c and  respectively. If in addition, Assumption 3.1 and the following condition also holds, (28) Efk2+1jxk = i; uk = ug < 1 8i; u; then the iterates 4ck and 4k converge almost surely to r c and r  respectively. The proof of the above theorem appears in [7].

4 Online optimization of a SMDP (OLSMDP) An online algorithm that updates the parameter  to minimise J () using stochastic approximation is presented. The following two-time scale stochastic approximation algorithm updates  at every time step:

Algorithm 4.1 (OLSMDP) Given x0 , 0, 2 (0; 1), 0 2 RK , randomised policies f(; :; :) :  2 RK g, positive step-sizes f k gk1 , f k gk1 satisfying

k ! 0, k ! 0. Generate u0  (0 ; x0 ; :). Observe x1 , 1 , generate u1  (0 ; x1 ; :), observe c(x1 ; u1). Set z0 = 4c0 = 40 = 0 2 RK , c(0) = (0) = 0. for k = 0 to T ? 1 do  Update



c(k+1) = c(k) + k+1 c(xk ; uk )k+1  (29) ?c(k) ; h i (k+1) = (k) + k+1 k+1 ? (k) ; (30) r(k ; xk ; uk ) : (31) zk+1 = zk + (k ; xk ; uk )

 Observe xk+2 , k+2 , generate uk+2  (k ; xk+2 ; :), observe c(xk+2 ; uk+2 ). Update 



4ck+1 = 4ck + k+1 c(xk+1 ; uk+1 )k+2 zk+1   + r(k ; xk+1 ; uk+1 ) ? 4c ; (32) k

(k ; xk+1 ; uk+1 ) 



4k+1 = 4k + k+1 k+2 zk+1   + r(k ; xk+1 ; uk+1 ) ? 4 ; (33) k+1

k (k ; xk+1 ; uk+1 )  (k+1) c = k ? k+1  4k+1 ?  c(k+1) 4k+1 :

(34)

The recursion involving k in the above algorithm is the actual stochastic approximation step that uses (k+1) 4ck+1 ? c(k+1) 4k+1 as the biased estimate of rJ (k ). The choice of step-sizes k and k should satisfy k

X

k

though we assume Poisson arrival and departures, we do not assume knowledge of the numerical values of the intensities. This immediately precludes the use of classical value iteration, policy iteration or the linear programming method. By a loss system, we mean that an arriving class i user is either admitted or rejected. If admitted, the user remains in the system for the duration of its service time. If rejected, the user is lost to the system. Note that the sojourn time of an admitted user in the system is equal to its service time. Let x(t) = [x1 (t); : : : ; xK (t)]T 2 X be the state of the network at time t where the state space X is de ned as follows: X := fx = [x1 ; : : : ; xK ]T : xi 2 Z+;

end for

X

total bandwidth M available to the network. Al-

X

k =

1;

2k

Suggest Documents