The Policy Improvement Algorithm: General ... - Semantic Scholar

1 downloads 0 Views 292KB Size Report
Feb 22, 1996 - This is a consequence of the skip free property of the network model. From this,. Theorem ... 4] J. Humphrey D. Eng and S. Meyn. Fluid network ...
The Policy Improvement Algorithm: General Theory with Applications to Queueing Networks and Their Fluid Models Sean Meyn



Coordinated Science Laboratory and the University of Illinois 1308 W. Main Street Urbana, IL 61801

Drafted on August 28, 1995 Revised on February 22, 1996

Abstract The average cost optimal control problem is addressed for Markov decision processes with unbounded cost. It is found that the policy improvement algorithm generates a sequence of policies which are c-regular (a strong stability condition), where c is the cost function under consideration. This result only requires the existence of an initial c-regular policy, and an irreducibility condition on the state space. Furthermore, under these conditions the sequence of relative value functions generated by the algorithm are bounded from below, and \nearly" decreasing, from which it follows that the algorithm is always convergent. Under further conditions, it is shown that the algorithm does compute a solution to the optimality equations, and hence an optimal average cost policy. These results shed new light on the optimal scheduling problem for multiclass queueing networks. Surprisingly, it is found that the formulation of optimal policies for a network is closely linked to the optimal control of its associated uid model. Moreover, the relative value function for the network control problem is closely related to the value function for the uid network. These results are surprising since randomness plays such an important role in network performance.

Keywords: Markov Decision Processes, Poisson equation, Howard's algorithm, policy improvement algorithm, multiclass queueing networks. Short title: The Policy Improvement Algorithm

 Work supported in part by NSF Grant ECS 940372, University of Illinois Research Board Grant

Beckman 1-6-49749, and JSEP grant N00014-90-J-1270.

1 Introduction

1

1 Introduction 1.1 Background This paper develops simple conditions which guarantee the convergence of the policy improvement algorithm. These results allow the synthesis of an average cost optimal stationary policy for a Markov decision process with unbounded cost and general state space. This paper is an abridged version of [20], which contains further details on general models. The existence of average cost optimal policies for Markov decision processes on a general or countable state space has been studied for a number of years - see for instance [1, 23] for recent surveys. Up to now, to determine whether an optimal policy exists required strong uniform stability or recurrence conditions on the process, over all policies. To show that an algorithm can be devised to compute an optimal policy, e.g. through an associated discounted control problem, strong assumptions are frequently imposed on the intermediate functions of interest. For instance, uniform upper and lower bounds have been imposed on the value function V for the discounted problem, when  1 [1, 12, 11, 24]. The paper [22] gives perhaps the mildest criterion for the existence of of an optimal stationary policy for general state space models. However, again one must nd uniform lower bounds on a normalized value function. In the countable state space case, it is shown in [2, 3] that if the cost function is suitably unbounded, then the discounted value functions possess certain solidarity properties for all suciently large discount rates. This then can be used to show that the normalized value functions are bounded from below. The present paper proceeds in a manner related to [2, 3], in that we again consider cost functions which are large, perhaps in an average sense, whenever the state is \large". The focus here will be on the policy improvement algorithm, and criteria for its convergence. We consider general state space models, so that the main results apply to systems ranging from linear Gaussian models, to multiclass queueing networks. The convergence results, and the intermediate bounds obtained are new even in the special case of countable state space models. The policy improvement algorithm was introduced by Howard in the fties [15]. This is a natural approach to the synthesis of optimal policies in which a succession of policies ffn g are constructed together with relative value functions fhn g, and intermediate steady state costs fn g. In the nite state space case it is known that the algorithm computes an optimal policy in a nite number of steps, but in the general state space case, or even in the case of countable state spaces, little is known about the algorithm except in special cases [13, 1, 7]. By extending recent results of [21, 10, 19, 14] on properties of solutions of Poisson's equation, we nd that the relative value functions fhn g are surprisingly well behaved: In Theorem 2.6 and Theorem 2.6 below the following bounds are derived: (i) When properly normalized by additive constants, there is a constant b such that 0  hn (x)  b(h0(x) + 1);

x 2 X; n 2 ZZ+:

1 Introduction

2

(ii) The function hn? serves as a stochastic Lyapunov function for the chain fn , 1

and hence all of the chains are cn -regular, as de ned in [19], where cn is the one-step cost using policy fn . (iii) The functions hn are \almost" decreasing, and hence converge point-wise to a limit h. These properties hold without any blanket stability condition on the controlled processes, except for the existence of an initial stabilizing policy. The only global requirement is an irreducibility condition on a single compact subset of the state space, and a natural unboundedness condition for the cost function. Given the strong form of convergence of the algorithm, it is then easy to formulate conditions under which the limit gives rise to an optimal policy.

1.2 Markov decision processes A Markov Decision Process (MDP) is a stochastic process  = fn : n  0g evolving on a state space X, with some control sequence taking values in the action space A. The state space X and the action space A are assumed to be locally compact, separable metric spaces, and we let B(X) denote the (countably generated) Borel  eld of X. To each x 2 X there is a non-empty, closed subset A(x)  A whose elements are the admissible actions when the MDP takes the value x. The set of admissible state-action pairs f(x; a) : x 2 X; a 2 A(x)g is assumed to be a measurable subset of the product space X  A. The transitions of the MDP are governed by the conditional probability distributions fPa(x; B )g which describe the probability that the next state is in B for any B 2 B(X) given that the current state is x 2 X, and the current action chosen is a 2 A. These are assumed to be probability measures on B(X) for each state-action pair (x; a), and measurable functions of (x; a) for each B 2 B(X). The choice of action when in a state x is governed by a policy. A (stationary) policy is simply a measurable function f : X ! A such that f (x) 2 A(x) for all x. When the policy f is applied to the MDP, then the action f (x) is applied whenever the MDP is in state x, independent of the past and independent of the time-period. We shall write Pf (x; B ) = Pf (x)(x; B ) for the transition law corresponding to a stationary policy f . The state process f := ffk : k  0g of the MDP is, for each xed f , a Markov chain on (X; B(X)), and we write the n-step transition probabilities for this chain as Pfn (x; B) = P(fn 2 B j f0 = x); x 2 X; B 2 B(X): For a measurable function h: X ! IR, we will frequently use the operator-theoretic notation, Pfn h (x) := E[h(fn) j f0 = x] A probability measure f is called invariant for the chain fn if it satis es the invariance equations Z

f (A) = f (dy)Pf (y; A);

A 2 B(X):

1 Introduction

3

We assume that a cost function c: X  A ! IR+ := [0; 1) is given. The average cost of a particular policy f is, for a given initial condition x, de ned as NX ?1 E [c (f )]; J (f; x) := lim sup 1 x!1

N

k=0

x f

k

where cf (y ) = c(y; f (y )). Typically, the average cost is independent of x, and the R resulting cost is J (f; x) = J (f ) = f (dy )cf (y ) A policy f  is then called optimal if J (f  )  J (f ) for all admissible policies f . We assume that there is a large penalty for large control actions, or large excursions of the state. This requires that the cost be norm-like: that the sublevel set f(x; a) : c(x; a)  bg is a precompact subset of X  A for any b > 0, though we relax this condition in Theorem 2.7 of [20]. One fruitful approach to nding optimal policies is through the following optimality equations:  + h(x) = a2A min [c(x; a) + Pa h (x)] (1) (x) f (x) = arg min[c(x; a) + Pah (x)]; x 2 X: (2) a2A(x)

If a policy f , a measurable function h, and a constant  exist which solve these equations, then typically the policy f is optimal. See, for example [17, 12, 1, 20], for a proof of this and related results. Theorem 1.1 Suppose that the following conditions hold (a) The pair (; h) solve the optimality equation (1); (b) The policy f satis es (2), so that cf (x) + Pf h (x)  c(x; a) + Pah (x); x 2 X; a 2 A(x):

(c) For any x 2 X, and any admissible policy f satisfying J (f; x) < 1, 1 P n h (x) ! 0; n ! 1: n f

Then f is an optimal control, and  is the optimal cost, in the sense that n 1X Ex [cf (fk )] =  ; lim n!1

n k=1

and J (f; x)   for all policies f . ut We show in this paper that the policy improvement algorithm is an e ective approach to establishing the existence of solutions to the optimality equations. In some cases it gives a practical algorithm for their computation; in others it gives insight into the structure of optimal policies. The paper is organized as follows. After presenting some general de nitions and deriving some bounds for Harris recurrent Markov chains in Section 2, we derive our main convergence results for the policy improvement algorithm, and give conditions under which the limiting relative value function solves the optimality equation. In Section 3 we develop some applications to the scheduling problem for multiclass queueing networks, and the control of linear stochastic systems.

2 The Policy Improvement Algorithm

4

2 The Policy Improvement Algorithm The policy improvement algorithm (PIA) is a method for successively computing increasingly well behaved stationary policies for an MDP. The important features of the algorithm can be explained in a few paragraphs. Suppose that a policy fn?1 is given, and assume that hn?1 satis es the Poisson equation

Pfn?1 hn?1 = hn?1 ? cn?1 + n?1; where cn?1 (x) = cfn?1 (x) = c(x; fn?1(x)), and n?1 is a constant (presumed to be the steady state cost with this policy). Then, one attempts to nd an improved policy fn by choosing, for each x,

fn (x) = arg min[c(x; a) + Pahn?1 (x)]:

(3)

a2A(x)

Once fn is found, polices fn+1 ; fn+2 ; : : : may be computed by induction, so long as the appropriate Poisson equation may be solved, and the minimization above has a solution. We will establish a general convergence result for the PIA, which when specialized to the countable state space case yields the following.

Theorem 2.1 Consider an MDP with X = A = ZZ , and suppose that (a) There exists a policy f for which f0 is an irreducible Markov chain, and such that J (f ) =  < 1. (b) The cost function c is norm-like on the product space X  A, and there exists a norm-like function c: X ! IR such that c(x; a)  c(x) for any x 2 X, a 2 A(x). (c) There is a function s: X ! (0; 1), such that for any policy f satisfying J (f )   , +

0

0

0

+

0

Kf (`; 0) :=

1

X

k=0

( 21 )k+1 Pfk (`; 0)  s(`);

` 2 X:

(4)

Then the following hold:

(i) The optimality equations (1,2) admit a solution (h; f; ). (ii) Using the initial policy f , the PIA produces a sequence of solutions (fn ; hn; n) such that fhn g is pointwise convergent to a solution of the optimality equation (1), and any policy f which is an accumulation point of the ffn g satis es (2). (iii) The limiting policy f gives rise to a \cf -regular" Markov chain. Hence, for any initial condition x 2 X, 0

n X 1 Ex [cf (fk )] = J (f ) < 1: nlim !1 n k=1

ut

2 The Policy Improvement Algorithm

5

This result is a consequence of the far more general Theorem 2.8 of [20]. In Section 2.3 we provide conditions which guarantee that the limiting policy f is optimal. The assumption of norm-like costs has been used extensively in the recent literature. See for instance [1, 22]. The main result Theorem 2.7 of [20] relaxes this condition so that the norm-like condition is only required in a time/ensemble-averaged sense. This allows application to the LQG problem where the state weighting matrix is not necessarily full rank. It is well known that the average cost optimal control problem is plagued with counterexamples [25, 1, 8]. It is of some interest then to see why Theorem 2.1 does not fall into any of these traps. Consider rst counterexamples 1 and 2 of [25, p. 142]. In each of these examples, the process, for any policy, is completely non-irreducible in the sense that P(fk < f0 ) = 0 for all times k, and all policies f . It is clear then from the cost structure that (4) cannot hold. A third example is given in the Appendix of [25]. Here (4) is directly assumed! However, the cost is not unbounded, and in fact is designed to favor large states. The assumptions (b) and (c) imply that the center of the state space, as measured by the cost criterion, possesses some minimal amount of irreducibility, at least for the policies ffn g. If either the unboundedness condition or the accessibility condition is relaxed, so that the process is non-irreducible on a set where the cost is low, then optimal stationary policies may not exist. To understand the PIA, we must consider the pair of equations Pfn hn = hn ? cn; (5) Pfn hn?1 = hn?1 ? cn ? n; (6) where cn = cn ? n . The second identity follows from the de nition of fn above, where the function n : X ! IR satis es the lower bound n (x)  n ? n?1 , x 2 X. If hn is bounded from below, and if J (fn ) = n , it follows from the Comparison Theorem [19, p. 337] that n satis es the average upperbound fn ( n )  0. Thus for large n the error term n is small, and hence the function hn?1 almost solves the Poisson equation for Pfn . One might then expect that hn will be close to hn?1 . Under mild conditions, this is shown to be true in a very strong sense. In Theorem 2.6 we establish the following remarkable properties of the PIA: (P1) Uniform boundedness from below: For some constant 0 < N < 1, inf h (x) > ?N ; x2X;n0 n

(P2) Almost decreasing property: There exists a sequence of functions fgn : n  0g such that

gn(x)  gn?1 (x)  : : :  g0(x); x 2 X; n  0; and for some sequence of positive numbers f k ; k g, gn (x) = nhn (x) + n; n  0; x 2 X; with k # 1, k # 0 as k ! 1.

These properties together imply that the relative value functions are pointwise convergent to the function h(x) := limn gn (x).

2 The Policy Improvement Algorithm

6

2.1 A general stability result To begin, we show that the PIA generates policies which give rise to \stable" Markov chains ffn : n  0g, if the initial policy is suitably stabilizing, and the cost functions are appropriately chosen. The processes we consider will be -irreducible either by assumption or by implication. The results developed here depend strongly on recent results for -irreducible Markov chains as described in [19]. In particular, we require lower bounds and uniqueness for solutions to Poisson's equation for c-regular chains. To begin then, we must de ne these terms. Assume momentarily that a stationary policy f has been speci ed, and let P = Pf , c(x) = c(x; f (x)). It will simplify the development here if we assume that c  1. This of course may be assumed without loss of generality. The Markov chain  with transition function P is assumed to be -irreducible for some maximal irreducibility measure . Hence we have

K (x; A) :=

1

X

k=0

x 2 X, whenever (A) > 0.

( 21 )k+1 P k (x; A) > 0;

We let B+ (X) denote the set of A 2 B(X) for which (A) > 0. The  - eld Fn is de ned to be Fn =  (0; : : :; n ), n  0, and the Fn -adapted stopping times A , A are de ned as A = min(k  0 : k 2 A) A = min(k  1 : k 2 A): A set C 2 B(X) is called petite if for some probability  on B(X) and an " > 0, K (x; A)  " (A); x 2 C; A 2 B(X): Equivalently, for a -irreducible chain, the set C is petite if for each A 2 B+ (X), there exists n  1, and  > 0 such that Px(A  n)   for any x 2 C . (7) For a -irreducible process, there always exists a countable covering of the state space by petite sets. A set S 2 B(X) is called c-regular if for any A 2 B+ (X), sup Ex x2S

A ?

h X1

i=0

i

c(i) < 1;

where Ex denotes the expectation operator when 0 = x. From the characterization in (7) we see that a c-regular set is always petite. The Markov chain is c-regular if the state space X admits a countable covering by c-regular sets. The class of c-regular processes are developed in [19, Chapter 14]. Regularity is closely connected to the extension of Foster's criterion,

PV  V ? c + ;

(8) where V : X ! IR + , and  2 IR+ . The bound (8) can have no meaning unless some structure is imposed on the state space, and on the function c. Since c is interpreted

2 The Policy Improvement Algorithm

7

as a cost function, it is natural to suppose that this function is norm-like. Recall that this means that the sublevel set Cc (n) = fx : c(x)  ng is precompact for each n. To connect the topology of the state space with measure-theoretic properties of the Markov chain, we typically assume that all compact sets are petite, so that the Markov chain is a T-chain [19]. Theorem 2.2 shows that the strong c-regularity property is equivalent to the generalization (8) of Foster's criterion.

Theorem 2.2 Assume that c: X ! IR is norm-like, and that all compact subsets of +

X are petite. Then,

(a) If there exists a nite, positive-valued solution V to the inequality (8), then for each A 2 B (X), there exists a d(A) < 1 such that +

Ex

A

hX 0

i

c(k )  2V (x) + d(A);

x 2 X:

(9)

Hence, each of the sublevel sets CV (n) = fx : V (x)  ng is c-regular, and the process itself is c-regular.

(b) If the chain is c-regular, then for any c-regular set S 2 B (X), the function +

V (x) = Ex

S

hX 0

i

c(k ) ;

x 2 X;

(10)

is a norm-like solution V to (8). Proof The result is essentially known: The bound (8) is equivalent to the drift condition PV0  V0 ? c + b1lK , where K is compact: if (8) holds, we can take V0 = 2V and b = 2. The result is then an immediate consequence [19, Theorem 14.2.3]. ut A critical concept in the theory MDPs is the Poisson equation. In [19, 10] it is shown that regularity is a simple approach to obtaining bounds on solutions to this equation. We now describe how to obtain useful solutions to the Poisson equation

Ph = h ? c + ;

(11)

where c: X ! [1; 1) is norm-like, and  > 0. The following theorem establishes the existence of suitably bounded solutions to Poisson's equation, given that the process is c-regular. A proof is provided in [20].

Theorem 2.3 Assume that c: X ! IR+ is norm-like, that all compact subsets of X are petite, and that  is c-regular, so that there exists a nite, positive-valued solution V to (8). Then there exists a solution h to Poisson's equation (11) satisfying ?N  h(x)  dV (x); where d and N are nite constants.

x 2 X;

ut

2 The Policy Improvement Algorithm

8

Given the lower bound on h, it is easy to establish uniqueness. The proof of Theorem 2.4 and more general results may be found in [20]. Related results are given in [21, 10, 19, 26].

Theorem 2.4 Suppose that c: X ! IR is norm-like, and that all compact subsets of X are petite. Let

+

h and g be two functions on X which are bounded from below, and

satisfy the inequalities

Ph  h ? c +  Pg  g ? c + ; where  = c d . Then h ? g is a constant, and both functions satisfy the Poisson equation (11). ut R

We now return to the case of controlled Markov chains. We henceforth call a policy regular if the controlled process f is a cf -regular Markov chain. To apply the previous results, we require the following bound on the cost function.

(A1') The cost function c is norm-like on the product space X A, and there exists a norm-like function c: X ! IR such that c(x; a)  c(x) for any x 2 X, a 2 A(x). +

We have included a prime in (A1') to indicate that this condition is somewhat relaxed in [20]. Note again that while we are assuming that c  0 in (A1'), in the analysis to follow we assume that c  1. If f(fn ; hn; n ) : n  1g is a solution to the PIA with the cost c, then f(fn ; hn; n + 1) : n  1g is a solution with cost c + 1, so this makes little di erence in practice. To invoke the algorithm we must also ensure that the required minimum exists. The following condition will hold automatically under appropriate continuity conditions since the cost c is norm-like on the product space X  A. See [18] for results in this direction.

(A2) For each n, if the PIA yields a triple (fn; hn; n) which solve the Poisson equation

Pfn hn = hn ? cn + n;

with hn bounded from below, then the minimization

fn+1 (x) := arg min[c(x; a) + Pahn (x)] a2A(x)

admits a measurable solution fn+1 . Condition (A2) may be relaxed by taking a \near minimizer" fn+1 such that

cfn+1 (x) + Pfn+1 hf (x)  c(x; a) + Pahf (x) + n+1; x 2 X; a 2 A(x); where fn g is a positive, summable sequence. The development to follow then requires

only super cial modi cations. We now show that the algorithm recursively produces stabilizing policies under (A1')-(A2). More general results can easily be formulated following a similar proof.

2 The Policy Improvement Algorithm

9

Theorem 2.5 Suppose that (A1') and (A2) hold, and that for some n, the policies ffi : i < ng and relative value functions fhi : i < ng are de ned through the policy improvement algorithm. Suppose moreover that

(a) The relative value function hi is bounded from below, i < n; (b) Each policy fi is regular for i < n; (c) All compact sets are petite for the Markov chain fn . Then, the PIA admits a solution (fn ; hn ; n) such that

(i) The relative value function hn is bounded from below; (ii) The constant n is the cost at the nth stage: n = J (fn ), and the costs are decreasing:

0   1       n ;

(iii) The policy fn is regular. (iv) If n = n? , then the triple (fn; hn; n) satis es the optimality equations (1,2). 1

Proof We rst prove (i) and (iii). Result (ii) is then a consequence of (iii) and the Comparison Theorem [19]. The main idea is to apply (6), which is a version of (8) if hn?1  0. Assume without loss of generality that hn?1  1, and let fn be the policy which attains the minimum in (3). Then by (6) the function Vn = hn?1 satis es

Pfn Vn  Vn ? cn + n?1; a version of (8). Applying Theorem 2.2 we see that (iii) holds, and applying Theorem 2.3 we obtain (i). To see (iv), observe that from (16), if n = n?1 then from (6) and Theorem 2.4, it follows that hn  hn?1 + Const. By de nition the function hn then satis es the optimality equation (1), and the policy fn satis es (2). ut

2.2 Convergence for processes with an accessible state The uniformity in the results (P1) and (P2) require a rather subtle analysis of the sequence of processes fn . It is now well known that for a -irreducible chain, it is possible to construct an atom  2 B+ (X) on an extended state space with the property that P (x;  ) = P (y;  ) for x; y 2  . In this paper we assume that in fact a singleton 2 X exists which is accessible for each chain. This is generalized in [20], where (A3') is replaced by a far more general accessibility condition by examining the split chain on an extended state space. For convergence we also require uniformity conditions on the atom :

2 The Policy Improvement Algorithm

10

(A3') If the policies ffn g are de ned through the policy improvement algorithm, then (i) all compact sets are petite for the Markov chains ffn : n  0g; and (ii) there is a state 2 X such that for some  > 0, Kn(x; ) :=

1

X

k=0

( 12 )k+1 Pfkn (x; )  

for all x 2 S , n  0,

(12)

where S denotes the precompact set

S = fx : c(x)  20g:

(13)

The condition (12) on accessibility of the state is not strong, since S is a precompact subset of X. It resembles the \uniform Doeblin condition" described in, for instance [1], but while the Doeblin condition requires a uniform bound on supx Ex [R ] for some R > 1, the condition on is in no way related to stability. It will typically hold in networks applications, where can be taken to be the empty state. We can apply Theorem 2.5 to deduce that fn is cn -regular for any n. For a cn regular chain, a solution hn to the Poisson equation satisfying the condition hn ( ) = 0 is ?1 i hX (14) cn(fkn ) : hn (x) = Ex i=0

We henceforth assume that each hn is of this form. We may now demonstrate the desired properties of these relative value functions:

Theorem 2.6 Suppose that the initial policy f is regular. Then, under (A1'), (A2), 0

and (A3'), for each n, the algorithm admits a solution (fn ; hn ; n) such that each policy fn is regular, and the sequence of relative value functions fhn g given in (14) satis es properties (P1) and (P2). Proof From Theorem 2.5 we see that the PIA generates regular policies. For any n, by Theorem 2.3 the function hn?1 is necessarily bounded from below, and by (6) it satis es the inequality

Pfn hn?1  hn?1 ? 21 cn + n?11lS ;

(15)

where we have used the property that fx : cn (x)  n?1 g  S , where the set S is de ned in (13). It follows from Dynkin's formula and Fatou's lemma, as in the proof of Proposition 11.3.2 of [19], that Ex

 ?

h X1

k=0

?1 i hX i  f n 1lS (fkn ) : cn(k )  2 hn?1(x) ? hn?1 ( ) + n?1Ex k=0

Arguing as in Theorem 11.3.11 of [19] and recalling that hn?1 ( ) = 0, we obtain 0  Ex [ ]  Ex

 ?

h X1

k=0

i

cn (fkn )  2[hn?1(x) + 0=]:

2 The Policy Improvement Algorithm

11

In particular, this shows that (P1) holds with N = 0= . Using the drift inequality Pfn hn?1  hn?1 ? cn + n?1 once more gives an upperbound on hn :

hn (x) =

Ex

 ?

h X1

k=0

cn(fkn )

i

 hn? (x) ? hn? ( ) + (n? ? n)Ex[ ]  [1 + 2(n? ? n)]hn? (x) + 2( =)(n? ? n): 1

1

1

1

1

0

1

Thus, for all n and x we have

? =  hn (x)  (1 + "n)hn? (x) + ( =)"n; ? n). 0

1

(16)

0

where "n = 2(n?1 To prove (P2), let

gn(x) =

1



Y

k=n+1



(1 + "k ) hn (x) + (0= )

1

X

k=n+1



n  0; x 2 X:

"k ;

From the previous bound (16),

gn(x) 







=

1

Y

k=n+1

1

Y

1



(1 + "k ) (1 + "n )hn?1 (x) + (0= )"n + (0= ) 



(1 + "k ) (1 + "n ) hn?1 (x) + (0= )

k=n+1 gn?1 (x)

1

X

k=n

"k

X

k=n+1

"k





We also have from the lower bound on hn given in (16), 1 Y gn  ? 0 (1 + "k )  ? 0 exp (2(0 ? )) ;

n 2 ZZ+ :

k=0

Hence gn (x) # g (x) > ?1 as n ! 1 for each x, and this and the form of gn proves (P2). ut

2.3 Optimality Now that we know that hn is pointwise convergent, we can show that the PIA yields an optimal policy. This requires some continuity with respect to the control. Weaker conditions than (A4) are surely possible for a speci c application.

(A4) The function c: X  A ! [1; 1) is continuous, and the functions (Pahn (x) : n  0) and Pah (x) are continuous in a for any xed x 2 X. Theorem 2.7 Suppose that there exists a regular policy f , and that assumptions (A1'), (A2), (A3') and (A4) hold. Then,

0

2 The Policy Improvement Algorithm

12

(a) The optimality equations (1,2) admit a solution (h; f; ). (b) For the policy f , the PIA produces a sequence of solutions (fn; hn; n) such that fhn g is pointwise convergent to a solution h of the optimality equation (1), and 0

any policy f which is a pointwise limit of fn satis es (2).

(c) The limiting policy is cf -regular, so that for any initial condition x 2 X, 1 J (f ) = f = nlim !1 n

n

X

k=1

Ex [cf (fk )]:

Proof As in [13] we may assume that for each x, there is a subsequence ni (x) such that fni (x)(x) ! f (x) as i ! 1, where f is a measurable function. Since we have assumed that A(x) is closed, it follows that f de nes a stationary policy for . The existence of f requires compactness of A(x), but since the cost function c(x;  ) is norm-like on A(x) for any x, and since fhn g is uniformly bounded from below, compactness may

be assumed without loss of generality. We rst establish the upper bound

 + h(x)  cf (x) + Pf h (x);

x 2 X:

(17)

Observe that from Poisson's equation

 + h(x) ? cf (x) = ilim P h (x) !1 fni ni 1 P g (x) + ni = ilim !1 ni hni ni ni  ilim P g (x) !1 hni nk0  Ph gnk0 (x)

where k0 is arbitrary, and the rst inequality is a consequence of the fact that fgn g is decreasing in n. By dominated convergence, it then follows that (17) does hold. Conversely, we have that

n?1 + hn?1 ? cfn (x)  Pfn hn?1 = 1 Pfn gn?1 (x) + n?1 n

 1 Pfn h (x)

n?1

n

Letting n ! 1 through the subsequence fnk g then gives by (A4),

 + h(x)  cf (x) + Pf h;

x 2 X:

Hence the limit h satis es Pf h = h ?  + c, where c(x) = cf (x). To see that the optimality equation is satis ed, recall from (3) that

cfn (x) + Pfn hn?1 (x) ? c(x; a)  Pahn?1(x)

2 The Policy Improvement Algorithm

13

for all admissible (x; a) 2 X  A. It follows that for any a, c (x) + 1 P h (x) ? c(x; a)  1 P g fn

n

fn

n

a n?1 (x):

Letting n ! 1, we see that (1) is satis ed. ut The technical assumption (iii) in Theorem 1.1 is frequently met in practice via the following result. Suppose that the policyR f gives rise to a f -irreducible Markov chain. If the resulting cost f := f (cf ) = cf (x)f (dx) is nite, let Sf denote any xed cf -regular set for which f (Sf ) > 0, and de ne the function

Vf (x) = Ex

Sf ?1

h X 0

cf (fk )

i

(18)

Since f (Sf ) > 0, the function Vf is a.e. [f ] nite-valued [19]. Note that by [19, Theorem 14.2.3], the choice of the particular cf -regular set Sf chosen is not important. If Sf1 and Sf2 give rise to functions Vf1 and Vf2 of the form (18), then for some constant  1,

?1(Vf1(x) + 1)  Vf2(x)  (Vf1(x) + 1);

x 2 X:

Theorem 2.8 Suppose that (a) The optimality equations (1,2) hold for (f; h; ), with h bounded from below; (b) For any policy f whose cost J (f; x) is not identically in nite, the average cost

Kf cf is nite and norm-like, and all compact sets are petite for the Markov chain f ; (c) For any policy f whose cost is not identically in nite, there exists some constant b = b(f ) < 1 such that, jh(x)j  b(1 + Vf (x)); x 2 X: (19)

Then f is optimal in the sense that for any initial condition x 2 X, and any policy f ,

1  = nlim !1 n

n

X

k=1

Ex [cf (fk )]  f

 lim "inf (1 ? ) 1

1

X

k=0

k Ex[cf (fk )]

(20)

Proof First note that if J (f; x) = 1 for all x, then there is nothing to prove. If not, then since all compact sets are petite, the Markov chain f is a positive recurrent T -chain, with unique invariant probability f [19]. Under the assumptions of the theorem, we can show inductively that

Pfn Vf = Vf ?

nX ?1 k=0

Pfk (cf ? sf );

where sf  0 and satis es  (sf ) =  (cf ). This function can be written explicitly as

2 The Policy Improvement Algorithm

sf (x) =

Z

Sf

Pf (x; dy)Ey

Sf

hX

k=1

cf (k )

14

i

It follows from the f -Norm Ergodic Theorem [19] that Pfn Vf (x)=n ! 0 as n ! 1 for a.e. [f ] x 2 X. By (19), we also have Pfn h (x)=n ! 0 for such x. From (1,2) we have for any policy f ,

Pfn h (x)=n + n1

nX ?1 k=0

Ex [cf (fk )]  h(x)=n + ;

n  1; x 2 X:

Letting n ! 1 proves the inequality n 1X lim inf Ex [cf (fk )]   n!1

n k=1

a.e. x 2 X [f ]:

It follows from the f -Norm Ergodic Theorem that f   also [19]. To prove the theorem, we must now consider the limit and limit in mum in (20) for every x. That n 1X Ex [cf (fk )] ! 

n k=1 as n ! 1 follows directly from the f -Norm Ergodic Theorem and Theorem 2.2, since the function h is assumed to be bounded from below.

The limit in mum is more subtle. From the assumptions of the theorem, the function Kf cf is unbounded o of petite sets [19]. It then follows as in the proof of Theorem 17.1.7 of [19] that, for every initial condition, the law of large numbers holds, lim (1 ? ) "1

1

X

k=0

Kf cf (fk ) = f 1lH + 11lH c

a.s. [Px ]

where H is the event that the f -controlled chain f enters its maximal Harris set. On the event H c we actually have Kf cf (fk ) ! 1 as k ! 1, since then the process visits each petite set only nitely often. Taking expectations of both sides of this equation and applying Fatou's lemma then gives lim "1inf (1 ? )

1

X

k=0

k Ex[Kf cf (fk )]  f ;

x 2 X:

It can also be shown using the resolvent equation [19, p. 291] that lim (1 ? ) "1

1

X

k=0

k (Ex[Kf cf (fk )] ? Ex [cf (fk )]) = 0;

and this completes the proof of (20).

ut

3 Applications to Networks

15

3 Applications to Networks Here we describe one application of Theorem 2.7 together with Theorem 2.8. We consider a countable state space model used to describe multiclass queueing networks with deterministic routing, as may be found in semiconductor manufacturing plants. We consider a network composed of d single server stations, which we index by i = 1; : : :; d. The network is populated by K classes of customers: Class k customers require service at station s(k). An exogenous stream of customers of class 1 arrive to machine s(1). If the service times and interarrival times are assumed to be exponentially distributed, then after a suitable time scaling and sampling of the process, the dynamics of the network can be described by the random linear system,

k+1 = k +

K

X

i=0

Ik+1(i)[ei+1 ? ei]fk (i);

(21)

where the state process  evolves on X = ZZK+ , and k (i) denotes the number of class i customers in the system at time k. P The random variables fIk : k  0g are i.i.d. on f0; 1gK +1, with Pf i Ik (i) = 1g = 1, and E[Ik (i)] = i . For 1  i  K , i denotes the service rate for class i customers, and 0 :=  is the arrival rate of customers of class 1. For 1  i  K we let ei denote the ith basis vector in IRK , and we set e0 = eK +1 := 0. The sequence ffk : k  0g is the control, which takes values in f0; 1gK +1. We de ne fk (0)  1. The set of admissible control Pactions A(x) is de ned in an obvious manner: for a 2 A(x), i) ai = 0 or 1; ii) 0  k:s(k)= aP k  1; iii) xk = 0 ) ak = 0. We also assume (iv) policies are non-idling , so that k:s(k)= ak = 1 whenever P x > 0. k:s(k)= k Since the control is bounded, a reasonable choice of cost function is c(x; a) = cT x, where c 2 IR K Pis a vector with strictly positive entries. For concreteness, we take c(x; a) = jxj := i xi. Since A(x) is a nite set for any x, it follows that (A1') holds with this cost function. The transition function has the simple form, Pa (x; x + ei+1 ? ei ) =

i ai;

Pa(x; x) = 1 ?

K X 0

0  i  K:

i a i

It is obvious in this case that the accessibility condition (12) of (A30) holds with = 0. Associated with this network is a uid model. For each initial condition (0) = x 6= 0, we construct a continuous time process x(t) as follows. If jxjt is an integer, we set x(t) = 1 (jxjt):

jxj

For all other t  0, we de ne x (t) by linear interpolation, so that it is continuous and piecewise linear in t. Note that jx (0)j = 1, and that x is Lipschitz continuous. The collection of all \ uid limits" is de ned by

3 Applications to Networks X2 (k)

X 1 (k)

λ

16

µ2

µ1 Machine 1

Machine 2

X4 (k)

X3 (k)

µ4

µ3

Figure1. A multiclass network L :=

1

\

n=1

fx : jxj > ng

where the overbar denotes weak closure. The process  evolves on the state space IR K+ , and frequently satis es a di erential equation of the form K d (t) = X i [ei+1 ? ei ]ut(i); dt i=0

(22)

where the function ut is analogous to the discrete control, and satis es similar constraints. It is now known that stability of (21) in terms of c-regularity is closely connected with the stability of the uid model [5, 16, 6, 18]. The uid model L is called Lp -stable if lim sup E[j(t)jp] = 0: t!1 2L

It is shown in [16] that L2-stability of the uid model is equivalent to a form of c-regularity for the network.

Theorem 3.1 (Kumar and Meyn [16]) The following stability criteria are equivalent for the network under any nonidling policy.

(i) The drift condition (8) holds for some function where V is equivalent to a quadratic in the sense that, for some > 0,

1 + jxj2  V (x)  1 + ?1 jxj2;

x 2 X:

(ii) For some quadratic function V , 

hX

E

n=0

i

jnj  V (x);

where  is the rst entrance time to = 0.

x 2 X;

(23)

3 Applications to Networks

17

(iii) For some quadratic function V and some c < 1, N

X

n=1

Ex [jn j]

 V (x) + cN; for all x and N  1:

(iv) The uid model L is L2 -stable.

ut

We now strengthen the previous result: if the uid model is L2-stable, then in fact (t) = 0 for all t suciently large.

Lemma 3.2 If the uid model is L2-stable, then for any  2 L, there exists a possibly random time 1 = 1 () < 1 such that (t) = 0 for all t  1 . Moreover, these times can be chosen so that sup E[1 ()] < 1 2L

Proof It can be shown that for any policy f Z jxj?1 j j i 1 Ef hT X k = Ef h T jx (s)j dsi + o(1); x jxj x jxj 0

0

(24)

where the term o(1) vanishes as x ! 1. Using (iii) of Theorem 3.1 it then follows that for any  2 L, i hZ 1 j(s)j ds  k := sup V (x) < 1: E x

x6=0

0

De ne the super martingale M (s) as

M (s) = E

hZ

s

1

jxj

2

i

j( )j d j Gs ;

where Gs =  ((r) : r  s). It is shown in [16] using the previous bound that

M (t)  kj(t)j2;

t  0; a:s: Observe that from this bound and the de nition of M (t), if (t) = 0 for any t, then (s) = 0 for all s > t with probability one. Hence the random variable 1 described in the lemma is naturally de ned as

1 := min(s : j(s)j = 0) = min(s : M (s) = 0); set to 1 if  never vanishes. p Taking square roots, V (t) := M (T ), we obtain a new super martingale from which we will obtain the desired bounds. The previous bound on M (t) gives

p

V (t)  kj(t)j;

t  0; a:s: By concavity of the square root we have, whenever V (s) 6= 0,

(25)

3 Applications to Networks

18

1 EhZ s+t j( )j d j G i s 2V (s) s To bound the negative term, note that for all t, E[V (s + t) j Gs ]  V (s) ?

j(s + t)j  j(s)j ? t: (26) This follows from the fact that j_ j  1. Letting 0  "  12 , the bounds (25) and (26) together with the conditional expectation bound imply that for t  2"j(s)j, E[V (s + t) j Gs ]

 



1 2 t2 V (s) ? tj2p(sk)jj? (s)j   1p ? " t: V (s) ? 2



(27)

k

De ne inductively the random times fTk : k  0g as follows: T0 = 0, and

Tk+1 = Tk + 2"j(Tk)j: We will show now that 1 = T1, where T1 := lim k!1 Tk . Note rst that it is clear that 1  T1 , since j(Tk )j > 0 for any Tk < T1 . So, we only need establish the reverse inequality. If T1 = 1, there is nothing to prove. If T1 is nite then j(Tk)j = 21" (Tk+1 ? Tk ) ! 0; k ! 1: By continuity of , we then have that j(T1)j = 0, so that T1  1 . Having shown that T1 = 1 , we now use (27) to bound its expectation. Since Tk+1 is Gk measurable for each k, it follows from (27) that   1 ? " E[V (Tk+1) j GTk ]  V (Tk ) ? p (Tk+1 ? Tk ) 2 k Summing over k = 0 to n, and using the smoothing property of the conditional expectation shows that for any n, p 2k ; E[Tn+1 ]  2 kV (0)  1?" where in the second inequality we have used (25) with t = 0. By the monotone convergence theorem we must have 2k E[1 ]  1 ? ": The constant " was arbitrary, so this yields the interesting bound E[1 ]  2k. Since  2 L is arbitrary, this proves the lemma. ut These results will be used to establish the following theorem. A policy f will be called optimal for the uid model if for any policy f , i i  hZ hZ 1 x (s)j ds ? Ef T jx (s)j ds  0: lim inf lim inf j  Efx x T !1 jxj!1 0

0

3 Applications to Networks

19

If the paths of the uid model are purely deterministic, this form of optimality amounts to minimality of the total cost 1

Z

0

j(s)j ds:

Currently, there is much interest in directly addressing methods for the synthesis of optimal policies for uid network models [9, 27, 4].

Theorem 3.3 If the initial policy f is chosen so that the uid model is L stable, 0

then

2

(i) The PIA produces a sequence f(fn; hn; n) : n  0g such that each associated

uid model is L2 stable. Any policy f which is a pointwise accumulation point of ffn g is an optimal average cost policy.

(ii) For each n  1,



lim inf lim inf Efn?1 T !1 jxj!1 x

hZ

T

0

jx(s)j ds

i

? Efxn

hZ

T 0

i

jx(s)j ds  0:

Hence if f0 is optimal for the uid model, so is fn for all n.

(iii) Any policy f which is a pointwise accumulation point of ffn g is optimal for the

uid model. Moreover, with h equal to the relative value function for the policy f , i hZ h(x) f T jx(s)j ds = 0: lim inf lim inf ? E T !1 jxj2 x

jxj!1

0

Hence, when properly normalized, the relative value function approximates the value function for the uid model control problem. Proof Observe from Theorem 3.1 and Theorem 2.3 that whenever the uid model is L2-stable, the relative value function h is equivalent to a quadratic, in the sense of (23). Moreover, for any policy f , we have the lower bound

Vf (x)  21 jxj2: This is a consequence of the skip free property of the network model. From this, Theorem 2.7, and Theorem 2.8 we obtain (i). To see (ii), we will use the approximation (24). Consider the Poisson equation and the bound (6) together, which when iterated gives for any T > 0,

PfTnjxjhn?1 (x)  hn?1 (x) ? PfTnjxjhn (x) = hn (x) ?

TX jxj?1 0

TX jxj?1 0

Efn [jk j] + T jxjn?1

Efn [jk j] + T jxjn

4 Conclusions

20

We will combine these equations and take limits, but to do so we must eliminate the term PfTnjxjhn (x)=jxj2. To show that this converges to 0 as x ! 1, T ! 1, observe that we have the upper bound jhn (x)j  b(jxj2 + 1), where b < 1. Hence by L2 stability of the uid model, PnT jxjhn (x)=jxj2 ! 0 as x ! 1, and then T ! 1. Combining the previous equations and using the approximation (24) then proves (ii). To prove (iii), let h denote the value function for the optimal policy f  . Then we have for any policy f , Pf h  h ? cf + ; where  is the optimal steady state cost. The proof then follows as above by iteration, and letting jxj ! 1. ut The result leaves open the choice for the initial policy. Many scheduling policies are known to be stabilizing for general networks of this form. For instance, the last bu er rst served (LBFS) policy can be used to initiate the policy improvement algorithm. It is clear from Theorem 3.3 that when the customer population is large, it is desirable to approximate the network scheduling policy with its uid analog. That is, the uid control policy should be used when the network is in a transient regime, e.g., after a recent long breakdown. Since the optimal policy for the uid model brings the

uid model to its stationary regime while minimizing inventory, it is reasonable that this policy should eciently drain overloaded queues for the real network. The exponential assumption here is not crucial. In [6] the case of general distributions is developed, and the analogous regularity results are obtained when a uid model is L2 -stable.

4 Conclusions This paper has introduced several techniques for the analysis of Markov decision processes. We expect that this is just a start of a more thorough development of optimal control of MDPs. Given the surprising structure exhibited by the relative value functions, it appears that the duality theory using linear programming formulations may be strengthened (see [12]). Also, given the uniform lower bounds obtained in Theorem 2.6 and the lower bounds required in the analysis of, for instance [22], it seems likely that the relationship between discounted and average control problems may be better developed using the techniques presented here. Currently, we are also investigating the network problem of Section 3 to see if some additional insight can be gained in this interesting application. Acknowledgements The research for this paper was begun while the author was visiting Onesimo Herandez-Lerma at at the Centro de Investigacion del IPN, Mexico City. He wishes to thank him for sharing his unpublished work. He is also grateful for his hospitality, and his explainations of the state of the art in Markov decision theory.

References

21

References [1] A. Arapostathis, V. S. Borkar, E. Fernandez-Gaucherand, M. K. Ghosh, and S. I. Marcus. Discrete-time controlled Markov processes with average cost criterion: a survey. SIAM J. Control Optim., 31:282{344, 1993. [2] R. Cavazos-Cadena. Weak conditions for the existence of optimal stationary policies in average Markov decision chains with unbounded costs. Kybernetika, 1989. [3] R. Cavazos-Cadena and E. Ferndndez-Gaucherand. Value iteration in a class of controlled markov chains with average criterion: Unbounded costs case. In Proceedings of the 34th Conference on Decision and Control, page TP05 3:40, 1995. [4] J. Humphrey D. Eng and S. Meyn. Fluid network models: Linear programs for control and performance bounds. In Proceedings of the 11th IFAC World Congress, San Francisco, CA, 1996. [5] J. G. Dai. On the positive Harris recurrence for multiclass queueing networks: A uni ed approach via uid limit models. Ann. Appl. Probab., 5:49{77, 1995. [6] J. G. Dai and S.P. Meyn. Stability and convergence of moments for multiclass queueing networks via uid limit models. IEEE Trans. Automat. Control, 40:1889{1904, November 1995. [7] R. Dekker. Denumerable Markov decision chains: Optimal policies for small interest rates. PhD thesis, University of Leiden, 1985. [8] R. Dekker. Counterexamples for compact action Markov decision chains with average reward criteria. Comm. Statist.-Stoch. Models, 3:357{368, 1987. [9] D. Bertsimas F. Avram and M. Ricard. Fluid models of sequencing problems in open queueing networks: an optimal control approach. Technical Report, Massachussetts Institute of Technology, 1995. [10] P. W. Glynn and S. P. Meyn. A Lyapunov bound for solutions of Poisson's equation. Ann. Probab., 1993 (to appear). [11] O. Hernandez-Lerma and J.B. Lassere. Average cost optimal policies for Markov control processes with Borel state space and unbounded costs. Systems & Control Letters, pages 349{356, 1990. [12] O. Hernandez-Lerma and J.B. Lassere. Discrete time Markov Control Processes I. 1995. to appear. [13] O. Hernandez-Lerma and J.B. Lassere. Policy iteration for average cost Markov control processes on Borel spaces. Technical report, IPN, Departamento de Matematicas, Mexico, and LAAS-CNRS, France, 1995. Technical Report. [14] A. Hordijk. Dynamic Programming and Markov Potential Theory. 1977. [15] R.A. Howard. Dynamic Programming and Markov Processes. MIT Press, 1960. [16] P.R. Kumar and S.P. Meyn. Duality and linear programs for stability and performance analysis queueing networks and scheduling policies. IEEE Transactions on Automatic Control, 41(1):4{17, January 1996. [17] L.I. Sennott. Average cost optimal stationary policies in in nite state Markov decision processes with unbounded cost. Operations Res., 37:626{633, 1989.

References

22

[18] S. P. Meyn. Transience of multiclass queueing networks via uid limit models. Ann. Appl. Probab., 1995. to appear. [19] S. P. Meyn and R. L. Tweedie. Markov Chains and Stochastic Stability. SpringerVerlag, London, 1993. [20] S.P. Meyn. The policy improvement algorithm for markov decision processes with general state space. submitted for publication, 1995. [21] E. Nummelin. On the Poisson equation in the potential theory of a single kernel. Math. Scand., 68:59{82, 1991. [22] O. Hernandez-Lerma and J.B. Lassere. Weak conditions for average optimality in Markov decision control processes. Systems Control Lett., 22:287{291, 1994. [23] O. Hernandez-Lerma, R. Montes-de-Oca, and R. Cavazos-Cadena. Recurrence conditions for Markov decision processes with Borel state space: A survey. Ann. Operations Res., 28:29{46, 1991. [24] R.K. Ritt and L.I. Sennott. Optimal stationary policies in general state Markov decision chains with nite action set. Operations Res., ??:??, 1993. [25] S. M. Ross. Applied probability models with optimization applications. Dover books on advanced Mathematics, 1992. note republication of the work rst published by Holden-Day, 1970. [26] A. Shwartz and A. Makowski. On the Poisson equation for Markov chains: existence of solutions and parameter dependence. Technical Report, Technion|Israel Institute of Technology, Haifa 32000, Israel., 1991. [27] G. Weiss. On the optimal draining of re-entrant uid lines. Technical report, Georgia Georgia Institute of Technology and Technion, 1994.