Algorithms for Optimization and Stabilization of Controlled Markov Chains
Sean Meyn Abstract. This article reviews some recent results by the author on the optimal control of Markov chains. Two common algorithms for the construction of optimal policies are considered: value iteration and policy iteration. In either case, it is found that the following hold when the algorithm is properly initialized: (i): A stochastic Lyapunov function exists for each intermediate policy, and hence each policy is regular (a strong stability condition). (ii): Intermediate costs converge to the optimal cost. (iii): Any limiting policy is average cost optimal. The network scheduling problem is considered in some detail as both an illustration of the theory, and because of the strong conclusions which can be reached for this important example as an application of the general theory. This is a preliminary draft of an article which will appear in one of the Indian Academy of Sciences' journals. It describes a recent circle of ideas concerning optimization and stabilization of controlled Markov chains. Please pass along any comments! My address is included on the last page of this document.
1991 Mathematics Subject Classi cation. Primary 90C40. 93E20, 60J20. Secondary 90B35, Key words and phrases. Multiclass queueing networks, Markov decision processes, optimal control, dynamic programming. Work supported in part by NSF Grant ECS 940372; JSEP grant N00014-90-J-1270; and a Fulbright Research Fellowship. This work was completed with the assistance of equipment granted through the IBM Shared University Research program and managed by the Computing and Communications Services Oce at the University of Illinois at Urbana-Champaign.
ALGORITHMS FOR OPTIMIZATION OF MDPS
1
1. Introduction
1.1. Feedback. The subject of this article concerns the synthesis of feedback laws for controlled, noisy, nonlinear systems. For any sort of controlled system, such as an automobile steered by a human driver, or the cruise control system on the same vehicle, some sort of feedback will certainly play a prominent role. In this automotive example, the most signi cant example of feedback involves the driver's eyes which are hopefully following the road. The car's cruise control uses speed measurements to attempt to regulate the actual speed of the vehicle. Even in a modern oven there is a thermosensor which sends measurements to a device whose job is to regulate the temperature to ensure a well baked cake. In a control system as simple as the cruise control one must still be careful in the way that measurements are used in the adjustment of the gas ow to the engine to maintain a constant speed. In areospace applications such control design requires even greater thought. The design of eective \feedback laws" to utilize available measurements is the subject of automatic control ; an active eld in the mathematics and engineering communities for over fty years. The control systems considered in this paper may be modelled by the nonlinear state space model (1.1)
(t + 1) = F ((t); a(t); I (t + 1));
t 0;
where time is discrete. The sequence := f(t) : t 0g denotes a state process for the system to be controlled, and a := fa(t) : t 0g is called the control sequence, or action sequence, or the policy. We assume that controllers are not clairvoyent so that, for any t, the action a(t) depends only upon the variables (s) for s t. The state process takes values in a state space X, and the control sequence takes values in an action space, denoted A. In the example of an automobile and driver the variable (t) will denote the position, velocity, etc. for the car to be controlled. The actions a(t) which the driver implements will of course depend upon these variables. The function F describes the dynamics of the system, and the sequence fI (t) : t 0g represents noise which is beyond the control of the driver, such as gusts of wind or oil on the road. If this noise is modelled as an i.i.d. sequence (the noise variables fI (t) : t 0g are independent, with identical distributions), then the model (1.1) is known as a controlled Markov chain, or Markov decision process (MDP). The motivation for considering such systems is largely two fold. First of all, almost any man-made or natural system can be at least approximately modelled by an MDP. Moreover, once one has decided on such a model, there is the hope that simple and eective feedback laws may be computed. We are particularly interested in nding a useful action sequence of the stationary feedback form where a(t) = w((t)) : the feedback law w is a static function from the set of states X to the set of actions A. We are interested in control problems of the `process control' type (such as the cruise control, or the oven). A more relevant example may be found in the area of manufacturing, where today one nds assembly production lines of high complexity. There may be many process control problems within plants in a facility. However, because it is such an interesting example, and because the specialization of the results described here are so striking in this context, in this paper we will consider in some detail the 'meta' control problem of scheduling parts in the system to
2
SEAN MEYN
minimize inventory, and minimize production delay. The control goal is one of regulation, and such a control problem may be conveniently modelled as an MDP. To show that this scheduling problem ts into our framework, consider a network of the form illustrated in Figure 1, composed of d single server machines, indexed by = 1; : : : ; d. The network is populated by K classes of \customers" (which may in fact be partially assembled parts). After completion of service at a machine, the customer changes class and enters a machine, or leaves the network. It is assumed that there is an exogenous stream of customers of class 1 which arrive to machine s(1). A customer of class k waits in buer k, until it receives service at machine s(k). If the service times and interarrival times are assumed to be exponentially distributed, then after a suitable time scaling and sampling of the process, the dynamics of the network can be described by the random linear system, (1.2)
(t + 1) = (t) +
K X k=0
Ik (t + 1)[ek+1 ? ek ]wk ((t));
where the state process evolves on the state space X := ZK+ . The vector ek 2 RK is the kth basis vector, ek = (0; : : : ; 0; 1; 0; : : :; ; 0)0 . The value of i (t) is the number of class i customers in the system at time t which await service in buer i at machine s(i). The function w : X ! A := f0; 1gK +1 is the feedback law to be designed. If wi ((t))Ii (t + 1) = 1, this means that at the discrete time t, a customer of class i is just completing a service, and is moving on to buer i + 1 or, if i = K , the customer then leaves the system. The set of admissible control actions that a(t) may take on at time t depends on the value x of the system at time t. This set, denoted A(x), is de ned as follows where we write a as a column vector, a = (a0 ; : : : ; aK )0 2 A(x) f0; 1gK +1. (i): a0 = 1, and for any 1 i K , ai = 0 or 1; (ii): For any 1 i K , xi = 0P) ai = 0; (iii): For any machine , P 0 i:s(i)= ai 1. (iv): For any machine , i:s(i)= ai = 1 whenever Pi:s(i)= xi > 0. Condition (iv) is the non-idling property that a server will always work if there is work to be done. To regulate this system itPis convenient to impose a cost function. Here a natural \cost" is c((t)) = j(t)j = i (t), which is the total customer population at time t. We use j j here to denote the `1 norm. If this cost criterion can be minimized in some sense, then the system will be well regulated as desired. The non-idling condition (iv) may then be assumed without any loss of generality since any optimal policy will be non-idling with this cost function. λ
Φ1 (t)
Φ2 (t)
µ2
µ
1
Machine 2 Machine 1
Φ3 (t)
µ3
Figure 1. A multiclass network with d = 2 and K = 3.
ALGORITHMS FOR OPTIMIZATION OF MDPS
3
K +1. We Suppose that P the random variables fI (t) : t 0g are i.i.d. on f0; 1g assume that Pf i Ii (t) = 1g = 1, and we de ne E[Ii (t)] = i . For 1 i K , i is interpreted as the service rate for class i customers, and for i = 0 we let 0 := , which is interpreted as the arrival rate of customers of class 1. It is clear then that the model (1.2) is an MDP of the form (1.1). The general results of this paper will be applied to this speci c example in Section 6.
1.2. The optimality equations. The MDP model can be equivalently de ned through its transition probabilities. Consider the case where the state space X and the action space A are both countably in nite. The controlled transition probability Pa is then de ned via Pa (x; y) = Prob f(t + 1) = y j (t) = x; a(t) = ag = Prob fF (x; a; I ) = yg; x; y 2 X; a 2 A; t 0; where I is a generic noise variable. We consider the constrained control problem where for each x there is a set A(x) A such that a(t) 2 A(x) whenever (t) = x. Our attention is primarily restricted to Markov policies where the action a(t) only depends upon (t). This means that there exists a sequence of feedback laws fwt : t 0g with a(t) = wt ((t)) for each t; a generalization of the stationary case where wt = w0 for all t. We evaluate a policy based upon the average cost criterion: ?1 i hnX 1 J (x; a) = lim sup E (1.3) c((t); a(t)) n!1
n
x
t=0
where the initial condition (0) = x is given. The quantity c((t); a(t)) represents the cost paid at time t; the one step cost c is a function from X A to [1; 1). An optimal policy is one that minimizes (1.3) for any initial condition. It is hoped that an optimal policy does exist, and that it can be taken in the simpler stationary form. The motivation for considering this average cost control problem is that there is good reason to believe that the optimal solution will take on the relatively simple stationary form. Moreover, it is a natural criterion in settings such as process control, where controlling transients is not so important as achieving good performance in the steady state. The construction of an optimal policy typically involves the solution to the following pair of equations (1.4) + h (x) = a2A min(x)[c(x; a) + Pa h (x)] (1.5) w (x) = arg min[c(x; a) + Pa h (x)]: x 2 X: a2A(x)
In these equations the transition function Pa is viewed as an in nite dimensional matrix, and h as a compatible column vector. Hence Pa h may be viewed as a column vector whose xth entry is X Pa h (x) := Pa (x; y)h (y): y2X
The equality (1.4) is known as the average cost optimality equation (ACOE). The function h is known as a relative value function. The second equation (1.5) de nes
4
SEAN MEYN
a stationary policy w . If a solution can be found, then typically the stationary policy w is optimal (see e.g. [19, 1, 3, 16]). The questions addressed in this paper all revolve around these two equations. Is there a solution? If so, how can it be computed? If an algorithm is found, is its convergence guaranteed? How fast? Are the intermediate feedback laws generated by the algorithm useful? The last question is important since we cannot wait forever for whatever algorithm we implement to converge. There are two popular algorithms to attempt to solve (1.4). The rst is the standard dynamical programming approach known as value iteration (VIA). The second is called policy iteration (PIA). The idea is this: The equation (1.4) is a xed point equation of the form L(h) = h where h is a function on X, and g = L(h) is the function de ned by
g(x) = a2A min(x)fc(x; a) ? + Pa h (x)g: The VIA is then (essentially) the process of successive substitution: Given one approximate solution hk , we compute another, perhaps better approximation via hk+1 = L(hk ). Although it is frequently convergent, at least locally, the successive approximation method is known to be slow in general since the \slope" of the function h ! L(h) ? h may be small in a neighborhood of h . The standard approach to speeding the convergence of the successive approximation approach is to use derivative information to adjust this undesirable slope - this leads to the Newton Raphson method, to which the PIA is closely related [22, 5]. The remainder of this paper is organized as follows. The VIA and PIA are de ned formally and their properties are explored in Sections 4 and 5. Sections 2 and 3 below provide some basic ergodic theory and optimization theory which form a foundation for the remainder of the paper. In the nal section of the paper we present applications to the network scheduling problem.
2. A steady state cost?
Since the goal is to nd a policy which minimizes the steady state cost given in (1.3) it is reasonable to rst try to understand when the cost J (x; a) can be expected to be nite, and how we can characterize the limit. We consider in this section the stationary case where a is a Markov policy de ned by the feedback law w so that a(t) = w((t)), t 2 Z+. We denote by P the transition kernel P (x; y) = Pw (x; y) and similarly drop the dependence of c(x; a) upon the control since the function w is xed throughout. Hence in this section we let c denote a function on X with c 1. Although we are taking the very special case of countable state space chains, whenever possible we state de nitions and results so that extensions to the general state space framework will at least seem plausible. The reader is referred to [15] for an analogous treatment of Markov chains on a general state space. We assume that the Markov chain with transition law P is -irreducible, as de ned in [15]. For a countable state space model this means that there is a single communicating class which is reachable, with positive probability, from any initial condition. Equivalently, there exists a state 2 X which is accessible in the sense
ALGORITHMS FOR OPTIMIZATION OF MDPS
that
1 X t=0
P t (x; ) =
1 X t=0
Pf(t) = j (0) = xg > 0;
5
x 2 X:
Throughout the paper we use to denote some xed state which is accessible in this sense. We also assume throughout that is aperiodic, which is equivalent to the requirement that the accessible state satis es P n (; ) > 0 for all n suciently large. 2.1. Regularity, Lyapunov functions, and ergodicity. For this uncontrolled Markov chain the existence of a limit in (1.3) is guaranteed if the chain is c-regular - a form of stability. For the countable state space chains considered here the de nition is simply stated: is c-regular if for some 2 X and every x 2 X, Ex
?1 hX
t=0
i
c((t)) < 1;
where the rst return time to the point is de ned as = min(t 1 : (t) = ); set to 1 if the minimum is over an empty set. A c-regular chain always possesses a unique invariant probability such that X P (x) := (y)P (y; x) = (x); x 2 X;
(c) :=
y2X
X
x2X
c(x)(x) < 1:
For c-regular chains we can solve equations of the form Ph = h ? c + for the unknown function h with = (c). This is obviously closely connected to the solution of the average cost optimality equation. A stochastic Lyapunov function in the context here is a function V : X ! R+ satisfying the drift inequality (2.1) E[V ((t + 1)) j Ft ] V ((t)) ? c((t)) + s((t)) where Ft = ((r) : r t), and s is bounded, positive function. When the one step cost c((t)) is large, then (2.1) tells us that we expect the value of V ((t +1)) to be correspondingly smaller than V ((t)). Hence there is a negative \drift" towards the center of the state space where the cost is relatively low. The inequality (2.1) is closely related to both Foster's criterion for stability, and an approach known as Lyapunov's second method. Hence (2.1) is called a Foster-Lyapunov drift inequality. By the Markov property it may be equivalently expressed algebraically as PV V ? c + s. The connections between Lyapunov functions and c-regularity are largely based upon the following general result, which is a consequence of the Comparison Theorem of [15]. Theorem 2.1 (Comparison Theorem). Let be a Markov chain on X satisfying the drift inequality (2.2) PV V ? c + s
6
SEAN MEYN
The functions V , c and s take values in R+ . Then for any stopping time
(2.3)
Ex [V (( ))] + Ex
?1 hX t=0
i
c((t)) V (x) + Ex
?1 hX t=0
i
s((t)) :
ut
The Comparison Theorem provides the bound that is required in the de nition of c-regularity provided that the RHS of (2.3) is nite valued. This of course will depend upon the function s. De ne the resolvent as the weighted sum
K=
1 X 2?(i+1) P i : i=0
A function s : X ! [0; 1] is called petite : if there is 2 X such that (2.4) K (x; ) s(x); special : if sup Ex x2X
?1 hX
t=0
x 2 X: i
s((t)) < 1:
In the de nition of a special function we do not require that the return time be nite-valued. If S X, and 1lS is petite (special) for some > 0, then the set is called petite (special). A geometric trials argument shows that petite sets are always special [15, 18]. If the chain is irreducible in the usual sense then obviously every nite set is petite. If the function s used in Theorem 2.1 is special then the RHS of (2.3) will be bounded by V (x) plus a constant, which implies that the chain is c-regular. This observation and renewal theory arguments lead to the following version of the f -Norm Ergodic Theorem of [15]. Theorem 2.2. Suppose that is a Markov chain on X satisfying the FosterLyapunov drift inequality PV V ? c + where > 0 is a constant; the function c takes values in [1; 1), and the function V takes values in R+ . Suppose moreover that the set S = fx : c(x) 2g is petite. Then (i): is a c-regular Markov chain satisfying the bound, for some b0 < 1, Ex
?1 hX
t=0
i
c((t)) V (x) + b0 ;
x 2 X;
(ii): There is a unique invariant probability , and (c) ; (iii): The \mean cost" converges for any x: Ex [c((t))] ! (c) t ! 1:
ut
ALGORITHMS FOR OPTIMIZATION OF MDPS
7
2.2. Poisson's equation. Suppose that is a c-regular Markov chain with invariant probability , and denote = (c). The Poisson equation is de ned as (2.5) Ph = h ? c + ; where h : X ! R. The computation of h in the nite state space case involves a simple matrix inversion which can be generalized to the present setting provided that the chain is c-regular. Given a function s : X ! [0; 1], and a probability on X, the kernel s : X X ! [0; 1] is de ned as the product s (x; y) = s(x) (y), x; y 2 X. Letting denote the point-mass at , the minorization condition (2.4) may be expressed K s . Suppose that the Poisson equation has a solution h. Then we have the equivalent formula Kh = h ? K c where c(x) = c(x) ? (c). Since any translation h + Const. is also a solution, we may assume that (h) = 0. We then have (K ? s )h = h ? K c, or I ? (K ? s ) h = K c The inverse of the in nite matrix on the LHS is given by the in nite sum
I ? (K ? s )
?1
=
1 X i=0
(K ? s )i :
Thus we may expect that the solution to Poisson's equation will be expressed by the formula (2.6)
h(x) =
1 X i=0
(K ? s )i K c (x);
x 2 X;
provided the sum is absolutely convergent. This is indeed the case for c-regular chains [11, 16]. The advantage of this construction is that it can be applied to Markov chains on a general state space, and may even be extended to continuous time processes, where the kernel K must be replaced with the continuous time version of the resolvent. When X is countable and 2 X is accessible, then a much more transparent construction is (2.7)
h(x) = Ex
?1 hX
t=0
i
c((t)) ? ;
x 2 X:
Evidently, for a c-regular chain the function h is nite-valued . The papers [11, 16] use these ideas to establish a sucient condition for the existence of suitably bounded solutions to Poisson's equation for general state space chains. Theorem 2.3 specializes this result to the countable state space setting. Theorem 2.3. Suppose that the Markov chain is c-regular, and let
V (x) := Ex
?1 hX
t=0
i
c((t)) :
Assume that with = (c) the set S de ned below is petite S = fx : Kc (x) (c)g:
8
SEAN MEYN
Then the function h given in (2.7) is a solution to Poisson's equation (2.5), and has the following properties: (i): The function is bounded from above: h(x) V (x) x 2 X: (ii): It is bounded from below: inf h(x) > ?1: x2X
(iii): It is essentially unique: If g is any nite-valued function which is bounded from below and satis es
Pg g ? c + ; then g(x) ? g() = h(x) ? h() for almost every x 2 X [], and g(x) ? g() h(x) ? h() for every x 2 X.
ut
The following result is then easily proven since Poisson's equation is a very special case of the drift inequality used in Theorem 2.2. Theorem 2.4. Suppose that c is unbounded o of petite sets. That is, the sublevel set S = fx : c(x) g is petite for any > 0. Then the following are equivalent: (i): Markov chain is c-regular; (ii): There exists a nite-valued function V : X ! R+ and a constant < 1 satisfying PV V ? c + ; (iii): There exists a positive and nite-valued solution h to Poisson's equation.
ut
We note that if (ii) holds then the solution h of Poisson's equation can be chosen so that for some constant b0 , 0 h(x) V (x) + b0 ; x 2 X: This follows from Theorem 2.2.
3. Existence of Optimal Policies
We now return to the case of a controlled chain with transition function Pa . If a stationary policy de ned by a feedback law w is used then the controlled chain has stationary transition probabilities, to be denoted Pw , so that Pw (x; y) = Pw(x) (x; y); x; y 2 X: Similarly, we write the resulting one-step cost as cw so that cw (x) = c(x; w(x)); x 2 X: A feedback law w is called regular if the Markov chain with transition law Pw is cw -regular, as de ned in the previous section. In this case the chain possesses a unique invariant probability, to be denoted w , and w (cw ) < 1.
ALGORITHMS FOR OPTIMIZATION OF MDPS
9
3.1. The minimal relative value function. With a better understanding of Poisson's equation in the control-free case we may now formulate existence criteria for the closely related ACOE (1.4). To begin we de ne the following \cost over one cycle" w for a Markov policy w: (3.1)
n
w = min 1 : Ew
?1 hX
i
o
c((t); wt ((t))) ? 0 :
t=0
The minimal cyclic cost over all Markov policies will be denoted : n (3.2) = inf w : w is Markov : When w is a regular feedback law then the resulting stationary policy w satis es w = w (cw ) = J (x; w), x 2 X. The minimal relative value function h is de ned pointwise as (3.3)
h (x) = inf Ew x
?1 hX
t=0
i
c((t); wt ((t))) ? ;
where the minimum is over all Markov policies w. This de nition is a generalization of the construction of a solution to Poisson's equation given in the control-free case. To obtain the desired bounds on h (we hope that it is nite-valued, and uniformly bounded from below) the following assumptions will be imposed. (A1): There exists a feedback law w?1 , a function V0 : X ! R+ , and < 1 satisfying Pw?1 V0 V0 ? cw?1 + : (A2): For each xed x 2 X, the set A(x) is nite. For each > 0 the following sublevel set is nite: S = fx : a2A min(x) c(x; a) g
(A3): There is a xed state 2 X such that for any Markov policy w, Kw (x; ) > 0; x 2 X:
The assumption (A1) means simply that there is at least one regular policy. Condition (A2) asks that large states receive a large penalty through the cost function c. Condition (A3) is a generalization of the petite set condition imposed in the previous section. Using Fatou's Lemma one may show that this implies the (apparently) stronger set of conditions which show that the sublevel sets are \uniformly petite" and \uniformly special" over all Markov policies. Proposition 3.1. Suppose that (A1){(A3) hold. Then (i): For some function s : X ! (0; 1) the following bound holds for all x 2 X and all Markov policies w: Kw (x; ) s(x); (ii): For some non-decreasing function B : R+ ! R+ and any Markov policy w, (3.4)
sup Ew x x2X
?1 hX
t=0
i
1lS ((t)) < B ();
10
SEAN MEYN
where S is de ned in (A2). Proof. (i) Suppose not. Then there exists some x0 such that inf w Kw (x0 ; ) = 0, and hence there is a sequence of Markov policies wn for which Kwn (x0 ; ) 1=n. Assume without loss of generality that wn ! w1 pointwise as n ! 1. By Fatou's Lemma we then have inf Kwn (x0 ; ) = 0; Kw1 (x0 ; ) lim n!1 which contradicts (A3), establishing (i). (ii) Fix > 0 so that S is non-empty and nd > 0 such that Kw (x; ) for x 2 S , and any Markov policy w. We then have for any such w, Ew x
?1 hX
i
1lS ((t)) ?1 Ew x
?1 hX
Kw[t] ((t); )
i
t=0 t=0 [ t ] where w is the shifted policy w[t] = (wt ; wt+1 ; : : : ):
From the Markov property we can replace the resolvent by a sum over a sample path as follows Ew x
?1 hX
t=0
Kw[t] ((t); )
Ew x
Ew x
=
1 ?1X hX
i
2?k?1 1l((t + k) = )
t=0 k=0 1 ?1 X hX
2?k?1
i
i
t=0 k= ?t ?1 hX i ?( ?t) w Ex t=0
2
1 X 2?t = 1: t=1
ut
3.2. Properties. From the uniform bound (3.4) we then obtain uniform bounds on the minimal relative value function: Proposition 3.2. Under (A1){(A3) we have h () = 0, and for some b0 < 1 ?b0 h (x) V0 (x) + b0 ; x 2 X: Proof. That h () = 0 follows from the minimality of and the de nition of h . The lower bound follows from Proposition 3.1 which implies the explicit bound h (x) ? B ( ), x 2 X. The upper bound is a consequence of minimality of h : ?1 i ?1 h X w c((t); wt ((t))) V0 (x) + Const. h (x) Ex t=0
ut
ALGORITHMS FOR OPTIMIZATION OF MDPS
11
For any Markov policy w0 = (w00 ; w10 ; : : : ) we can derive the following dynamic programming inequality using the fact that h () = 0: min fc(x; a) ? + Pa h (x)g a2A(x)
c(x; w00 (x)) ? +
X
= c(x; w00 (x)) ? +
X
c(x; w00 (x)) ? +
X
y6=
y6=
y6=
Pw00 (x; y)h (y) n
Pw00 (x; y) min Ew w y n
0[1] Pw00 (x; y) Ew y
?1 hX
c((t); wt ((t))) ?
t=0 ? 1 hX t=0
io
c((t); wt0+1 ((t))) ?
io
where in the last inequality we denote w0[1] = (w10 ; w20 ; : : : ). From the Markov property we see that min fc(x; a) ? + Pa h (x)g Ew x a2A(x)
?1
i
0h X
t=0
c((t); wt0 ((t))) ? :
We can minimize over all Markov policies since w0 was arbitrary to obtain the inequality: min fc(x; a) ? + Pa h (x)g h (x); x 2 X: a2A(x) Suppose that w is a minimizing feedback law: (3.5) min(x)fc(x; a) + Pa h (x)g cw (x) + Pw h (x) = a2A
We see then that the Poisson inequality holds (3.6) cw (x) + Pw h (x) + h : It follows that w is regular, and from the Comparison Theorem we have the bound w (cw ) = w . By minimality of the equality w = must hold, and by minimality of h and uniqueness of solutions to Poisson's equation the inequality (3.6) must also be an equality. We summarize these results in the existence theorem, Theorem 3.3. Suppose that (A1){(A3) hold and that < 1. Then (i): The minimal relative value function is a solution to the ACOE. (ii): Any feedback law w satisfying (3.5) is optimal over all Markov policies. (iii): Any feedback law w satisfying (3.5) minimizes the \relative cost": For any Markov policy w, and any x, h (x) = Ew x
?1 hX
i
cw ((t)) ? Ew x
?1 hX
t=0 + (iv): If (h+; w ) is any other solution to (1.4) for which inf x h+(x) then w+ is regular with unique invariant probability + , and t=0
h+ (x) ? h+ () = h (x) h+ (x) ? h+ () h (x)
i
c((t); wt ((t))) ? :
if + (x) > 0; for all x:
> ?1,
ut
12
SEAN MEYN
4. Value Iteration
Now that we know that the ACOE has a solution, the next question we consider is, how do we compute a solution? Recall that the ACOE may be described as a xed point equation h = L(h ) = min fc( ; a) ? + Pa h g: a The rst algorithm we shall look at is the value iteration algorithm, or VIA. 4.1. Successive approximation. The VIA approach to solving the ACOE was described in the introduction as being \somewhat like" the successive approximation approach to solving a xed point equation. Suppose that eh0 is given as an initial condition. Assume eh0 : X ! R with inf x2X eh0 (x) > ?1. Then the actual successive approximation algorithm de nes a sequence of functions fehn : n 0g inductively as follows: ehn+1 = L(ehn ), or equivalently, ehn+1 (x) = min fc(x; a) ? + Pa ehn (x)g; (4.1) x 2 X; n 0: a2A(x) For any n the function ehn has the interpretation
?1 i hnX ehn (x) = min Ew c((t); wt ((t))) ? + eh0 ((n)) x t=0
where the minimum is over all Markov policies. Note the similarity between this and the de nition (3.3), which is one motivation for the algorithm. However the diculty with (4.1) is that one must know apriori the optimal cost . This turns out to be a minor diculty since the constant plays a minor role in the recursion (4.1). On eliminating this constant we arrive at the more easily implemented value iteration algorithm, which may be described as a two step procedure: Given the value function Vn , Step 1: Compute the feedback law wn through theminimization wn (x) = arg min ca + Pa Vn (x) a2A(x)
Step 2: Compute the value function Vn+1 via
Vn+1 = cn + Pn Vn ; with cn = cwn and Pn = Pwn . The function V0 0 is given as an initial condition. When the initial conditions are identical, eh0 = V0 , we then see that Vn (x) ? Vn () = ehn (x) ? ehn () for all x and n. Hence the function hn de ned by (4.2) hn (x) := Vn (x) ? Vn (); x 2 X; is a plausible approximation to the solution of the ACOE when n is large. The value function Vn has the interpretation (4.3)
Vn (x) = min Ex
?1 hnX t=0
i
c((t); a(t)) + V0 ((n)) ;
where the minimum is over all policies, and is attained with the Markov policy vn whose rst n actions may be expressed v n[0;n?1] = (wn?1 ((0)); : : : ; w0 ((n ? 1))):
ALGORITHMS FOR OPTIMIZATION OF MDPS
13
We shall let wn :=(wn ; wn ; wn ; : : : ) denote the stationary policy de ned by the nth feedback law. 4.2. Stability. The VIA generates a sequence of feedback laws fwn : n 0g which one expects will in some way approximate an optimal feedback law when n becomes large. At the very least we hope that the feedback laws are regular or exhibit some form of stability. However, one cannot even expect the controlled chains to be recurrent unless some additional assumptions are imposed. The following example is taken from [7]. Consider the network illustrated in Figure 2 consisting of four buers and two machines fed by separate arrival streams. It is shown in [20] that the last buer rst served policy where buers 2 and 4 receive priority at their respective machines will make the controlled process transient, even under the usual trac condition that < 1, if the cross-machine condition =2 + =4 1 is violated. If the VIA is applied to this model with V0 0, then one obtains V1 (x) = jxj, and V2 (x) = jxj + min (jxj + 2 ? 2 w2 (x) ? 4 w4 (x)): w The minimizing feedback law w2 is given by w22 (x) = 1l(x2 > 0); w42 (x) = 1l(x4 > 0): We conclude that J (w2 ) 1 when =2 + =4 > 1 since this is precisely the destabilizing policy introduced in [20, 13]. In fact, not only is the average cost in nite, the feedback law w2 gives rise to a transient Markov chain. ut λ
Φ1 (t)
Φ1 (t)
µ2
µ1 Machine 1
µ4
Φ3 (t)
Machine 2
Φ3 (t)
µ3 λ
Figure 2. A multiclass network with d = 2 and K = 4.
The diculty in this example is that the VIA is initialized with V0 0, which is not remotely close to the actual relative value function h . Suppose that instead of zero, the initial condition V0 is chosen as a Lyapunov function for just one policy. In this case all of the feedback laws fwn g are guaranteed to be regular. To see this let V0 be the function de ned in (A1), let gn = Vn+1 ? Vn , set n = supx2X gn(x), and let n = wn . We let Pn and cn denote the one step transition probability and one step cost obtained on the nth iteration, and n denote the invariant probability for Pn if one exists: Pn = Pwn cn = cwn n = wn Theorem 4.1. Under (A1){(A3), if the VIA is initialized with the Lyapunov function V0 then for each n 2 Z+, (i): n+1 n ; (ii): PnVn Vn ? cn + n;
14
SEAN MEYN
(iii): The feedback law wn is regular with n n.
Proof. To see (i) we rst note that for all n,
(4.4) Pn gn = Pn Vn+1 + cn ? (Pn Vn + cn ) Vn+2 ? Vn+1 = gn+1 : It immediately follows that n = supx gn (x) n+1 . The bound (ii) is then a simple consequence of the de nitions of Vn+1 and gn: Pn Vn = Vn+1 ? cn = Vn ? cn + gn Vn ? cn + n : The nal result (iii) then follows from Theorem 2.2. ut Hence it pays to initialize the VIA using a stochastic Lyapunov function. If there is a choice, one should choose the function V0 so that the constant is as small as possible since this represents a global upper bound on the performance of all succeeding feedback laws. 4.3. Convergence. Convergence of the VIA is more delicate than simple stability of the intermediate policies. A complete proof of the convergence of the functions fhng given in (4.2) may be found in [7] for an initial function V0 satisfying (A1). In [6] the case V0 0 is considered under somewhat stronger assumptions on the model. Here we sketch the main ideas of the proof given in [7]. The rst step is Proposition 4.2. Under (A1){(A3) we have hn () = 0 for all n, and for some
b0 < 1
?b1 hn (x) V0 (x) + b0 ;
x 2 X:
Proof. The lower bound follows from Theorem 4.1 and the Comparison Theorem which together give
hn (x) = Vn (x) ? Vn () Exwn
?1 hX
t=0
i
n cn ((t)) ? n ?Ew x
?1 hX
t=0
1lS ((t))
i
The RHS is uniformly bounded from below by Proposition 3.1. For the upper bound use minimality of Vn to obtain the bound
Vn (x) E0x + E0x
?1 hnX
i
i
c0 ((t)) + V0 ((n)) 1l( > n)
t=0 ?1 hX t=0
c0 ((t)) + Vn? () 1l( n) ;
where E0 is the expectation operator obtained with the policy w0 . Subtracting Vn () from both sides then gives
hn (x) E0x (4.5)
+ E0x
?1 hnX t=0 ?1 hX
ht=0
c0 ((t)) + V0 ((n)) 1l( > n)
c0 ((t))
i
i
i
+ E0x Vn? () ? Vn () 1l( n) : Each of these three terms may be bounded through regularity of w0 . From the inequality P0 V0 V0 ? c0 + and Theorem 2.2 we see that the second term
ALGORITHMS FOR OPTIMIZATION OF MDPS
15
is bounded by V0 (x) plus a constant. The last term is similarly bounded since one may show that gk () ?b1 , k 0, for some constant b1 . Hence
Vn? () ? Vn () = ?
n ?1 X
k=n?
gk () b1 :
The rst term is bounded using these same ideas and an inductive argument.
ut
These uniform bounds may then be used to establish a form of \uniform regularity" for the Markov policies fvn : n 0g. Lemma 4.3. Under (A1){(A3) there is a xed constant b0 such that for all x and n, n Evx [n ^ ] b0 (V0 (x) + 1): Proof. Letting V (t) = hn?t+1 ((t)) we can establish the following drift inequality for the non-homogeneous chain n Ev [V (t + 1) j Ft ] V (t) ? cvtn ((t)) + ; 0 t < n: This is similar to the Lyapunov drift inequality (2.1) and may be used in the same way to obtain the desired bound on the mean of . The bound is uniform in n because of Proposition 3.1 and Proposition 4.2. ut The next key step is to obtain a uniform bound on g(x) := lim sup gn (x); x 2 X: n!1
Lemma 4.4. Suppose that (A1){(A3) hold and that the initial value function
V0 satis es Then
1
n Pw V0 (x) ! 0; n
n ! 1; x 2 X:
(i): g(x) for every x 2 X. (ii): limn!1 gn() = .
Proof. There are two steps to the proof. First we show that is maximal in the sense that g(x) g(); x 2 X: To see this we apply the upper bound Pn gn gn+1 from (4.4) to construct a submartingale under vn de ned as M (t) = gn?t+1((t)). Letting = ^ n then gives n Evx [gn? +1 (( ))] = E[M ( )] M (0) = gn+1 (x): Using this bound and the previous lemma giving uniform bounds on the rst moment of we nd that for any x, g() = lim sup Evx n [gn? +1 (( ))] lim sup gn+1 (x) = g(x): n!1
n!1
Given that is maximal we then obtain a bound on g(). We can nd a subsequence fni g of Z+ such that for each t the following limit exists pointwise Wt (x) := ilim !1 hni ?t+1 :
16
SEAN MEYN
Moreover, from the uniform bounds on fhn g we may choose fni g so that Pw Wt Wt?1 ? cw + g(); and so that for some b0 < 1, ?b0 Wt V0 + b0 : By iteration we then obtain
Pw Wt Wt?n ?
n ?1 X
and from the bounds on fWt g this gives
i=0
Pwi cw + ng():
g() n1 Pwn V0 + 2b0 +
n ?1 X i=0
Pwi cw
Letting n ! 1 we nd that the RHS converges to , which gives the desired upper bound g(x) g() , x 2 X. To obtain equality when x = involves the identity Pn hn = hn ? cn + gn ; which implies that n (gn ) n . An argument based upon Fatou's Lemma then shows that g() . Since we already have the reverse inequality this shows that g() = . ut Combining these bounds we obtain the following main result. Part (iii) follows from strict uniqueness of solutions to Poisson's equation which we have established in Theorem 2.3 for the irreducible case only. Theorem 4.5. Suppose that (A1){(A3) hold and that the initial condition V0 satis es 1 P n V (x) ! 0; n ! 1; x 2 X: Then
n
w
0
(i): As n ! 1, gn () ! where is the optimal cost;
and
1
n Vn () ! ;
(i): If h1 is any point-wise limit of the sequence fhn : n 0g, then the pair (h1 ; ) is a solution to the average cost optimality equation (1.4); (iii): Suppose in addition that w is irreducible in the usual sense for any optimal feedback law w. Then as n ! 1, hn (x) ! h (x); x 2 X; with h equal to the minimal relative value function.
ut
ALGORITHMS FOR OPTIMIZATION OF MDPS
17
5. Policy Iteration
The PIA may be viewed as an optimized form of the VIA. This is seen by considering the Lyapunov drift inequality which holds at the nth stage of the VIA under (A1){(A3): cn + Pn Vn Vn + n : (5.1) It is this inequality that ensures regularity of both wn and wn+1 , and implies that n and n+1 are both bounded by n . Although there are many solutions to (5.1) in general, for an irreducible chain the solution which minimizes the upper bound n is unique, up to an additive constant. This `optimal' Lyapunov function is hn , the solution to Poisson's equation, and the minimal upper bound is n , the steady state cost under wn . The PIA is then an implementation of this reasoning: at each stage n the solution to Poisson's equation is used to construct the next policy wn+1 instead of the value function Vn . Hence we again arrive at a two step procedure: Given a feedback law wn , Step 1: Find a solution hn to the Poisson equation cn + Pn hn = hn + n with inf x hn (x) > ?1. Step 2: Compute the next policy through the minimization wn+1 (x) = arg min ca + Pa hn (x) : a2A(x)
Although similar in form to the VIA, this algorithm is substantially more computationally intensive due to the matrix inversion required in Step 1. To initialize the algorithm we only need a feedback law w0 for which the associated Poisson equation admits a positive solution. That is, we require an initial regular feedback law, and we nd in Theorem 5.1 that with this initialization all subsequent feedback laws will also be regular. This result also establishes bounds on the intermediate solutions to Poisson's equation which are tight enough to ensure convergence. For de niteness we take the speci c solution to Poisson's equation given by
hn (x) = Enx
?1 hX
t=0
i
cn ((t)) ? n ;
x 2 X:
Theorem 5.1. Under (A1){(A3), suppose that the PIA is initialized with a regular policy, such as w?1 . Then the PIA generates a sequence of regular feedback laws fwn : n 0g and relative value functions fhn : n 0g such that (i): The average costs are decreasing: J (x; wn+1 ) = n+1 n ; n 0; x 2 X: (ii): The relative value functions are uniformly bounded from below, and \almost" decreasing: There exists b0 ; b1 < 1 such that ?b0 hn+1 (x) (1 + b0 (n ? n+1 ))hn + b1(n ? n+1 ); n 0; x 2 X: (iii): The sequence fhn : n 0g is pointwise convergent to a limit h1 satisfying, for some b2 ; b3 < 1, ?b2 h1 (x) b2 h0 (x) + b3 ; x 2 X:
18
SEAN MEYN
Proof. The proof is similar to the proof of Theorem 4.1. To begin, observe that for each n the function hn serves as a Lyapunov function for the transition function Pn+1 : (5.2) cn+1 + Pn+1 hn hn + n : From the Comparison Theorem and Theorem 2.2 we conclude by induction that each feedback law wn is regular, and that n+1 n for all n. Also using the Comparison Theorem and (5.2) we obtain two important bounds: For some constants b4 ; b5 , (5.3) Enx+1 [ ] b4 hn (x) + b5; n 0; x 2 X: and,
(5.4) Enx+1
?1 hX
t=0
i
cn+1 ((t)) ? n hn (x) ? hn () = hn (x);
n 0; x 2 X:
The rst bound requires some additional work using the \uniformly special" property of the sublevel sets of c given in Proposition 3.1. Since the LHS of (5.4) diers from the de nition of hn+1 only in the use of n in place of n+1 , we can combine (5.3) and (5.4) to obtain hn+1 (x) hn (x) + (n ? n+1 )Enx+1 [ ] (1 + b4 (n ? n+1 ))hn (x) + b5 (n ? n+1 ): These arguments complete the proof. ut From these bounds we may prove a complement to Theorem 4.5. Theorem 5.2. Suppose that (A1){(A3) hold and that the initial condition h0 satis es 1 P n h (x) ! 0; n ! 1; x 2 X:
Then, as n ! 1,
n
w
0
hn (x) ! h1 (x);
n ! ;
where (h1 ; ) is a solution to the average cost optimality equations, and is the optimal cost. tu
6. Networks
We now return to the network scheduling problem described in the introduction. This is given as the nonlinear state space model (1.2) with countable state space ZK+ . Viewed as a combinatorial optimization problem the scheduling problem possesses unimaginable complexity. For the idealized model (1.2) with unbounded buers and hence countable state space it is unlikely that any optimal scheduling policy will ever be explicitly computable except in trivial cases such as a single queue where there is nothing to schedule. However, when viewed as a dynamic control problem there is much that can be said, and the general results described in the preceding sections aid a great deal in our understanding of this control problem. We describe here the form of the optimal policy when the system state corresponds to a highly congested network. In this region of the state space the optimal policy approximates the solution to a uid control problem which attempts to drive the state towards the origin in an optimal fashion.
ALGORITHMS FOR OPTIMIZATION OF MDPS
19
The uid model is described by an ordinary dierential equation K d (t) = X k [ek+1 ? ek ]uk (t); dt k=0
(6.1)
where the function u(t) 2 RK +1 is analogous to the discrete control, and satis es similar constraints [9]. If (6.1) is to approximate (1.2) then the control u must depend on which feedback law w which was used for the original discrete model. Suppose for example that w is the last buer- rst served (LBFS) policy which gives priority to a class k customer over a class j customer at the same machine if and only if k j . Then the control u for the uid model (6.1) will be the same, though it must be described with greater care: at a given machine , if s(k) = then X X i (t) > 0: ui (t) = 1 whenever ik:s(i)=
ik:s(i)=
The primary reason for studying (6.1) is that when properly scaled, this equation describes approximately the behavior of the network (1.2) in a transient phase during which the network is highly congested. The scaling is through the initial condition (0) = x which is assumed to be large: If jxjt is an integer, we set x (t) = 1 (jxjt):
jxj
For all other t 0, we de ne (t) by linear interpolation, so that it is continuous and piecewise linear in t. Note that jx (0)j = 1, and that x is Lipschitz continuous. One can show that for large x the process x is an approximate solution to (6.1) for some choice of u. In this section we describe qualitatively the consequences of the relationship between (1.2) and (6.1) to the network scheduling problem. We will not make many precise statements - for theorems and proofs, the reader is referred to the papers [9, 12] for connections with stability, and [7, 17, 16] for results pertaining to network optimization and optimization of the uid model (6.1). 6.1. The relative value function. To understand the minimal relative value function h for the network control problem we rst must understand regularity in this context. For any feedback law w it may be shown using the skip-free property that j(t + 1)j j(t)j ? 1 that x
hw (x) := Ew x
?1 hX
t=0
i
c((t); wt ((t))) ? 41 jxj2 ? 21 B (2 );
where denotes the empty state in which every buer is empty. Thus the function h satis es the same lower bound. Provided that the network can be stabilized (that is, the usual trac condition that < 1 holds) for many policies we can obtain a quadratic upper bound hw (x) b0 (jxj2 + 1); x 2 X; where b0 is some xed constant. This bound holds for the last buer- rst served policy, for example. Hence the minimal relative value function must be equivalent to a quadratic in the sense that for some " > 0, b0 < 1,
"jxj2 ? b0 h (x) "?1 jxj2 + b0 ;
x 2 X:
20
SEAN MEYN
These observations are a consequence of the following result taken from [12, 17]. Although this result is stated as a list of \properties", the reader must be warned that many of the results given here are not entirely precise as stated. In particular, in Property 6.1 we should not permit any solution to (6.1), but only those which arise as so-called \weak limits" of the fxg. Moreover, we are ignoring the possibility that some of these limits might be non-deterministic. These technicalities are beyond the scope of this paper: For exact statements see the aforementioned references. Property 6.1. The following are equivalent for any feedback law w: (i): There exists a solution V to the Foster-Lyapunov drift criterion (2.2) which is equivalent to a quadratic. (ii): There exists a solution to Poisson's equation (2.5) which is equivalent to a quadratic. (iii): The \total cost" for the uid model is uniformly bounded: For some
B0 < 1
Z1
0
j(t)j dt < B0
for any \weak limit" of fxg.
ut
The uid limit model will be called stable if Property 6.1 (iii) is satis ed. That is, the total cost for the uid model is uniformly bounded over all initial conditions satisfying j(0)j = 1. It will be useful to consider the minimal cost for the uid model given by Z1
(0) = x; x 2 X; where the minimum is over all possible controls u for the uid model.
(6.2)
V (x) = min
0
j(t)j dt;
The equivalence expressed in Property 6.1 can be re ned to show that in fact the uid model is stable in the sense of (iii) then the solution to Poisson's equation in (ii) can be approximated by the total cost:
(6.3)
h(x)
Z1
0
j(t)j dt;
when (0) = x; jxj 1:
This suggestive approximation leads to an interesting implementation of the VIA, and is also used to show that the PIA for the network simultaneously acts as the policy improvement algorithm for the uid model which computes the value function V . These connections will be explored next. 6.2. Algorithms. We have seen that the PIA leads to the pair of equations Pn hn = h n ? c + n Pn+1 hn hn ? c + n ; where here we are taking c(x) = jxj. If hn is quadratically bounded, then it follows from Theorem 2.2 and the above inequality that hn+1 is also quadratically bounded. Hence, from Property 6.1 if the initial feedback law w0 is chosen so that its uid model is stable, then all subsequent policies will have stable uid models. From these equations we can also show that hn+1 (x)=jxj2 hn (x)=jxj2 + o(1), where the term o(1) is bounded by a constant times 1=jxj. From (6.3) we then obtain the following:
ALGORITHMS FOR OPTIMIZATION OF MDPS
21
Property 6.2. If the initial feedback law w0 is chosen so that the uid model
is stable, then (i): The PIA produces a sequence f(wn; hn; n) : n 0g such that each associated uid model is stable. Any feedback law w which is a pointwise accumulation point of fwn g is an optimal average cost policy. (ii): For each n 1, with identical initial conditions, Z1
0
j (s)j ds n
Z1
0
jn?1 (s)j ds
Hence if w0 is optimal for the uid model, so is wn for all n. (iii): With equal to the uid solution for any optimal feedback law w generated by the algorithm, and h equal to the minimal relative value function
h (x) jxj2
Z1
0
j (s)j ds;
(0) = x=jxj; x 1:
Hence, when properly normalized, the relative value function approximates the value function for the uid model control problem. (iv): With equal to the uid solution for any optimal feedback law w generated by the algorithm, and equal to any other uid model solution with identical initial condition (0) = (0) = x, Z1
0
j (s)j ds
Z1
0
j(s)j ds
Hence the limiting feedback law w is optimal for the uid model.
ut
Property 6.2 (iii) leads to a candidate function V0 to inititialize the value iteration algorithm. Note that if we by luck chose the initial function V0 exactly equal to h then we would nd that Vn = h + n for each n. While we will never be so lucky in this optimization problem, given (iii) it is very sensible to use the value function V given in (6.2) as an initial condition in the VIA since, at least for large x, this is approximately equal to h . While the function V given in (6.2) is not easily computable in general, we can obtain an approximation based upon a nite dimensional linear program [8]. If a suciently tight approximation to (6.2) is found then one can expect that (A1) will hold for this approximation. We illustrate this approach with the three buer example illustrated in Figure 1. We have computed explicitly the value function V for the optimal uid policy using the optimal uid policy given in [21]. For comparison we have also computed the value function V0 = VLBFS for the uid model under the LBFS policy. Three experiments were performed based upon three dierent initializations: V0 0, which is the standard VIA; V0 = VLBFS ; and V0 = V . The results from two experiments are shown in Figure 3. We have taken truncated buer levels to 33 in the model to obtain a chain with 333 = 35; 937. In each experiment value iteration was performed for 300 steps, saving data for n = 10; : : : ; 300. The parameter values (; 1 ; 2 ; 3 ) are given in [7]. The convergence is slow when V0 0, but exceptionally fast in the other two experiments. Surprisingly, the \suboptimal" choice using LBFS leads to the fastest convergence.
22
SEAN MEYN 12.6
J(wn) Standard VIA
12.4
Initialized with optimal fluid value function Initialized with LBFS fluid value function
12.2
12
11.8
11.6
11.4
11.2
11 50
100
150
200
250
300
n
Figure 3. Convergence of the VIA with V0 taken as the value function for the associated uid control problem. Two value functions are found: one with the optimal uid policy, and the other taken with the LBFS uid policy. Both choices lead to remarkably fast convergence. Surprisingly, the \suboptimal" choice using LBFS leads to the fastest convergence of J (wn ) to the optimal cost 10:9.
In summary, the optimal value function for the uid model does approximate the minimal relative value function. We see this in a general result, and through the experiment illustrated in Figure 3. The question then is, how can we use this insight in network design? In a real, discrete queueing network with a large number of stations it is natural to construct a policy which uses the optimal uid policy to regulate the buer levels to some desirable, perhaps optimal, mean value. Attempting to maintain some non-zero steady state in this way has two desirable features. One is that when problems arise causing network congestion, the policy will resemble the optimal uid policy as desired. In normal operating conditions when the state stays near some target level we expect good performance since states are neither too large, nor too small. Small buer levels are not necessarily desirable since starvation of work to a queue can lead to a loss of resource utilization, burstiness, and even instability. Some initial experiments on a class of policies based upon these ideas were introduced in [17], which also includes some positive numerical experiments.
7. Extensions
Many extensions of these results are possible. The general state space case is particularly important in the network problem since it is unfortunate to be forced
ALGORITHMS FOR OPTIMIZATION OF MDPS
23
to assume that service times and interarrival times are exponentially distributed, as we did in the previous section. Here we describe three situations which lie outside of the class of models which have been considered in this paper. We merely lay some foundation for extending the results of this paper to the more general, or more complex setting. Details may be found in the appropriate references. 7.1. General state spaces. The situation where X is uncountable, such as X = Rn , is typical in many MDPs encountered in practice. The techniques used in this paper can be extended to this more complex setting provided that (A1){(A3) are modi ed appropriately. A complete treatment of the PIA is given in [16] where (A3) is replaced by (A3'): There is a xed probability on B(X), a > 0, such that for any feedback law w satisfying w 0 , Kw (x; A) (A) for all x 2 S , n 0, A 2 B(X), where S denotes the sublevel set (7.1) S = fx : min c(x; a) 20 g: a Most of the proofs go through essentially unchanged with the solution to Poisson's equation given by (2.6) instead of (2.7) which depends upon an accessible state. Using these ideas it is shown in [16] that the PIA converges for general state space models, and from this the results of the previous section concerning optimization of networks carry over without modi cation. 7.2. Continuous time. Continuous time models are easily treated using these same techniques, even for general state space models. Again, the main issue is understanding Poisson's equation, which in continuous time takes the form Dw0 h = ?cw0 + w0 ; (7.2) where Dw0 denotes the generator of the process under the feedback law w0 . The existence of well behaved solutions to (7.2) follows under conditions analogous to the discrete time case [11]. To see how the optimality equations appear, and how one can establish existence of an optimal policy, suppose that w is a feedback law such that for any other feedback law w, cw + Dw h cw + Dw h If h solves Poisson's equation (7.2) with w0 = w then the bound above may be written, w cw + Dw h; and then by the de nition of the extended generator, ZT 1 E [c (w )] dt + 1 (P T h (x) ? h(x)): w
T
0
x w
T
t
w
This bound holds for any x, and any T > 0. Letting T ! 1, it follows that ZT lim inf 1 E [c (w )] dt ; T !1
T
0
x w
t
w
24
provided that
SEAN MEYN
1 P T h (x) ! 0
as T ! 1: To construct a solution to Poisson's equation it is simplest to rst establish a solution to Poisson's equation for the resolvent de ned by
T
w
Rw =
Z1
0
e?t Pwt dt:
Suppose that a solution to the Poisson equation is found for the resolvent: Rw g = g ? cw + : Using the formula Dw Rw = Rw ? I we may set h = Rw g to obtain Dw h = Dw Rw g = (Rw ? I )g = ?cw + w : Since stability properties of the continuous time process and the resolvent are essentially equivalent [14, 10], one can construct solutions to the Poisson equation and the minimal relative value function under assumptions analogous to (A1){(A3). The PIA in continuous time is given as follows. Suppose that the policy wn?1 is given, and that hn?1 satis es Poisson's equation Dn?1 hn?1 = ?cn?1; where we again adopt the notation used in the discrete time development. This n?1 equation has a solution provided that the chain w is cn?1 -regular [11]. A new feedback law wn is then found which satis es, for any other feedback law w, cn + Dwn hn?1 cw + Dw hn?1 : To see when this is an improved policy, let w = wn?1 : from the inequality above, and Poisson's equation which hn?1 is assumed to satisfy, we have Dwn hn?1 ?cn + n?1 : This is a stochastic Lyapunov drift inequality, but now in continuous time. Again, if hn?1 is bounded from below, if cn is norm-like, and if compact sets are petite, wn then the process is cn -regular, and we have n n?1 . Since the feedback law wn is so well behaved, one can assert that Poisson's equation Dn hn = ?cn ; has a solution hn which is bounded from below, and hence the algorithm can once again begin the policy improvement step. Results analogous to Theorem 5.2 can then be formulated using almost identical methodology [16]. 7.3. Risk sensitive control. The techniques described here may also be extended to other control criteria. One which has attracted much recent attention is the \risk sensitive" criterion, ?1 h nX i (7.3) c((t); a(t)) : R(x; a) = lim sup 1 log E exp n!1
n
x
t=0
The use of the exponential reduces the probability of large excursions of the state since the one step cost c satis es condition (A2).
ALGORITHMS FOR OPTIMIZATION OF MDPS
25
Under certain conditions on the model, in particular, under the assumption of linearity, the controls that optimize (7.3) are known to be robust to model uncertainty. In general it may be shown that any stationary policy which gives rise to a nite risk senstive cost will enjoy some attractive properties. In particular, the controlled chain is geometrically ergodic [4], which itself implies some degree of robustness to model uncertainty [11]. The rst step in solving this control problem is to de ne an appropriate analogue of the relative value function. We rst de ne the cost over one cycle via n
?1 X
h
w := inf 0 : Ew exp
t=0
i
o
[c((t); wt ((t))) ? ] 1 :
We then let := inf w , where the in mum is over all Markov policies, and set
= exp( ). It is shown in [4] that R(x; w) for any Markov policy.
A candidate relative value function and optimal policy are de ned respectively as follows: For each x 2 X, h
?1 X
h (x) := min Ew exp w x
(7.4)
t=0
[c((t); wt ((t))) ? ]
i
(7.5) w (x) := arg min exp(c(x; a))Pa h (x) ; a2A where in (7.5) the feedback law w is taken to be any solution to the minimization. With these de nitions we nd the following Proposition 7.1. Suppose that (A1){(A3) hold, and suppose that < 1. Then the function h satis es the following bounds (i): h() = 1; (ii): exp(cw (x))Ph (x) h, x 2 X; (iii): inf x2X h(x) > 0. Property (i) follows from minimality of ; (ii) follows from dynamic programming arguments, exactly as in the proof of Theorem 3.3; and (iii) follows from Proposition 3.1 (ii) and Jensen's inequality: for any Markov policy, h
?1 X
log Ew x exp
t=0
i
[c((t); wt ((t))) ? ]
Ew x
?1 hX
t=0
[c((t); wt ((t))) ? ]
?B ( ):
i
This uniform lower bound then gives a uniform lower bound on h . ut It is property (ii) that implies the geometric recurrence for the chain under the policy w . This inequality combined with the uniform lower bound (iii) implies that h acts as a stochastic Lyapunov function for the process, satisfying a bound far stronger than (2.2), which implies in particular that Ewx [cw ((t))] ! (c); geometrically fast as t ! 1. See [4] for other consequences. The optimality of w is proven by iterating (ii), which gives ?1 i h nX cw ((t)) h ((n)) n h (x): Ewx exp t=0
26
SEAN MEYN
Using this and the lower bound (iii) shows that R(x;w ) = for any x 2 X := fx : h (x) < 1g. The set X is all of X if the chain w is irreducible. When assembled together this leads to the following existence theorem for solutions to a dynamic programming equation. Theorem 7.2. Suppose that (A1){(A3) hold and that < 1. Then (i): The function h solves the dynamic programming equation min exp(c(x; a))Pa h (x) = h ; a2A(x)
(ii): Any feedback law w satisfying
(7.6)
x 2 X;
min exp(c(x; a))Pa h (x) = exp(cw (x))Pw h (x);
a2A(x)
is optimal over all Markov policies;
(iii): Any feedback law w satisfying (7.6) minimizes the \relative cost": For any Markov policy w, h
?1 X
exp h (x) = Ew x
t=0
cw ((t)) ?
i
h
?1 X
Ew x exp
t=0
c((t); wt ((t))) ?
(iv): If (h+; w+ ) is any other solution to (1.4) for which inf x2X h+(x) > ?1, then Pw+ has a unique invariant probability + , and h+ (x)=h+ () = h (x) + (x) > 0; h+ (x)=h+ () h (x) for all x:
ut
Given the relative value function h and the analogous theory of the \multiplicative Poisson equation" developed in [2] one can then construct multiplicative versions of the PIA and VIA, and prove convergence using approaches similar to those presented here.
References
[1] A. Arapostathis, V. S. Borkar, E. Fernandez-Gaucherand, M. K. Ghosh, and S. I. Marcus. Discrete-time controlled Markov processes with average cost criterion: a survey. SIAM J. Control Optim., 31:282{344, 1993. [2] S. Balaji and S.P. Meyn. Multiplicative ergodic theorems for an irreducible Markov chain. submitted, 1998. [3] V. S. Borkar. Topics in controlled Markov chains. Pitman Research Notes in Mathematics Series #240, Longman Scienti c & Technical, UK, 1991. [4] V.S. Borkar and S.P. Meyn. Risk sensitive optimal control: Existence and synthesis for models with unbounded cost. SIAM J. Control Optim., 1998. Submitted for publication. [5] X. Cao. The relation among potentials, perturbation analysis, and markov decision processes. J. DEDS, to appear, 1998. [6] R. Cavazos-Cadena. Value iteration in a class of communicating Markov decision chains with the average cost criterion. Technical report, Universidad Autonoma Agraria Anonio Narro, 1996. [7] R-R. Chen and S.P. Meyn. Value iteration and optimization of multiclass queueing networks. QUESTA, to appear, 1997. [8] J. Humphrey D. Eng and S.P. Meyn. Fluid network models: Linear programs for control and performance bounds. In J. Cruz J. Gertler and M. Peshkin, editors, Proceedings of the 13th IFAC World Congress, volume B, pages 19{24, San Francisco, California, 1996. [9] J. G. Dai. On the positive Harris recurrence for multiclass queueing networks: A uni ed approach via uid limit models. Ann. Appl. Probab., 5:49{77, 1995.
i
:
ALGORITHMS FOR OPTIMIZATION OF MDPS
27
[10] D. Down, S. P. Meyn, and R. L. Tweedie. Geometric and uniform ergodicity of Markov processes. Ann. Probab., 23(4):1671{1691, 1996. [11] P. W. Glynn and S. P. Meyn. A Lyapunov bound for solutions of Poisson's equation. Ann. Probab., 24, April 1996. [12] P. R. Kumar and S. P. Meyn. Duality and linear programs for stability and performance analysis of queueing networks and scheduling policies. IEEE Trans. Automat. Control, 41(1):4{17, January 1996. [13] P. R. Kumar and T. I. Seidman. Dynamic instabilities and stabilization methods in distributed real-time scheduling of manufacturing systems. IEEE Trans. Automat. Control, AC-35(3):289{298, March 1990. [14] S. P. Meyn and R. L. Tweedie. Generalized Resolvents and Harris Recurrence of Markov Processes, page . Lecture Notes in Mathematics. Springer-Verlag, Berlin, 1992. [15] S. P. Meyn and R. L. Tweedie. Markov Chains and Stochastic Stability. Springer-Verlag, London, 1993. [16] S.P. Meyn. The policy improvement algorithm for Markov decision processes with general state space. 1997. [17] S.P. Meyn. Stability and optimization of multiclass queueing networks and their uid models. In proceedings of the summer seminar on \The Mathematics of Stochastic Manufacturing Systems". American Mathematical Society, 1997. [18] E. Nummelin. General Irreducible Markov Chains and Non-Negative Operators. Cambridge University Press, Cambridge, MA, 1984. [19] M. L. Puterman. Markov Decision Processes. Wiley, New York, NY, 1994. [20] A. N. Rybko and A. L. Stolyar. On the ergodicity of stochastic processes describing open queueing networks. Problemy Peredachi Informatsii, 28:2{26, 1992. [21] G. Weiss. Optimal draining of a uid re-entrant line. In Stochastic Networks, volume 71 of IMA volumes in Mathematics and its Applications, pages 91{103. Springer-Verlag, N.Y., 1995. [22] P. Whittle. Optimisation: Basics and Beyond. John Wiley and Sons, Chichester, England, 1996.
Sean Meyn, Coordinated Science Laboratory and the University of Illinois, 1308 W. Main Street, Urbana, IL 61801, URL http://black.csl.uiuc.edu:80/e meyn E-mail address :
[email protected]