COMPUTING OPTIMAL SAFE CONTROL POLICY. A. Linear Programming Formulation. Given a constraint set Î c := {Ï â Î | ÏA ⤠b} where b=[b1 b2 ··· bm] is a ...
On Optimal Control of Markov Chains with Safety Constraint Shun-Pin Hsu National Chi-Nan University Electrical Engineering Nantou, Taiwan 545
Ari Arapostathis The University of Texas at Austin Electrical and Computer Engineering Austin, TX 78712-0240
Abstract— We study the control of completely observed Markov chains with safety bounds as introduced in [3], but with more general safety constraints and the added requirement of optimality. In [3], the safety bounds were specified as unitinterval valued vector pairs (lower and upper bounds for each component of the state probability distribution). In this paper we generalize the constraint set to be any linear convex set and present a way to compute a stationary control policy which is safe (i.e., maintains the safety of the distribution that is initially safe) and at the same time it is long-run average optimal. We propose a linear programming formulation for computing such a safe optimal policy. Under a simplifying assumption that the optimal policy is ergodic, we present a finitely-terminating iterative algorithm to compute the maximal invariant safe set (MISS) where the initial distribution must lie so that the future distributions always remain safe. Our approach allows us to calculate an upper bound for the number of iterations needed for the algorithm to terminate. In particular, for the two-state chains we show that at most one iteration is needed to compute the MISS. Index Terms— Control of Markov chains, Safety control, Stochastic discrete event system, Reliability
I. I NTRODUCTION Controlled Markov Chains (CMC) are among the most widely used models for the study of stochastic systems. A problem that has attracted a lot of attention in recent years is that of finding a policy that minimizes the induced long-run average cost. It is well known that, under certain conditions, this optimal policy can be characterized by the stochastic dynamic programming equation called Bellman’s ergodic optimality equation, which can be solved by the policy iteration or value iteration algorithm (see e.g. [5]). When the CMC model is subject to other constraints (e.g., in cost), the model is called constrained CMC or CMC with constraints. To solve the associated dynamic programming equation with constraints, a linear programming formulation involving the concept of state-action frequency was proposed (see e.g. [1]), which is dual to the original dynamic programming equation (See [4]). Other models of CMC with constraints can be investigated in [7], [8]. The optimal policy relative to the long-run average cost, could have undesirable transient behavior. This might be a serious concern in short term investment business or This research was supported in part by the National Science Foundation under the grants ECS-0099851, ECS-0218207, NSF-ECS-0244732, NSFEPNES-0323379, NSF-ECS-0424048, and in part by the Office of Naval Research through the Electric Ship Research and Development Consortium.
Ratnesh Kumar Iowa State University Electrical and Computer Engineering Ames, IA 50011-3060
small companies that could only survive in the environment of limited supply-demand variation. For modern computerrelated industries which provide short life-cycle products in a highly competitive market, both the long and the short term behavior of a policy are significant. Our work addresses this issue and combines ideas used in the classical constrained CMC and the safety control of non-stochastic discrete event systems (DESs). A non-stochastic DES is often modeled by a state machine or an automaton that evolves in response to the occurrence of events. Events are categorized into a controllable class, which can be controlled by an external agent, and a uncontrollable class. The policy dynamically disables controllable events so that closed-loop behavior satisfies the control goal. The goal of safety control is normally specified by a set of forbidden states that the system must avoid. Thus, a policy performing a safety control must prevent the system from visiting those pre-specified states. We integrate the concept of safety control of non-stochastic DESs into the classical constrained CMCs and form a strictsense version of constrained CMC, in which the each time, in addition to the long-run average, behavior of the system is constrained. Previous work with this setting can be found in [2], [3], [10], where the constraint was specified by the type of upper/lower bound on the system’s state probability distribution vector. In this paper we consider more general constraints in the form of convex polyhedral sets and we call these the constraint sets. A distribution is safe if it lies in this set. We address the problem of constructing a policy that minimizes a long-run average cost objective subject to the requirement that the state probability distribution is safe at each time step. In this paper we first assume that every admissible policy induces a unique invariant distribution, and apply linear programming to search among all safe policies the one which minimizes the incurred long-run average cost. The issue of existence of a safe policy is addressed via a feasibility analysis of the formulated linear programming. Next we present an algorithm to construct the maximal set safe initial distributions corresponding to a given safe policy, and show that the algorithm will terminate in finite steps provided that under the safe policy the chain has a unique invariant distribution in the interior of the constraint set. A theoretical upper bound on the number of iterations needed by the algorithm to terminate is derived. This algorithm can be easily implemented in practice, as is illustrated by
a numerical example. On the other hand, we show that the maximal set of safe initial distributions is a safety invariant set, meaning that the chain under safe policy leaves this set invariant. For this reason we call the maximal set of safe initial distributions the maximal invariant safe set (MISS). When the MISS of the safe policy is the constraint set itself, we say that the policy is safety enforcing. We discuss how to identify a safety enforcing policy. II. P RELIMINARIES AND N OTATIONS A controlled Markov chain is represented by the tuple (X, U , {P (u)}u∈U , π0 ), where X = 1, . . . , n is a finite state space composed of n states, U = 1, . . . , k is a finite set of control inputs, for each u ∈ U , P (u) = [puij ] is a state transition matrix, and π0 is the initial distribution of the state variable. Let Π and ΠU denote the sets of distributions on the state space and action space, respectively. Assuming a complete observation of states, for each time k ≥ 0, a history space Hk and an admissible control function µk is given recursively as follows:
2, [2]). Therefore, a given constrained CMC might possess many safe policies. We identify the optimal safe policy which minimizes a pre-specified cost function. The idea is to construct a linear programming formulation as suggested in [1]. Suppose βiu is the long-run average probability (occupation measure) that the state i is visited and the action u is then taken. The occupation measure matrix [βiu ] encodes P the following stationary control policy: ϕu (i) = βiu / u βiu . The resulting transition matrix Pϕ can be calculated by (1). Note that Pϕ can have many invariant distributions depending on the initial distribution. The partic∗ ular P invariant P distribution P encoded by [βiu ] is given by, π = [ u β1u u β2u · · · u βnu ]. If c(i, u) is the pre-specified one-step cost when the system is in state i and action u is taken, then we formulate the following linear programming to search the optimal safe policy which minimizes the longrun average cost among all the safe policies. XX c(i, u)βiu (3) min βiu
H0 = {x0 } ⊆ X;
µ 0 : H 0 → ΠU
s.t.
Hk+1 = Hk × ΠU × X ⊆ (X × ΠU )k × X; µk+1 : Hk+1 → ΠU
u∈U
III. C OMPUTING O PTIMAL S AFE C ONTROL P OLICY
A. Linear Programming Formulation Given a constraint set Πc := {π ∈ Π | πA ≤ b} where b=[b1 b2 · · · bm ] is a row vector of size m, and A with its (i, j)th entry Aij is a matrix of size n × m. Let I(b) = {1, · · · , m} be the index set of b. The safety concept of strict-sense constrained CMC is as follows. Definition 3.1: A state probability distribution π is safe if π ∈ Πc where Πc is the constraint set. An admissible policy ϕ is safe if there exists a set of safe distributions Πin ⊂ Πc such that ∀k ∈ N .
βiu = 1 ,
(4)
i∈X u∈U
The sequence µ = {µk | k ≥ 0} is called an admissible control strategy or policy. The set of all admissible policies will be denoted by Σ. Certain classes of admissible policies are of special interest. A strategy µ is called Markov if each µk depends only on xk . A Markov policy is called stationary, if there exists a function ϕ : X → ΠU such that µk ≡ ϕ, for all k. If ϕ is a stationary policy, then the state probability distribution πk is Markov with transition matrix Pϕ . This can be obtained from {P(u)}u∈U as follows. Note that for each i ∈ X, ϕ(i) = ϕu (i) u∈U is a distribution vector on U . Then Pϕ can be defined as: X (1) ϕu (i)puij . Pϕ ij :=
π0 ∈ Πin ⇒ πk := π0 Pϕk ∈ Πc
i∈X u∈U
XX
(2)
When (2) holds, we say Πin is the set of safe initial distributions corresponding to the safe policy ϕ. It is not difficult to see that the existence of Πin is assured if the policy ϕ induces a safe invariant distribution (Theorem
XX
βju puji =
j∈X u∈U
XX j
X
βiu
∀i ∈ X,
(5)
u∈U
βju Aji ≤ bi
∀ i ∈ I(b) ,
(6)
u∈U
βiu ≥ 0
∀ i ∈ X, u ∈ U ,
Equation (4) follows from that fact that βiu defines a stationary policy, equation (5) is simply the invariant distribution identity π ∗ Pϕ = π ∗ , and inequality (6) represents the safety constraint. In general the solution of the linear program in (3)–(6), would leave some of the constraints in (6) active. Therefore the invariant distribution π ∗ thus obtain would normally lie on the boundary of Πc . In the interest of robust design, we can choose to modify (6) by replacing bi by bi −ε, for some ε > 0. Then the solution π ∗ lies in the interior of Πc . This is utilized in Section IV. B. Feasibility Analysis In this section we discuss the feasibility of the linear programming formulation (3)-(6). This problem is equivalent to the existence problem of a safe policy of the given CMC. Let 0, 1 be the column vector of 0’s, 1’s, respectively and I denote the identity matrix, with its ith column denoted by ei . Moreover, we use Ai· and AT , to denote the ith row and the transpose of the matrix A, respectively. Equations (3)–(6) can thus be written in the following matrix form. min β
subject to
cT β Rβ = S , Wβ ≤ b, β ≥ 0m ,
where c = [c(1, 1) · · · c(1, k) · · · c(n, k)] and β = [β11 · · · β1k · · · βnk ]. Let pui· = [pui1 pui2 · · · puim ] then 1T k 1 p1· − e1 · · · pkT · · · pkT m· − e 1· − e . R= 1 ··· 1 ··· 1 Note that R ∈ R(n+1)×m . Furthermore, S = [0 · · · 0 1]T ∈ Rn+1 , W = [AT1· 1T
AT2· 1T
···
ATn· 1T ] ∈ Rm×(nk) .
TABLE I M AINTENANCE PARAMETERS FOR E XAMPLE 3.6 Current State
Available Actions
Transition Parameters
Action Cost
E
basic maint. advanced maint.
pE = .12 pE = .08
pEG = .7 pEG = .9
M
basic maint. advanced maint.
pM = .2 pM = .05
pM G = .68 pM G = .9
500 1000
L
basic maint. advanced maint.
pL = .2 pL = .1
pLG = .65 pLG = .88
50 200
G
do nothing
pG = .85
100 300
0
The linear program (3)-(6) is feasible if the set β ≤ := {β | Rβ = S, W β ≤ b, β ≥ 0}
Theorem 3.5: If in Theorem 3.4 rβ ∗ 6= 0 then r ∗ /krβ ∗ k solves the following problem
is nonempty. Consider the set ≥
T
θ := {θ =
[θ1T
θ2T ]
| θ1T S + θ2T b < 0, θ1T R + θ2T W ≥ 0T , θ2
minθ1 ,θ2 subject to ≥ 0} .
If both β ≤ and θ ≥ are nonempty, then there exist β ∗ and θ∗ such that θ1∗T S = θ1∗T Rβ ∗ and θ2∗T b ≥ θ2∗T W β ∗ . Hence θ1∗T S + θ2∗T b ≥ θ1∗T Rβ ∗ + θ2∗T W β ∗ = (θ1∗T R + θ2∗T W )β ∗ , which is nonnegative, and a contradiction results. This means that β ≤ and θ ≥ can not be both nonempty. Determining which of these two sets is nonempty can be accomplished by considering the following quadratic programming: min β
subject to
kRβ − Sk2 Wβ ≤ b, β ≥ 0.
where k·k denotes the Euclidean norm. Denote rβ := Rβ−S and we have the following result: Lemma 3.2: In (7), for a feasible β ∗ , that is, β ∗ ≥ 0 and W β ∗ ≤ b, if there exists v ≥ 0 such that rβT∗ R+v T W ≥ 0T , (rβT∗ R+v T W )β ∗ = 0, and v T (W β ∗ −b) = 0, then β ∗ solves (7). Remark 3.3: Define Lβ the index set of active constraints at β. That is Lβ := {r|Wr· β = br } where Wr· is the r th row of W and br the r th element of b. If β ∗ is a regular point, that is, if the rows Wr· ’s are linearly independent for r ∈ Lβ ∗ , then the conditions in Lemma 3.2 are necessary and sufficient due to the convexity of the objective function and the Karush-Kuhn-Tucker theorem. According to the definition of rβ , we have rβT∗ S + v T b = −krβ ∗ k2 + rβT∗ Rβ ∗ + v T b. By the conditions of Lemma 3.2, v T b = v T W β ∗ , and (rβT∗ R + v T W )β ∗ =0. Thus, we obtain the following: Theorem 3.4: Suppose that β ∗ solves (7) and is a regular point. The residual vector rβ ∗ = Rβ ∗ − S. If rβ ∗ = 0 then β ∗ ∈ β ≤ . Otherwise there exists v ≥ 0Nb such that if r∗T := [rβT∗ v T ] then r ∗ ∈ θ≥ . Furthermore, rβT∗ S + v T b = −krβ ∗ k2 < 0. The variable r ∗ in Theorem 3.4 has the following property.
θ1T S + θ2T b θ1T R + θ2T W ≥ 0 , kθ1 k = 1 ,
θ2 ≥ 0 . WE now present an example that uses constrained CMC to optimize the scheduling task in machine maintenance. Example 3.6: A quality control personnel is in charge of the scheduling task for a complicated manufacturing system composed of several subsystems such as assembly stations, robots, and computer control systems. Since the components in each subsystem are prone to failure, the personnel categorizes all the manufacturing system’s components into three types: E (electrical), M (mechanical), and L (lubricant), and contracts with three companies for the associated maintenance work. The operation of the manufacturing system is then classified into four states: E, M, L, and G. If the system is in E (M, or L) state, then its electrical (mechanical, or lubricant) components need maintenance and the associated company will be called. When the system is in G (good) state, no maintenance work will be performed. Let the system state space S = {1, 2, 3, 4} where 1, 2, 3, and 4 represents state E, M, L, and G respectively. Suppose that the probability transition matrix for the operation of the system is pE rE rE pEG r M pM r M pM G P = rL rL pL pLG rG rG rG pG
where rE = 12 (1 − pE − pEG ), rM = 12 (1 − pM − pM G ), rL = 21 (1 − pL − pLG ), and rG = 13 (1 − pG ). Due to the budget constraint, at most one company each time can be called and the called company can perform either basic or advanced maintenance defined by the maintenance cost. Suppose the parameters listed in Table I are used. If the quality requirement asks that the average probability of system in E, M, or L does not exceed .05 and the average cost does not exceed 65, then we can construct a linear programming equation following (3)–(6) to obtain the optimal safe policy that maximizes the probability of system in G state under the constraint.
Solving the linear programming equation yields the optimal policy which suggests the action of advanced maintenance when the system is in E or L state. If the system is in M state, the policy suggests a joint action with 1/3 basic and 2/3 advanced maintenance. The resulting transition matrix induced by the optimal safe policy is 0.08 0.01 0.01 0.90 0.0367 0.1002 0.0367 0.8264 , P = 0.01 0.01 0.10 0.88 0.05 0.05 0.05 0.85 the corresponding invariant distribution is ∗
∗
∗
∗
∗
π = [π (1) π (2) π (3) π (4)] = [.0488 .0485 .0499 .8528] and the average cost is C ∗ = 65. IV. M AXIMAL I NVARIANT S AFE S ET So far we have identified the optimal policy ϕ that is safe in the long-run. If we can find the corresponding set Πin of safe initial distributions then whenever the chain starts in this set, and is controlled by the optimal policy, its distribution will remain safe at each step, and the long-run average cost will be minimized. Suppose this optimal safe policy induces a unique limiting distribution π ∗ , its corresponding set Πin might not be unique. For example, the smallest such set is the set containing π ∗ only and the largest such set is Πc , if the policy is safety enforcing as defined in Section I. We study in this section the algorithm to characterize the maximal set Πϕ among all Πin ’s corresponding to a given optimal safe policy ϕ. As we will see later, to search Πϕ is equal to search the MISS of Pϕ . This observation is used in the design of our algorithm. To make our analysis simple, we make the following assumption on the induced limiting distribution of the given safe policy ϕ. More general cases could be analyzed similarly. Assumption 4.1: The optimal safe stationary policy induces a transition matrix which has a unique communicative class of aperiodic recurrent states and the corresponding limiting distribution lies in the interior of the safety set. In order to ensure that the unique limiting distribution π ∗ lies in the interior of the safety set, the optimal safe stationary policy may be computed by the linear program in (3)–(6), with bi in (6) replaced by bi − ε. Therefore, there exists an ε > 0 such that π ∗ A + ε1 ≤ b where 1 is a row vector of 1’s with the size of b. A. Searching Algorithm Consider the following algorithm to compute the maximal set Πϕ of safe initial distributions corresponding to the safe policy ϕ. Algorithm 4.1: Π (0) = {π ∈ Π | πA ≤ b} = Πc , Π (k) = {π ∈ Π (k−1) | πPϕ ∈ Π (k−1) } = {π ∈ Πc | πPϕj ∈ Πc , 1 ≤ j ≤ k} , k ∈ N.
To state the property of the algorithm, we first introduce the following useful concept. The ergodicity coefficient τ (P ) of matrix P is defined as 1 τ (P ) := max kPi1 · − Pi2 · k1 2 i1 ,i2 Pn and for a vector v ∈ Rn we define kvk1 := i=1 |v(i)|. It is well known (see e.g., Lemma 2.4 in [9]) that if v is Pn nonzero and i v(i) = 0, then for a nonnegative matrix (whose components are all nonnegative) B ∈ Rn×n we have kv T Bk1 ≤ τ (B)kvk1 .
As a result, for any π ∈ Π we can write k(π − π ∗ )Pϕ k1 ≤ τ (Pϕ )kπ − π ∗ k1 where π ∗ is the invariant distribution of Pϕ . Suppose q is the smallest integer such that τ (Pϕq ) < 1, and define ∆A := max max (Ai1 j − Ai2 j ) . j∈I(b) i1 ,i2 ∈X
We show in the following theorem that the algorithm terminates in finite steps. A upper bound on the number of steps to terminate the algorithm is also provided. Theorem 4.2: If Assumption 4.1 holds, then Algorithm 4.1 terminates in k steps where k is the smallest integer such that Π (k) = Π (k+1) and ' & log ∆ε A , (7) k≤q· log τ (Pϕq ) where dxe is the smallest integer not less than x. Also, Π (k) is the maximal set of safe initial distributions. In the following we show that the maximal set of safe initial distributions Πϕ is actually the MISS of Pϕ . ˆ satisfying Theorem 4.3: Πϕ Pϕ ⊆ Πϕ , and for any Π ˆ ⊆ Πc and ΠP ˆ ϕ⊆Π ˆ we have Π ˆ ⊆ Πϕ . Π Algorithm 4.1 and Theorem 4.2 imply that the number of inequalities to characterize the MISS can amount to m × (k ∗ + 1), in addition to the inequalities π ≥ 0T . Intuitively, some of these inequalities might be redundant. We now propose an algorithm to get rid of some redundant inequalities. Here we use A ← B to mean that A is replaced or updated by B. Algorithm 4.4: step 1. i = 0, J0 = I(b) = {1, · · · , m} , step 2. i ← i + 1, ∀ j ∈ Ji−1 , kij ← max πP i A·j subject to πP r A·s ≤ bs , s ∈ Jr , r = 0, · · · , i − 1, step 3. if kij ≤ bj ∀ j ∈ Ji−1 then stop, call the current i = i∗ ; else Ji ← {j | kij > bj }, and go to step 2 It is easy to see that the number N (Ji ) of elements in the P i∗ set Ji is at most m and the algorithm results in i=0 N (Ji ) Pi ∗ inequalities before it terminates where i=0 N (Ji ) ≤ m × (i∗ + 1).
Remark 4.5: When the algorithm terminates in the first step, namely i∗ = 1, the MISS is actually Πc itself. Therefore the corresponding safe policy is a safety enforcing policy. In particular, if I b , b= A= , −I −b where I is the identity matrix of size n × n and we assume that b, b ∈ [0 , 1]m , then we can write Πc = Π(b, b) := {π ∈ Π | b ≤ π ≤ b} . In this case, the necessary and sufficient condition for a policy to be safety enforcing can be seen in (Theorem 3.1, [3]). Example 4.6: Suppose in Example 3.6 the given constraint set is Πc = {π ∈ Π | EC ≤ (1 + 10%)C ∗ , π(i) ≤ (1 + 10%)π ∗ (i), i = 1, 2, 3} (8) P3 where the expected cost EC := i=1 πi Ci . Since the optimal policy suggests the action of advanced maintenance when the system is in E or L state, and a joint action of 1/3 basic and 2/3 advanced maintenance when the system is in M state, we have C1 = 300, C2 = (1 × 500 + 2 × 1000)/3 and C3 = 200. To calculate the maximal set Πϕ of safe initial distributions corresponding to the optimal policy in Example 3.6 and the constraint set Πc in (8), we run Algorithm 4.4 and find out that the algorithm terminates in the first iteration, meaning that in this case Πϕ = Πc and thus the optimal policy is also a safety enforcing policy. Example 4.7: If in Example 4.6 the constraint set is Πc = {π ∈ Π | EC ≤ (1 + 1%)C ∗ , π(i) ≤ (1 + 1%)π ∗ (i), i = 1, 2, 3} , then our calculation shows that we need two iterations for Algorithm 4.4 to terminate and the resulting MISS can be expressed by {π ∈ Π | π Aˆ ≤ ˆb} where ∗ π (1) 1 0 0 0 0 π ∗ (2) 1 0 0 ∗ 0 π (3) 0 1 0 , ˆbT = C ∗ × 1.01 . 300 832.70 200 0 AˆT = ∗ π (1) .08 .0367 .01 .05 ∗ .01 .1002 .01 .05 π (2) .01 .0367 .10 .05 π ∗ (3)
Note that in this case we have for inequality (7) q = 1, ∆A = 832.7011, ε = 4.8485 × 10−4 , and τ (Pϕq ) = .1169, hence Theorem 4.2 provides a upper bound seven on the number of iterations for the algorithm to terminate.
B. Two-State Markov Chains In this section we consider the special case of two-state system. We will show that in general one iteration, at most, is enough in running Algorithm 4.1 to obtain the MISS. This result improves the analysis of Example 3 in [2]. The transition matrix is accordingly expressed as p 1−p Pϕ = (9) 1−q q
where p, q ∈ (0, 1). The constraint Πc without loss of generality is reduced to Πc := {π ∈ Π | π ≤ b = [b1 b2 ]} where b1 , b2 ∈ [ 0, 1] and b1 + b2 ≥ 1. To calculate the maximal set Πϕ of safe initial distributions corresponding to the transition matrix Pϕ , we first consider the following special cases whose Πϕ can be easily verified. p=q=1
⇒
p+q =1
⇒
Πs if b1 + b2 ≥ 1 , ∅ otherwise Π if p ≤ b1 , q ≤ b2 Πϕ = . ∅ otherwise Πϕ =
In the case of p = q = 0, let b∗ = min{b1 , b2 }, if b∗ ≥ then
1 2
Πϕ = {(x1 , x2 ) ≥ 0|x1 + x2 = 1, 1 − b∗ ≤ x1 ≤ b∗ } , otherwise Πϕ = ∅. From now on, we consider the remaining cases satisfying the following assumption. Assumption 4.2: The 2 × 2 transition matrix Pϕ in (9) has the entries satisfying (p, q) 6= (1, 1) , (p, q) 6= (0, 0) , p + q 6= 1 , and induces a unique limiting distribution π ∗ that is safe. Assumption 4.2 implies that the limiting distribution π ∗ satisfies 1−q 1−p π ∗ Pϕ = π ∗ = [ ] ≤ b = [b1 b2 ] . 2−p−q 2−p−q (10) Write the kth step transition matrix k p11 pk12 . (11) Pϕk = pk21 pk22 The following lemma shows the monotone property when the size of Pϕ is 2×2. Lemma 4.8: With the definition in (11), and Assumption 4.2 we have for n ∈ N: < 0 if p < 1 − q and k = 1, 3, 5 · · · k k p11 − p21 , > 0 otherwise > 0 if p < 1 − q and k = 1, 3, 5 · · · k+1 k , p11 − p11 < 0 otherwise < 0 if p < 1 − q and k = 1, 3, 5 · · · k+1 k p21 − p21 . > 0 otherwise Now we define (k)
A=b1 := {(x1 , x2 ) ≥ (0, 0) | x1 + x2 = 1, x1 pk11 + x2 pk21 = b1 } . (k)
Also, define Mb1 := {k ∈ N | A=b1 6= ∅}. It is apparent that under Assumption 4.2 we have for k ∈ Mb1 k p11 ≤ b1 ≤ pk21 if p < 1 − q and k = 1, 3, 5 · · · . pk21 ≤ b1 ≤ pk11 otherwise
(k)
(k)
(k)
Lemma 4.9: Suppose (x1 , x2 ) ∈ A=b1 for k ∈ Mb1 . Under Assumption 4.2 we have > 0 if p < 1 − q and k = 2, 4, 6 · · · (k+1) (k) x2 − x2 , < 0 otherwise > 0 if p < 1 − q and k = 1, 3, 5 · · · (k+2) (k) x2 − x2 . < 0 otherwise Corollary 4.10: For k ∈ Mb1 define (k)
A≤b1 := {(x1 , x2 ) ≥ 0 | x1 + x2 = 1 , x1 pk11 + x2 pk21 ≤ b1 }, then under Assumption 4.2 we have ( (k) (k+2) A≤b1 ⊂ A≤b1 if p < 1 − q . (k) (k+1) A≤b1 ⊂ A≤b1 otherwise Similar arguments can be applied to the other upper bound b2 and we obtain the following theorem. Theorem 4.11: Under Assumption 4.2, Algorithm 4.1 terminates in one step. The MISS is Πϕ = Πc for p ≥ 1 − q and Πϕ = Π
(1)
= {(x1 , x2 ) | x1 + x2 = 1 , x2 ≤ x2 ≤ x2 } (12)
where 1 − p − b2 x2 = max , 1 − b1 , 1−q−p b1 − p , b2 . x2 = min 1−q−p
for p < 1 − q. Remark 4.12: The conclusion in Theorem 4.11 can also be checked with the following argument. Consider a system of m linear inequalities with an unknown vector x ∈ RN : Ax ≤ b ,
x ≥ 0,
(13)
and an additional inequality dx ≤ d0 . It is easy to see that if there exists an u ∈ Rm satisfying u ≥ 0,
d ≤ uA ,
ub ≤ d0 ,
then dx ≤ d0 is redundant relative to (13). We apply the above observation to test the redundant inequalities caused by iterations of the algorithm. Consider the case for 1 − q > p only. After one iteration we have 1 x1 x1 ≤ , x1 ≥ 0 (14) −1 −x1 where
b2 − q x1 = min b1 , , 1−q−p 1 − q − b1 x1 = max 1 − b2 , . 1−q−p
m To test if x1 pm 11 + (1 − x1 )p21 ≤ b1 is redundant relative to (14), we can check if there exist u1 ≥ 0 and u2 ≥ 0 such that 1 m , pm − p ≤ [u u ] 1 2 11 21 −1 (15) x1 [u1 u2 ] ≤ b1 − pm . 21 −x1
Note that inequality (10) in Assumption 4.2 is the same as pm 21 m ≤ b1 , 1 + pm 21 − p11
1 − pm 11 ≤ b2 . 1 + pm − pm 21 11
(16)
So by (16) and Lemma 4.8, if m = 2n then we can take b1 − p2n 21 0 b1 as a solution of [u1 u2 ] for (15). If m = 2n + 1, we obtain a solution (p2n+1 − b1 )(1 − q − p) 21 . 0 1 − q − b1
Similarly argument can be applied to show for m ≥ 2 the m redundancy of x1 pm 12 + (1 − x1 )p22 ≤ b2 . So we conclude that the MISS is Πϕ = {(x1 , x2 ) | x1 + x2 = 1, x1 ≤ x1 ≤ x1 } which is equivalent to (12). V. C ONCLUSION This paper continues our prior work ( [3], [2]) on controlled Markov chains with safety constraints. We generalize the constraint set to be any linear convex set and also propose a method to compute a long-run average optimal, safe and stationary control policy. Our method is based on solving a linear program over all occupation measures. Under a simplifying assumption that the optimal safe stationary policy is ergodic, we present a finitely-terminating iterative algorithm that computes a maximal invariant safe set (MISS) of distributions for the optimal policy. This set contains all the safe distributions starting from which the future distributions remain safe under the optimal policy. An upper bound on the number of steps needed for the termination of the algorithm that computes the MISS is also presented. R EFERENCES [1] E. Altman, Constrained Markov decision processes, Chapman and Hall/CRC, 1999. [2] A. Arapostathis, R. Kumar, and S. Tangirala, “Controlled Markov Chains with Safety Upper Bounds”, IEEE Transactions on Automatic Control, Vol 48, No. 7, pp. 1230–1234, July 2003. [3] A. Arapostathis, R. Kumar, and S.-P. Hsu, “Control of Markov Chains with Safety Bounds”, IEEE Transactions on Automation Science and Engineering, Vol 4, No. 2, pp. 333–343, 2005. [4] A. Hordijk and L. C. M. Kallenberg, “Constrained undiscounted stochastic dynamic programming”, Mathematics of Operations Research, Vol 9, No. 2, pp. 276–289, 1984. [5] P. R. Kumar and P. Varaiya, Stochastic Systems: Estimation, identification and adaptive control, Prentice Hall, 1986. [6] R. Kumar and V. K. Garg, Modeling and Control of Logical Discrete Event Systems, Kluwer Academic Publishers, MA, 1995. [7] K. W. Ross and R. Varadarajan, “Markov Decision Processes with sample path constraints: the communicating case”, Operations Research, Vol 37, No. 5, pp. 780–790, 1989. [8] K. W. Ross, “Randomized and past-dependent policies for Markov Decision Processes with multiple constraints”, Operations Research, Vol 37, No. 3, pp. 474–477, 1989. [9] E. Seneta, Non-negative Matrices and Markov Chains, SpringerVerlag, New York, NY, 2 edition, 1981. [10] W. Wu, A. Arapostathis, and R. Kumar, “On Non-Stationary Policies and Maximal Invariant Safe Sets of Controlled Markov Chains”, Proc. IEEE Conference on Decision and Control, Nassau, Bahama, pp. 3696–3701, Dec. 2003.