Math Meth Oper Res (2010) 71:477–502 DOI 10.1007/s00186-010-0311-8 ORIGINAL ARTICLE
Markov control processes with pathwise constraints Armando F. Mendoza-Pérez · Onésimo Hernández-Lerma
Received: 28 January 2009 / Accepted: 11 May 2010 / Published online: 4 June 2010 © Springer-Verlag 2010
Abstract This paper deals with discrete-time Markov control processes in Borel spaces, with unbounded rewards. The criterion to be optimized is a long-run samplepath (or pathwise) average reward subject to constraints on a long-run pathwise average cost. To study this pathwise problem, we give conditions for the existence of optimal policies for the problem with “expected” constraints. Moreover, we show that the expected case can be solved by means of a parametric family of optimality equations. These results are then extended to the problem with pathwise constraints. Keywords (discrete-time) Markov control processes · Average reward criteria · Pathwise average reward · Constrained control problems Mathematics Subject Classification (2000)
93E20 · 90C40
1 Introduction This paper is about discrete-time Markov control processes (MCPs) in Borel spaces. The problem we are concerned with is to maximize a long-run pathwise (that is, sample-path) average reward subject to constraints on a long-run pathwise average cost. (For expositional convenience we consider a single constraint, but in fact our approach holds for any finite number of constraints.) To analyze our problem we
A. F. Mendoza-Pérez Universidad Politécnica de Chiapas, Calle Eduardo J. Selvas S/N, Tuxtla Gutiérrez, Chiapas, Mexico e-mail:
[email protected] O. Hernández-Lerma (B) Mathematics Department, CINVESTAV-IPN, A. Postal 14-740, Mexico, DF 07000, Mexico e-mail:
[email protected]
123
478
A. F. Mendoza-Pérez, O. Hernández-Lerma
proceed in two steps. In the first one, we give conditions for the existence of optimal policies for an average reward MCP with expected constraints. These results are extended, in the second step, to our problem with pathwise constraints. We illustrate our results with a detailed analysis of an inventory system. Constrained MCPs form an important class of stochastic control problems with applications in many areas, including mathematical economics, signal processing, queueing systems, epidemic processes, etc.; see, for instance, Borkar (1994), Ding et al. (2003), Djonin and Krishnamurthy (2007), Feinberg and Shwartz (1996), Haviv (1996), Hernández-Lerma et al. (2003), Krishnamurthy et al. (2003), PrietoRumeau and Hernández-Lerma (2008), Ross and Varadarajan (1989, 1991), Vega-Amaya (Submitted) as well as the books (Altman 1999) and (Piunovskiy 1997). A common feature of these works is that almost all of them concern MCPs with expected average rewards/costs and/or countable (possibly finite) state space. Among the few exceptions dealing with pathwise constraints we can mention the following. First of all, we have the pioneering papers by Ross and Varadarajan (1989, 1991) on MCPs with finitely many states and actions. To the best of our knowledge, these papers were the first to study pathwise constraints, sometimes called “hard constraints” (in contrast to “soft” or expected constraints; compare the pathwise problem (8)–(9) below, which uses the notation in Definition 2.1, with the expected case (37)–(38)). An interesting paper by Haviv (1996) shows by means of examples that pathwise constraints are, in general, more “natural” than expected constraints, in the sense that the latter can lead to optimal policies that do not satisfy Bellman’s principle of optimality. As far as we can tell, the only paper (in addition to ours) dealing with pathwise constraints and uncountable Borel spaces is the work by Vega-Amaya (Submitted). However, the hypotheses and general approach in that paper is completely unrelated to ours. We should note though that the setting in Vega-Amaya (Submitted) allows an uncountable set of constraints. Finally, we should mention the paper by Prieto-Rumeau and Hernández-Lerma (2008) on continuous-time controlled Markov chains with a countable state space. Even though that setting is quite different from ours, the approach and results in PrietoRumeau and Hernández-Lerma (2008) were an important motivation for our work. The remainder of the paper is organized as follows. In Sect. 2 we recall the basic components of a Markov control model, and state some of our main assumptions. The pathwise constrained problem (CP) is introduced in Sect. 3 together with our main result, Theorem 3.4. The pathwise CP is illustred in Sect. 4 with an application to an inventory system. Finally, Sects. 5 and 6 present the proof of Theorem 3.4. Actually, in Sect. 5 we analyze an expected constraints problem, showing that it admits an optimal policy and then, in Sect. 6, we show that this policy is also optimal for the pathwise CP. For additional material related to constrained MCPs the interested reader may consult Mendoza-Pérez (2008).
2 The control model and main assumptions Let (X, A, {A(x) : x ∈ X}, Q, r, c) be a discrete time Markov control model with state space X and control (or action) set A, both assumed to be separable metric spaces with
123
MCPs with pathwise constraints
479
Borel σ -algebras B(X) and B(A), respectively. For each x ∈ X there is a nonempty set A(x) in B(A) which represents the set of feasible actions in the state x. The set K := {(x, a) : x ∈ X, a ∈ A(x)}
(1)
is assumed to be a Borel subset of X ×A. The transition law Q is a stochastic kernel on X given K. The one-stage reward r and the one-stage cost c are real-valued measurable functions on K. We interpret r as a reward to be maximized with the restriction that the cost c does not exceed (in a suitably defined sense) a given value. The class of measurable functions f : X → A such that f (x) is in A(x) for every x ∈ X is denoted by F, and we suppose that it is nonempty. Let be the set of stochastic kernels ϕ on A given X for which ϕ(A(x)|x) = 1 for all x ∈ X. Control policies. For every n = 0, 1, . . ., let Hn be the family of admissible histories up to time n; that is, H0 := X, and Hn := Kn × X if n ≥ 1. A control policy is a sequence π = {πn } of stochastic kernels πn on A given Hn such that πn (A(xn )|h n ) = 1 for every n-history h n = (x0 , a0 , . . . , xn−1 , an−1 , xn ) in Hn . The class of all policies is denoted by . Moreover, a policy π = {πn } is said to be a (a) randomized stationary policy if there exists a stochastic kernel ϕ ∈ such that πn (·|h n ) = ϕ(·|xn ) for all h n ∈ Hn and n = 0, 1, . . .; (b) deterministic stationary policy if there exists f ∈ F such that πn (·|h n ) is the Dirac measure at f (xn ) ∈ A(xn ) for all h n ∈ Hn and n = 0, 1, . . . Therefore, we have F ⊂ ⊂ . Following a standard convention, we identify with the class of randomized stationary policies and F with the class of deterministic stationary policies. Given ϕ ∈ , we will use the following notation: (2) rϕ (x) : = r (x, a)ϕ(da|x), cϕ (x) := c(x, a)ϕ(da|x), A
A
Q ϕ (·|x) :=
Q(·|x, a)ϕ(da|x)
(3)
A
for all x ∈ X. Moreover, the n-step transition probabilities are denoted by Q nϕ , with Q 1ϕ (·|x) := Q ϕ (·|x) and Q 0ϕ (·|x) := δx , the Dirac measure concentrated at the initial state x. We can write Q nϕ recursively as Q nϕ (·|x) =
Q ϕ (·|y)Q n−1 ϕ (dy|x), n ≥ 1.
(4)
X
In particular, for a deterministic policy f ∈ F, (2)–(3) become r f (x) = c(x, f (x)), c f (x) = c(x, f (x)),
Q f (·|x) = Q(·|x, f (x)).
Let (, F) be the (canonical) measurable space consisting of the sample space := (X × A)∞ and its product σ -algebra F. Then for each policy π ∈ and initial
123
480
A. F. Mendoza-Pérez, O. Hernández-Lerma
state x ∈ X, a stochastic process {(xn , an )} and a probability measure Pxπ is defined on (, F) in a canonical way, where xn and an represent the state and the control at time n, n = 0, 1, . . .. The expectation operator with respect to Pxπ is denoted by E xπ . Given π ∈ , x ∈ X, and n = 1, 2, . . ., we define the n-stage pathwise reward and the n-stage expected reward as
Sn (π, x) :=
n−1
r (xk , ak ) and Jn (π, x) := E xπ [Sn (π, x)],
k=0
respectively. Replacing the reward function r with the cost c we obtain the definition of Sc,n (π, x) and Jc,n (π, x). Definition 2.1 The (long-run) pathwise average reward and the (long-run) expected average reward are given by S(π, x) := lim inf n→∞
1 1 Sn (π, x) and J (π, x) := lim inf Jn (π, x), n→∞ n n
respectively. Similarly, the pathwise average cost and the expected average cost are respectively defined as Sc (π, x) := lim sup n→∞
1 1 Sc,n (π, x) and Jc (π, x) := lim sup Jc,n (π, x). n n→∞ n
Observe that J (π, x) (and S(π, x)) is defined as a “lim inf”, whereas Jc (π, x) (and Sc (π, x)) is a “lim sup”. This is because according to standard conventions, the function r is interpreted as a reward-per-stage function, whereas the function c is a cost-per-stage. We will introduce four sets of hypotheses. The first one, Assumption 2.2, consists of standard continuity-compactness conditions (see, for instance, Gordienko and Hernández-Lerma 1995; Hernández-Lerma and Lasserre 1999; Hernández-Lerma et al. 1999; Puterman 1994) together with a growth condition on the one-step reward r and the cost c (see (b1)), and the Lyapunov-like condition (b3). Assumption 2.2 For every state x ∈ X: (a) A(x) is a compact subset of A; (b) there exists a measurable function W ≥ 1 on X, a bounded measurable function b ≥ 0, and nonnegative constants r1 , c1 , and β, with β < 1, such that (b1) |r (x, a)| ≤ r1 W (x), |c(x, a)| ≤ c1 W (x) ∀(x, a) ∈ K; (b2) X W (y)Q(dy|x, a) is continuous in a ∈ A(x); and (b3) X W (y)Q(dy|x, a) ≤ βW (x) + b(x) for every x ∈ X.
123
MCPs with pathwise constraints
481
To state our second set of hypotheses, we will use the following notation, where W is the function in Assumption 2.2(b): BW (X) denotes the normed linear space of measurable functions u on X with a finite W -norm u W , which is defined as u W := sup |u(x)|/W (x).
(5)
x∈X
In this case we say that u is W -bounded. Similarly, we say that a function v : K → R belongs to BW (K) if x → supa∈A(x) |v(x, a)| is in BW (X). In particular, by Assumption 2.2(b1), r (x, a) and c(x, a) are both in BW (K). We write μ(u) := u(y)μ(dy) X
whenever the integral is well defined. The next set of hypotheses guarantees that the MCP has a nice “stable” behavior uniformly in . Assumption 2.3 For each randomized stationary policy ϕ ∈ : (a) (W -geometric ergodicity) There exists a (necessarily unique) probability measure μϕ on X such that (with Q tϕ as in (3)–(4)) u(y)Q t (dy|x) − μϕ (u) ≤ u W Rρ t W (x), ϕ
(6)
X
for every t = 0, 1, . . . , u ∈ BW (X), and x ∈ X, where R > 0 and 0 < ρ < 1 are constants independent of ϕ. (b) (Irreducibility) There exists a σ -finite measure ν on B(X) with respect to which Q ϕ is ν-irreducible, which means that if B ∈ B(X) is such that ν(B) > 0, then for every x ∈ X there exists t > 0 for which Q tϕ (B|x) > 0. Remark 2.4 (i) By Assumptions 2.3(a) and 2.2(b3), we have that μϕ (W ) ≤ b/(1 − β) < ∞ ∀ϕ ∈ ,
(7)
with b = supx∈X b(x). Moreover, by (6), J (ϕ, x) and Jc (ϕ, x) in Definition 2.1 are constant (that is, do not depend on the initial state x), and verify that 1 ϕ E r (xk , ak ) = μϕ (rϕ ) =: g(ϕ) n→∞ n x n−1
J (ϕ, x) = lim
k=0
(where the letter g is an abbreviation for “gain”, which is another standard name for “average reward” Prieto-Rumeau and Hernández-Lerma 2008; Puterman 1994), and
123
482
A. F. Mendoza-Pérez, O. Hernández-Lerma
1 ϕ E c(xk , ak ) = μϕ (cϕ ) =: gc (ϕ). n→∞ n x n−1
Jc (ϕ, x) = lim
k=0
(ii) Sufficient conditions for Assumption 2.3 are well known for deterministic stationary policies; see, for instance, (Vega-Amaya 2003, Theorem 3.6), (Hernández-Lerma and Lasserre 1999, Proposition 10.2.5), (Gordienko and Hernández-Lerma 1995, Lemmas 3.3, and 3.4). (iii) Pick an arbitrary policy ϕ ∈ . By the Assumptions 2.2 and 2.3, the Markov chain with transition probability Q ϕ is positive Harris recurrent, and so the strong law of large numbers for Markov chains holds; see, for instance, (HernándezLerma and Lasserre 1999, Theorem 11.2.1) or (Hernández-Lerma and Lasserre 2003, Theorems 4.3.1 and 4.2.13) or (Meyn and Tweedie 1993, p. 411). Hence, in particular, for every funcion u ∈ BW (X) and every initial state x ∈ X, n−1 1 u(xk ) = μϕ (u) Pxϕ − a.s. n→∞ n
lim
k=0
with μϕ as in Assumption 2.3(a). Moreover, the latter pathwise limit coincides the limit of the expected values 1 ϕ E u(xk ) = μϕ (u). n→∞ n x n−1
lim
k=0
(iv) Under Assumptions 2.2, 2.3, and 3.2 below, if in (iii) we replace u(·) with rϕ (·) or cϕ (·) (see (2)), then from part (i) above and from Definition 2.1 we obtain (see, for instance, Proposition 2.8 in Mendoza-Pérez and Hernández-Lerma 2009) n−1 1 rϕ (xk ) = μϕ (rϕ ) = g(ϕ) Pxϕ − a.s., n→∞ n
S(ϕ, x) = lim
k=0
and n−1 1 Sc (ϕ, x) = lim cϕ (xk ) = μϕ (cϕ ) = gc (ϕ) Pxϕ − a.s. n→∞ n k=0
3 Main results Now we are ready to define our pathwise constrained problem (CP). Let θ ∈ R be a given constant. With the notation in Definition 2.1, and using the standard abbreviation “a.s.” for “almost surely”, we want to maximize the pathwise
123
MCPs with pathwise constraints
483
average reward S(π, x) over the set of all policies π ∈ satisfying, for every initial state x ∈ X, the following constraint on the pathwise average cost Sc (π, x) ≤ θ Pxπ − a.s. Hence, we can explicitly state our pathwise CP as follows: maximize S(π, x) subject to:
(8)
π ∈ and Sc (π, x) ≤ θ
Pxπ
− a.s. ∀x ∈ X.
(9)
A policy π ∈ is said to be feasible for the pathwise CP if it satisfies (9). The following definition uses the notation g(ϕ) and gc (ϕ) in Remark 2.4(i) and (iv). Definition 3.1 Let ϕ ∗ ∈ be a feasible policy for the pathwise CP, i.e., gc (ϕ ∗ ) ≤ θ . (See Remark 2.4(iv).) Then ϕ ∗ is said to be optimal for the pathwise CP (8)–(9) if for each feasible π ∈ we have S(π, x) ≤ g(ϕ ∗ ) Pxπ − a.s. If ϕ ∗ is an optimal policy for the CP (8)–(9), then we define the optimal value of CP as V ∗ (θ ) := g(ϕ ∗ ). Further assumptions. In the following assumption we strengthen the growth condition on the reward function r in Assumption 2.2(b1). Assumption 3.2 There exist positive constants r2 and c2 such that r (x, a)2 ≤ r2 W (x) and c(x, a)2 ≤ c2 W (x) ∀(x, a) ∈ K. Actually, since W ≥ 1, Assumption 3.2 implies Assumption 2.2(b1). In the remainder of this paper we consider the function w(x) :=
W (x) ∀x ∈ X.
We state our last assumption Assumption 3.3 (a) The transition law Q is strongly continuous on K, that is, the mapping (x, a) →
v(y)Q(dy|x, a) X
is continuous on K for every measurable bounded function v on X. (b) The cost function c is nonnegative and lower semicontinuous (l.s.c.) on K. (c) The reward function r is upper semicontinuous (u.s.c.) on K.
123
484
A. F. Mendoza-Pérez, O. Hernández-Lerma
(d) The function w, seen as a function (x, a) → w(x) on K, is continuous. Moreover, w is a so-called moment function on K, that is, there exists a nondecreasing sequence of compact sets K n ↑ K such that lim inf{w(x) : (x, a) ∈ / K n } = ∞.
n→∞
By the Assumption 3.3(b) and the Remark 2.4(i), we can define θmin := inf
cϕ (y)μϕ (dy) and θmax := sup
ϕ∈
ϕ∈
X
cϕ (y)μϕ (dy), X
which are finite numbers. To avoid trivial situations, we will assume that the constant θ in (9) verifies that θmin < θ < θmax .
(10)
We can now state our main result, which is proved in Sects. 5 and 6. Theorem 3.4 Suppose that Assumptions 2.2, 2.3, 3.2 and 3.3 hold. Then: (i) There exists an optimal policy ϕ ∗ ∈ for the pathwise CP (8)–(9). In particular, gc (ϕ ∗ ) ≤ θ and g(ϕ ∗ ) = V ∗ (θ ), with V ∗ (θ ) as in Definition 3.1. (ii) There exist 0 ≤ 0 and h ∈ Bw (X) such that the average reward optimality equation (AROE) ⎡ V ∗ (θ ) + h(x) = max ⎣r (x, a) + 0 · (c(x, a) − θ ) +
a∈A(x)
= rϕ ∗ (x) + 0 · cϕ ∗ (x) − θ +
⎤ h(y)Q(dy|x, a)⎦
X
h(y)Q ϕ ∗ (dy|x)
(11)
X
holds for every x ∈ X. Furthermore, the following “orthogonality” property is satisfied: (gc (ϕ ∗ ) − θ ) · 0 = 0.
(12)
(iii) For each ≤ 0, let (ρ( ), h ) ∈ R × Bw (X) be a solution to the AROE ⎡ ρ( ) + h (x) = max ⎣r (x, a) + · (c(x, a) − θ ) +
a∈A(x)
X
for every x ∈ X. Then V ∗ (θ ) = min ≤0 ρ( ).
123
⎤ h (y)Q(dy|x, a)⎦
MCPs with pathwise constraints
485
We conclude this section with some comments on how to verify the hypotheses of Theorem 3.4. The easiest thing to do is to start with Assumption 2.2(a) and Assumption 3.3(a), (b), (c), because these conditions are usually given by (or immediately deduced from) the control model itself. Then we should determine the function W (·) in Assumption 3.2, which also gives Assumption 2.2(b1). This function is usually a polynomial or an exponential—see, for instance, (20) below. Verification of Assumption 2.2(b2), (b3) usually follows from direct calculations, and so is Assumption 3.3(d). This completes the verification of Assumptions 2.2, 3.2, and 3.3 [cf. Lemma 4.3(a)]. Assumption 2.3 is more complicated. As noted in Remark 2.4(ii) sufficient conditions for Assumption 2.3 are well known when dealing with deterministic stationary policies f ∈ F. In our present “constrained” case, we usually take one of those sufficient conditions and then verify that it holds in the larger set of randomized stationary policies. Lemmas 4.4 – 4.8 show how this is done for a particular MCP. 4 Example: an inventory system As an example, in this section we consider an inventory-production system studied by several authors, for instance Hernández-Lerma and Lasserre (1999), HernándezLerma and Vega-Amaya (1998), Vega-Amaya and Montes-de-Oca (1998), VegaAmaya (1998). (Other examples can be seen in Mendoza-Pérez 2008.) In this inventory-production system the stock level xt evolves in X := [0, ∞) according to the equation xt+1 = max(xt + at − z t , 0), t = 0, 1, . . . ,
(13)
for some given initial stock level x0 . Here at is the amount of product ordered (and immediately supplied) at the beginning of period t, whereas z t denotes the product’s demand during that period. The production variables at are supposed to take values in the interval A := [0, ], for some given constant > 0 irrespective of the stock level, that is, A(x) = A ∀x ∈ X.
(14)
Additionally, we suppose that the demand process {z t } with values in Z := [0, ∞), and its distribution G satisfy the following assumption. Assumption 4.1 The demand process {z t } is a sequence of i.i.d. random variables, and independent of the initial state x0 . Moreover, the demand distribution G is such that: (a) G has a continuous bounded density G ; ∞ ∞ (b) G has a finite mean value z¯ := E(z 0 ) = 0 zG(dz) = 0 zG (z)dz < ∞, where E denotes the expectation with respect to G. (c) < z¯ . Note that Assumption 4.1(c) states that the average demand z¯ should exceed the maximum allowed production. This assumption, in addition to being a convenient
123
486
A. F. Mendoza-Pérez, O. Hernández-Lerma
technical condition, reflects the fact that we wish to bound from above the combined production and maintenance costs. Hence it excludes some frequently encountered cases that require the opposite, ≥ z¯ . To complete our control model, we introduce a reward-per-stage function r that represents a net reward of the form sales revenue – (production cost + maintenance cost) given by r (x, a) := s · E min(x + a, z 0 ) − [ p · a + m · (x + a)]
(15)
for all (x, a) ∈ K, where p, m and s are positive constants, and K := X × A. The unit production cost p and the unit maintenance cost m do not exceed the unit sale price s, i.e., p, m ≤ s.
(16)
Hence, the corresponding cost-per-stage function c is given by c(x, a) := p · a + m · (x + a) ∀(x, a) ∈ K.
(17)
We can verify that x+a E min(x + a, z 0 ) = (x + a)[1 − G(x + a)] + zG(dz).
(18)
0
Thus, the functions r and c in (15) and (17) are continuous on K. On the other hand, let (r ) := E exp[r ( − z 0 )] be the moment generating function of the random variable − z 0 . Note that (0) = 1 and by Assumption 4.1(c), r such that (0) = E( − z 0 ) = − z¯ < 0. Hence, there is a positive number β := ( r ) < 1.
(19)
W (x) := exp[ r (x + 2¯z )] ∀x ∈ X.
(20)
We define the weigth function
Then W ≥ 1, and by a straigthforward calculation using Assumption 4.1(c), (16) and (18), we can see that there exist constants r1 and c¯ such that |r (x, a)| ≤ r1 W (x) and |c(x, a)| ≤ cW ¯ (x) ∀(x, a) ∈ K.
123
MCPs with pathwise constraints
487
√ Moreover, we can check that the functions r, c belong to Bw (X), where w := W . Hence, Assumption 2.2(a), (b1), Assumption 3.3(b)–(e), and Assumption 3.2 are satisfied. For notational purposes, given a function u on X, let u (x, a) := E[u(xt+1 )|xt = x, at = a] = u(y)Q(dy|x, a). X
Therefore, by (13), x+a u (x, a) = u(0)[1 − G(x + a)] + u(x + a − z)G (z)dz
0 x+a
u(z)G (x + a − z)dz.
= u(0)[1 − G(x + a)] +
(21)
0
for every bounded measurable function u on X. Then Assumptions 2.2(b2) and 3.3(a) hold. It only remains to prove that Assumptions 2.2(b3) and 2.3 are satisfied. To this end, define for every (x, a) ∈ K and B ∈ B(X) l(x, a) := 1 − G(x + a) and ν(B) := δ0 (B),
(22)
with δ0 the Dirac measure at x = 0. The following lemma summarizes important facts. Lemma 4.2 Under the Assumption 4.1, we have: (a) Q(B|x, a) ≥ l(x, a)ν(B) for all B ∈ B(X), (x, a) ∈ K; (b) ν(W ) < ∞; (c) for all (x, a) ∈ K W (y)Q(dy|x, a) ≤ βW (x) + l(x, a)ν(W ); X
(d) There exists a constant γ > 0 such that ν(lϕ ) ≥ γ for all ϕ ∈ . Proof Part (a) follows from (21) and (22) taking u(·) = 1 B (·), with B ∈ B(X). Part (b) follows easily by the definition of ν in (22). From (21) again, with u = W , x+a
W (y)Q(dy|x, a) = W (0)[1 − G(x + a)] +
W (x + a − z)G (z)dz
0
X
x+a = ν(W )l(x, a) + W (x) exp[ r (a − z)]G(dz), 0
123
488
A. F. Mendoza-Pérez, O. Hernández-Lerma
Therefore, since r (a − x) ≤ r ( − x) for all a ∈ A, by (19) we get W (y)Q(dy|x, a) ≤ ν(W )l(x, a) + βW (x) ∀(x, a) ∈ K. X
This proves part (c). To prove (d), with l(x, a) and ν as in (22), observe that for each ϕ ∈ lϕ (x) ≥ 1 − G(x + ). Thus, integrating both sides with respect to ν = δ0 , ν(lϕ ) ≥ 1 − G(θ ) ∀ϕ ∈ .
(23)
To conclude, we only need to show that G() < 1. By contradiction, if G() = 1, then ∞ z¯ =
zG(dz) 0
=
∞ zG(dz) +
0
zG(dz)
≤ G() = , which contradicts Assumption 4.1(c).
Lemma 4.3 Suppose that Assumption 4.1 holds, and let θmin and θmax be as in (10). Then: (a) the inventory-production system (13)–(17) satisfies Assumptions 2.2, 3.2, and 3.3; (b) θmin = 0 and ( p + m) ≤ θmax . Proof (a) By Lemma 4.2, it suffices to prove Assumption 2.2(b3). This assumption, however, follows from Lemma 4.2(c). (b) By (17), the cost c(x, a) is nonnegative and so θmin ≥ 0. Now consider the constant policy f (x) := 0 for all x ≥ 0. Observe that c f (x) = c(x, 0) = mx. We will show that μ f (c f ) = gc ( f ) = 0, which implies θmin = 0. To prove this, let M be an upper bound for the density G in Assumption 4.1(a). Then a straightforward induction argument using (21) yields that
123
MCPs with pathwise constraints
489
f
E x [c f (xn )] =
c f (y)Q nf (dy|x) ≤ X
m (M x)n+1 · M (n + 1)!
for every x ≥ 0 and n = 0, 1, . . . As n → ∞, therightmost term in the latter inequality converges to zero [because exp(M x) = n≥0 (M x)n /n! is finite] and so, by (6), μ f (c f ) = 0. Finally, to obtain the second inequality in (b) consider the constant policy f (x) := for all x ≥ 0. Hence c f (x) = ( p + m) + mx ≥ ( p + m) ∀ x ≥ 0. This gives μ f (c f ) ≥ ( p + m), which in turn gives θmax ≥ ( p + m).
In general, it can be difficult to compute the constants θmin and θmax in (10), and so Lemma 4.3(b) gives us estimates of the range of admissible values of the constraint parameter θ . In particular, in the remainder of this section we will replace (10) with the inequality 0 < θ < ( p + m). It only remains to verify that the inventory-production system (13)–(17) satisfies Assumption 2.3. To this end we will use the following lemma, which is a special case of Theorem 2.1.4 in Mendoza-Pérez (2008). Lemma 4.4 Suppose that Assumption 4.1 holds. Then, for each ϕ ∈ , the Markov chain defined by the transition probability Q ϕ (·|·) in the inventory-production system (13)–(17) is ν-irreducible [with ν as in (22)] and positive Harris recurrent; hence, it admits a unique invariant probability measure, say μϕ , such that μϕ (W ) < ∞. We will need the next three lemmas, which are Lemmas 3.1, 3.3, and 3.4 from Gordienko and Hernández-Lerma (1995) adapted to our present context. The following Lemma 4.5 is a direct consequence of Lemma 4.2. Lemma 4.5 If Assumption 4.1 holds, then for every ϕ ∈ we have the following: (a) γ ≤ ν(ϕ ) ≤ 1, with γ as in Lemma 4.2(d); (b) 1 ≤ μϕ (W ) ≤ b0 , with b0 := ν(W )/(1 − β); (c) b1 ≤ μϕ (ϕ ) ≤ 1, with b1 := 1/b0 = (1 − β)/ν(W ). Let MW (X) be the normed linear space of all finite signed measures μ on X with finite norm μ W := W (x)|μ|(d x); X
where |μ| stands for the total variation of μ. Further, for every ϕ ∈ , the transition probability Q ϕ defines a linear mapping from MW (X) into itself given by Q ϕ μ(·) :=
Q ϕ (·|x)μ(d x) ∀ μ ∈ MW (X).
(24)
X
123
490
A. F. Mendoza-Pérez, O. Hernández-Lerma
We also define Q ϕ W := sup{ Q ϕ μ W : μ ∈ MW (X ) and μ W ≤ 1}. Observe that, by Lemma 4.2(c), Q ϕ W ≤ ν(W ) + β = ν W + β with β as in (19). Finally, recalling Lemma 4.2(a), let τ (B|x, a) := Q(B|x, a) − ν(B)(x, a) ≥ 0.
(25)
By Lemma 4.2(c), τ is contractive in the sense that W (y)τ (dy|x, a) ≤ βW (x) ∀ (x, a) ∈ K.
(26)
X
Lemma 4.6 Under Assumption 4.1, for each ϕ ∈ we have: (E1) ν(lϕ ) > 0, μϕ (lϕ ) > 0; (E2) The kernel τϕ (B|x) = Q ϕ (B|x) − lϕ (x)ν(B) is nonnegative; (E3) τϕ W := sup{ τϕ μ W : μ ∈ MW , μ W ≤ 1} ≤ β for β as in (19) and Lemma 4.2(c). Proof (E1) follows from Lemma 4.5(a) and (c), whereas (E2) and (E3) follow from (25) and (26), respectively. Let N be the function on (0, 1] × (0, 1] defined as 1 − s log r · for 0 < s, r ≤ 1. N (s, r ) := exp − s 1−r
(27)
From the conditions (E1)–(E3) in Lemma 4.6 and the Corollary to Theorem 6 in Kartashov (1985), we obtain the following. Lemma 4.7 Under Assumption 4.1, for any ϕ ∈ Q tϕ
− ϕ W ≤ β
t
(t + ζ0t+1
+ 2)e (1 + σ ) β
(28)
for all t > ζ0 /(1 − ζ0 ), where ϕ denotes the stationary projector of μϕ , defined as ϕ μ(·) := μ(X)μϕ (·) ∀μ ∈ MW (X), and σ and ζ0 are positive constants satisfying
123
MCPs with pathwise constraints
491
σ ≤ b0 (with b0 as in Lemma 4.5(b)), 1−β ζ0 = 1 − , and 1 + βσ κ(ϕ) κ(ϕ) = 2N μϕ (lϕ ), ν(lϕ ) − 1,
(29) (30) (31)
with N as in (27). Lemma 4.8 Under Assumption 4.1, there exist positive constants R and ρ, with ρ < 1, such that for every x ∈ X and t = 0, 1, . . . , sup Q tϕ δx − μϕ W ≤ RW (x)ρ t ,
ϕ∈
(32)
where δx denotes the Dirac measure concentrated at x. Proof Let N and κ(ϕ) be as in (27) and (31), and define κ∗ := 2N (b1 , γ ) − 1, with b1 as in Lemma 4.5(c) and γ as in Lemmas 4.2(d) and 4.5(a). Thus κ(ϕ) ≤ κ∗ ∀ϕ ∈ ,
(33)
since N (s, r ) is decreasing in s ∈ (0, 1] and r ∈ (0, 1). Now let b0 be as in Lemma 4.5(b) and define ζ∗ = 1 −
1−β < 1. 1 + βκ∗ b0
Then (29)–(30), together with (33), imply ζ0 ≤ ζ∗ . Hence, since κ∗ and ζ∗ are independent of ϕ ∈ , the inequality (28) yields (t + 2)e (1 + b0 ) sup Q tϕ − ϕ W ≤ β t + ζ∗t+1 β ϕ∈
(34)
for all t > ζ∗ /(1 − ζ∗ ). The estimation (34) shows, as in the proof of Lemma 3.4 in Gordienko and Hernández-Lerma (1995), that there exist positive constants R and ρ, with ρ < 1, satisfying (32). These constants can be estimated in terms of the quantities β, ν(W ) and γ appearing in Lemma 4.2. From Lemmas 4.4 and 4.8 we obtain the following. Lemma 4.9 Under Assumption 4.1, the inventory-production system (13)–(17) satisfies Assumption 2.3. To conclude, by Lemmas 4.2, 4.3 and 4.9, together with Theorem 3.4, we have:
123
492
A. F. Mendoza-Pérez, O. Hernández-Lerma
Proposition 4.10 Suppose that Assumption 4.1 holds. Then the inventory-production system (13)–(17) has a pathwise constrained optimal policy. Moreover, for each ≤ 0, let (ρ( ), h ) ∈ R × BW (X) be a solution to the AROE ⎡ h (x) + ρ( ) = sup ⎣r (x, a) + a∈A(x)
⎤ h (y)Q(dy|x, a)⎦
(35)
X
with r (x, a) := r (x, a) + (c(x, a) − θ ) · , then the constrained optimal value V ∗ (θ ) satisfies V ∗ (θ ) = min ρ( ).
≤0
(36)
5 MCPs with expected constraints To prove Theorem 3.4 we need some results about optimal policies for MCPs with expected constraints. Some of these results are similar to others available in the literature (for instance, compare our Theorems 5.2 and 5.3 below with Theorems 4.9 and 4.10 in (24)), but we cannot use here the latter results because their assumptions are different from ours. Let J (π, x) and Jc (π, x) be the long-run expected averages in Definition 2.1, and let θ be a constant as in (10). Then the expected constraints problem (ECP) we are concerned with is: maximize J (π, x) subject to:
(37)
π ∈ and Jc (π, x) ≤ θ ∀x ∈ X.
(38)
In the remainder of the paper we will show that this ECP is solvable, and that a solution to it is also a solution to the pathwise CP (8)–(9). Definition 5.1 A policy π ∈ is said to be feasible for the ECP if it satisfies the constraints in (38), that is, Jc (π, x) ≤ θ for all x in X. Moreover, a feasible policy π ∗ is called optimal for the ECP if J (π, x) ≤ J (π ∗ , x) for every feasible π . Let feas ⊂ be the class of feasible randomized stationary policies, i.e., feas := {ϕ ∈ : Jc (ϕ, x) ≤ θ ∀x ∈ X}. We will show that this set is nonempty. Henceforth, v ∗ (θ, x) will designate the optimal value function of the ECP (37)–(38) in the class feas , that is, v ∗ (θ, x) := sup J (ϕ, x). ϕ∈feas
123
(39)
MCPs with pathwise constraints
493
The following theorem states the existence of an optimal policy for the ECP. Furthermore, it establishes the existence of a solution to the average reward optimality equation (AROE) (41). Theorem 5.2 Suppose that Assumptions 2.2, 2.3, 3.2 and 3.3 are satisfied. Then: (i) There exist 0 ≤ 0, a constant V (θ ) which depends on θ , and h ∈ Bw (X) such that the AROE ⎤ ⎡ V (θ ) + h(x) = max ⎣r (x, a) + 0 · (c(x, a) − θ ) + h(y)Q(dy|x, a)⎦ a∈A(x)
X
(40) holds for every x ∈ X. (ii) There exists an optimal randomized stationary policy ϕ ∗ ∈ for the ECP (37)– (38) that attains the maximum in the right-side of (40), i.e., (41) V (θ ) + h(x) = rϕ ∗ (x) + 0 · cϕ ∗ (x) − θ + h(y)Q ϕ ∗ (dy|x) X
for all x ∈ X. Moreover, the following “orthogonality” property (using the notation in the Remark 2.4(i)) is satisfied gc (ϕ ∗ ) − θ · 0 = 0,
(42)
V (θ ) = μϕ ∗ (rϕ ∗ ) = g(ϕ ∗ ).
(43)
which together with (41) gives
Theorem 5.2 shows that the ECP (37)–(38) induces a non-constrained problem depending on a real number 0 ≤ 0, which is unknown. The next theorem shows that the ECP can be solved by means of a parametric family of AROEs. Theorem 5.3 Suppose that the hypotheses of Theorem 5.2 are satisfied, and consider the ECP (37)–(38). For each real number ≤ 0, let (ρ( ), h ) ∈ R × BW (X) be a solution to the AROE ⎤ ⎡ ρ( ) + h (x) = max ⎣r (x, a) + (c(x, a) − θ ) · + h (y)Q(dy|x, a)⎦ a∈A(x)
X
(44) for every x ∈ X. Then V (θ ) = min ρ( ).
≤0
(45)
123
494
A. F. Mendoza-Pérez, O. Hernández-Lerma
In the remainder of this section we prove Theorems 5.2 and 5.3. To this end, we first introduce some notation and establish some √preliminary facts. Let W be as in Assumption 2.2, and w := W . We denote by Pw (K) the set of probability measures μ on B(K) such that w(x)μ(d(x, a)) < ∞. K
This set is supposed to be endowed with the w-weak topology (Föllmer and Schied 2002, Appendix A.5), that is, the smallest topology for which the mapping μˆ → v d μˆ K
on Pw (K) is continuous for every v ∈ Cw (K), where Cw (K) is the linear subspace of Bw (K) that consists of the continuous functions on K. Since X and A are separable metric spaces (see Sect. 2), then K is a separable and metrizable space and, therefore, Pw (K) is separable and metrizable. For every ϕ ∈ , let μϕ be as in Assumption 2.3(a), and define μˆ ϕ ∈ Pw (K) as μˆ ϕ (B × C) := ϕ(C|x)μϕ (d x) ∀B ∈ B(X), C ∈ B(A). B
The set of all of these measures is denoted by , i.e., := {μˆ ϕ : ϕ ∈ } ⊂ Pw (K)
(46)
Moreover, for each θ ∈ [θmin , θmax ], with θmin and θmax as in (10), let ⎧ ⎫ ⎨ ⎬ (θ ) := μˆ ∈ : c d μˆ ≤ θ . ⎩ ⎭ K
It can be verified that and (θ ) are convex sets. Furthermore, after some calculations (see Mendoza-Pérez 2008, Lemma 5.2.2) for details) and using Prohorov’s theorem (Föllmer and Schied 2002, Appendix A.5) it follows that and (θ ) are both compact sets in the w-weak topology. Lemma 5.4 For each θ in [θmin , θmax ], let ⎧ ⎫ ⎨ ⎬ V (θ ) := sup r d μˆ : μˆ ∈ (θ ) ⎩ ⎭ K ⎫ ⎧ ⎬ ⎨ r d μˆ : μˆ ∈ and c d μˆ ≤ θ . = sup ⎭ ⎩ K
123
K
(47)
MCPs with pathwise constraints
495
(a) The mapping θ → V (θ ) is concave and nondecreasing. (b) The function v ∗ (θ, x) ≡ v ∗ (θ ) in (39) does not depend on x ∈ X, and V (θ ) = v ∗ (θ ) ∀θ ∈ (θmin , θmax ). (c) There is a number 0 ≤ 0 such that V (θ ) = sup [r + (c − θ ) · 0 ] d μ. ˆ μ∈ ˆ
K
(d) There exists ϕ ∈ such that μˆ ϕ attains the maximum in (47), i.e.
c d μˆ ϕ ≤ θ and V (θ ) = K
r d μˆ ϕ . K
Proof (a) This follows from the convexity of . (b) By the Remark 2.4(i), for every ϕ ∈ and x ∈ X we have J (ϕ, x) = μϕ (rϕ ) ≡ g(ϕ), and Jc (ϕ, x) = μϕ (cϕ ) ≡ gc (ϕ).
(48)
This gives the first statement v ∗ (θ, ·) ≡ v ∗ (θ ), while the second follows from (48), (47), and (39). (c) For every μˆ ∈ let θˆ :=
c d μˆ and μ(r ˆ ) := K
r d μ. ˆ K
ˆ μ(r By part (a), the function V is concave and nondecreasing, and the pair (θ, ˆ )) is in the hypograph of V . Let 0 ≤ 0 be a real number such that − 0 is a superdifferential of V at θ (see, for instance, Ekeland and Temam 1976, Chapter 1); that is, ˆ + (θˆ − θ ) · 0 . V (θ ) ≥ V (θ) Hence
V (θ ) ≥ μ(r ˆ ) + (θˆ − θ ) · 0 = [r + (c − θ ) · 0 ] d μ. ˆ K
Since
K (c
− θ ) · 0 d μˆ ≥ 0 for all μˆ ∈ (θ ), it follows that [r + (c − θ ) · 0 ] d μˆ ≥ sup
V (θ ) ≥ sup
μ∈ ˆ
K
μ∈(θ) ˆ
r d μˆ = V (θ ), K
by (47). This proves (c).
123
496
A. F. Mendoza-Pérez, O. Hernández-Lerma
(d) By standard arguments (see, for instance, Mendoza-Pérez 2008, Lemma 5.2.5), the mapping μˆ → K r d μˆ is upper semicontinuous on Pw (K). Therefore, by the compactness of (θ ), there exists ϕ ∈ that satisfies (d). Proof of Theorem 5.2(i) By Lemma 5.4(c) and the definition (46) of , V (θ ) is the optimal value of an average reward MCP with reward-per-stage function r +(c−θ )· 0 , which is in BW (K). Therefore, by well-known dynamic programming results (see, for instance, Hernández-Lerma and Lasserre 1999, Section 10.3), there exists a function h in BW (X) such that (V (θ ), h) is a solution of the corresponding AROE, that is, for every x ∈ X: ⎡ V (θ ) + h(x) = max ⎣r (x, a) + (c(x, a) − θ ) · 0 +
⎤
h(y)Q(dy|x, a)⎦ .
a∈A(x)
X
(49) Moreover, by Lemma 5.4(b), V (θ ) = v ∗ (θ ). Hence (49) is the same as (40). By Lemma 11.3.9 in Hernández-Lerma and Lasserre (1999), there exist constants R0 > 0 and ρ0 ∈ (0, 1) such that the geometric ergodicity condition in Assumption 2.3 is satisfied when W, R, and ρ are replaced with w, R0 and ρ0 , respectively. This fact together with our Assumption 3.2 give that the function r + (c − θ ) · 0 is in Bw (K), and also that the function h in (49) belongs to Bw (X). This completes the proof of Theorem 5.2(i). Proof of Theorem 5.2(ii) By our continuity and compactness conditions in Assumptions 2.2 and 3.3, usual measurable selection theorems (see Hernández-Lerma and Lasserre 1996, Appendix D, for instance) give the existence of a stationary deterministic policy f ∈ F such that for every x ∈ X, the action f (x) ∈ A(x) attains the maximum in the right-hand side of (49), i.e., V (θ ) + h(x) = r f (x) + (c f (x) − θ ) · 0 +
h(y)Q f (dy|x).
(50)
X
To complete the proof of Theorem 5.2(ii) we will procede in two main steps. First, we will show that the randomized policy ϕ ∈ in Lemma 5.4(d) is an optimal policy for the ECP (37)–(38). Second, we use this ϕ ∈ , and f ∈ F in (50) to construct ϕ ∗ ∈ that satisfies the conclusions in Theorem 5.2(ii). Step 1. By (49), we have, for all (x, a) ∈ K, V (θ ) + h(x) ≥ r (x, a) + (c(x, a) − θ ) · 0 +
h(y)Q(dy|x, a). (51) X
This implies that for every π ∈ and x ∈ X V (θ ) ≥ J (π, x) + [Jc (π, x) − θ ] · 0 .
123
(52)
MCPs with pathwise constraints
497
Let us suppose now that π ∈ is a feasible policy for the ECP. Hence, by (38) and the fact that 0 ≤ 0 (by Lemma 5.4(c)), [Jc (π, x) − θ ] · 0 ≥ 0,
(53)
which gives that V (θ ) ≥ J (π, x) for every x ∈ X and π ∈ as in (38). Finally, taking π = ϕ, with ϕ ∈ as in Lemma 5.4(d), we conclude that ϕ is optimal for the ECP. Step 2. Note that taking π = ϕ in (52) we obtain (recalling (48)) [Jc (ϕ, x) − θ ] · 0 = [μϕ (cϕ ) − θ ] · 0 ≤ 0. This inequality and (53) give [μϕ (cϕ ) − θ ] · 0 =
(c(x, a) − θ ) · 0 μˆ ϕ (d(x, a)) = 0.
(54)
K
On the other hand, integrating both sides of (51) with respect to ϕ(da|x) it follows that V (θ ) + h(x) ≥ rϕ (x) + (cϕ (x) − θ ) · 0 + h(y)Q ϕ (dy|x) (55) X
and so
⎡ ⎣V (θ ) + h(x) − rϕ (x) − (cϕ (x) − θ ) · 0 −
X
⎤ h(y)Q ϕ (dy|x)⎦ μϕ (d x)
X
= −[μϕ (cϕ ) − θ ] · 0 = 0 (by (54)). Consequently, by (55), there exists a Borel set N ∈ B(X) with measure μϕ (N ) = 0 and such that V (θ ) + h(x) = rϕ (x) + (cϕ (x) − θ ) · 0 + h(y)Q ϕ (dy|x) ∀x ∈ N c , (56) X
with N c = X \ N , the complement of N . Now define the new policy ϕ ∗ (·|x) := I N (x)δ f (x) (·) + I N c (x)ϕ(·|x) ∀x ∈ X,
(57)
where I B denotes an indicator function, and δ f (x) (·) is the Dirac measure concentrated at f (x). Therefore, in particular, for every x ∈ N c we have ϕ ∗ (·|x) = ϕ(·|x) and Q ϕ ∗ (·|x) = Q ϕ (·|x),
123
498
A. F. Mendoza-Pérez, O. Hernández-Lerma
which give μϕ ∗ = μϕ ,
and μˆ ϕ ∗ = μˆ ϕ .
(58)
The latter equality (and replacing ϕ with ϕ ∗ in Lemma 5.4(d)) implies that ϕ ∗ in (57) is optimal for the ECP. Moreover, from (57) we immediately obtain (41). Indeed, (57) and (56) give (41) if x ∈ N c , while (57) and (50) give (41) if x ∈ N . Finally, by (58), integration of both sides of (41) with respect to μϕ ∗ (and recalling (48)) gives (42). This completes the proof of Theorem 5.2. Proof of Theorem 5.3 Fix an arbitrary ≤ 0, and suppose that (44) holds. Then ρ( ) + h (x) ≥ r (x, a) + (c(x, a) − θ ) · +
h (y)Q(dy|x, a)
(59)
X
for all (x, a) ∈ K. Now, let ϕ ∗ ∈ be as in Theorem 5.2(ii), and integrate both sides of (59) with respect to ϕ ∗ (da|x), and then with respect to μϕ ∗ (d x), to obtain ρ( ) ≥
[rϕ ∗ (x) + (cϕ ∗ (x) − θ ) · ] μϕ ∗ (d x) X
=
[r + (c − θ ) · ] d μˆ ϕ ∗ ≥ V (θ ). X
Hence, since ≤ 0 was arbitrary, we conclude that inf ρ( ) ≥ V (θ ).
(60)
≤0
Now take = 0 as in Theorem 5.2(i). Then comparison of the AROEs (40) and (44) with = 0 , together with the “uniqueness” of solution to an AROE (see Hernández-Lerma and Lasserre 1999, Section 10.3.C, for instance), shows that V (θ ) = ρ( 0 ).
This fact and (60) give (45). 6 Proof of Theorem 3.4
By Theorem 5.2, we know that there exists an optimal (randomized) stationary policy ϕ ∗ ∈ for the ECP (37)–(38). In particular, by (43) (using the notation (48)) we have ∗
g(ϕ ) =
r d μˆ
ϕ∗
K
123
∗
= V (θ ) and gc (ϕ ) =
c d μˆ ϕ ∗ ≤ θ. K
(61)
MCPs with pathwise constraints
499
By Remark 2.4(iv), ϕ ∗ is a feasible policy for the CP (8)–(9). The key idea to prove Theorem 3.4 is to show that this optimal policy ϕ ∗ is also optimal for the pathwise CP (8)–(9). To this end we need some preliminary results. Let us consider an arbitrary policy π ∈ , and a function h : X → R such that h 2 ∈ BW (X); equivalently, h ∈ Bw (X). For k, n = 1, 2, . . . , let Yk := h(xk ) − E xπ [h(xk )|xk−1 , ak−1 ] = h(xk ) − E xπ [h(xk )|h k−1 , ak−1 ],
(62)
with h k−1 ∈ Hk−1 being the admissible history up to time k − 1, and Mn :=
n
Yk ,
(63)
k=1
The next two lemmas are from Hernández-Lerma et al. (1999); see also HernándezLerma and Lasserre 1999, Lemmas 11.3.10 and 11.3.11. Lemma 6.1 Let x0 = x be an (arbitrary) initial state. Then {Mn }n≥1 is a squareintegrable Pxπ -martingale with respect to the σ -algebras Fn = σ {x0 , a0 , . . . , xn−1 , an−1 , xn , an } = σ {h n , an }. Moreover, as k → ∞, k −1 w(xk ) → 0 Pxπ − a.s. Lemma 6.2 The Pxπ -martingale {Mn } satisfies that lim
n→∞
1 Mn = 0 Pxπ − a.s. n
Lemmas 6.1 and 6.2 give the following. Lemma 6.3 Given a policy π ∈ , an initial state x ∈ X, and an arbitrary function h ∈ Bw (X), as n → ∞ we have n−1 1 π L h(xk ) → 0 Pxπ − a.s., n
(64)
k=0
with L π h(xk ) : = E xπ [h(xk+1 )|h k , ak ] − h(xk ) = E xπ [h(xk+1 )|Fk ] − h(xk ).
123
500
A. F. Mendoza-Pérez, O. Hernández-Lerma
Proof From (62)–(63),
Mn = [h(xn ) − h(x)] −
n−1
L π h(xk ).
(65)
k=0
Moreover, Lemmas 6.1 and 6.2 give 1 1 h(xn ) → 0 and Mn → 0 Pxπ − a.s. n n as n → ∞. These limits and (65) imply (64).
Proof of Theorem 3.4 From Theorem 5.2, let ϕ ∗ ∈ be an optimal policy for the ECP (37)–(38). Proof of (i). Let 0 ≤ 0 and h ∈ Bw (X) be as in Theorem 5.2. Fix an arbitrary randomized policy π ∈ . Define r (x, a) := r (x, a) + 0 · (c(x, a) − θ ) ∀(x, a) ∈ K.
(66)
Then the AROE (40) gives V (θ ) ≥ r (x, a) +
h(y)Q(dy|x, a) − h(x) ∀(x, a) ∈ K. X
Therefore, for every k = 0, 1, . . ., and every initial state x ∈ X, V (θ ) ≥ r (xk , ak ) +
h(y)Q(dy|xk , ak ) − h(xk ) X
= r (xk , ak ) + E xπ [h(xk+1 )|h k , ak ] − h(xk ) = r (xk , ak ) + L π h(xk ) Pxπ − a.s. with L π h(xk ) as in Lemma 6.3. Hence, from (66) and Definition 2.1, for each n = 1, 2, . . . we have
V (θ ) ≥
123
1 Sn (π, x) + 0 · n
n−1 1 1 π Sc,n (π, x) − θ + L h(xk ) Pxπ − a.s. n n k=0
MCPs with pathwise constraints
501
Taking the limit as n → ∞, by the Lemma 6.3 and the fact that 0 ≤ 0,
n−1 1 1 π V (θ ) − 0 · (Sc (π, x) − θ ) ≥ lim sup L h(xk ) Sn (π, x) + n n n→∞
k=0
≥ lim inf n→∞
n−1 1 1 π Sn (π, x) + lim L h(xk ) n→∞ n n
= S(π, x)
Pxπ
k=0
− a.s.
That is, V (θ ) ≥ S(π, x) + 0 · (Sc (π, x) − θ )
Pxπ − a.s.
for all x ∈ X and π ∈ . Finally, if π verifies (9), i.e., Sc (π, x) ≤ θ Pxπ − a.s., then S(π, x) ≤ V (θ ) Pxπ − a.s., which together with (61) completes the proof of part (i). Particularly, V (θ ) coincides with the optimal value V ∗ (θ ) for the CP (8)–(9) as in Definition 3.1. Parts (ii) and (iii) follow from Theorem 5.2(i)–(ii) and Theorem 5.3, respectively. References Altman E (1999) Constrained Markov decision processes. Chapman & Hall/CRC, Boca Raton Borkar VS (1994) Ergodic control of Markov chains with constraints—the general case. SIAM J Control Optim 32:176–186 Ding Y, Jia R, Tang S (2003) Dynamical principal agent model based on CMCP. Math Methods Oper Res 58:149–157 Djonin DV, Krishnamurthy V (2007) MIMO transmission control in fading channels—a constrained Markov decision process formulation with monotone randomized policies. IEEE Trans Signal Process 55:5069–5083 Ekeland I, Temam R (1976) Convex analysis and variational problems. North-Holland, Amsterdam Feinberg E, Shwartz A (1996) Constrained discounted dynamic programming. Math Oper Res 21:922–945 Föllmer H, Schied A (2002) Stochastic finance. An introduction in discrete time. Walter de Gruyter & Co, Berlin Gordienko E, Hernández-Lerma O (1995) Average cost Markov control processes with weigthed norms: existence of canonical policies. Appl Math (Warsaw) 23:199–218 Haviv M (1996) On constrained Markov decision processes. Oper Res Lett 19:25–28 Hernández-Lerma O, González-Hernández J, López-Martínez RR (2003) Constrained average cost Markov control processes in Borel spaces. SIAM J Control Optim 42:442–468 Hernández-Lerma O, Lasserre JB (1996) Discrete-time Markov control processes: basic optimality criteria. Springer, New York Hernández-Lerma O, Lasserre JB (1999) Further topics on discrete-time Markov control processes. Springer, New York Hernández-Lerma O, Lasserre JB (2003) Markov chains and invariant probabilities. Birkhäuser Verlag, Basel Hernández-Lerma O, Vega-Amaya O (1998) Infinite-horizon Markov control processes with undiscounted cost criteria: from average to overtaking optimality. Appl Math (Warsaw) 25:153–178 Hernández-Lerma O, Vega-Amaya O, Carrasco G (1999) Sample-path optimality and varianceminimization of average cost Markov control processes. SIAM J Control Optim 38:79–93 Kartashov HV (1985) Inequalities in theorems of ergodicity and stability of Markov chains with common phase space. II. Theory Probab Appl 30:507–515
123
502
A. F. Mendoza-Pérez, O. Hernández-Lerma
Krishnamurthy V, Vázquez Abad F, Martin K (2003) Implementation of gradient estimation to a constrained Markov decision problem. In: 42nd IEEE conference on decision and control pp 4841–4846 Mendoza-Pérez A (2008) Pathwise average reward Markov control processes. Doctoral thesis, CINVESTAV-IPN, México. Available at http://www.math.cinvestav.mx/ohernand_students Mendoza-Pérez A, Hernández-Lerma O (2009) Markov control processes with pathwise constraints (longer version). Available at http://www.math.cinvestav.mx/sites/default/files/art-MMOR.pdf Meyn SP, Tweedie RL (1993) Markov chains and stochastic stability. Springer, London Prieto-Rumeau T, Hernández-Lerma O (2008) Ergodic control of continuous-time Markov chains with pathwise constraints. SIAM J Control Optim 47:1888–1908 Piunovskiy AB (1997) Optimal control of random sequences in problems with constraints. Kluwer, Boston Puterman ML (1994) Markov decision process. Wiley, New York Ross KW, Varadarajan R (1989) Markov decision processes with sample path constraints. Oper Res 37: 780–790 Ross KW, Varadarajan R (1991) Multichain Markov decision processes with a sample path constraint. Math Oper Res 16:195–207 Vega-Amaya O (1998) Markov control processes in Borel spaces: undiscounted criteria. Doctoral thesis, UAM-Iztapalapa, México (In Spanish) Vega-Amaya O, Montes-de-Oca R (1998) Application of average dynamic programming to inventory systems. Math Methods Oper Res 47:451–471 Vega-Amaya O (2003) The average cost optimality equation: a fixed point approach. Bol Soc Mat Mexicana 9:185–195 Vega-Amaya O, Expected and sample-path constrained average Markov decision processes, Internal Report no. 35, Departamento de Matemáticas, Universidad de Sonora. (Submitted)
123