Fixed Point Solutions of the Shapley Equation

0 downloads 0 Views 174KB Size Report
difference with the present paper since, in spite of imposing a similar stability ... yields a minimax characterization of certain solution of the Shapley equation— .... which are called the lower value and the upper value of the game, respectively,.
Zero-Sum Average Semi-Markov Games: Fixed Point Solutions of the Shapley Equation∗ Oscar Vega-Amaya Departamento de Matem´aticas Universidad de Sonora May 2002

Abstract This paper deals with zero-sum average semi-Markov games with Borel state and action spaces, and unbounded payoffs and mean holding times. A solution of the Sapley equation is obtained via the Banach Fixed Point Theorem assumming that the model satisfies a Lyapunov-like condition, a growth hypothesis on the payoff function and the mean holding time, besides standard continuity and compactness requirements. Key words. Zero-sum semi-Markov games, Average payoff criterion, Lyapunov conditions, Fixed-point approach. AMS subject classification. 90D10, 90D20, 93E05.

1

Introduction

Several recent papers have used variants of a Lyapunov-like condition to solve an average payoff optimization problem for markovian systems with unbounded payoff and Borel state and action spaces (see, e.g. [9], [13], [14], for Markov models; [15], [20], [28] for semi-Markov models; [11], [16], [23] for zero-sum Markov games and [17] for zero-sum semi-Markov games). The key property used in all these papers is that the imposed Lyapunov condition yields the socalled weighted geometric ergodicity (WGE) property, which is a generalization of the standard uniform geometric ergodicity in Markov chain theory (see [10], [12] and [21] for a detailed discussion of these concepts). Roughly speaking, in these papers the WGE property is combined, explicitly or implicitly, either with the vanishing discount factor approach or with some variants of the policy iteration algorithm for proving their main results. These facts are the first main difference with the present paper since, in spite of imposing a similar stability condition, we use instead a “fixed-point approach” which does not rely, at least explicitly, on the WGE property. ∗ This

research was supported by CONACyT (M´exico) under Grant 28309-E.

1

The fixed-point approach allows us to obtain directly the Shapley equation, which in turn yields the existence of a stationary optimal strategy pair or saddle point—see Theorem 4.7 (a) and (b). In contrast, the approaches followed in [11], [16], [23] first show the existence of a stationary saddle point and then establish the Shapley equation. On the other hand, [20], [15], [17] recur to auxiliary models related to the original one; more precisely, [20] uses the socalled Schweitzer’ data transformation [26], while the analysis in [15] and [17] relies on certain perturbed models. A second key difference concerns the times between two consecutive decision epochs. In contrast with discrete-time Markov control processes and Markov games, the decision epochs in semi-Markov control processes are random; thus it is necessary to ensure that such processes experience only finitely many transitions in each finite time period. This is usually done by assuming that the mean holding time function is bounded below by a constant, even for the discrete state space case (see, e.g. [2], [5], [19], [24] and their references). In particular, this condition plays a crucial role in the approaches followed in [28], [15], [17] and [20]; in fact, in the three latter references it is also assumed that the mean holding time function is bounded above by a constant while in the present paper it is only assumed that this function is positive. It is important to mention that, as a by-product, the fixed-point approach yields a minimax characterization of certain solution of the Shapley equation— Theorem 4.7(c)—which, seemingly, have not been previously discussed in the literature dealing with zero-sum stochastic games. We should also mention that the fixed-point approach has been used in several early papers (see, e.g. [7], [12], [18], [25]) but under much stronger ergodicity conditions, which, in particular, exclude the case of unbounded payoffs. The variant of Lyapunov condition we consider here was recently introduced in [27] for Markov control process and used in [8] to study minimax problems. In fact, the present paper extends to zero-sum semi-Markov games the results of the two latter references. For brief surveys of the existing literature on stochastic games with finite or denumerable state space the reader can consult [1], [3], [6], [7] and [19]. The remainder of the paper is organized as follows. The semi-Markov game model and the (ratio) expected average payoff criterion are introduced in Sections 2 and 3, respectively. The assumptions and main results are stated in Section 4. The proofs of all results are given in Sections 5 and 6.

2

The Game Model

Throughout the paper we shall use the following notation. Given a Borel space S—that is, a Borel subset of a complete separable metric space— B(S) denotes the Borel σ–algebra and “measurability” always means measurability with respect to B(S). The class of all probability measure on S is denoted by P(S). Given two Borel spaces S and S 0 , a stochastic kernel ϕ(·|·) on S given S 0 is a function such that ϕ(·|s0 ) is in P(S) for each s0 ∈ S 0 , and ϕ(B|·) is a measurable 2

function on S 0 for each B ∈ B(S). Moreover, R+ stands for the nonnegative real number subset and N (N0 , resp.) denotes the positive (nonnegative, resp.) integers subset. The semi-Markov game model. This paper is concerned with a zero-sum semi-Markov game modeled by (X, A, B, KA , KB , Q, F, r) where X is the state space, and the sets A and B are the control spaces for players 1 and 2, respectively. It is assumed that all these sets are Borel spaces. The constraint sets KA and KB are Borel subsets of X × A and X × B, respectively. Thus, for each x ∈ X, the x-sections A(x) := {a ∈ A : (x, a) ∈ KA } B(x) := {b ∈ B : (x, b) ∈ KB }, stand for the sets of admissible actions or controls for players 1 and 2, respectively. Now, let K := {(x, a, b) : x ∈ X, a ∈ A(x), b ∈ B(x)}, which, by [22], is a Borel subset of X × A × B. The transition law Q(·|·) of the system is a stochastic kernel on X given K. For each (x, a, b, y) ∈ K×X, F (·|x, a, b, y) is a distribution function on R+ := [0, +∞), and F (t|·) is a measurable function on K×X for each t ∈ R+ . Finally, the payoff r is a measurable function on K × R+ . The game is played over an infinite horizon as follows: at time t = 0 the game is observed in some state x0 = x ∈ X and the players independently choose controls a0 = a ∈ A(x0 ) and b0 = b ∈ B(x0 ). Then, the system remains in state x0 = x for a nonnegative random time δ1 and player 1 receives the amount r(x, a, b, δ1 ) from player 2. At time δ1 the system jumps to a new state x1 = x0 ∈ X according to the probability measure Q(·|x, a, b). The distribution of the random variable δ1 , given that the system has jumped into state x0 , is F (·|x, a, b, x0 ); that is, F (t|x, a, b, x0 ) = Pr [δ1 ≤ t|x0 = x, a0 = a, b0 = b, x1 = x0 ] ∀t ∈ R+ . Thus, given that x0 = x, a0 = a and b0 = b, the distribution of δ1 is

G(t|x, a, b) :=

Z

+∞

F (t|x, a, b, y)Q(dy|x, a, b), ∀t ∈ R+ , (x, a, b) ∈ K, 0

and it is called the holding time distribution. Immediately after the transition occurs, the players again choose controls, say, a1 = a0 ∈ A(x0 ) and b1 = b0 ∈ B(x0 ), and the above process is repeated over and over again. 3

This procedure yields a stochastic processes {(xn , an , bn , δn+1 )} where, for each n ∈ N0 , xn is the state of the system, an and bn are the control variables for player 1 and 2, respectively, and δn+1 is the holding time at state xn . The goal of player 1 (player 2, resp.) is to maximize (minimize, resp.) his/her flow rewards (costs, resp.) r(x0 , a0 , b0 , δ1 ), r(x1 , a1 , b1 , δ2 ), · · · over an infinite horizon using an “expected average reward (cost) criterion” defined by (5) below. The functions on K given as

τ (x, a, b) :=

R(x, a, b) :=

Z Z

+∞

tG(dt|x, a, b)

(1)

r(x, a, b, t)G(dt|x, a, b)

(2)

0 +∞ 0

are called the mean holding time and the mean payoff, respectively. Strategies. Let H0 := X and Hn := K × R+ × Hn−1 for n ∈ N. Then, for each n ∈ N0 , a generic element of Hn is denoted as hn := (x0 , a0 , b0 , δ1 , · · · , xn−1 , an−1 , bn−1 , δn , xn ) which can be thought of as the history of the game up to the time of the nth transition Tn := Tn−1 + δn ,

n ∈ N,

(3)

where T0 := 0. Thus a strategy for player 1 is a sequence π 1 = {πn1 } of stochastic kernels πn1 on A given Hn satisfying the constraint πn1 (A(xn )|hn ) = 1 ∀hn ∈ Hn , n ∈ N0 . The class of all strategies for player 1 is denoted by Π1 . For each x ∈ X, let A(x) := P(A(x)) and denote by Φ1 the class of all stochastic kernels ϕ1 on A given X such that ϕ1 (·|x) ∈ A(x) for all x ∈ X. A policy π 1 is called stationary if πn1 (·|hn ) = ϕ1 (·|xn ) ∀hn ∈ Hn , n ∈ N0 , for some stochastic kernel ϕ1 in Φ1 . Following an standard convention, Φ1 is identified with the class of stationary strategies for player 1. The sets of strategies Π2 and Φ2 for player 2 are defined in a similar way but writing B(x) and B(x) instead of A(x) and A(x), respectively. 4

Let (Ω, F) be the (canonical) measurable space consisting of the sample space Ω := (K × R+ )∞ and its product σ-algebra. Thus, for each strategy pair (π 1 , π 2 ) ∈ Π1 × Π2 and each “initial state” x ∈ X, there exists a probability 1 2 measure Pxπ ,π defined on (Ω, F) which governs the evolution of the stochastic process {(xn , an , bn , δn+1 )}. The expectation operator with respect to the 1 2 1 2 measure probability Pxπ ,π is denoted as Exπ ,π . Throughout the paper we shall use the following notation: for a measurable function u on K and a stationary strategy pair (ϕ1 , ϕ2 ) ∈ Φ1 × Φ2 , let Z Z uϕ1 ,ϕ2 (x) := u(x, a, b)ϕ1 (da|x)ϕ2 (db|x) ∀x ∈ X. (4) B(x)

A(x)

Thus, in particular, we shall write

Rϕ1 ,ϕ2 (x) :=

τϕ1 ,ϕ2 (x) :=

Z Z

B(x)

A(x)

Z Z

R(x, a, b)ϕ1 (da|x)ϕ2 (db|x), A(x)

τ (x, a, b)ϕ1 (da|x)ϕ2 (db|x), B(x)

and, similarly, Qϕ1 ,ϕ2 (·|x) :=

Z

B(x)

Z

Q(·|x, a, b)ϕ1 (da|x)ϕ2 (db|x), A(x)

for all x ∈ X. If the players use a stationary strategy pair, say (ϕ1 , ϕ2 ), then the state process {xn } is a Markov chain with transition probability Qϕ1 ,ϕ2 (·|·). In this case, the n-step transition probability is denoted by Qnϕ1 ,ϕ2 (·|·) for each n ∈ N0 , where Q0ϕ1 ,ϕ2 (·|x) is the Dirac measure at x ∈ X. Thus, for each u ∈ BW (X), Qnϕ1 ,ϕ2 u(x) :=

3

Z

u(dy)Qnϕ1 ,ϕ2 (dy|x) = Exϕ

1

,ϕ2

u(xn ) ∀x ∈ X, n ∈ N0 .

X

The expected average payoff criterion

The (ratio) expected average payoff (EAP) for the strategy pair (π 1 , π 2 ) ∈ Π1 × Π2 , given the initial state x0 = x ∈ X, is defined as 1

2

J(π , π , x) := lim inf

Exπ

1

,π 2

n→∞

Pn−1

k=0 r(xk , ak , bk , δk+1 ) . 1 2 Exπ ,π Tn

It is easy to verify using properties of conditional expectation that

5

(5)

Exπ

1

,π 2

δk+1 = Exπ

1

,π 2

τ (xk , ak , bk )

and also that Exπ

1

,π 2

1

2

r(xk , ak , bk , δk+1 ) = Exπ 1

1

,π 2

R(xk , ak , bk ),

2

for all x ∈ X, (π , π ) ∈ Π × Π , k ∈ N0 . Thus, (5) can be rewritten as 1

2

J(π , π , x) = lim inf n→∞

Exπ

1

,π 2

1 2 Exπ ,π

Pn−1 k=0

R(xk , ak , bk )

k=0

τ (xk , ak , bk )

Pn−1

.

(6)

Now consider the following functions on X defined as L(x) := sup

inf J(π 1 , π 2 , x) and U (x) := 2inf

2 2 π 1 ∈Π1 π ∈Π

sup J(π 1 , π 2 , x),

π ∈Π2 π 1 ∈Π1

(7)

which are called the lower value and the upper value of the game, respectively, for the ratio EAP criterion. In general, L(·) ≤ U (·), but if it holds L(·) = U (·), the common function is called the value of the game and denoted by V (·). If the game has a value V (·), a strategy π∗1 ∈ Π1 is said to be expected average payoff (EAP-) optimal for player 1 if inf J(π∗1 , π 2 , x) = V (x) ∀x ∈ X.

π 2 ∈Π2

Similarly, π∗2 ∈ Π2 is said to be expected average payoff (EAP-) optimal for player 2 if sup J(π 1 , π∗2 , x) = V (x) ∀x ∈ X. π 1 ∈Π1

If π∗i is EAP-optimal for player i (i = 1, 2), then (π∗1 , π∗2 ) is called an EAPoptimal pair or saddle point. Note that (π∗1 , π∗2 ) is EAP-optimal if and only if J(π 1 , π∗2 , x) ≤ J(π∗1 , π∗2 , x) ≤ J(π∗1 , π 2 , x) ∀x ∈ X, (π 1 , π 2 ) ∈ Π1 × Π2 .

4

Assumptions and main results

The first condition imposed on the model, Assumption 4.1 below, ensures that the systems is regular, which means that it experiences finitely many jumps or transitions over each finite period of time. Usually, the regularity property is obtained assuming that the mean holding time τ is bounded below by a positive constant (see, e.g. [2], [5], [15], [17], [18], [19], [20], [24], [26], [28] and their references). In the present paper is only assumed that the mean holding time is a positive function. 6

Assumption 4.1.(Regularity condition) τ (x, a, b) > 0 for all (x, a, b) ∈ K. The second hypothesis imposes a growth condition both in the mean holding time and the mean payoff. Assumption 4.2. There exists a measurable function W (·) on X bounded below by a constant θ > 0 such that max {τ (x, a, b), |R(x, a, b)|} ≤ KW (x) ∀(x, a, b) ∈ K, for a fixed positive constant K. To state the third set of hypotheses—as well as several of its consequences— some notation is required. For a measurable function u(·) on X, define the weighted norm with respect to W (W–norm, for short) as |u(x)| , x∈X W (x)

||u||W := sup

and denote by BW (X) the Banach space of all measurable functions with finite W –norm. Moreover, for a measure γ(·) on X let Z γ(u) := u(x)γ(dx), X

whenever the integral is well defined. Assumption 4.3.(Lyapunov condition) There exists a non-trivial measure ν(·) on X, a nonnegative measurable function S(·) on K and a positive constant λ < 1 such that: (a) ν(W ) < ∞; (b) Q(B|x, a, b) ≥ ν(B)S(x, a, b) ∀B ∈ B(X), (x, a, b) ∈ K; Z (c) W (y)Q(dy|x, a, b) ≤ λW (x) + S(x, a, b)ν(W ) ∀(x, a, b) ∈ K; X

(d) ν(Sϕ1 ,ϕ2 ) > 0 ∀(ϕ1 , ϕ2 ) ∈ Φ1 × Φ2 .

As we mentioned in the Introduction, Assumption 4.3 allows us to use a fixed-point approach. More precisely, we consider the kernel b Q(·|x, a, b) := Q(·|x, a, b) − ν(·)S(x, a, b) ∀(x, a, b) ∈ K,

(8)

which, under Assumption 4.3, is nonnegative. The point here is that Assumption 4.3(c) can be expressed equivalently as Z b W (y)Q(dy|x, a, b) ≤ λW (x) ∀(x, a, b) ∈ K, (9) X

b which, roughly speaking, means that Q(·|·) satisfies a certain contraction property. This contraction property is precisely what we shall exploit to prove our main results (Theorems 4.5 and 4.7 below). 7

Assumption 4.3 was first used in [27], though it is actually a simplified version of the Lyapunov condition introduced in [9]. Specifically, besides the conditions in Assumption 4.3, [9] assume the existence of a common irreducibility measure for the transition laws induced by the stationary strategies and also that the inequality in Assumption 4.3(c) holds uniformly, that is, inf ϕ1 ,ϕ2 ν(Sϕ1 ,ϕ2 ) > 0. However, as it is shown in [27, Thm. 3.3]—see Proposition 4.4 below—the latter condition is not required while the irreducibility condition is redundant. On the other hand, several other papers have used similar Lyapunov conditions to Assumption 4.3 (see, e.g. [13], [14], [15], [16], [17], [23]) but with some important differences, which seemingly precludes the fixed point-approach. For instance, the fourth latter papers suppose instead of the conditions in Assumption 4.3 that Z W (y)Q(dy|x, a, b) ≤ λW (x) + bIC (x) ∀(x, a, b) ∈ K X

where C is a Borel subset of X, b is a positive constant, λ ∈ (0, 1) and W (·) is bounded on C, and also that Qϕ1 ,ϕ2 (B|x) ≥ δIC (x)νϕ1 ,ϕ2 (B) for all ∀x ∈ X, B ∈ B(X), (ϕ1 , ϕ2 ) ∈ Φ1 × Φ2 , where each νϕ1 ,ϕ2 (·) is a probability measure concentrated on C and δ is a positive constant. A quick glance at the latter conditions shows that they do not lead to a contraction property as in (9), so the fixed-point approach is not applicable, at least in the way we do here. Finally, it is convenient to point out again that, in spite of imposing similar conditions to Assumption 4.3, the approaches followed in all the papers so far cited rely on the WGE mentioned in the Introduction, with the only exception of [27] and [8]. In the next proposition are stated some important consequences of Assumption 4.2 and 4.3, which are proved in [27] using fixed-points arguments too. Proposition 4.4. Suppose that Assumptions 4.3 holds. Then, for each stationary strategy pair (ϕ1 , ϕ2 ) ∈ Φ1 × Φ2 , the following holds: (a) The transition law Qϕ1 ,ϕ2 (·|x) is positive Harris recurrent. Thus, in particular, there exists a unique invariant probability measure µϕ1 ,ϕ2 (·), that is, Z µϕ1 ,ϕ2 (·) = Qϕ1 ,ϕ2 (·|x)µϕ1 ,ϕ2 (dx). X

Moreover, ν is an irreducibility measure for Qϕ1 ,ϕ2 (·|·). (b) µϕ1 ,ϕ2 (W ) is finite; in fact, it holds the bounds θ ≤ µϕ1 ,ϕ2 (W ) ≤

8

ν(W ) . (1 − λ)ν(X)

(10)

Next observe that, under the Assumptions 4.1–4.3, by Proposition 4.4 the constants µϕ1 ,ϕ2 (Rϕ1 ,ϕ2 ) ρ(ϕ1 , ϕ2 ) := ∀(ϕ1 , ϕ2 ) ∈ Φ1 × Φ2 (11) µϕ1 ,ϕ2 (τϕ1 ,ϕ2 ) are finite. Then, for each (ϕ1 , ϕ2 ) ∈ Φ1 × Φ2 , define on BW (X) the operator Z u(y)Qϕ1 ,ϕ2 (dy|x) ∀x ∈ X, (12) Lϕ1 ,ϕ2 u(x) := Rϕ1 ,ϕ2 (x) + X

where

Rϕ1 ,ϕ2 (·) := Rϕ1 ,ϕ2 (·) − ρ(ϕ1 , ϕ2 )τϕ1 ,ϕ2 (·).

(13)

Theorem 4.5. Suppose that Assumptions 4.1, 4.2 and 4.3 hold. Then for each stationary strategy pair (ϕ1 , ϕ2 ) ∈ Φ1 × Φ2 : (a) There exists a unique function hϕ1 ,ϕ2 ∈ BW (X), with ν(hϕ1 ,ϕ2 ) = 0, that satisfies the (semi-Markov) Poisson equation hϕ1 ,ϕ2 (x) = Lϕ1 ,ϕ2 hϕ1 ,ϕ2 (x)

= Rϕ1 ,ϕ2 (x) +

Z

hϕ1 ,ϕ2 (y)Qϕ1 ,ϕ2 (dy|x) ∀x ∈ X; X

(b) Moreover, J(ϕ1 , ϕ2 , ·) = ρ(ϕ1 , ϕ2 ). Now, we impose some compactness/continuity conditions on the model to assure the existence of measurable minimizers/maximizers; notice that this can be done in several settings (see, e.g. [10, Thm. 3.5, p. 28] or [8, Lemma 3.5]). Here, for simplicity, we consider the following one. Assumption 4.6.(Compactness/continuity conditions) For each (x, a, b) ∈ K : (a) A(x) and B(x) are non-empty compact subsets; (b) R(x, ·, b) is upper semicontinuous on A(x), and R(x, a, ·) is lower semicontinuous on B(x); (c) τ (x, ·, b) and τ (x, a, ·) are continuous on A(x) and B(x), respectively; (d) S(x, ·, b) and S(x, a, ·) are continuous on A(x) and B(x), respectively; (e) For each bounded measurable function v on X, the functions Z Z v(y)Q(dy|x, ·, b) and v(y)Q(dy|x, a, ·) X

X

are continuous on A(x) and B(x), respectively; (f ) The functions

9

Z

Z

W (y)Q(dy|x, ·, b) and X

W (y)Q(dy|x, a, ·) X

are continuous on A(x) and B(x), respectively. Theorem 4.7. Suppose that Assumptions 4.1, 4.2, 4.3 and 4.6 hold. Then: (a) There exists a unique function h∗ ∈ BW (X) with ν(h∗ ) = 0, a stationary strategy pair (ϕ1∗ , ϕ2∗ ) ∈ Φ1 × Φ2 and a constant ρ∗ which satisfy the Shapley equation











h (x) = min Rϕ1∗ ,ϕ2 (x) − ρ τϕ1∗ ,ϕ2 (x) + 2 2 ϕ ∈Φ

= max Rϕ1 ,ϕ2∗ (x) − ρ τϕ1 ,ϕ2∗ (x) + 1 1 ϕ ∈Φ

= Rϕ1∗ ,ϕ2∗ (x) − ρ∗ τϕ1∗ ,ϕ2∗ (x) +

Z

X

Z Z

X

X









h (y)Qϕ1∗ ,ϕ2 (dy|x)

h (y)Qϕ1 ,ϕ2∗ (dy|x)

∀x ∈ X,

h∗ (y)Qϕ1∗ ,ϕ2∗ (dy|x).

(b) The constant ρ∗ is the value of the game and (ϕ1∗ , ϕ2∗ ) is an EAP-optimal stationary strategy pair. That is, J(ϕ1∗ , ϕ2∗ , ·) = ρ∗ and J(π 1 , ϕ2∗ , ·) ≤ ρ∗ ≤ J(ϕ1∗ , π 2 , ·) ∀(π 1 , π 2 ) ∈ Π1 × Π2 . Hence, by Theorem 4.5, h∗ (·) = hϕ1∗ ,ϕ2∗ (·). (c) Moreover, ρ∗ = ρ(ϕ1∗ , ϕ2∗ ) = max min2 ρ(ϕ1 , ϕ2 ) = min max1 ρ(ϕ1 , ϕ2 ), 2 2 1 1 1 2 ϕ ∈Φ ϕ ∈Φ

ϕ ∈Φ ϕ ∈Φ

hϕ1 ,ϕ2 (·) = max hϕ1 ,ϕ2 (·), h∗ (·) = hϕ1∗ ,ϕ2∗ (·) = min 2 2 1 1 ϕ ∈F



ϕ ∈Φ

(14) (15)



where Fi stands for the class of all stationary EAP-optimal strategies for player i (i = 1, 2). It is worth mentioning that, to the best of our knowledge, the minimax characterization of the solution h∗ (·) of the Shapley equation given in (15) has been discussed in any of the previous paper dealing with zero-sum stochastic games, even for the case of discrete state space.

10

5

Proof of Theorem 4.5

For the proof of the results in Section 4 several preliminary results are needed. The first one are collected in the next lemma, which we state without proofs because they follow directly from Assumption 4.1, 4.2, and 4.3. Lemma 5.1. Suppose that Assumption 4.3 holds. Then: (a) For each function u in BW (X), lim

n→∞

1 π1 ,π2 E u(xn ) = 0 ∀x ∈ X, (π 1 , π 2 ) ∈ Π1 × Π2 ; n x

(b) For each stationary strategy pair (ϕ1 , ϕ2 ) ∈ Φ1 × Φ2 , it holds that µϕ1 ,ϕ2 (Sϕ1 ,ϕ2 ) ≥

(1 − λ)θ > 0; ν(W )

(c) If in addition Assumptions 4.1 and 4.2 hold, then µϕ1 ,ϕ2 (Sϕ1 ,ϕ2 ) 1−λ ≥ > 0. µϕ1 ,ϕ2 (τϕ1 ,ϕ2 ) Kν(W ) The following lemma concerns the existence of solutions to the Poisson equation which, in addition to being interesting in itself, plays a key role in our development. In fact, its proof exhibits the way we take advantage of the contraction property (9). Lemma 5.2. Suppose Assumption 4.2 and 4.3 holds and let (ϕ1 , ϕ2 ) ∈ Φ1 × Φ2 be fixed but arbitrary. Then, for each function v in BW (X) there exists a unique function hv in BW (X), with ν(hv ) = 0, which satisfies the Poisson equation Z hv (x) = v(x) − µϕ1 ,ϕ2 (v) + hv (y)Qϕ1 ,ϕ2 (dy|x) ∀x ∈ X. (16) X

Thus, from Lemma 5.1(a), n−1 1 ϕ1 ,ϕ2 X µϕ1 ,ϕ2 (v) = lim Ex v(xk ) ∀x ∈ X. n→∞ n

(17)

k=0

Proof of Lemma 5.2. Fix a function v ∈ BW (X), and let µ(·) := µϕ1 ,ϕ2 (·), S(·) := Sϕ1 ,ϕ2 (·|·) and Q(·|·) := Qϕ1 ,ϕ2 (·|·). Next, define Z b Tbu(x) = v(x) − µ(v) + u(y)Q(dy|x) ∀x ∈ X, u ∈ BW (X). X

By Assumption 4.3(c), it is clear that Tb maps BW (X) into itself. Moreover, for any functions u, w ∈ BW (X), it holds that

11

|Tbu(x) − Tbw(x)| ≤

Z

X

b |u(y) − w(y)| Q(dy|x)

≤ ||u − w||W

Z

X

for all x ∈ X. Hence,

b W (y)Q(dy|x) ≤ ||u − w||W λW (x)

||Tbu − Tbw||W ≤ λ ||u − w||W .

That is, Tb is a contraction operator from BW (X) into itself with modulus λ. Then, by the Banach Fixed Point Theorem, there exists a unique function hv ∈ BW (X) that satisfies the equation hv (x) = v(x) − µ(v) +

= v(x) − µ(v) +

Z Z

X

b hv (y)Q(dy|x) ∀x ∈ X hv (y)Q(dy|x) − ν(hv )S(x).

X

Now, an integration with respect to the invariant probability measure µ(·) in both sides of the last equation yields ν(hv )µ(S) = 0, which, by Lemma 5.1(b), implies that ν(hv ) = 0. Therefore, hv satisfies the Poisson equation Z hv (x) = v(x) − µ(v) + hv (y)Q(dy|x) ∀x ∈ X, X

which proves (16). Finally, the property (17) is obtained by iteration of the Poisson equation and using Lemma 5.1(a).  Now we proceed to prove Theorem 4.5.

Proof of Theorem 4.5. Let (ϕ1 , ϕ2 ) ∈ Φ1 × Φ2 be fixed but arbitrary. Thus, since the function v(·) := Rϕ1 ,ϕ2 (·) = Rϕ1 ,ϕ2 (·) − ρ(ϕ1 , ϕ2 ) τϕ1 ,ϕ2 (·) is in BW (X), by Lemma 5.2 there exists a unique function hϕ1 ,ϕ2 ∈ BW (X) with ν(hϕ1 ,ϕ2 ) = 0 that satisfies the Poisson equation Z hϕ1 ,ϕ2 (x) = Rϕ1 ,ϕ2 (x) + hϕ1 ,ϕ2 (y)Qϕ1 ,ϕ2 (dy|x) ∀x ∈ X. X

12

This proves part (a) of the theorem. Next, to prove part (b), first note that iteration of the last equation yields

hϕ1 ,ϕ2 (x) =

1 2 Exϕ ,ϕ

"n−1 X

Rϕ1 ,ϕ2 (xk ) − ρ(ϕ , ϕ )

Z

hϕ1 ,ϕ2 (y)Qnϕ1 ,ϕ2 (dy|x)

1

2

k=1

+

n−1 X

τϕ1 ,ϕ2 (xk )

k=1

# (18)

X

for all n ∈ N and x ∈ X. Moreover, by Assumptions 4.1 and 4.2, applying Lemma 5.2 with v(·) := τϕ1 ,ϕ2 (·), we obtain n−1 1 ϕ1 ,ϕ2 X Ex τϕ1 ,ϕ2 (xk ) > 0 ∀x ∈ X, n→∞ n

µϕ1 ,ϕ2 (τϕ1 ,ϕ2 ) = lim

k=1

which combined with (18) and Lemma 5.1(a) implies that 1

2

ρ(ϕ , ϕ ) = lim

n→∞

6

Exϕ

1

,ϕ2

1 2 Exϕ ,ϕ

Pn−1 k=0

Rϕ1 ,ϕ2 (xk )

k=0

τϕ1 ,ϕ2 (xk )

Pn−1

∀x ∈ X.

Proof of Theorem 4.7

Define the constants ρl := sup ϕ1 ∈Φ1

inf ρ(ϕ1 , ϕ2 ) and ρu := 2inf

ϕ2 ∈Φ2

sup ρ(ϕ1 , ϕ2 ).

ϕ ∈Φ2 ϕ1 ∈Φ1

We show in the next lemma that this constants are finite. Observe that this trivially holds if one assume that the mean holding time function is bounded below by a positive constant. Lemma 6.1. Suppose that Assumptions 4.1, 4.2, 4.3 and 4.6 hold. Then |ρl | < ∞ and |ρu | < ∞. Proof of Lemma 6.1. Let ϕ1 be a fixed but arbitrary stationary strategy for player 1 and consider the Markov (one player) model e τe) M = (X, KB , Q,

where X and KB are as above, and the transition law and the “one-step cost” function are defined as

13

e Q(·|x, b) := τe(x, b) :=

Z Z

Q(·|x, a, b)ϕ1 (da|x) A(x)

τ (x, a, b)ϕ1 (da|x) A(x)

for all (x, b) ∈ KB , respectively. Thus following the notation (4), for all x ∈ X and ϕ2 ∈ Φ2 , define e ϕ2 (·|x) := Q τeϕ2 (x) :=

Z Z

B(x)

e Q(·|x, b)ϕ2 (db|x)

B(x)

τe(x, b)ϕ2 (db|x).

e ϕ2 (·|·) = Qϕ1 ,ϕ2 (·|·) and τeϕ2 (·) = τϕ1 ,ϕ2 (·) for all ϕ2 ∈ Φ2 . Note that Q The Markov model M satisfies all the conditions in [27, Thm. 3.6]; hence, in particular, there exists a stationary policy ϕ2+ ∈ Φ2 such that µϕ1 ,ϕ2+ (τϕ1 ,ϕ2+ ) = µϕ1 ,ϕ2+ (e τϕ2 ) = 2inf 2 µϕ1 ,ϕ2+ (e τϕ2 ). ϕ ∈Φ

Then, by Assumption 4.1, it holds that µϕ1 ,ϕ2+ (τϕ1 ,ϕ2+ ) > 0. Next observe that |ρ(ϕ1 , ϕ2 )| ≤



µϕ1 ,ϕ2 (W ) µϕ1 ,ϕ2 (|Rϕ1 ,ϕ2 |) ≤ µϕ1 ,ϕ2 (τϕ1 ,ϕ2 ) µϕ1 ,ϕ2+ (τϕ1 ,ϕ2+ ) k µϕ1 ,ϕ2+ (τϕ1 ,ϕ2+ )

where the last inequality follows from (10) with k := ν(W )[(1 − λ)ν(X)]−1 . Hence,

−∞ < −

k ≤ inf ρ(ϕ1 , ϕ2 ) ≤ ρ(ϕ1 , ϕ2 ) ∀ϕ1 ∈ Φ1 . µϕ1 ,ϕ2+ (τϕ1 ,ϕ2+ ) ϕ2 ∈Φ2

(19)

Now fix ϕ2 ∈ Φ2 and proceed as above to get a stationary strategy ϕ1+ ∈ Φ such that µϕ1+ ,ϕ2 (τϕ1+ ,ϕ2 ) = 1inf 1 µϕ1 ,ϕ2 (τϕ1 ,ϕ2 ) > 0. ϕ ∈Φ

14

Then, ρ(ϕ1 , ϕ2 ) ≤

µϕ1 ,ϕ2 (|Rϕ1 ,ϕ2 |) k ≤ < +∞. µϕ1 ,ϕ2 (τϕ1 ,ϕ2 ) µϕ1+ ,ϕ2 (τϕ1+ ,ϕ2 )

Hence, ρ(ϕ1 , ϕ2 ) ≤ sup ρ(ϕ1 , ϕ2 ) ≤ ϕ1 ∈Φ1

k . µϕ1+ ,ϕ2 (τϕ1+ ,ϕ2 )

(20)

Therefore, by (19)-(20), −∞ < ρl = sup

inf ρ(ϕ1 , ϕ2 ) ≤ ρu = 2inf 2 sup ρ(ϕ1 , ϕ2 ) < +∞

2 2 ϕ1 ∈Φ1 ϕ ∈Φ

ϕ ∈Φ ϕ1 ∈Φ1

which proves the desired result. For the proof of Theorem 4.7 introduce the following operators: for each u ∈ BW (X) define Z l l b L u(x, a, b) := R (x, a, b) + u(y)Q(dy|x, a, b) ∀(x, a, b) ∈ K, (21) X

where

Rl (x, a, b) := R(x, a, b) − ρl τ (x, a, b) ∀(x, a, b) ∈ K.

(22)

Thus, following the notation (4), for each strategy pair (ϕ1 , ϕ2 ) ∈ Φ1 ×Φ2 define the operators Z l b ϕ1 ,ϕ2 (dy|·), Llϕ1 ,ϕ2 u(·) := Rϕ (·) + u(y)Q (23) 1 ,ϕ2 X

L∗ u(·) :=

sup

inf

2 ϕ1 ∈A(x) ϕ ∈B(x)

Llϕ1 ,ϕ2 u(·),

(24)

for each u ∈ BW (X). The results in the next lemma are a combination of well-known measurable selection theorem [22] and Fan Minimax Theorem [4]. The proof is omitted since it is the same as the proof of Lemma 6.5 in [11] and Lemmas 2, 3 and 4 in [23]. Lemma 6.2. Suppose that Assumption 4.1, 4.2, 4.3 and 4.6 hold and let u be a fixed function in BW (X). Then (a) For each x ∈ X, the sets A(x) and B(x) are compact with respect to the weak convergence of measures; (b) For each x ∈ X, (ϕ1 , ϕ2 ) ∈ Φ1 × Φ2 and u ∈ BW (X), the mappings

15

ϕ1 → Llϕ1 ,ϕ2 u(x) ϕ2 → Llϕ1 ,ϕ2 u(x) are upper semicontinuous and lower semicontinuous on A(x) and B(x), respectively, with respect to the weak convergence of measures; (c) Moreover, there exists a stationary strategy pair (ϕ1u , ϕ2u ) ∈ Φ1 × Φ2 such that L∗ u(·) = Llϕ1u ,ϕ2u u(·) = max Llϕ1 ,ϕ2u u(·) = min Llϕ1u ,ϕ2 u(·). 1 1 2 2 ϕ ∈Φ

ϕ ∈Φ

Hence, L∗ u(·) is in BW (X). The proof of Theorem 4.7 follows the same scheme as that of Lemma 5.2, so we first show—in Lemma 6.3 below—that L∗ is a contraction operator from BW (X) into itself with modulus λ; hence, by the Banach Fixed Point Theorem, there exists a unique function h∗ in BW (X) such that h∗ (·) = L∗ h∗ (·) =

sup

inf

2 ϕ1 ∈A(x) ϕ ∈B(x)

Llϕ1 ,ϕ2 h∗ (·).

(25)

As a second step, in Lemma 6.4, we prove that ρ∗ := ρl = ρu and ν(h∗ ) ≤ 0. Once the latter is done, we show in Lemma 6.5 that ν(h∗ ) = 0. Then, (25) becomes

h∗ (x) =

sup

inf 2

ϕ1 ∈A(x) ϕ ∈B(x)



Rϕ1 ,ϕ2 (x) − ρ∗ τϕ1 ,ϕ2 (x) +

Z

h∗ (y)Qϕ1 ,ϕ2 (dy|x) X



for all x ∈ X. Hence, Lemma 6.2 yields a stationary strategy pair (ϕ1∗ , ϕ2∗ ) ∈ Φ1 × Φ2 satisfying Theorem 4.7(a). Lemma 6.3. Suppose that assumptions in Theorem 4.7 hold. Then, L∗ in (24) is a contraction operator from BW (X) into itself with modulus λ. Thus, by the Banach Fixed Point Theorem and Lemma 6.2, there exists a unique function h∗ in BW (X) and a stationary strategy pair (ϕ1∗ , ϕ2∗ ) ∈ Φ1 × Φ2 such that 16

h∗ (·) = L∗ h∗ (·) = Llϕ1∗ ,ϕ2∗ h∗ (·)

(26)

= 2min Llϕ1∗ ,ϕ2 h∗ (·) = 1max Llϕ1 ,ϕ2∗ h∗ (·). ϕ ∈B(x)

ϕ ∈A(x)

(27)

Proof of Lemma 6.3. By Lemma 6.2 it only remains to prove that L∗ is a contraction operator from BW (X) into itself with modulus λ. To prove this, consider arbitrary functions u, v in BW (X) and observe, by Assumption 4.3(b) and (9), that l L 1 ϕ

,ϕ2

u(·) − Llϕ1 ,ϕ2 v(·) ≤ ku − vkW

Z

X

b ϕ1 ,ϕ2 (dy|·) W (y)Q

≤ ku − vkW λW (·) for all (ϕ1 , ϕ2 ) ∈ Φ1 × Φ2 . This implies that Llϕ1 ,ϕ2 u(·) ≤ Llϕ1 ,ϕ2 v(·) + ku − vkW λW (·) ∀(ϕ1 , ϕ2 ) ∈ Φ1 × Φ2 . Thus, the latter inequality together Lemma 6.2 implies inf

ϕ2 ∈B(x)

Llϕ1 ,ϕ2 u(·) ≤

inf

ϕ2 ∈B(x)

Llϕ1 ,ϕ2 v(·) + ku − vkW λW (·) ∀ϕ1 ∈ Φ1 ,

which, using again Lemma 6.2, yields L∗ u(·) ≤ L∗ v(·) + ku − vkW λW (·). Similarly, interchanging the role of u and v, it also holds that L∗ v(·) ≤ L∗ u(·) + ku − vkW λW (·). Therefore, kL∗ u − L∗ vkW ≤ λ ku − vkW . That is, L∗ is a contraction operator from BW (X) into itself with modulus λ. Now, the Banach Fixed Point Theorem together with Lemma 6.2 ensures the existence of a unique function h∗ ∈ BW (X) and a stationary strategy pair (ϕ1∗ , ϕ2∗ ) ∈ Φ1 × Φ2 satisfying (26)-(27). Lemma 6.4. Suppose that assumptions in Theorem 4.7 hold and let h∗ be as in Lemma 6.3. Then,

17

ν(h∗ ) ≤ 0 and ρl = ρu . Proof of Lemma 6.4. Let (ϕ1∗ , ϕ2∗ ) be as in Lemma 6.3. Then, h∗ (x) = 2min

ϕ ∈B(x)



=

  Z l ∗ b ϕ1 ,ϕ2 (dy|x) Rϕ h (y) Q 1 ,ϕ2 (x) + ∗ ∗

l Rϕ 1 ,ϕ2 (x) ∗

l Rϕ 1 ,ϕ2 (x) ∗

(28)

X

+

+

Z

Z

X

X

b ϕ1 ,ϕ2 (dy|x) h∗ (y)Q ∗ h∗ (y)Qϕ1∗ ,ϕ2 (dy|x) − ν(h∗ )Sϕ1∗ ,ϕ2 (x)

for all x ∈ X, ϕ2 ∈ Φ2 . Then, an integration with respect to the invariant probability measure µϕ1∗ ,ϕ2 yields l ∗ 0 ≤ µϕ1∗ ,ϕ2 (Rϕ ∀ϕ2 ∈ Φ2 , 1 ,ϕ2 ) − ν(h )µϕ1 ,ϕ2 (Sϕ1 ,ϕ2 ) ∗ ∗ ∗

which implies that ν(h∗ )µϕ1∗ ,ϕ2 (Sϕ1∗ ,ϕ2 ) ≤ µϕ1∗ ,ϕ2 (Rϕ1∗ ,ϕ2 ) − ρl µϕ1∗ ,ϕ2 (τϕ1∗ ,ϕ2 )   = µϕ1∗ ,ϕ2 (τϕ1∗ ,ϕ2 ) ρ(ϕ1∗ , ϕ2 ) − ρl ,

for all ϕ2 ∈ Φ2 . Now, taking infimum over Φ2 , we obtain   ν(h∗ )µϕ1∗ ,ϕ2 (Sϕ1∗ ,ϕ2 ) ≤ 2inf ρ(ϕ1∗ , ϕ2 ) − ρl ≤ 0, inf µϕ1∗ ,ϕ2 (τϕ1∗ ,ϕ2 ) ϕ ∈B(x) ϕ2 ∈B(x) which, by Assumption 4.1 and Lemma 5.1(b), implies that ν(h∗ ) ≤ 0. This inequality combined with (27) implies h (x) = 1max



l Rϕ 1 ,ϕ2 (x) ∗

≥ 1max



l Rϕ 1 ,ϕ2 (x) ∗



ϕ ∈A(x)

ϕ ∈A(x)

l ≥ Rϕ 1 ,ϕ2 (x) + ∗

Z

X

+

+

Z Z

X

X









b ϕ1 ,ϕ2 (dy|x) h (y)Q ∗ h (y)Qϕ1 ,ϕ2∗ (dy|x)

h∗ (y)Qϕ1 ,ϕ2∗ (dy|x)

18

for all x ∈ X, ϕ1 ∈ Φ1 . Now, integrating both sides of the latter inequality with respect to the invariant probability measure µϕ1 ,ϕ2∗ , we see that l l 0 ≥ µϕ1 ,ϕ2∗ (Rϕ ∀ϕ1 ∈ Φ1 , 1 ,ϕ2 ) = µϕ1 ,ϕ2 (Rϕ1 ,ϕ2 ) − ρ µϕ1 ,ϕ2 (τϕ1 ,ϕ2 ) ∗ ∗ ∗ ∗ ∗

which implies that ρl ≥ ρ(ϕ1 , ϕ2∗ ) =

µϕ1 ,ϕ2∗ (Rϕ1 ,ϕ2∗ ) ∀ϕ1 ∈ Φ1 . µϕ1 ,ϕ2∗ (τϕ1 ,ϕ2∗ )

Hence, ρl ≥ sup ρ(ϕ1 , ϕ2∗ ) ≥ 2inf 2 sup ρ(ϕ1 , ϕ2 ) = ρu . ϕ1 ∈Φ1

ϕ ∈Φ ϕ1 ∈Φ1

Therefore, ρl = ρu . Lemma 6.5. Suppose that assumptions in Theorem 4.7 hold and let h∗ be as in Lemma 6.3. Then, ν(h∗ ) = 0. Proof of Lemma 6.5. Let (ϕ1∗ , ϕ2∗ ) be as in Lemma 6.3 and put ρ∗ := ρl = ρu . By (27), we have



h (x) = 1max

ϕ ∈A(x)





Rϕ1 ,ϕ2∗ (x) − ρ τϕ1 ,ϕ2∗ (x) +

≥ Rϕ1 ,ϕ2∗ (x) − ρ∗ τϕ1 ,ϕ2∗ (x) +

Z

X

Z

b ϕ1 ,ϕ2 (dy|x) h (y)Q ∗ ∗

X



b ϕ1 ,ϕ2 (dy|x) h∗ (y)Q ∗

for all x ∈ X, ϕ1 ∈ Φ1 . As above, integrating with respect to the invariant probability measure µϕ1 ,ϕ2∗ in both sides of the latter inequality we obtain   ν(h∗ )µϕ1 ,ϕ2∗ (Sϕ1 ,ϕ2∗ ) ≥ µϕ1 ,ϕ2∗ (τϕ1 ,ϕ2∗ ) ρ(ϕ1 , ϕ2∗ ) − ρ∗ "

= µϕ1 ,ϕ2∗ (τϕ1 ,ϕ2∗ ) ρ(ϕ "

1

, ϕ2∗ )

1

− 2inf 2 sup ρ(ϕ , ϕ ) ϕ ∈Φ ϕ1 ∈Φ1

#

≥ µϕ1 ,ϕ2∗ (τϕ1 ,ϕ2∗ ) ρ(ϕ1 , ϕ2∗ ) − sup ρ(ϕ1 , ϕ2∗ ) , which implies that 19

2

ϕ1 ∈Φ1

#

ν(h∗ )µϕ1 ,ϕ2∗ (Sϕ1 ,ϕ2∗ ) ≥ ρ(ϕ1 , ϕ2∗ ) − sup ρ(ϕ1 , ϕ2∗ ) ∀ϕ1 ∈ Φ1 . µϕ1 ,ϕ2∗ (τϕ1 ,ϕ2∗ ) ϕ1 ∈Φ1 Then, sup ϕ1 ∈Φ1



 ν(h∗ )µϕ1 ,ϕ2∗ (Sϕ1 ,ϕ2∗ ) ≥ 0. µϕ1 ,ϕ2∗ (τϕ1 ,ϕ2∗ )

This inequality implies that ν(h∗ ) ≥ 0. Hence, by Lemma 6.4, ν(h∗ ) = 0. Finally we are ready for the proof of Theorem 4.7. Proof of Theorem 4.7. Let h∗ and (ϕ1∗ , ϕ2∗ ) be as in Lemma 6.3. First note that proof of part (a) is given throughout Lemmas 6.3, 6.4 and 6.5. Part (b) follows using standard dynamic programming arguments, while the first statement in part (c) is exactly Lemma 6.4. Thus, it only remains to prove the equalities in (15). To do this first recall that Fi denotes the class of all stationary optimal strategies for player i, with i = 1, 2, which is nonempty because of part (b). Now, define the following operators on BW (X):

M u(x) := 1max

ϕ ∈A(x)

N u(x) := 2min

ϕ ∈B(x)

  Z b ϕ1 ,ϕ2 (dy|x) Rϕ1∗ ,ϕ2 (x) − ρ∗ τϕ1∗ ,ϕ2 (x) + u(y)Q ∗ X



Rϕ1∗ ,ϕ2 (x) − ρ∗ τϕ1∗ ,ϕ2 (x) +

Z

X

b ϕ1 ,ϕ2 (dy|x) u(y)Q ∗



for all x ∈ X. Proceeding as above it is easy to check that M and N are welldefined and they are λ–contraction operators on BW (X) into itself. In fact, from part (a), h∗ is the fixed point for both operators; that is, h∗ (·) = M h∗ (·) = N h∗ (·). Next choose an arbitrary strategy ϕ10 in F1 and note that ρ∗ = ρ(ϕ10 , ϕ2∗ ). Then, by Theorem 4.5, there exists a unique function hϕ10 ,ϕ2∗ in BW (X), with ν(hϕ10 ,ϕ2∗ ) = 0, which satisfies hϕ10 ,ϕ2∗ (x) = Rϕ10 ,ϕ2∗ (x) − ρ∗ τϕ10 ,ϕ2∗ (x) + Next, observe that

Z

X

b ϕ1 ,ϕ2 (dy|x) ∀x ∈ X. hϕ10 ,ϕ2∗ (y)Q ∗ 0

hϕ10 ,ϕ2∗ (·) ≤ M hϕ10 ,ϕ2∗ (·), 20

which implies that hϕ10 ,ϕ2∗ (·) ≤ M n hϕ10 ,ϕ2∗ (·) ∀n ∈ N. Now, since M is a contraction and h∗ is its fixed point, we have hϕ10 ,ϕ2∗ (·) ≤ h∗ (·). Hence, since h∗ (·) = hϕ1∗ ,ϕ2∗ (·) and the policy ϕ10 was chosen arbitrarily in F1 , we have max hϕ1 ,ϕ2∗ (·) = h∗ (·).

ϕ1 ∈F1

Similar arguments, but using the operator N instead of M , show that h∗ (·) = min hϕ1∗ ,ϕ2 (·). 2 2 ϕ ∈F

Acknowledgment. The author thanks to Prof. On´esimo Hern´ andez-Lerma for his valuable comments on a early version of this work.

References [1] E. Altman, A. Hordijk and F. M. Spieksma, Contraction conditions for average and α–discount optimality in countable state Markov games with unbounded rewards, Math. Oper. Res. 22 (1997), 588-618. [2] S. Bathnagar and V. S. Borkar, A convex analitic framework for ergodic control of semi-Markov processes, Math. Oper. Res. 20 (1995), 923-936. [3] V.S. Borkar, M. K. Gosh, Denumerable stochastic games with limiting average payoff, J. Optim. Theory Appl. 76 (1993), 539-560. [4] K. Fan, Minimax theorems, Proc. Acad. Sci. USA 39 (1953), 539-560. [5] A. Ferdegruen, P. J. Schweitzer and H. C. Tijms, Denumerable undiscounted semi-Markov decision processes with unbounded rewards, Math. Oper. Res. 8 (1983), 298-313. [6] J. Filar and K. Vrieze, Competitive Markov Decision Processes, SpringerVerlag, New York, 1997. [7] M. K. Gosh and A. Bagchi, Stochastic games with average payoff criterion, Appl. Math. Optim. 38 (1998), 283-301. [8] J.I. Gonz´ alez-Trejo, O. Hern´ andez-Lerma and L. F. Hoyos-Reyes, Minimax control of discrete-time stochastic systems, SIAM J. Control Optim., to appear.

21

[9] E. Gordienko and O. Hern´ andez-Lerma, Average cost Markov control processes with weighted norms: existence of canonical policies, Appl. Math. (Warsaw) 23 (1995), 199-218. [10] O. Hern´ andez-Lerma and J.B. Lasserre, Further Topics on Discrete-Time Markov Control Processes, Springer-Verlag, New York, 1999. [11] O. Hern´ andez-Lerma and J.B. Lasserre, Zero-sum stochastic games in Borel spaces: average payoff criteria, SIAM J. Control Optim. 39 (2001), 15201539. [12] O. Hern´ andez-Lerma, R. Montes-de-Oca, R. Cavazos-Cadena, Recurrence condtions for MDPs with Borel state space, Ann. Oper. Res. 28 (1991), 29-46. [13] O. Hern´ andez-Lerma and O. Vega-Amaya, Infinite-horizon Markov control processes with undiscounted cost criteria: from average to overtaking optimality, Appl. Math. (Warsaw) 25 (1998), 153-178. [14] O. Hern´ andez-Lerma, O. Vega-Amaya and G. Carrasco, Sample-path optimality and variance-minimization of average cost Markov control processes, SIAM J. Control Optim. 38 (1999), 79-93. [15] A. Ja´skiewicz, An approximation approach to ergodic semi-Markov control processes, Math. Methods Oper. Res. 54 (2001), 1-19. [16] A. Ja´skiewicz and A. S. Nowak, On the optimality equation for zero-sum ergodic stochastic games, Math. Methods Oper. Res. 54 (2001), 291-301. [17] A. Ja´skiewicz, Zero-sum semi-Markov games, SIAM J. Control and Optim., to appear. [18] M. Kurano, Average optimal adaptive policies in semi-Markov decision processes including an unknown parameter, J. Oper. Res. Soc. Japan 28 (1985), 252-266. [19] A. K. Lal and S. Sinha, Zero-sum two-person semi-Markov games, J. Appl. Prob. 29 (1992), 56-72. [20] F. Luque-V´ asquez and O. Hern´ andez-Lerma, Semi-Markov models with average costs, Appl. Math. (Warsaw) 26 (1999), 315-331. [21] S. P. Meyn and R. L. Tweedie, Markov Chains and Stochastic Stability, Springer-Verlag, London, 1993. [22] A. S. Nowak, Measurable selection theorems for minimax stochastic optimizations problems, SIAM J. Control Optim. 23 (1985), 466-477. [23] A. S. Nowak, Optimal strategies in a class of zero-sum ergodic stochastic games, Math. Methods Oper. Res. 50 (1999), 399-419.

22

[24] M. L. Puterman, Markov Decision Processes. Discrete Stochastic Dynamic Programming, Wiley, New York, 1994. [25] U. Rieder, Average optimality in Markov games with general state space, Proc. 3rd Conf. on Approx. Theory and Optim. (1995), Puebla, M´exico. (Available in http://www.emis.de/proceedings/). [26] P. J. Schweitzer, Iterative solutions of functional equations of undiscounted Markov renewal programming, J. Math. Anal. Appl. (1971), 495-501. [27] O. Vega-Amaya, The average cost optimality equation: a fixed point approach, Reporte de Investigaci´ on No. 4 (2001), Departamento de Matem´ aticas, Universidad de Sonora, M´exico. (Available in: http://fractus.mat.uson.mx/˜tedi/reportes). [28] O. Vega-Amaya and F. Luque-V´ asquez, Sample-path average cost optimality for semi-Markov control processes on Borel spaces: unbounded costs and mean holding times, Appl. Math. (Warsaw) 27 (2000), 343-367.

23