computing stationary nash equilibria of ...

MATHEMATICS OF OPERATIONS RESEARCH Vol. 27, No. 2, May 2002, pp. 384–400 Printed in U.S.A.

COMPUTING STATIONARY NASH EQUILIBRIA OF UNDISCOUNTED SINGLE-CONTROLLER STOCHASTIC GAMES T. E. S. RAGHAVAN and ZAMIR SYED Given a two-person, nonzero-sum stochastic game where the second player controls the transitions, we formulate a linear complementarity problem LCPq M whose solution gives a Nash equilibrium pair of stationary strategies under the limiting average payoff criterion. The matrix M constructed is of the copositive class so that Lemke’s algorithm will process it. We will also do the same for a special class of N -person stochastic games called polymatrix stochastic games.

0. Introduction. Zero-sum two-person discounted stochastic games in finite state and action spaces were first analysed by Shapley (1953). He showed that these games possess a value function and optimal stationary strategies for the two players. Undiscounted zerosum games, namely, games with Cesaro average payoffs, were unraveled by Blackwell and Ferguson (1968). In general, they lack stationary optimal strategies for the players. Mertens and Neyman (1981) proved the existence of value in behavior strategies for zero-sum twoperson undiscounted games. Non-zero-sum discounted stochastic games were independently studied by Fink (1964) and Takahashi (1964), and they proved the existence of Nash equilibria in stationary strategies. Recently, Vieille (1997) has proved that undiscounted nonzero-sum two-person stochastic games possess Nash equilibrium payoffs. A problem is said to possess ordered field property if the entries of some solution to the problem lie in the same Archimedean ordered field as the data entries defining the problem. Ordered field property, with the idea of finite step algorithms, was first studied by Parthasarathy and Raghavan (1981) for single-controller stochastic games. This was further exploited for other classes as well by many (Filar 1979, Parthasarathy et al. 1984, Filar and Schultz 1987, Raghavan et al. 1986, Mohan and Raghavan 1987). Parthasarathy and Raghavan (1981) showed that single-controller stochastic games (zero-sum or non-zero-sum, discounted or undiscounted) possess stationary equilibria and also ordered field property. They showed this constructively for the discounted single-controller zero-sum games by reducing it to solving a linear programming problem. In general, games with undiscounted payoffs are more difficult to handle. It was Hordijk and Kallenberg (1979) who first showed a linear programming reduction of the closely related problem of Markovian decision processes with undiscounted payoff. For the case of zero-sum single-controller undiscounted stochastic games, a linear programming reduction was shown independently by Vrieze (1983) and Hordijk and Kallenberg (1984). Usefulness of linear complementary pivoting (LCP) algorithms to Markov decision processes was first initiated by Eaves (1977), who proposed using LCP to study zero-sum switching-controller stochastic games. Filar and Schultz (1987), and also Filar et al. (1991), developed an LCP reduction and bilinear programming reduction of many structured stochastic games. A constructive proof via the simplex method is automatic once the problem is reduced to a linear program. We cannot claim the same for just reducing a problem to solving an LCP. We have no general necessary and sufficient conditions for solving arbitrary LCP. Received May 17, 1999; revised January 23, 2001, and March 6, 2001. MSC 2000 subject classification. Primary: 91A05, 91A15. OR/MS subject classification. Primary: Games/group decisions, stochastic. Key words. Stochastic games, linear complementarity problem, Lemke’s algorithm. 384 0364-765X/02/2702/0384/$05.00 1526-5471 electronic ISSN, © 2002, INFORMS

NASH EQUILIBRIA OF STOCHASTIC GAMES

385

For discounted non-zero-sum two-person single-controller games, Nowak and Raghavan (1992) gave the first finite step algorithm to locate a stationary equilibrium pair. They reduced the problem to solving for a Nash equilibria of an associated bimatrix game. Although the Lemke-Howson (1964) algorithm can be used to solve this game, the computations of the individual entries of the bimatrix game and the size of the bimatrix game are prohibitively large. For the discounted cost criterion, Mohan et al. (1997) gave an algorithm by reducing the problem to solving a single LCP of the so-called Eaves class . The entries defining the matrix of their LCP formulation involve feeding just the data defining the game. Eaves had already shown that this class is solvable by Lemke’s (1965) seminal algorithm. The main contribution of this paper is to prove constructively that non-zero-sum singlecontroller undiscounted stochastic games have stationary equilibria. We reduce the problem to a single linear complementarity problem of the copositive plus class. Also in our case, the entries defining the matrix of our LCP formulation involve feeding only the data defining the game. Lemke (1965) had already shown that his algorithm will solve LCP problems of this class. Non-zero-sum polymatrix static games were first introduced by Janovskaya (1968). Miller and Zucker (1991) showed that Lemke’s (1965) algorithm can be used to solve polymatrix games. Mohan et al. (1997) considered single-controller discounted N -person polymatrix stochastic games, and showed that Lemke’s (1965) algorithm will process (starting from an almost complementary solution, terminate at a complementary solution) these games as well. We show that our constructive proof for the two-person single-controller undiscounted case extends also to the undiscounted N -person single-controller polymatrix stochastic games. For the sake of notational simplicity, we postpone the mathematical formulation of polystochastic games to the last section. 1. Two-person stochastic games. In a two-person stochastic game , we have a finite set of states S = 1 2 s , and for each state t ∈ S, there are two finite sets, At = 1 2 at and Bt = 1 2 bt , called the action sets for Players I and II, respectively. For each state t and for each a b ∈ At × Bt there is a pair of immediate costs r 1 t a b r 2 t a b for Players (I, II) as well as a probability distribution pt a b on the set S. Given an initial state, t0 ∈ S, the game is played as follows. The players simultaneously choose actions a0 ∈ At0 and b 0 ∈ Bt0 , resulting in the costs r 1 t0 a0 b 0 and r 2 t0 a0 b 0 to be paid by Players I and II, respectively. The system moves to a new state, t1 , according to pt0 a0 b 0 , and the players again choose actions a1 ∈ At1 and b 1 ∈ Bt1 . Accordingly, the costs r 1 t1 a1 b 1 and r 2 t1 a1 b 1 are again paid by their respective players, and the game moves to a new state, t2 , according to pt1 a1 b 1 . At each stage, players are reminded of the past history on the states, actions, and costs incurred by every player. The game continues indefinitely generating two streams of costs, r 1 ti ai b i and r 2 ti ai b i , as i = 1 2 . A general strategy for a player would be a function from the set of all possible histories into the set of probability distributions over the player’s action space. A general strategy can therefore be very complicated but nevertheless, given a pair of strategies for both players, we can evaluate the expected -discounted values (for 0 ≤ < 1): (1) (2)

1 t0 = 2 t0 =

n=0 n=0

n rn1 t0 n rn2 t0

where t0 is the starting state, and rn1 t0 (rn2 t0 ) is the expected cost to Player I(II) at the nth stage when the players are using and . Thus, we treat 1 and

386

T. E. S. RAGHAVAN AND Z. SYED

2 as payoff vectors indexed by the starting state. We say that a pair of strategies ∗ ∗ make up a -discounted equilibrium point (in the sense of Nash) if for any pair of strategies the following conditions hold simultaneously: (3)

1 ∗ ∗ t ≤ 1 ∗ t

(4)

2 ∗ ∗ t ≤ 2 ∗ t

for t = 1 s. Another, generally more difficult to “handle” payoff criterion is the limiting average cost. Under the same notation as above, we explicitly write (5)

1 t0 = lim sup

N 1 r 1 t N + 1 n=0 n 0

(6)

2 t0 = lim sup

N 1 r 2 t N + 1 n=0 n 0

N →

N →

to represent the expected limiting average cost, also called undiscounted cost, to the respective players when they are using the pair of strategies . By dropping the s, we get the analogues of (3) and (4) under this second cost criterion: A pair of strategies ∗ ∗ for Players I and II make up an undiscounted equilibrium point (in the sense of Nash) if for any pair of strategies the following conditions hold simultaneously: (7)

1 ∗ ∗ t ≤ 1 ∗ t

(8)

2 ∗ ∗ t ≤ 2 ∗ t

It should be noted that adding a fixed constant to all the payoffs does not affect the equilibrium points of the game. Hence, without loss of generality, we can assume that all immediate costs are strictly positive. A strategy for Player I (II) is called stationary only if it depends on the current state but not on the past history of the play. Such strategies can generally be mixed. Nevertheless, they are much simpler than the more complex behavioral strategies that vary depending on the past history of the states visited, actions chosen at those states, and the time taken by the players before the current state is reached, thus giving hope for finite algorithms to solve games that possess stationary equilibrium points (equilibrium pairs with both strategies being stationary). Here, the notation to be used will be that for any a ∈ At, t a is the probability that Player I chooses action a in state t under the strategy . Similarly for any b ∈ Bt, t b is the probability that Player II chooses action b in state t under the strategy . Given a pair of stationary strategies , the expected immediate cost to the players ( = 1 2), starting at state t when the stationary strategies are used, can be written as the tth coordinate of the s × 1 column vector r , given by at bt a=1 b=1 r t a bt at b = 1 2. We also write P to be the s × s matrix whose t lth entry is given by at bt a=1 b=1 pt a bl t a. Note that P is simply the transition matrix of the ∗ Markov chain induced Nby and . t Accordingly, P will denote the stationary matrix limN → 1/N + 1 t=0 P . 2. Two-person single-controller stochastic games. A two-person stochastic game is of the single-controller type if the transition probabilities depend on only one of the players. In this paper, we will assume the controlling player to be Player II. In terms of our notation this means that (9)

pt a b = pt 1 b

∀ t ∈ S ∀ a b ∈ At Bt

387


Hence, we may write pt a b = pt b. The following stochastic game is an example of a single-controller game with two states. Here the rows of the two matrices correspond to the action sets At for Player I at states t = 1 2, respectively. The columns of the two matrices correspond to action sets Bt for Player II at states t = 1 2, respectively. The transitions to various states that depend only on the actions of the column player (Player II) appear below the cost matrix. 1

State 1

2

1

State 2 2

3

1 1 2

3 5

1 5 1

2 4

3 2

2 6 1

2 2

2 1 8

7 3

6 7

↓

↓

↓ 1 1 2 2

↓ 1 2 3 3

↓ 1 3 4 4

2 1 3 3

1 4 5 5

Player I is the row player and Player II is the column player. The ordered pairs represent the immediate costs to the players. The first coordinate in each cell is the immediate cost to Player I and the second coordinate is the immediate cost to Player II. For example, if Players I and II choose Row 2 and Column 3 in State 2, Player I incurs an immediate cost 6 and Player II incurs an immediate cost 7. In both states, the transition probability vector depends only on the choice of column player (i.e., Player II’s choice). The above is such an example of a single-controller (Player II) stochastic game with two states and where Player I has two actions in State 1 and two actions in State 2. Player II has two actions in State 1 and three actions in State 2. For example, in State 2, if Player II chooses Column 3, then the game moves next time to State 1 with probability 1/5 and State 2 with probability 4/5. 3. The linear complementarity problem. The linear complementarity problem can be stated as follows. Given a vector q ∈ Rn and a matrix M ∈ Rn×n , find a vector z such that (10)

w = q + Mz

(11)

z w ≥ 0

(12)

zT w = 0

The system (10)–(12) is usually denoted by LCPq M. It can be shown that the LCP is a generalization of the well-known linear program (LP). In a historic work of C. E. Lemke (1965), a simplex-like pivoting algorithm is given as a possible approach to solve LCPs. Unfortunately, the algorithm does not always find a solution to a given LCP. There are, however, certain classes of matrices M for which Lemke’s (1965) algorithm will solve LCPq M. For non-zero-sum, bimatrix games, linear programming is not enough. However, it was shown by Lemke and Howson (1964) that the problem is reducible to solving a single LCP. Furthermore, the Lemke-Howson (1964) algorithm will always generate a solution for the class of LCPs arising from bimatrix games. Lemke (1965) studied general LCP and showed that in general almost complementary path starting at one end of an unbounded ray could fail to terminate in a complementary solution and could end up at an extreme end of a secondary ray. However, Lemke (1965) and Eaves (1971) showed that for a wide subclass of matrices M, called matrices of class , a properly initiated almost complementary path of Lemke (1965) terminates in a complementary extreme solution. This gives hope for the non-zero-sum, single-controller stochastic games. Indeed Mohan et al. (1997) gave an LCP reduction of the discounted single controller non-zero-sum case to a class studied

388


by Eaves (1971) and Garcia (1973). For an encyclopedic reference to LCP see Cottle et al. Stone (1992). In this paper, we will show that under the limiting average cost criterion, these games can be efficiently solved by a single, “Lemke-processible” LCP, namely an algorithm that initiates at a unique, almost complementary extreme solution of an unbounded edge that passes through almost complementary extreme solutions at all intermediate iterations terminating finally in a complementary solution that solves the problem. Indeed, our main effort here is to reduce the problem to an LCP such that the matrix M is of the special type first considered by Lemke (1965), namely, of the copositive plus class. 4. Formulation. We begin by analyzing Conditions (7) and (8) for the single-controller game. We first note that if, in (7) and (8), we restrict ∗ , ∗ , , and to only stationary strategies, then ∗ ∗ would still be an equilibrium point as defined without the restriction. This can be seen by noticing that when one player’s strategy is fixed to a stationary one, the other player is faced with a Markovian decision process (MDP). From the dynamic programming literature (Blackwell 1962) it is well known that i = P ∗ r i for i = 1 2. Since the transitions only depend on Player II’s actions, P can be written as P. Now (7) reduces to the componentwise inequality P ∗ ∗ r 1 ∗ ∗ ≤ P ∗ ∗ r 1 ∗ ∀

(13)

The standard reduction of (13) would entail “canceling” the P ∗ ∗ from both sides of the inequality and simply requiring that r 1 ∗ ∗ ≤ r 1 ∗ ∀

(14)

Here (14) is a stronger condition than (13), and in many cases, it is much more than what we need. After all, (13) is satisfied as long as r 1 ∗ ∗ t ≤ r 1 ∗ t ∀ is true for all recurrent states t in the Markov chain induced by ∗ (because P ∗ ∗ is zero in the columns corresponding to transient states), whereas (14) requires that it be true for all states. As for Player II’s side of the equilibrium condition, there is really no reduction. The requirement in (8), unlike in (7), can change entirely with each strategy . Thus, the task of formulating a mathematical program to solve (7) and (8) will involve a combination of the traditional MDP (coming from Player II’s side of the game) and simple linear inequalities (from Player I’s side of the game). Fortunately, limiting average MDPs have a nice linear programming formulation. Therefore, in constructing our LCP, we will embed both the LP arising from (8) and the linear inequalities arising from (14) into a single LCP. Next we consider the MDP arising from Player I fixing his strategy to a stationary strategy . When Player II chooses action b in state t, the immediate cost incurred is given by at rt ˜ b = a=1 t ar 2 t a b. The transitions of the MDP are the same as those of the original game since has no influence on them. Using the LP formulation for limiting average MDP’s, Player II’s best reply to comes as a solution to the following pair of dual LPs: Primal Maximize

s 1 i s i=1

s.t. (15)

i −

s j=1

(16)

pi bj j ≤ 0

i + ui −

s j=1

(17)

pi bj uj ≤ ri ˜ b

i ui unrestricted

∀ i ∈ S ∀ b ∈ Bi ∀ i ∈ S ∀ b ∈ Bi ∀ i ∈ S

389


Dual Minimize

s i=1 b∈Bi

s.t. (18)

b∈Bj

(19)

b∈Bj

(20)

ri ˜ bxib

xjb + xjb −

b∈Bj

yjb −

s i=1 b∈Bi

s i=1 b∈Bi

pi bj yib =

1 s

pi bj xib = 0

∀j ∈ S ∀j ∈ S

xib ≥ 0 yib ≥ 0

∀ i ∈ S ∀ b ∈ Bi

The first point is that we can replace (19) with (21)

b∈Bj

xjb −

s i=1 b∈Bi

pi bj xib ≥ 0

∀ j ∈ S

Clearly (19) implies (21). To see the other direction, we just sum (21) over j to get s 0≤ xjb − pi bj xib j∈S b∈Bj

=

j∈S b∈Bj

=

j∈S b∈Bj

=

j∈S b∈Bj

i=1 b∈Bi

xjb − xjb − xjb −

s j∈S i=1 b∈Bi s i=1 b∈Bi j∈S

j∈S b∈Bj

pi bj xib pi bj xib

xjb

= 0

Hence (21) implies (19). In this setup, we have yib and xib complementary to the slacks of (15) and (16), respectively. The replacement of (19) by (21) allows us to restrict the corresponding uis to being nonnegative. Suppose we have an optimal solution for both programs, say ∗ u∗ x∗ y ∗ . Then Player II’s optimal strategy ∗ (against ) would be extracted as follows:  ∗ ∗  xib xic∗ when xic > 0    c∈Bi c∈Bi ∗ i b =  ∗  yic∗ otherwise. yib  c∈Bi

One can verify (using (18)) that ∗ is well defined. Also we have ∗ i = 2 ∗ . A key property of this pair of LPs is that those states i ∈ S for which c∈Bi xic∗ = 0 are transient in the Markov chain induced by ∗ . Next, we make a few minor adjustments to (15)–(20) so that they can be put into LCP form. The next changes will have no effect on the optimal solutions of the LPs. Remember that we are assuming r i t a b > 0 i = 1 2 and ∀ t a b. Let m = mintab rt a b. ˜ u, ˜ = m ∀ i ∈ S, and u˜ = 0 ∀ i ∈ S. It is immediate that Consider the pair ˜ where i it is a feasible solution. It is well known (see Kallenberg 1983, p. 102 or Filar and Vrieze 1996, p. 111) that an optimal ∗ for this LP is a minimal subharmonic vector and will

390


dominate all feasible in every coordinate. Hence, any optimal solution ∗ u∗ will satisfy ˜ = m > 0. Therefore, we can reduce the feasible set of the primal even further ∗ i ≥ i by requiring i ≥ 0 ∀ i ∈ S. Finally, since we know that the optimal ∗ has all positive coordinates, by complementary slackness we must have equality in (18) for any optimal pair x∗ y ∗ . Hence, we can change (18) to

(22)

b∈Bj

xjb +

b∈Bj

yjb −

s i=1 b∈Bi

pi bj yib ≥

1 s

∀ j ∈ S

In other words, for any optimal pair x∗ y ∗ if the jth inequality of (22) is strict, then it would only hurt any optimal solution ∗ by bringing ∗ j to 0. This contradicts our assumption that all r i t a b > 0 i = 1 2 One additional adjustment (the importance of which will be seen later) will be to change (16) to (16 )

i + ui −

s

pi bj uj ≤ ri ˜ b +

j=1

s i=1 b∈Bi

xib

˜ b ∀ i ∈ S ∀ b ∈ Bi. Here we have simply added the fixed quantity si=1 b∈Bi xib to ri Since xib ≥ 0, we have that at a complementary solution i > 0 ∀ i ∈ S. Using comple ∗ = 1. mentary slackness in the dual LP and summing (22) over j ∈ S, we get si=1 b∈Bi xib Hence, the only effect of changing (16) to (16 ) is that the optimal value will have all coordinates increased by 1. From dynamic programming, we know that adding a fixed constant to all the immediate costs has no effect on the optimal strategy. Putting in all the changes gives the following set of inequalities: i −

s j=1

pi bj j ≤ 0

i + ui −

s j=1

b∈Bj

b∈Bj

xjb + xjb −

b∈Bj

∀ i ∈ S ∀ b ∈ Bi

pi bj uj ≤ ri ˜ b +

yjb −

s i=1 b∈Bi

s i=1 b∈Bi

s i=1 b∈Bi

pi bj yib ≥

xib

1 s

pi bj xib ≥ 0

∀ i ∈ S ∀ b ∈ Bi ∀j ∈ S ∀j ∈ S

i ui xib yib ≥ 0

∀ i ∈ S ∀ b ∈ Bi

Any solution of these inequalities that satisfies the complementarity conditions from the LPs will provide an optimal stationary strategy for Player II when Player I fixes his strategy to an arbitrary stationary . Next, we will construct a set of inequalities to take care of (13). Recall that (13) will be valid if the vector inequality (14) is true on those coordinates corresponding to recurrent states of the Markov chain induced by ∗ (again assume that ∗ ∗ satisfy (7) and (8)). Consider the following set of inequalities: (23)

b∈Bi

(24)

r 1 i a b∗ i b ≥ vi a∈Ai

(25)

z˜ia ≥ 1

z˜ia vi ≥ 0

∀ i ∈ S ∀a ∈ Ai

∀i ∈ S

∀ i ∈ S ∀ a ∈ Ai

391


along with the complementary slackness conditions, z˜ia

(26)

1

∗

r i a b i b − vi = 0

b∈Bi

vi

(27)

a∈Ai

∀ i ∈ S ∀ a ∈ Ai

z˜ia − 1 = 0

∀ i ∈ S

Suppose (23)–(27) are satisfied by some v z˜. If vi > 0 for all i ∈ S then by (27), we would have a∈Ai z˜ia = 1 ∀i ∈ S. Define the stationary strategy ˜ by i ˜ a = z˜ia . ˜ ∗ i = vi. Substituting this into Fixing i and summing (26) over a ∈ Ai gives r 1 ˜ ∗ i ≤ r 1 ∗ i for all stationary . Hence, (14) is satisfied by ∗ = . ˜ (23) yields r 1 Unfortunately, this implication was a result of vi > 0 for all i ∈ S. Such a condition is not necessarily true, although with a small adjustment this problem can be alleviated. We replace (23) with s

(28)

i=1 a∈Ai

z˜ia +

r 1 i a b∗ i b ≥ vi ∀ i ∈ S ∀ a ∈ Ai

b∈Bi

and accordingly replace (26) with (29)

z˜ia

s i=1 a∈Ai

z˜ia +

r 1 i a b∗ i b − vi = 0

∀ i ∈ S ∀ a ∈ Ai

b∈Bi

From (24), we know that for each i ∈ S there is some ai ∈ Ai with z˜iai > 0. Therefore, from the corresponding complementarity condition in (29), it follows that vi > 0. Now vi > 0 implies that a∈Ai z˜ia = 1 (using (27)). Thus, si=1 a∈Ai z˜ia = s, so that we have only added a constant to all the inequalities of (26) to get (28). It is easy to check that this adjustment does not affect the properties of . ˜ We are now ready to define an LCP whose solutions correspond to equilibrium points. 1 = wia

2 wib =

3 = wib

wj4 =

s i=1 a∈Ai s j=1

ia +

b∈Bi

r 1 i a bxib − vi

pi bj j − i

s i=1 b∈Bi

b∈Bj

xib +

xjb −

a∈Ai

ia r 2 t i b +

s i=1 b∈Bi

s j=1

pi bj uj − ui − i

pi bj xib

s 1 wj5 = − + xjb + yjb − pi bj yib s b∈Bj i=1 b∈Bi b∈Bj ia wi6 = −1 + a∈Ai

1 wia

≥ 0

ia ≥ 0

1 ia wia = 0

∀ i ∈ S

∀ a ∈ Ai

392

T. E. S. RAGHAVAN AND Z. SYED 2 wib ≥ 0

yib ≥ 0

2 yib wib = 0

∀ i ∈ S

∀ b ∈ Bi

3 ≥ 0 wib

xib ≥ 0

3 xib wib = 0

∀ i ∈ S

∀ b ∈ Bi

wj4

≥ 0

uj ≥ 0

ujwj4

wj5 ≥ 0

j ≥ 0

jwj5 = 0

wi6 ≥ 0

vi ≥ 0

= 0

viwi6 = 0

∀ j ∈ S ∀ j ∈ S ∀ i ∈ S

Next we will define z M, and q (recall that w = q + Mz), so as to construct an LCP that corresponds to the above sets of inequalities, . This set of inequalities has the important property that M is of the copositive plus class that allows Lemke’s (1965) algorithm for LCP to constructively solve for a complementary solution pair w z. Before considering the general case, we advise the reader to first examine the example below as it will help to clarify the process of specifying the LCP. We write z = 11 sas y11 ysbs x11 xsbs

T 1 s u1 us v1 vs

To label certain subsets of the index set of the matrices involved, we will use the pattern of z = y x u v. Let n be the number of coordinates in z. The n × n matrix M will be constructed by various partitions of its rows and columns. To refer to an entry of M we will use the labels of z. Thus, we can speak of the entry in “row xib ” and “column j” of M. We claim that the matrix M associated with the above LCP is of the type      1 0 0 0  = 0 0 0  =  1 0 0      0 2

2 0      M = (30)    

0 − 1T − T 0 0 0       0 − 2T  = 0 0 0   =  0 0 0 0 − T 0 0 In the above partitioned matrix M, the northwest corner block is a square matrix where the three-way split of column and row indices are partitioned as yx for the rows as well as its columns. The null matrix of the southeast corner block by default is a square matrix where the three-way split of column and row indices are partitioned as uv for the rows as well as its columns. We define all entries of 1 and 2 to be equal to 1. The ia xib th entry of is r 1 i a b ∀ i ∈ S ∀ a ∈ Ai ∀ b ∈ Bi. The other entries of are 0. The xib ia th entry of is r 2 i a b ∀ i ∈ S ∀ a ∈ Ai ∀ b ∈ Bi. The other entries of are 0. We will now describe the northeast block matrix corresponding to the three-way split of row indices and column indices given by yx and uv, respectively: The ia vith entry of is −1 ∀ i ∈ S ∀ a ∈ Ai. The rest of the entries of are 0. If i = j, then the yib jth entry of 1 is given by pi bj . If i = j, then the yib jth entry is pi bj − 1. 1 and 2 are identical. Formally, we have that the xib ujth entry of the former is the same as the yib jth entry of the latter. The xib ith entry of is −1 ∀ i ∈ S ∀ b ∈ Bi. The other entries of are all 0. The southwest corner block matrix = −T . To complete the construction of the LCP we need to define the vector q z, where 1 q T = 0 0 0 − eT 0 −e and zT = T y T xT T uT vT

s


393

Like z, q is also an n × 1 vector. We set all the coordinates of q to 0 with the exception of the indices u v. Those coordinates of q in u will have value −1/s. The coordinates in v will have value −1. Under this construction of q M and z, it is easy to check that Conditions (10), (11), and (12) are equivalent to . While the set of inequalities and equations given by are represented by an LCP, the real problem is whether it can be constructively solved and whether the solution so obtained is actually a solution to the stochastic game problem. In a seminal paper, Lemke (1965) studied general LCP problems and suggested an algorithm that initiates at one end of an unbounded edge of an unbounded polyhedral set and travels along almost complementary vertices and terminates either at a terminal vertex, which is complementary, or terminates at the almost complementary end vertex of another unbounded edge. With Lemke’s (1965) algorithm, there is guaranteed termination in a complementary vertex only for some special class of matrices M. Notable among them is the so-called copositive-plus class. These are defined as follows. Definition 1. A real square matrix M is called a copositive-plus matrix if (31)

zT Mz ≥ 0

(32)

zT Mz = 0 ⇒ M + M T z = 0

∀ z ≥ 0 ∀ z ≥ 0

Before proceeding further, we will write the q M and z for the stochastic game given at the beginning of §3 above. We have 

 0      0       0         0       0       0       0         0       0       0    q =      0       0       0         0     1 − 2     1 −   2    0         0       −1    −1



  11     12         21       22       y11       y  12      y   21       y22       y23       x   11  z=     x12       x21       x   22       x23      1     2      u1         u2       v1    v2

394




1   1  1   1   0   0   0   0   0   2 M = 5   0   0   0   0   0   0   0  1 

1 1 1

0

0

0

0

0

1

3

0

0

0

0

0

0

0

−1

1 1 1

0

0

0

0

0

6

2

0

0

0

0

0

0

0

−1

1 1 1

0

0

0

0

0

0

0

5

2

3

0

0

0

0

0

1 1 1

0

0

0

0

0

0

0

1

7

6

0

0

0

0

0

1 2

1 2

0

0

0

2 3

0

0

0

0 0 0

0

0

0

0

0

0

0

0

0

0

−

0 0 0

0

0

0

0

0

0

0

0

0

0

− 23

−

1 4

0

0

0

0

0

0

0 0 0

0

0

0

0

0

0

0

0

0

0

1 4

0 0 0

0

0

0

0

0

0

0

0

0

0

2 3

− 23 1 5

0 0 0

0

0

0

0

0

0

0

0

0

0

1 5

1 0 0

0

0

0

0

0

1

1

1

1

1

−1

−

0

0

0

0

− 21

1 2

0

2 3

2 3

0

2 0 0

0

0

0

0

0

1

1

1

1

1

−1

0

0 1 8

0

0

0

0

0

1

1

1

1

1

0

−1

− 1 4

− 41

0

−

2 3

0

0 4 3

0

0

0

0

0

1

1

1

1

1

0

−1

2 3

0 2 7

0

0

0

0

0

1

1

1

1

1

0

−1

1 5

− 15

0

0 0 0

1 2

2 3

0 0 0 − 21 − 23

−

1 4

1 4

−

2 3

2 3

−

1 5

1

1

0

0

0

0

0

0

0

0

1 5

0

0

1

1

1

0

0

0

0

0

1 2

2 3

0 0 0

0

0

0

0

0

0 0 0

0

0

0

0

0

1 0 0

0

0

0

0

0

0

0 0 1 1

0

0

0

0

0

0

−

2 3

−

1 5

0

0

0

0

0

1 4

2 3

1 5

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

− 21 − 23

−

1 4

0



  0  −1   −1   0   0   0   0   0   0  0   0   0   0   0   0   0   0  0  0

5. Main results. Given a two-player, non-zero-sum, single-controller, stochastic game , the following algorithm is suggested: Algorithm 1. Step 1. Input the data s A1 As B1 Bs r 1 i a b r 2 i a b pi b. Step 2. Construct q and M as specified above. Step 3. Use Lemke’s (1965) algorithm to process LCPq M. Step 4. From the solution z obtained in Step 3, set ∗ i a = ia ∀ i ∈ S ∀ a ∈ Ai, and set ∗ i b according to the following rule:   x x if xic > 0    ib c∈Bi ic c∈Bi ∗ i b =   yic otherwise.  yib c∈Bi

Theorem 1. The pair ∗ ∗ obtained in Step 4 of Algorithm 1 constitutes a stationary equilibrium solution of . Verifying Theorem 1 requires proving that Step 3 does indeed provide a complementary solution of LCPq M and further that the pair ∗ ∗ satisfies Conditions (7) and (8). We will prove these in the next two lemmas. Lemma 1. Lemke’s (1965) algorithm will provide a solution to LCPq M. Proof. We will show that LCPq M is feasible and that the matrix M is copositive plus. To show that (31) is satisfied by M, we simply use the partition in (30). Writing z = z1 z2 in accordance with the partition in (30) we get zT Mz = zT1 z1 = zT1 z1 . (Observe that the off-diagonal blocks of M are skew symmetric reflections, namely and −T .)


395

Since has nonnegative entries, zT1 z1 ≥ 0. To show (32), assume that zT Mz = 0 is true. Since zT Mz = zT1 z1 , we have that zT1 z1 = 0. Now using the partition of (31) and splitting z1 = z11 z12 z13 accordingly, we have 0 = zT1 z1 = zT11 1 z11 + zT13 + z11 + zT13 2 z13 . Since + ≥ 0, and since 1 and 2 have all entries positive, we can conclude that z11 = 0 and z13 = 0. From this, is easy to see now that M + M T z = 0. The feasibility of LCPq M entails showing that (10) and (11) together have at least one solution z w. Of course, we need only specify z as w = q + Mz is readily computable. Suppose Mz + q ≥ 0 is infeasible. Then by the theorems on alternate inequalities (see Gale 1960, Theorem 2.8, p. 47) there exists a column vector / ≥ 0 such that /T M ≤ 0, and /T q < 0. Let us split the vector / according to the six blocks of the matrix M, namely / as a direct sum of six subvectors /k k = 1 6, so that the dimensions of the subvectors match the corresponding block sizes of M. Thus, we can write   M My Mx M Mu Mv  M  y Myy Myx My Myu Myv     Mx Mxy Mxx Mx Mxu Mxv  

 M =  M My Mx M Mu Mv     Mu Muy Mux Mu Muu Muv  Mv Mvy Mvx Mv Mvu Mvv From the first block of columns of M in (30) we can easily see that when /T M ≤ 0, we have /1 = 0 /3 = 0 /6 = 0. From the second block of columns and from the third block of columns, we can conclude /T4 My ≤ 0 and /T4 Mx + /T5 Mux ≤ 0. Observe that the coordinates of q corresponding to the blocks 4 and 6 alone are nonzero and negative. That is, q1 = 0 q2 = 0 q3 = 0 q5 = 0. But My = Mux . Consider the submatrix

Mx

Mux

If we treat this as the payoff matrix of a zero-sum matrix game where the row player is the minimizer, we observe that each row of Mx dominates weakly some row of Mux . Thus, the value of the game Mx Mux is the same as the value of the game Mux . For this matrix, the columns add up to zero and so the value of the game is ≤0. If we take any optimal strategy of the minimizer (row player), it will skip the rows corresponding to the matrix Mx . Thus, when /4 T Mx + /5 T Mux ≤ 0, then either it is the trivial vector 0 or otherwise /5 alone is non-zero. However, since q5 = 0 the condition /T q < 0 is not satisfied and we have a contradiction. Thus, we have z ≥ 0 and q + Mz ≥ 0. Feasibility of LCPq M and M being copositive plus imply that a complementary solution exists and that Lemke’s (1965) algorithm will find one. This together with the inequalities in completes the proof of the theorem. Remark 1. Since we have proved the existence of a feasible solution for LCPq M and since M is copositive plus, we can use any complementary solution z∗ to construct a stationary pair ∗ ∗ via Step 4 of the algorithm. Lemma 2. The pair ∗ ∗ obtained in Step 4 of the algorithm is a stationary equilibrium pair, namely it satisfies Conditions (7) and (8). In the proof we assume now that z is a solution to the LCP in the algorithm.

396


Proof. We first note that since the immediate costs r i · · · are positive, from the first y ∗ x∗ ∗ u∗ v∗ to the set of inequalities in for any complementary solution z∗ = ∗ ∗ LCPq M, we must have v i > 0 ∀ i ∈ S, and we note that a∈Ai ia∗ = 1 ∀ i ∈ S. So ∗ is indeed a stationary strategy for Player I. When Player I fixes this strategy, the conditions in are sufficient to show that ∗ constructed via Step 4 is optimal in the resulting MDP. Thus, (8) is satisfied. ∗ = 1, there must exist a subset T ⊆ S Next weshow that (7) holds. Since i∈S b∈Bi xib ∗ for which b∈Bt xtb > 0 for all t ∈ T (recall that these are precisely the recurrent states of the Markov chain induced by ∗ ). Now using the complementarity conditions with ia∗ , we get for each t ∈ T , ∗ ∗ (33) t ar 1 t a bxtb = v∗ t

a∈At b∈Bt

∗ Since for each t ∈ T we have ∗ t b = xtb , we can write ∗ (34) t ar 1 t a b∗tb = vt ¯

∀t ∈ T

a∈At b∈Bt

∗ where vt ¯ = v∗ t/ b∈Bt xtb . But this together with the inequalities in imply that ∗ ∗ ∗ ∗ > 0, and we cannot conclude r t ≤ r t ∀ ∀ t ∈ T . If t T , then b∈Bt xtb ∗ ∗ ∗ that necessarily r t ≤ r t ∀ will hold. But this will not matter, because the states outside T are transient in the Markov chain induced by ∗ and for the matrix P ∗ the columns corresponding to states outside T are null vectors. From this we can conclude that (7) holds. This completes the proof. Any solution of LCPq M is a solution to the game . Also, the proof of Theorem 1 in no way uses the existence of stationary equilibria for single-controller games and is entirely constructive, thus giving a new proof for the existence of stationary equilibria in this class of stochastic games. Given a solution z = y x u v to LCPq M, the optimal pair of strategies ∗ ∗ is obtained from the algorithm after some trivial arithmetic operations. We also ¯ where e¯ is the s × 1 column vector with all coordinates 1. The have 2 ∗ ∗ = − e, ¯ only work needed is to compute 1 ∗ ∗ = P ∗ ∗ r 1 ∗ ∗ . Here r 1 ∗ ∗ = v − s e. Thus, it is clear that possesses the ordered field property. 6. Polymatrix single-controller stochastic games. In a non-zero-sum N -person stochastic game, we have a set of N players = 11 12 1N Players 1i i = 1 N have respective action sets Ai t i = 1 N in each state t in the state space S = 1 s . The game is played just as in two-person stochastic games. Given a starting state t0 , all N players simultaneously choose actions a01 a02 a0N , a0i ∈ Ai t0 , whence come N immediate costs r i t0 a01 a02 a0N , i = 1 N , where 1i pays r i . The game moves to a new state t1 according to a probability distribution pt0 a01 a0N on S. At each stage, players are reminded of the past history on the states, actions, and costs incurred by every player and the process continues indefinitely. Given a collection of strategies 1 N for the players, we again write (35)

i 1 N t0 = lim sup T →

T 1 r i t N T + 1 x=0 x 0 1

i = 1 N

where rxi 1 N is the expected cost to 1i on the xth stage when the players use 1 N and the starting state is t0 . We say that an N -tuple of strategies 1∗ N∗ is an equilibrium point if for any N -tuple of strategies 1 N , we have (36)

i 1∗ N∗ ≤ i 1∗ N∗ i

i = 1 N

397


where 1∗ N∗ i means to replace i∗ with i and to keep all other strategies fixed. We again point out that if we restrict 1∗ N∗ and 1 N to stationary strategies, then (36) will still yield an equilibrium point in the sense of the definition without the restriction. In a single-controller polymatrix stochastic game, a simplified cost structure is assumed. This simplification (along with the single-controller condition) will allow for (36) to be written as a system of linear inequalities with complementarity conditions. We will assume that 1N controls the transitions of the game so that pt a1 aN = pt aN . To help distinguish player 1N in the notation, we write Bt = AN t and refer to 1N ’s strategy as instead of N . We also write = 11 1N −1 . As in polymatrix games, we assume that the costs can be split as r i t a1 aN = (37) Rij t ai aj ∀ i ∈

j∈ j=i

Here, Rij t ai aj can be thought of as a partial cost paid by 1i as a result of the mutual action ai aj of 1i and 1j . Hence, the cost functions r i are completely determined by the partial costs Rij . If N = 2, it is easy to check that the two-person game of the previous section can be formulated to fit into this model. The following set of inequalities will provide an equilibrium point for such a game. The verification is nearly identical to that of the previous sections (with obvious modifications). wi1 t a = i j c + Rik t a ck t c j∈S c∈Ai j

+

b∈B

2

w t b =

j∈S

w 3 t b =

RiN t a bxtb − vi t

pt bj j − t

j∈S c∈Bj

+

j∈S

w 4 t =

k∈k=i c∈Ak t

b∈Bt

xjb +

k∈ c∈Ak t

Rik t a ck t c

pt bj uj − ut − t

xtb −

j∈S b∈Bt

pt bj xtb

1 w 5 t = − + xtb + ytb − pt bj ytb s b∈Bt j∈S b∈Bt b∈Bt i wi6 t = −1 + t a a∈Ai t

wi1 t a

≥ 0

w 2 t b ≥ 0 3

w t b ≥ 0 w 4 t ≥ 0

i t a ≥ 0 ytb ≥ 0 xtb ≥ 0

i t awi1 t a = 0

ytb w 2 t b = 0 3

xtb w t b = 0

∀ t ∈ S

t ∈ S

∀ a ∈ Ai t

∀ b ∈ Bt

∀ t ∈ S ∀ b ∈ Bt

ut ≥ 0

utw 4 t = 0

w t ≥ 0

t ≥ 0

5

tw t = 0

∀ t ∈ S

wi6 t ≥ 0

vi t ≥ 0

vi twi6 t = 0

∀ i ∈

5

∀ i ∈

∀ t ∈ S ∀ t ∈ S

the notation is slightly different from that of . For In this set of inequalities, call it , i ∈ t ∈ S and a ∈ Ai t, i t a is the probability that player 1i chooses action a when the system is in state t. Player 1N ’s strategy is extracted from x y just as before.

398


Let ait be Next we construct LCPq M whose solution corresponds to a solution of . i the number of elements in the set A t and bt the number of elements in Bt. We begin by defining the coordinates of zT , the transpose of z. iT = i 1 1 i 1 2 · · · i s ais i ∈ xT = x11 x12 · · · xsbs y T = y11 y12 · · · ysbs T = 1 2 · · · s T uT = u1 u2 · · · us vi = vi 1 vi 2 · · · vi s i ∈

Let

vT = v1 T v2 T · · · vN −1 T

and

T T = 1T 2T · · · N −1

Finally, we write zT as zT = T y T xT T uT vT

Just as before, we will use the coordinates of z to serve as markers for the partitions (and coordinates) of M and q. We first partition the rows and columns of M by yxuv and write (38) M=

0 The matrix contains the immediate cost information. Next, we partition by 1 2 · · · N −1 yx and write   1 12 ··· 1N −1 0 1N      21 2 ··· 2N −1 0 2N       ·  · · · ·      · · ··· · · ·  = (39) 

    · · · ·   ·     N 1 N 2 ··· NN −1 0 N −1N     0 0 ··· 0 0 0  N 1 N 2 ··· NN −1 0 N For i j ∈ i = j we define the entry in the i t ath row and the j t cth column of ij to be Rij t a c, where t ∈ S a ∈ Ai t c ∈ Aj t, and we define all other entries of ij to be zero. The 0s in represent zero matrices of appropriate dimensions, and for each i ∈ i has all entries equal to one. For each i ∈ we define the entry in the i t ath row and the xtb th column of iN to be RiN t a b, where t ∈ S a ∈ Ai t b ∈ Bt, and we define all other entries of iN to be zero. Similarly, for each j ∈ , we define the entry in the xtb th row and j t ath column of Nj to be RNj t b a, where t ∈ S a ∈ Ai t b ∈ Bt, and we define all other entries of Nj to be zero. This completes the construction of . To define , we partition its rows by yx and its columns by uv as   0 0 =  1 0 0 

(40)

2 0 For each i ∈ t ∈ S and a ∈ At, we define the entry in the i t ath row and vi tth column of to be equal to −1, and we define all other entries of to be zero. For each


399

pair t t ∈ S with t = t , define the entry in the ytb row and t th column of 1 to be pt bt . For t ∈ S, define the entry in the ytb th row and the tth column of 1 to be pt bt − 1. We define 2 = 1 . For each t ∈ S and b ∈ Bt, we define the entry in the xtb th row and tth column of to be −1, and we define all other entries of to be zero. This completes the definition of M. For the vector q, we set all coordinates in y x u to zero, in to −1/s, and in v to −1. It is easy to check that with this construction of q M and z, (10), (11), and (12) together It can also be verified that M is copositive plus and that LCPq M are equivalent to . is feasible (the verification is virtually identical to that of the previous sections). Given a solution z∗ to LCP( q M), the equilibrium strategies 1∗ N∗ −1 can be immediately read off. The strategy for 1N , namely ∗ , can be extracted from x∗ y ∗ just as before. Thus, we have: Algorithm 2. Step 1. Input the data s Ai t Rij t a b pt b. Step 2. Construct q and M as specified above. Step 3. Use Lemke’s (1965) algorithm to process LCPq M. Step 4. From the solution z∗ obtained in Step 3, 1∗ N∗ −1 ∗ is an equilibrium point with i∗ t a i ∈ being the probability that player 1i chooses action a ∈ Ai t when in state t, and ∗ t b being the probability that player N chooses action b ∈ Bt when in state t, where ∗ t b is defined by the following rule:  ∗ ∗ ∗  xtc if xtc > 0 xtb   c∈Bt c∈Bt ∗ t b =  ∗  y ytc∗ otherwise.   tb c∈Bt

Summing up the results we have: Theorem 2. Algorithm 2 will solve any single-controller polystochastic game. Just like Theorem 1, Theorem 2 is entirely constructive in that it does not require the knowledge of the existence of a stationary equilibrium point. Furthermore, it is easy to see that this class of games possesses the ordered field property. On applying our results to the previous example, we arrive at the solution 11 = 1 12 = 0 21 = 0 7 22 = 0 3 x11 = 0 5, x12 = 0 x21 = 0 3 x22 = 0 2 x23 = 0 and yib = 0 ∀ i b. This yields the equilibrium strategies ∗ 1 1 = 1 ∗ 1 2 = 0 ∗ 2 1 = 0 7 ∗ 2 2 = 0 3 for Player I and ∗ 1 1 = 1 ∗ 1 2 = 0 ∗ 2 1 = 0 6 ∗ 2 2 = 0 4 ∗ 2 3 = 0 for Player II. Acknowledgments. 4951.

Partially funded by NSF Grants DMS 930-1052 and DMS 970References

Blackwell, D. 1962. Discreted dynamic programming. Ann. Math. Statist. 33 719–726. Cottle, R. W., J. S. Pang, R. E. Stone. 1992. The Linear Complementarity Problem. Academic Press, New York. Eaves, B. C. 1977. Complementary pivot theory and Markovian decision chains. S. Karamardian, ed. Fixed Points: Algorithms and Applications. Academic Press, New York, 59–85. Filar, J. A., T. A. Schultz. 1987. Bilinear programming and structured stochastic games. J. Optim. Theory Appl. 53 85–104. , O. J. Vrieze. 1996. Competitive Markov Decision Processes. Springer Verlag, New York. , T. A. Schultz, F. Thuijsman, O. J. Vrieze. 1991. Nonlinear programming and stationary equilibria of stochastic games. Math. Programming 50 227–237. Gale, D. 1960. Linear Economic Models. McGraw Hill, New York. Garcia, G. B. 1973. Some classes of matrices in linear complementarity theory. Math. Programming 5 299–310.

400


Hordijk, A., L. C. M. Kallenberg. 1981. Linear programming and Markov decision chains. Management Sci. 25 352–362. , . 1984. Linear programming and Markov games. O. Meschling, D. Pallaschke, eds. Game Theory and Mathematical Economics. North-Holland, Amsterdam, The Netherlands, 307–319. Janovskaya, B. 1968. Equilibrium points of polymatrix games. Litovskij Mathematicheskij Sbornik (Lithuanian Math. Proc.) 18(12) 381–384. Kallenberg, L. C. M. 1983. Linear programming and finite Markovian control problems, Mathematical Center Tract 148. Center for Mathematics and Computer Science, Amsterdam, The Netherlands. Lemke, C. E. 1964. Bimatrix equilibrium points and mathematical programming. Management Sci. 11 681–689. , J. T. Howson, Jr. 1964. Equilibrium points of bimatrix games. J. Soc. Indust. Appl. Math. 12 413–423. Mertens, J. F., J. Neyman. 1981. Stochastic games. Internat. J. Game Theory 10 53–66. Miller, D. A., S. W. Zucker. 1991. Copositive-plus Lemke algorithm solves polymatrix games. Oper. Res. Lett. 10 285–290. Mohan, S. R., Neogy, S. K., T. Parthasarathy. 1997. Linear complementarity and discounted polymatrix stochastic games when one player controls transitions. M. C. Ferris, J. S. Pang, eds. Proc. Internat. Conf. Complementarity Problem. SIAM Publication, Philadelphia, PA, 284–294. Nowak, A. S., T. E. S. Raghavan. 1993. A finite step algorithm via a bimatrix game to a single controller nonzero-sum stochastic game. Math. Programming 59 249–259. Parthasarathy, T., T. E. S. Raghavan. 1981. An ordered field property for stochastic games when one player controls transition probabilities. J. Optim. Theory Appl. 33 375–392. Shapley, L. S. 1953. Stochastic games. Proc. National Acad. Sci. 39 1095–1100. Schultz, T. A. 1992. Linear complementarity and disconnected switching controller stochastic games. J. Optim. Theory Appl. 73 89–99. Vrieze, O. J. 1981. Linear programming and undiscounted stochastic game in which one player controls transitions. OR Spektrum 3 29–35. T. E. S. Raghavan: Department of Mathematics, Statistics and Computer Science, University of Illinois at Chicago, 851 S. Morgan, SEO517, Chicago, IL 60607-7045; e-mail: [email protected] Z. Syed: The Hull Group, 311 South Wacker Drive, Chicago, IL 60606; e-mail: [email protected]