Optimization and Identification in a Non-Equilibrium Dynamic Game

2 downloads 0 Views 535KB Size Report
strategy that maximizes the human's averaged payoff is actually periodic after ... knowledge, the above non-equilibrium dynamic game frame- work is neither ...
Joint 48th IEEE Conference on Decision and Control and 28th Chinese Control Conference Shanghai, P.R. China, December 16-18, 2009

ThC14.6

Optimization and Identification in a Non-equilibrium Dynamic Game Yifen MU and Lei GUO

Abstract— In this paper, we consider optimization and identification problems in a non-equilibrium dynamic game. To be precise, we consider infinitely repeated games between a human and a machine based on the standard Prisoners’ Dilemma model. The machine’s strategy is assumed to be fixed with k-step memory, which may be unknown to the human. By analyzing the state transfer graph, it will be shown that the optimal strategy that maximizes the human’s averaged payoff is actually periodic after finite steps. This can help us to find the optimal strategy in a feasible way. Moreover, when k = 1, the human will not lose to the machine while optimizing his averaged payoff; but when k ≥ 2, he may indeed lose if he focuses on optimizing his own payoff only. Identifiability problem will also be investigated when the machine’s strategy is unknown to the human.

I. INTRODUCTION Over the past half century, a great deal of research effort has been devoted to adaptive control systems, and much progress has been made in both theory and applications, see, e.g., [1]-[10], among many others. A survey of some basic concepts, methods and results in adaptive systems can be found in [11], where the basic framework together with some fundamental theoretic difficulties in adaptive systems are elaborated on in detail. In the traditional parametric adaptive control, the controller may be regarded as a single agent (or player) acting on the system based on the information or measurement received, which is usually designed by incorporating an estimation mechanism: the agent (controller) makes decision based on the parameter estimates of the uncertain (time-varying) parameters that influence the dynamic behavior of the system. Note that in this framework, the timevarying parameter process may be regarded as the strategy of another “agent”, save that the action of this “agent” usually does not depend on the control actions, and that it has no intention to gain its “payoff”. However, in many practical systems, especially social, economical, biological and ecological systems, which involve adaptation and evolution, people often encounter with the so-called complex adaptive systems (CAS) as introduced by John Holland in [12]. In a CAS, as summarized in [13], a large number of components, called agents, interact with and adapt to (or learn) each other and their environment, leading to some (possibly unexpected) macro phenomena called emergence. Despite of the flexibility in modeling a wide class 1. Yifen Mu and Lei Guo are with Laboratory of Systems and Control, ISS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, 100190, China. [email protected],

[email protected] 2. This work was supported by the National Natural Science Foundation of China under grant 60821091 and by the Knowledge Innovation Project of CAS under Grant KJCX3-SYW-S01.

978-1-4244-3872-3/09/$25.00 ©2009 IEEE

of complex systems by CAS, it brings great challenge to understand the evolution of a CAS mathematically, since the traditionally used mathematical tools appear to give limited help in the study of CAS, as pointed out in [13]. As an attempt towards initiating a theoretical investigation of CAS, we will, in this paper, consider a dynamic game framework that is somewhat beyond the current adaptive control framework. Intuitively, we will consider a scenario where we have two heterogeneous agents (players) in a system playing in a repeated noncooperative game[14], where each agent makes its decision based on the previous actions and payoffs of both agents, but the law for generating the actions of one agent is assumed to be fixed. Specifically, we will consider infinitely repeated games between a human and a machine based on the standard Prisoners’ Dilemma model. The machine’s strategy is assumed to be fixed with k-step memory, which may be unknown to the human. We would like to point out that, to the best of the authors’ knowledge, the above non-equilibrium dynamic game framework is neither contained in the traditional control theory, nor considered in the classical game theory. In fact, in the classical framework of game theory, all agents (players) stand in a symmetric position in rationality, in order to reach some kind of equilibrium; whereas, in our framework, the agents do not share a similar mechanism for decisions making and do not have the same level of rationality. This difference is of fundamental importance, since in many complex systems, such as non-equilibrium economy (see, e.g., [15]), the agents are usually heterogenous, and they may indeed differ in either their information obtained or their ability in utilizing it. It is worth mentioning that there have been considerable investigations in game theory in relation to adaptation and learning, which can be roughly divided into two directions. One is called evolutionary game theory, see, e.g., [16] [17], in which all agents in a large population are programmed to use certain actions to play. An action will spread or diminish according to the value of its corresponding payoff. The other direction is called learning in game theory, see, e.g., [18] [19], which considers whether the long-run behaviors of individual agents will arrive at some equilibrium [20]-[22]. In both the directions, all the agents in the games are equal in their ability to learn or adapt to the strategies of their opponents. Some recent works and reviews can be found in [23]-[28]. The dynamic game framework to be studied in this paper is partly inspired by the evolutionary game framework in [29], where the best strategy is emerged as a result of evolution, while our optimal strategy to be studied in the next sections will be obtained by optimization and identification. To this end, we need to analyze the state transfer graph

5750

ThC14.6 for machine strategy with k-memory in the paper. We will show that the optimal strategy that maximizes the human’s averaged payoff is actually periodic after finite steps. Moreover, when k = 1, the human will not lose to the machine while optimizing his averaged payoff; but when k ≥ 2, he may indeed lose if he focuses on optimizing his own payoff only. When the machine’s strategy is unknown to the human, we will give a necessary and sufficient condition for identifiability, and will investigate the consequences of identification in our non-equilibrium dynamic game problem. The remainder of this paper is organized as follows. In Section 2, the main theorems will be stated following the problem formulation; In Section 3, the state transfer graph (STG) will be described and some useful properties will be studied. The proofs of all the theorems will be given in Section 4, and Section 5 will conclude the paper with some remarks. II. P ROBLEM S TATEMENT AND M AIN R ESULTS Consider the well-known Prisoners’ Dilemma(PD) game. In this game, two players are two confederate suspects, called Player 1 and 2. They simultaneously choose their actions C or D, where “C” means the player cooperates with the partner, and “D” means the player defects the partner. Different action profiles will result in different payoffs to players. The payoff matrix of the PD game can be written in the following standard way (Fig 1):

their actions and get their payoffs simultaneously. Let h(t) denote the human’s action at t and m(t) the machine’s. Define the history of time t, Ht , as the sequence of two players’ action profiles before time t i.e. Ht , (m(0), h(0); m(1), h(1); ...; m(t − 1), h(t − 1)). Denote the set of all possible historiesSof t as Hˆt , and the set of all histories for all time t as H = t Hˆ t . As a start, we consider the case of pure strategy and define the strategy of either player as a function f : H → A . In this paper, we will further confine the machine’s strategy with finite k-memory as follows: m(t + 1) = f (m(t − k + 1), h(t − k + 1); ...; m(t), h(t))

(2) 2k

which, obviously, is a discrete function from {0, 1} to {0, 1}, where and hereafter, 0 and 1 stands for C and D respectively. Moreover, the following mapping can establish a one-to-one correspondence between the vector set {0, 1}2k and the integer set {1, 2, ......22k }: k−1

s(t) =



{22l+1 · m(t − l) + 22l · h(t − l)} + 1

(3)

l=0

For convenience, in what follows we will denote si = i and call it a state of the game under the given strategies. In the simplest case where k = 1, the above mapping reduces to s(t) = 2 · m(t) + h(t) + 1,

(4)

and the machine strategy (2) can be written as

Player 2

C

D

C

(r, r)

(s, t)

D

(t, s)

(p, p)

m(t + 1) = =

f (m(t), h(t)) a1 I{s(t)=s1 } + ... + a4I{s(t)=s4 } 4

Player 1

=

∑ ai I{s(t)=si}

(5)

i=1

Fig. 1.

The payoff matrix of the PD game

where the parameters satisfy the following standard conditions for PD game[13]: ( t>r> p>s (1) r > t+s 2 From the above inequalities, it is easy to see that the action profile (D, D) is the unique Nash Equilibrium of this homogeneous game, which is not Pareto optimal. The purpose of this paper is, however, not to investigate the Nash or Pareto equilibrium in the game theory. Instead, we will consider the scenario where Player 1 has the ability to search for the best strategy so as to optimize his payoff, while Player 2 acts according to a given strategy. Clearly, this nonequilibrium dynamic game problem is different from either the standard control problem or the classical game problem. Vividly, let Player 1 be a human (we say it is a “he” henceforth) while his opponent Player 2 is a machine. Assume they both know the payoff matrix. The action set of both players is denoted as A = {C, D}, and the time set is discrete, t = 0, 1, 2, .... At time t, both players will choose

which can be simply denoted as a vector A = (a1 , a2 , a3 , a4 ) with ai being 0 or 1. Obviously, each state s(t) corresponds to a pair (m(t), h(t)), and so by the definition of the payoff matrix, the human and the machine will obtain their payoffs, denoted by p(s(t)) and pm (s(t)), respectively. Let us further define the extended payoff vector for the human as P(s(t)) , (p(s(t)), w(s(t))), where w(s(t)) indicates the relative payoff to the machine at t, i.e., w(s(t)) = sgn{p(s(t)) − pm(s(t))} , w(t),

(6)

where sgn(·) is the sign function and sgn{0} = 0. For the above infinitely repeated games, the human may only observe the payoff vector P(s(t)), but since there is an obvious one-to-one correspondence between P(s(t)) and s(t), we will assume that s(t) is observable to the human at each time t throughout the paper. Now, for any given human and machine strategies with their corresponding realization, the averaged payoff (or ergodic payoff)[30] of the human can be defined as

5751

P∞+ = lim

T →∞

1 T ∑ p(t). T t=1

(7)

ThC14.6 In the case where the limit actually exists, we may simply write P∞+ = P∞ . Similarly, W∞+ can be defined. The basic problems that we are going to address are as follows: 1) How can the human choose his strategy g so as to obtain an optimal averaged payoff ? 2) In addition to optimality, is it necessary that the human’s payoff is better than the machine’s ? 3) When the machine’s strategy is unknown to the human, can he still obtain an optimal payoff ? The following four theorems will give some answers to these questions. Theorem 2.1. For any machine strategy with finite kmemory, there always exists a human strategy also with kmemory, such that the human’s payoff is maximized and the resulting system state sequence {s(t)} will become periodic after some finite time. We remark that from the proof to be given in Section 4.1, the optimal payoff value is free of the initial state if all the states in the state transfer graph (STG) of the machine strategy share the same reachable set, like when it is strongly connected, see Section 3 for the definition of STG. Furthermore, as will be shown in Example 3.1, the optimal human strategy can be found by searching on the STG based on Theorem 2.1. For the case of 1-memory, we also have the following nice result: Theorem 2.2. For any machine strategy with 1-memory, the optimal strategy of the human will not lose to the machine. Remark 2.1. Theorem 2.2 may not remain valid for machine strategy with general k-memory, k ≥ 2. In fact, when k ≥ 2, the game becomes more complicated and subtle. As will be demonstrated in Section 4, whether the human can win while getting his optimal payoff depends on delicate relationships among s, p, r,t. We note that it is the game structure that brings about this somewhat unexpected win-loss phenomenon. What interests us most is that such one-sided optimization problem (for the human) may not always win even if the opponent has a fixed strategy, which can not be observed in the traditional optimization framework. As will be shown in Section 3, when the machine strategy is known to the human, the human can find the optimal strategy with best payoff. A natural question is: What if the machine strategy is unknown ? One may hope to identify the machine strategy within finite steps before making optimal decision. A machine strategy which is parameterized by a vector A (like in (5) for the case of k = 1), is called identifiable if there exists a human strategy such that the vector A can be constructed from the corresponding realization † and the initial state. Proposition 2.1. A machine strategy with k-memory is identifiable if and only if its corresponding STG is strongly connected. Proposition 2.1 is quite intuitive, which can be used to confirm the existence of the unidentifiable machine strategies † Realization: the sequence of states produced by the game given any strategies of both players together with any initial state [21].

and help to find them. As an example, when k = 1, the STG corresponding to the machine strategy A = (0, 0, ∗, ∗) or A = (∗, ∗, 1, 1) will not be strongly connected, and so will not be identifiable by Proposition 2.1. In fact, as can be easily seen, only part of the entries of such A = (a1 , a2 , a3 , a4 ) can be identified from any given initial state. However, if the machine makes mistakes with a tiny possibility, the machine strategy may become identifiable. For example, if it changes its planed decision with a small positive probability to any other decisions, then the corresponding STG will be a Markovian transfer graph which is strongly connected. Hence, all strategies will be identifiable. To show how to identify the machine strategy, let us consider the case of k = 1 as an example. In this case, one effective way for the human to identify the machine strategy is to randomly choose his action at each time. One can also use the following method to identify the parameters:   0 as(t) is not known at time t, h(t + 1) = or as(t) is known, but a2·as(t) +1 is not; (8)   1 otherwise.

Theorem 2.3. For any identifiable machine strategy with k = 1, it can be identified using the above human strategy with at most 7 steps from any initial state. Remark 2.2. For non-identifiable machine strategies, one may be surprised by the possibility that identification may lead to a worse human’s payoff. For example, if the machine takes the non-identifiable strategy A = (0, 1, 1, 1), then by acting with “C” blindly, the human can get a payoff r by the PD payoff matrix at each time. However, once he tries to identify the machine’s strategy, he may use the “D” to probe it. Then the machine will be provoked and acts with “D” forever. That will lead to a worse human payoff p < r afterwards. III. T HE S TATE T RANSFER G RAPH To prove the main results in the above section, we need to use the concept of State Transfer Graph (STG) together with some basic properties. Throughout this section, the machine strategy A = (a1 , a2 , a3 , a4 ) is assumed to be known. Given an initial state and a machine strategy, any human strategy {h(t)} can lead to a realization of the states {s(1), s(2), ..., s(t), ...} . Hence, it also produces a sequence of human payoffs {p(s(1)), p(s(2)), ..., p(s(t)), ...}. Thus the Question 1) raised in Section 2 becomes to solve ∞ {h(t)}t=1 = arg max P∞+

among all possible human strategies. In order to solve this problem, we need the definition of STG, and we refer to [31] for some standard concepts in graph theory, e.g. walk, path and cycle. Definition 3.1. A directed graph with 22k vertices {s1 , s2 , ......s22k } is called the State Transfer Graph(STG), if it contains all the possible infinite walks corresponding to all possible human strategies, that equals to say, it contains all the possible one-step path or cycle in the walk.

5752

ThC14.6 In the case of k = 1, for a machine strategy A = (a1 , a2 , a3 , a4 ), the STG is a directed graph with the vertices being the state s(t) ∈ {s1 , s2 , s3 , s4 } with si = i. An edge si s j exists if s(t + 1) = s j can be realized from s(t) = si by choosing h(t + 1) = 0 or 1. Since si = i, by (4) and (5), that means, the edge si s j exists ⇔ s j = 2 · ai + 1 or s j = 2 · ai + 2

edge s1 s2 illustrates that if the human takes action D, he can transfer the state from s1 to s2 with payoff vector (5, 1). Others can be explained in the same way. Now we take the h=c

h=d h=c

s1 (3,0)

h=d

(9)

h=c h=c

and the way to realize this transfer is taking human’s action as h = (s j − 1)mod 2 by (4). By the definition above, one machine strategy leads to one STG, and vice versa. Definition 3.2. A state s j is called reachable from the state si , if there exists a path(or cycle) starting from si and ending with s j . All the vertices which are reachable from si constitute a set, called the reachable set of the state si . A STG is called strongly connected if any vertex si has all vertices in its reachable set. Furthermore, we need to define the payoff of a walk on STG as follows: Definition 3.3. The averaged payoff of an open walk W = v0 v1 ...vl on a STG, with v0 6= vl , is defined as p(v0 ) + p(v1 ) + ... + p(vl ) , (10) l+1 and the averaged payoff of a closed walk W = v0 v1 ...vl , with v0 = vl , is defined as pW ,

p(v0 ) + p(v1) + ... + p(vl−1) . (11) l Now, we can give some basic properties of STG below. Lemma 3.1. For a given STG, any closed walk can be divided into finite cycles † , such that the edge set of the walk equals the union of the edges of these cycles. In addition, any open walk can be divided into finite cycles plus a path. Lemma 3.2. Assume that a closed walk W = v0 v1 ...vL with length L, can be partitioned into cycles W1 ,W2 , ...,Wm , m ≥ 1, with their respective lengths being L1 , L2 , ..., Lm . Then, pW , the averaged payoff of W can be written as m Lj (12) pW = ∑ p j , j=1 L pW ,

where p1 , p2 , ..., pm are the averaged payoffs of the cycles W1 ,W2 , ...,Wm , respectively. By Theorem 2.1, the state of the repeated PD games will be periodic under the optimal human strategy. This enables us to find the optimal human strategy by searching on the STG, as will be illustrated in the example below. Example 3.1. Consider the ALL C strategy A = (0, 0, 0, 0) of the machine, and assume that the parameters in the PD payoff matrix are s = 0, p = 1, r = 3,t = 5. Then the STG can be drawn as shown in Figure 2, in which s1 (3, 0) means that under the state s1 , the human gets his payoff vector P(s1 ) = (p(s1 ), w(s1 )) = (r, sgn(r − r)) = (3, 0). The directed † The definition of cycle is a little different from [31]. We ignore the constraint that the length l ≥ 2 and include ‘loop’ in the concept of “cycle”.

s2 (5,1)

s4 (1,0)

Fig. 2.

h=d h=d

s3 (0,-1)

STG of ALL C machine strategy A = (0,0,0,0)

initial state as s(0) = s3 = (1, 0).Then the reachable set of s3 is {s1 , s2 }, and we just need to search the cycle whose vertices are on this set. Obviously, there are three possible cycles W1 = {s1 }, W2 = {s2 }, W3 = {s1 , s2 } and by (11), the averaged payoffs of the human are respectively pW1 = p(s1 ) = 3, pW2 = p(s2 ) = 5, 2) pW3 = p(s1 )+p(s = 4. 2 Obviously, the optimal payoff lies in the cycle W2 = {s2 }. To induce the system state enters into this cycle, the human just take h(1) = 1. Then by taking h(t) = 1,t ≥ 2, the optimal state sequence s(t) = s2 ,t ≥ 1 will be obtained from s(0) = s3 . Remark 3.1 The search procedure above can be accomplished in the general case by an algorithm which is omitted here for brevity, and it can also be seen that for any given machine strategy with k-memory, there always exists a search method to find the optimal strategy of the human. Moreover, the optimal payoff remains the same when the initial state varies in a reachable set. IV. P ROOFS OF THE M AIN R ESULTS A. Proof of Theorem 2.1 Proof. We just need consider the case where k = 1. Note that any cycle on the STG has a period not greater than 4, and hence by searching on the STG, we can find a cycle W ∗ with period d ≤ 4 such that the its averaged payoff p∗ is maximized among all possible cycles. ∞ can be constructed so that Clearly, a sequence {s∗ (t)}t=1 the state starting from s(0) will enter into the cycle W ∗ within smallest possible steps, say, T0 . Obviously, the payoff of this realization equals p∗ by definition. Next, we proceed to prove that the averaged payoff of any ∞ will not be greater than p∗ . state sequence {s(t)}t=1 Let us consider any state sequence {s(1), s(2), ..., s(L)} with length L > 4. By Lemma 3.1, the walk s(1)s(2)...s(L) on STG can be divided into finite cycles(plus a path with vertices e ≤ 4, if the walk is open). In the following , we need only to consider the little more complicated case of open walk. Let the cycles be denoted as W1 , ...,Wn with lengths being L1 , ..., Ln , and averaged payoffs being p1 , ..., pn , respectively.

5753

ThC14.6 Then, it is easy to see that L = L1 + ... + Ln + e. Suppose the path is v1 v2 ...ve . Then we have Ln L1 p1 + ... + pn pL = L L p(v1 ) + ... + p(ve) + L Ln A L1 p1 + ... + pn + (13) , L L L where 0 ≤ A ≤ α with α being a constant, and L − 4 ≤ L1 + ...... + Ln ≤ L. Now, for the above L, suppose that there are m optimal cyL . Then we have cles W ∗ contained in the sequence {s∗ (t)}t=1 ∗ ∗ ∗ ∗ {s (1), ..., s (L)} = {s (1), ..., s (T0 − 1)} ∪ {m cycles W ∗ } ∪ {remaining states of number f} with T0 − 1 < 4 and f < d ≤ 4. Then L = T0 − 1 + md + f , and the averaged payoff of L {s∗ (t)}t=1 is m · d ∗ p(s∗ (1)) + ... + p(s∗(T0 − 1)) p + L L ∑ p(remaining states) + L L1 + ... + Ln ∗ m · d − (L1 + ... + Ln) ∗ , p + p L L B (14) + L where 0 ≤ B ≤ β with β being a constant, and L − 6 < md ≤ L. From (13) and (14), we have L1 Ln pL − p∗L = { (p1 − p∗ ) + ... + (pn − p∗)} L L m · d − (L1 + ... + Ln) ∗ A − B p + − L L L1 Ln ∗ = { (p1 − p ) + ... + (pn − p∗)} L L Z A−B + (15) + L L where | Z |≤ γ with γ being a constant. Since all pi ≤ p∗ , we have the first part of (15) satisfies L1 Ln (p1 − p∗ ) + ... + (pn − p∗ ) ≤ 0. L L Consequently, by letting L → ∞, we know that for any ∞ , state sequence {s(t)}t=1 p∗L

B. Proof of Theorem 2.2 and Remark 2.1 In the PD game under consideration, the parameters in the payoff matrix satisfy (1). Thanks to the relationships, we have the nice result of Theorem 2.2 which can be proved case by case. The proof is omitted here while an counterexample will be established to prove Remark 2.1. For k = 2, the counterexample will be constructed based on the existence of the machine strategy “Two Tits For One Tat”, i.e., only when the machine gets the largest payoff value t from the action profile (D,C) for k times, it will act C once in return. When the human plays PD game against such a machine strategy, he may lose to the machine if he only concentrates on optimizing his payoff, which implies that he should not “too greedy” if he also cares about win or loss. Example 4.1. When k = 2, by (2) the machine will choose its action at t + 1 based on (m(t − 1), h(t − 1); m(t), h(t)). From (3), the state at time t becomes

=

P∞+ = lim pL ≤ lim p∗L = p∗ . L→∞

L→∞

(16)

From the definition of STG, the state can be induced into the optimal cycle if the human chooses the optimal strategy g∗ by solving the following equation g∗ (s∗ (0), s∗ (1), ..., s∗ (t)) = =

g∗ (s∗ (t)) (s∗ (t + 1) − 1)mod 2.

and g∗ is still with k-memory by (3).  Remark 4.1. From the proof, we can see that when the state enters into the optimal cycle, the limit P∞ exists and equals p∗ . Moreover, it is worth mentioning that the form of the averaged payoff criteria is important in Theorem 2.1. For other payoff criteria, similar results may not hold.

s(t) , 8 · m(t − 1) + 4 · h(t − 1) + 2 · m(t) + h(t) + 1

(17)

and again, this gives an one-to-one correspondence between s(t) and (m(t − 1), h(t − 1); m(t), h(t)). Let us denote the state space as {s1 , s2 , s3 , ..., s16 } with si = i, i = 1, 2, ..., 16. Note that each state s(t) still gives one of the four payoffs from {p(1) = r, p(2) = t, p(3) = s, p(4) = p}, and so the payoff of the human at state s(t) can be defined as q(s(t)) , p(2 · m(t) + h(t) + 1) = p((s(t) − 1)mod 4 + 1). Similar to (5), the machine’s strategy can be written as 16

m(t + 1) = ∑ ai I{s(t)=i}

(18)

i=1

and denote it as a vector A = (a1 , a2 , ..., a16 ) for simplicity. Similar to (9), the STG of any machine strategy of 2memory can be easily drawn under the rules below, si s j exists ⇔ s j = 4 · [(i − 1)mod 4] + 2 · ai + 1, i f h = 0; or s j = 4 · [(i − 1)mod 4] + 2 · ai + 2, i f h = 1. (19) Now, if the machine uses the strategy “Two Tits For One Tat”, it will take the action m(t + 1) = 0 only under the history (m(t − 1), h(t − 1), m(t), h(t)) = (1, 0, 1, 0), which corresponds to the state s(t) = s11 by (17). Hence, the strategy corresponds to a parameterized form A′ = (a′1 , a′2 , ..., a′16 ), with a′11 = 0, a′i = 1, if i 6= 11. Its state transfer subgraph with initial state s1 is shown as below(Fig 3). By searching on the subgraph Fig 3, we can find all the possible cycles together with their respective averaged payoffs. When the parameters in the PD payoff matrix are set s = 0, p = 1, r = 3,t = 5,we get the optimal cycle for the human is {s10 , s7 , s11 } with the averaged payoff 35 . On the other hand, it is not difficult to calculate the machine’s payoff, which equals 10 3 along the same cycle. Obviously, the human loses. We remark that, if the parameters in the PD payoff matrix change, the above assertion may change too. For instance, if 2s+t 3 < p, then in the example above, the optimal cycle becomes {s16 }, and the human will no longer lose.

5754

ThC14.6 S9

S4

S3

S11

S15 S12

S16

S8

S7

S10

Fig. 3.

A transfer subgraph of A′ with initial state s1

V. CONCLUDING REMARKS In an attempt to study adaptive systems beyond the traditional framework of adaptive control, we have, in this paper, studied some optimization and identification problems in a non-equilibrium dynamic game framework where two heterogeneous agents, called “Human” and “Machine”, play repeated games modeled by the standard Prisoners’ Dilemma. By using the concept and properties of the state transfer graph, we are able to establish some interesting theoretical results, which we may not have in a traditional control framework. For example, we have shown that (i) the optimal strategy of the game is periodic after finite steps; (ii) optimizing one’s payoff solely may lose to the opponent eventually, and (iii) probing the system improperly for identification may lead to a degraded payoff of the human. It goes without saying that there may be many implications and extensions of these results. For example, a) would the theorems still be true if the machine strategy contains some randomness? b) what if the PD game is replaced by other games? c) can we construct an online adaptive optimal strategy in general ? Furthermore, it would be more challenging to establish a mathematical theory for more complex systems, where many (possibly heterogeneous) agents interact with learning and adaptation. To that end, we hope that the current paper may serve as a starting point. VI. ACKNOWLEDGMENTS The motivation of this paper came out from an attempt to understand the fundamental differences between adaptive control systems and complex adaptive systems in the area of complexity science. We are grateful to Prof. John Holland of Michigan University for his insightful and stimulating discussions on this issue in the summer of 2006 in Beijing. Thanks also go to Prof.Jing Han, Dr.Zhixin Liu and other members in our weekly seminar on complex systems for their valuable comments and discussions. The authors also thank the reviewers for their comments. R EFERENCES [1] K. J. Astrom and B. Wittenmark, Adaptive Control, Addison-Wesley, Reading, MA, 2nd ed., 1995.

[2] L. Ljung, System Identification: Theory for the Users, Prentice-Hall, Upper Saddle River, NJ, 2nd ed., 1999. [3] L. Guo and H. Chen. “The Astrom-Wittenmark Self-tuning Regulator Revised and ELS-Based Adapptive Trakers”., IEEE Trans. on Automatic Control, vol. 36, 1991, pp. 802-812. [4] L. Guo and L. Ljung, “Performance analysis of general tracking algorithms”, IEEE Trans. on Automatic Control, vol. 40, 1995, pp. 1388-1402. [5] L. Guo, “Self-convergence of weighted least-squares with applications to stochastic adaptive control”, IEEE Trans. on Automatic Control, vol.41, 1996, pp. 79-89. [6] T. L. Duncan, L. Guo and B. Pasik-Duncan, “Continous-time linearquadratic Gaussian adaptive control”, IEEE Trans. on Automatic Control, vol. 44, 1999, pp.1653-1662. [7] L. Guo, “On critical stability of discrete-time adaptive nonlinear control”, IEEE Trans. on Automatic Control, vol. 42, 1997, pp. 14881499. [8] G. C. Goodwin and K. S. Sin, Adaptive Filtering, Prediction and Control, Prentice-Hall, Englewood Cliffs NJ, 1984. [9] P. R. Kumar and P. Varaiya, Stochastic Systems: Estimation, Identification and Adaptive Control, Prentice Hall, Englewood Cliffs NJ, 1986. [10] M. Kristic, I. Kanellakopoulos and P. Kokotoric, Nonlinear Adaptive Control Design, A Wiley-Interscience Publication,John Wiley & Sons, Inc., Canada, 1995. [11] L. Guo, “Adaptive Systems Theory: Some Basic Concepts, Methods andResults”, Journal of Systems Science and Complexity, vol. 16, 2003, pp. 293-306. [12] J. Holland, Hidden Order: How Adaptation Builds Complexity, Addison-Wesley, Reading, MA, 1995. [13] J. Holland, “Studying Complex Adaptive Systems”, Journal of System Science and Complexity, vol. 19, 2006, pp. 1-8. [14] T. Basar and G. J. Olsder, Dynamic Noncooperative Game Theory, the Society for Industrial Applied Mathematics, Academic Press, New York, 1999. [15] W. B. Arthur, S. N. Durlauf and D. Lane, The Economy As An Evolving Complex System II, Addison-Wesley, Reading, MA, 1997. [16] J. W. Weibull. Evolutionary Game Theory, MIT Press, Cambridge, MA, 1995. [17] J. Hofbauer and K. Sigmund, “Evolutionary game dynamics”, Bulletin of the American Mathematical Society, vol 40, 2003, pp. 479-519. [18] D. Fudenberg and D. K. Levine, The Theory of Learning in Games, MIT Press, Cambridge, MA, 1998. [19] D. Fudenberg and D. K. Levine, “Learning and equilibrium”, 2008. Available: http://www.dklevine.com/papers/ annals38.pdf. [20] D. Fudenberg and D. K. Levine, “ Self-confirming equilibrium”, Econometrica, vol. 61, 1993, pp. 523-545. [21] E. Kalai and E. Lehrer, “Rational learning leads to Nash equilibrium”, Econometria vol. 61, 1993, pp. 1019-1045. [22] E. Kalai and E. Lehrer, “Subjective equilibrium in repeated games”, Econometrica, vol. 61,1993, pp 1231-1240. [23] R. Lahkar and W. H. Sandholm, “The projection dynamics and the geometry of population games”, Games and Economic Behavior, vol. 64, 2008, pp. 565-590. [24] J. R. Marden, G. Arslan and J. S. Shamma, “Joint strategy fictitious play with inertia for potential games”. IEEE Trans. Automatic Control, vol. 54, 2009, pp. 208-220. [25] D. P. Foster and H. P. Young. “Learning, hyperthesis testing ans Nash equilibrium”, Games and Economic Behavior, vol. 45, 2003, pp. 7396. [26] J. R. Marden, H. P. Young, G. Arslan and J. S. Shamma, “Payoff based dynamics for multi-player weakly acyclic games”, in Prodeedings of the 46th IEEE Conferrence on Decision and Control, New Orleans, USA, 2007, pp. 3422-3427. [27] Y. Chang, “No regrets about no-regret”, Artificial Intelligence, vol. 171, 2007, pp. 434-439. [28] H. P. Young, “The possible and the impossible in multi-agent learning”, Artificial Intelligence, vol. 171, 2007, pp. 429-433. [29] R. Axelrod, The Evolution of Cooperation, Basic Books, New York, 1984. [30] M. Puterman. Markov Decision Processes:Discrete Stochastic Dynamic Programming, John Wiley & sons, Inc, New York, 1994. [31] J. B-Jensen and G. Gutin, Digraphs: Theory, Algorithms and pplications, Spring-Verlag, London, 2001.

5755

Suggest Documents