Learning by Adaptive and Forward-Looking ... - Semantic Scholar

Learning by Adaptive and Forward-Looking Players with One Period Memory by Takako Fujiwara-Greve Institute of Economics, Norwegian School of Management BI Elias Smiths vei 15, Box 580, Sandvika 1302, Norway and Department of Economics, Keio University 2-15-45 Mita, Minatoku, Tokyo 108-8345, Japan and Carsten Krabbe Nielsen Instituto de Politica Economica, Universita Cattolica Via Necchi 5, 20123 Milano, Italy. February, 2003. Corresponding author: Takako Fujiwara-Greve Phone: +47 - 6755 - 7379 Fax: +47 - 6755 - 7675 E-mail: [email protected]

Abstract. We investigate how much simple-minded players with one period memory can learn for general stage games. The adaptive and one-step forward-looking behavior rules are sufficient for convergence to a minimal weak curb set. Weak curb sets are in general larger than the curb sets but the minimal ones coincide for many games including weakly acyclic games and supermodular games. We also provide an example of more sophisticated players without convergence to a minimal curb set. Therefore, a key to convergence is the diversity in reactions to observations, rather than the depth of thinking. JEL classification number: C 73 Key words: Learning, forward looking, weak curb set, weakly acyclic games, supermodular games, diversity. 1

1

Introduction

In this paper we investigate how sophisticated players need to be for a learning process to converge. It is well known that simple deterministic dynamics such as the fictitious play and the deterministic adaptive process (Cournot dynamic, Young, 1993) may not converge for general games. The main reason of cycles of these processes is that the players react to the information in a fixed way. Many researchers thus proposed to perturb the actions.1 However, in the context of learning by optimizing players, random actions are not justified. Therefore we formulate a model where optimizing players randomize behavior rules (rules to choose actions based on the available information) which are based only on the knowledge of the game and optimizing behavior of the players. We focus on iterative reasoning with one period memory as follows. Each player receives the information of the previous period action combination only. Based on the information, one can believe that the next opponent plays the same action as the previous one, which leads to the adaptive behavior rule. If one knows the stage game and that the opponents optimize, then it is equally plausible that the next opponent uses the adaptive rule, in which case one should play a best response to a best response to the previous action combination. We call this behavior rule one-step forward-looking rule. This kind of reasoning can be continued, i.e., one can choose a rule that stipulates a best response to the one-step forward-looking rule and so on. An increased number of steps of this iterative reasoning is then interpreted as showing the player to be more sophisticated. Iterated best response behavior rules are already considered in some learning models and evolutionary models. Stahl (1993, 1996, 2000) studies a similar set of iterated best response rules except that the most naive rule is to play all actions with the equal probability. Hurkens (1995, Section 5) considers the same kind of iterative reasoning as above but in addition assumes that players have probabilistic beliefs over all iterated best responses by the opponents. Hurkens shows a convergence result with this widest range of iterative behavior rules. If all the iterative best responses based on some probability distribution of the iterative best responses by the 1

Among others, reinforcement learning models (e.g., Camerer and Ho, 1999, and Roth and Erev, 1995) and stochastic approximation models (e.g., Foster and Young, 2002) explicitly assume that players play suboptimal actions with positive probabilities. Alternatively, the no-mistake model of Young (1993) perturbs the information.

2

opponents have a positive probability, then the action combinations converge to a minimal curb set with probability one. We investigate how many steps of iterative best responses is sufficient for convergence in two person games. Proposition 1 proves that the adaptive behavior rule (no iteration) and one-step forward-looking (one iteration) behavior rule are sufficient for the actions to converge to a minimal weak curb set provided that there is a positive probability that the two players use different behavior rules. Thus rather simple behavior rules are sufficient. For many games including weakly acyclic games (Young, 1993) and supermodular games (Milgrom and Roberts, 1990), minimal weak curb sets, minimal curb sets, and pure Nash equilibria coincide. We also investigate a different type of sophistication in learning, but using the same set of behavior rules as Proposition 1. We consider a learning process in which players assess the performance of various behavior rules and choose only the rules which prescribed an ex-post best response in the previous period. Interestingly, even with such sophisticated players, we found that the action combination may cycle and stay away from minimal weak curb sets (and hence away from minimal curb sets as well). Therefore the depth of reasoning is not as useful for convergence as the diversity of reasoning.

2

Model

Let G = (S1 , S2 , u1 , u2 ) be a two person normal form game. The elements of Si (i = 1, 2) are called pure actions and we assume that Si ’s are finite. The function ui : S1 × S2 → (i = 1, 2) is a payoff function. For each i = 1, 2 and each pure action sj ∈ Sj (j = i), define the set of pure best responses against sj by i as BRi (sj ) = {si ∈ Si | ui (si , sj ) ≥ ui (x, sj ) ∀x ∈ Si }. For simplicity, assume that BRi (sj ) is a singleton for each i and sj .2 Let V1 , V2 be populations of players. In each period t = 1, 2, . . ., one player is drawn from each Vi to play the role of i in the stage game G. We call each player drawn from Vi player i. Each player i chooses a pure action in Si using a behavior rule. A behavior rule is a function from one’s information to the set of actions Si . 2

The set of such games is of measure one in the space of finite two-person games.

3

Assume that players only receive the information of the previous period action combination (s1 , s2 ) ∈ S1 × S2 . The adaptive behavior rule for player i is a function such that for any information (s1 , s2 ) ∈ S1 × S2 , it chooses a best response against the observed action by the previous opponent, i.e., it chooses an action si = BRi (sj ) (j = i). The one-step forward-looking behavior rule for player i is a function, which chooses a best response against an opponent using an adaptive behavior rule, i.e., it chooses an action si = BRi (BRj (si )) (j = i). One can define higher step forward-looking behavior rules which chooses a best response against the opponent using the one-step lower behavior rule, as in Hurkens (1995, Section 5.1) or Stahl (1996). Throughout the paper we assume that players use one of the adaptive behavior rule or m-step forward-looking behavior rule (m = 1, 2, . . .). For each i = 1, 2 and each subset X ⊂ Sj (j = i), define with a slight abuse of notation, the set of best responses against X by i as BRi (X) = ∪sj ∈X BRi (sj ). A product set C1 × C2 ⊂ S1 × S2 is a weak curb set (w-curb set henceforth) if BR1 (C2 ) × BR2 (C1 ) ⊂ C1 × C2 . A w-curb set C1 × C2 is a minimal w-curb set if there is no proper subset of C1 × C2 which is a w-curb set. Basu and Weibull (1991) define a curb set as satisfying BR1 (∆(C2 )) × BR2 (∆(C1 )) ⊂ C1 × C2 where ∆(X) is the set of all probability distributions over X. Clearly, a curb set is a w-curb set but not vise versa (see the example in Section 4). We show some classes of games in which minimal w-curb and minimal curb sets coincide at the end of Section 3. Since a curb set exists for any finite game, a w-curb set exists. A strict Nash equilibrium is both a minimal w-curb set and a minimal curb set. Since we assume one period memory, a player’s observation is a single pure action combination. Any belief based only on the observation and the iterated best responses against the observation has a singleton support, and thus our definition of a w-curb set is natural. We cannot expect the simple-minded players with only one period memory to always discover a curb set, which is closed under best responses against all mixed actions in itself. See Section 5 for more discussion.

4

3

Adaptive and one-step forward-looking rules are sufficient

Proposition 1 Assume that there exists > 0 such that in each period, the probability is at least that one of the players use the adaptive behavior rule and the other uses the one-step forward-looking behavior rule.3 From any initial observation (s1 , s2 ) ∈ S1 × S2 , with probability one, there is a finite period t∗ < ∞ such that the action combination is in a minimal w-curb set for all t ≥ t∗ .

Proof: Without loss of generality, assume that S1 ×S2 is not a minimal w-curb set. Since S1 ×S2 is a w-curb set, it suffices to prove that from any initial observation (s1 , s2 ) ∈ S1 × S2 , there is a positive probability that the action combination enters a smaller w-curb set in a finite number of periods. Then the same logic can be applied to each of the smaller w-curb sets, which is not minimal, until the action combination reaches a minimal w-curb set. Since S1 × S2 is finite, we reach a minimal w-curb set in a finite number of periods with a positive probability. Let Sˆ11 × Sˆ21 , Sˆ12 × Sˆ22 , . . . , Sˆ1K × Sˆ2K be the largest4 w-curb sets, which are proper subsets of S1 ×S2 . In other words, each Sˆ1k × Sˆ2k is a union of w-curb sets W11 ×W21 , W12 ×W22 , . . . , W1H ×W2H such that each W1h × W2h is a proper subset of S1 × S2 and for each h, there is h = 1, 2, . . . , H

such that [W1h × W2h ] ∩ [W1h × W2h ] = ∅. ˆk ˆk Step 1: A pure action combination (s1 , s2 ) ∈ S1 × S2 belongs to at least one of ∪K k=1 [S1 × S2 ], K ˆk ˆk ˆk ˆk ˆk ˆk ∪K k=1 [(S1 \ S1 ) × S2 ], ∪k=1 [S1 × (S2 \ S2 )], or (S1 \ ∪k S1 ) × (S2 \ ∪k S2 ).

Proof of Step 1: Case 1: Suppose that s1 ∈ Sˆ1k for some k ∈ {1, 2, . . . , K}. K ˆk ˆk ˆk ˆk ˆk If s2 ∈ Sˆ2k , then (s1 , s2 ) ∈ ∪K k=1 [S1 × S2 ]. If s2 ∈ S2 \ S2 , then (s1 , s2 ) ∈ ∪k=1 [S1 × (S2 \ S2 )].

Case 2: Suppose that s1 ∈ Sˆ1k for any k ∈ {1, 2, . . . , K}, i.e., s1 ∈ S1 \ ∪k Sˆ1k . ˆk ˆk ˆk If s2 ∈ Sˆ2j for some j ∈ {1, 2, . . . , K}, then (s1 , s2 ) ∈ ∪K k=1 [(S1 \ S1 ) × S2 ]. If s2 ∈ S2 for any k ∈ {1, 2, . . . , K}, then (s1 , s2 ) ∈ (S1 \ ∪k Sˆ1k ) × (S2 \ ∪k Sˆ2k ). // 3

There are two such events: player 1 uses the adaptive rule and player 2 uses the one-step forward-looking rule and vice versa. The assumption is that both of these have a probability at least . 4 W-curb sets can be nested.

5

Sˆ21 Sˆ22 1\2 X Sˆ11 w-curb (step 2) Step 4 Step 3,4 x Step 3 Step 5 Step 3 y Step 3 Step 5 Step 3 2 ˆ S1 Step 3,4 Step 4 w-curb z Step 3 Step 5 Step 3 3 ˆ S1 Step 3,4 Step 4 Step 3,4

Y Step Step Step Step Step Step

4 5 5 4 5 4

Z Step Step Step Step Step Step

4 5 5 4 5 4

Sˆ23 Step 3,4 Step 3 Step 3 Step 3,4 Step 3 w-curb

Figure 1: Relevant areas of various steps5

Note that the four areas are not necessarily mutually exclusive. See Figure 1 for a visual aide. Note also that (S1 \ ∪k Sˆ1k ) × (S2 \ ∪k Sˆ2k ) is not a w-curb set since the Sîk ’s cover all smaller w-curb sets of S1 × S2 . Let us describe the outline of the rest of the proof first. In Step 2, we show that once the process enters a smaller w-curb set (Sˆ1k × Sˆ2k ), the action combinations do not get out. Step 3 and 4 consider the case that one of the players’ observed action was in a smaller w-curb set K ˆk ˆk ˆk ˆk ˆk (that is, either (s1 , s2 ) ∈ ∪K k=1 [S1 × (S2 \ S2 )] or (s1 , s2 ) ∈ ∪k=1 [(S1 \ S1 ) × S2 ]). Let s1 ∈ S1 ,

then player 2 can use the adaptive behavior rule to play a best response so that the next period action is such that s2 ∈ Sˆ2k . Then player 1 can use F1 rule so that s1 ∈ Sˆ1k . Finally, the most difficult part is Step 5 which deals with the case that no player was observed to play an action in a smaller w-curb set, that is, (s1 , s2 ) ∈ (S1 \ ∪k Sˆ1k ) × (S2 \ ∪k Sˆ2k ) was the observation. We show that since (S1 \ ∪k Sˆ1k ) × (S2 \ ∪k Sˆ2k ) is not a w-curb set, there is at least one player’s action s¯i which leads to a best response outside of (S1 \∪k Sˆ1k )×(S2 \∪k Sˆ2k ). If the process hits this action at some point, then we go to one of the steps 2-4. If not, the process must stay in a smaller set T1 × T2 than (S1 \ ∪k Sˆ1k ) × (S2 \ ∪k Sˆ2k ). This smaller set T1 × T2 cannot be a w-curb set, and thus there exists another s¯i which leads to a best response outside of T1 × T2 and so on. Since (S1 \ ∪k Sˆ1k ) × (S2 \ ∪k Sˆ2k ) is finite, at some point the process must hit such an “exit” action. 5

The sets Sîk may contain more than one pure action.

6

Step 2: If (s1 , s2 ) ∈ Sˆ1k × Sˆ2k for some k ∈ {1, 2, . . . K}, then the next period action combination (s1 , s2 ) belongs to the same w-curb set Sˆ1k × Sˆ2k . That is, the process never leaves Sˆ1k × Sˆ2k . Proof of Step 2: Recall that all players use one of the adaptive or forward-looking behavior rules, and hence the next period action is an iterative best response. Take player 1. If he uses the adaptive behavior rule, s1 = BR1 (s2 ) and since s2 ∈ Sˆ2k , we have s1 ∈ Sˆ1k . If he uses F1 rule, s1 = BR1 (BR2 (s1 )). Since s1 ∈ Sˆ1k , we have BR2 (s1 ) ∈ Sˆ2k and hence s1 ∈ Sˆ1k . Higher step forward-looking rules give the same conclusion and player 2 is similar. // ˆk ˆk Step 3: If (s1 , s2 ) ∈ ∪K k=1 [(S1 \ S1 ) × S2 ], then there is a positive probability that the next period action combination (s1 , s2 ) enters a smaller w-curb set Sˆ1k × Sˆ2k for some k. Proof of Step 3: Consider the positive probability event that player 1 uses the adaptive behavior rule and player 2 uses the F1 rule. Then s1 = BR1 (s2 ) and s2 = BR2 (BR1 (s2 )) = BR2 (s1 ). Since s2 ∈ Sˆ2k for some k, we have s1 = BR1 (s2 ) ∈ Sˆ1k and s2 = BR2 (s1 ) ∈ Sˆ2k . // ˆ ˆk Step 4: If (s1 , s2 ) ∈ ∪K k=1 [S1 × S2 \ S2 ], then there is a positive probability that the next period action combination (s1 , s2 ) enters a smaller w-curb set Sˆ1k × Sˆ2k for some k. Proof of Step 4: This step is symmetric to Step 3. Consider the positive probability event that player 1 uses the F1 rule and player 2 uses the adaptive behavior rule. Then s2 = BR2 (s1 ) and s1 = BR1 (BR2 (s1 )) = BR1 (s2 ). Since s1 ∈ Sˆ1k for some k, we have s2 = BR2 (s1 ) ∈ Sˆ2k and s1 = BR1 (s2 ) ∈ Sˆ1k . // Step 5: If (s1 , s2 ) ∈ (S1 \ ∪k Sˆ1k ) × (S2 \ ∪k Sˆ2k ), then there are t¯ < ∞, a positive probability p, and a w-curb set Sˆ1k × Sˆ2k such that the action combination (s1 (t), s2 (t)) ∈ Sˆ1k × Sˆ2k for every t ≥ t¯ with probability p. s1 ) ∈ Proof of Step 5: Since (S1 \ ∪k Sˆ1k ) × (S2 \ ∪k Sˆ2k ) is not a w-curb set, either (1) BR2 (¯ (S2 \ ∪k Sˆ2k ) for some s¯1 ∈ (S1 \ ∪k Sˆ1k ), or (2) BR1 (¯ s2 ) ∈ (S1 \ ∪k Sˆ1k ) for some s¯2 ∈ (S2 \ ∪k Sˆ2k ) holds. Let {(s1 (t), s2 (t))}t=0,1,2,... start from (s1 (0), s2 (0)) = (s1 , s2 ) ∈ (S1 \ ∪k Sˆ1k ) × (S2 \ ∪k Sˆ2k ). Case 1: Suppose that (s1 (t∗ ), s2 (t∗ )) = (x, s¯2 ) for some x ∈ (S1 \ ∪k Sˆ1k ) and some t∗ < ∞. Consider a positive probability event that the next period player 1 uses the adaptive behavior s2 ) ∈ (S1 \ ∪k Sˆ1k ) i.e., s1 (t∗ + 1) ∈ Sˆ1k rule and player 2 uses the F1 rule. Then s1 (t∗ + 1) = BR1 (¯ 7

for some k. Hence s2 (t∗ + 1) = BR2 (s1 (t∗ + 1)) ∈ Sˆ2k as well and thus the process enters Sˆ1k × Sˆ2k for some k at period t∗ + 1 with a positive probability. Case 2: Suppose that (s1 (t∗ ), s2 (t∗ )) = (¯ s1 , y) for some y ∈ (S2 \ ∪k Sˆ2k ) and some t∗ < ∞. This case is symmetric to Case 1. Consider the event that player 1 uses the F1 rule and s1 ) ∈ (S2 \ ∪k Sˆ2k ) i.e., s2 (t∗ + 1) ∈ Sˆ2k player 2 uses the adaptive rule. Then s2 (t∗ + 1) = BR2 (¯ for some k. Hence s1 (t∗ + 1) = BR1 (s2 (t∗ + 1)) ∈ Sˆ1k as well and thus the process enters Sˆ1k × Sˆ2k for some k. Case 3: Suppose that none of case 1 or 2 holds. We can divide this possibility into three subcases. Let T1 = (S1 \ ∪k Sˆ1k ) \ {¯ s1 } and T2 = (S2 \ ∪k Sˆ2k ) \ {¯ s2 }. Since T1 × T2 is not a w-curb set, either (1’) BR2 (¯ s1 ) ∈ T2 for some s¯1 ∈ T1 , or (2’) BR1 (¯ s2 ) ∈ T1 for some s¯2 ∈ T2 holds. Case 3-1: Suppose that (s1 (t∗ ), s2 (t∗ )) = (x, s¯2 ) for some x ∈ T1 and some t∗ < ∞. Then there is a positive probability that player 1 uses the adaptive behavior rule and player 2 uses the F1 rule, in which case (s1 (t∗ + 1), s2 (t∗ + 1)) ∈ T1 × T2 . Since the process never hits s¯1 or s¯2 , it goes to the areas considered in Step 3 or Step 4 in one period from t∗ . Therefore there is a finite period t¯ and a positive probability that the action combination enters a minimal w-curb set. s1 , y) for some y ∈ T2 and some t∗ < ∞. Case 3-2: Suppose that (s1 (t∗ ), s2 (t∗ )) = (¯ Similar to Case 3-1. Case 3-3: Suppose that none of Case 3-1 and 3-2 holds. Then again we can define a smaller product set eliminating s¯i from Ti (i = 1, 2) and make 3 subcases. Since (S1 \ ∪k Sˆ1k ) × (S2 \ ∪k Sˆ2k ) is finite, by iterating the above argument we have that there exists t¯ < ∞ and a positive probability such that (s1 (t¯), s2 (t¯)) ∈ Sˆ1k × Sˆ2k for some k. Proposition 1 implies that the action combinations converge to a minimal w-curb set with probability 1. An inspection of the proof shows that it is straightforward to extend Proposition 1 into a weaker sufficient condition: there exists > 0 such that in each period the probability is at least that the two players use behavior rules, which are one step different. Experimental research (for example Costa-Gomes et al., 2001) shows that real subjects are 8

heterogeneous in decision making but often not very sophisticated to consider iterated best responses of many levels. Proposition 1 can be interpreted as giving a justification of being simple-minded. Finally, we show some classes of games in which minimal w-curb sets, minimal curb sets and pure Nash equilibria coincide. Young (1993) defines a weakly acyclic games in which the deterministic adaptive process converges to a sequence of strict Nash equilibria (a convention) under perturbed information of the history. For completeness of the paper we repeat his definition under our notation. Define the best-reply graph of a finite game G as follows: each vertex is an action combination s ∈ S1 × S2 , and for every two vertices s and s , there is a directed edge s → s if and only if s = s and there exists exactly one player i such that si is a best response to sj and sj = sj . A game G is acyclic if its best reply graph contains no directed cycles. It is weakly acyclic if, from any initial vertex s, there exists a directed path to some vertex s∗ from which there is no exiting edge (a sink). A game is weakly acyclic if and only if from every action combination there exists a finite sequence of best responses by one player at a time that ends in a strict, pure Nash equilibrium. (Young, 1993, p. 64.) Hence in weakly acyclic games, minimal w-curb sets and minimal curb sets coincide and are the strict pure Nash equilibria. To compare with Young’s no-mistake process, our model requires less and unperturbed information of the history but adds the probability of non-adaptive behavior rules. Next, consider the supermodular games (Milgrom and Roberts, 1995.) For completeness of the paper we repeat their definitions under our notation. Assume that the action set Si (i = 1, 2) comes with a partial order ≥i . The set of action combinations S = S1 × S2 is endowed with the product order ≥ that is, (s1 , s2 ) ≥ (s1 , s2 ) if and only if si ≥ si for each i = 1, 2. The game G = ({1, 2}, S1 , S2 , u1 , u2 ) is a supermodular game if for each i = 1, 2: (A1) Si is a complete lattice; i.e., for each two element set {x, y} ⊂ Si , there is a supremum for {x, y} (denoted x ∧ y) and an infimum (denoted x ∨ y), and for all nonempty subsets T ⊂ Si , inf (T ) ∈ Si and sup(T ) ∈ S. (A2-1) ui is order continuous in sj (for fixed si ); i.e., for each chain C (a totally ordered subset of Sj ), it converges along C in both the increasing and decreasing directions, that is,

9

limsj ∈C,sj ↓inf (C) ui (si , sj ) = ui (si , inf (C)) and limsj ∈C,sj ↑sup(C) ui (si , sj ) = ui (si , sup(C)). (A2-2) ui : S → ∪ {−∞} is order upper semi-continuous in si (for fixed sj ); i.e., lim supsi ∈C,si ↓inf (C) ui (si , sj ) ≤ ui (inf (C), sj ) and lim inf si ∈C,si ↑sup(C) ui (si , sj ) ≤ ui (sup(C), sj ). (A2-3) ui has a finite upper bound; (A3) ui is supermodular in xi (for fixed xj ); i.e., for all x, y ∈ S, ui (x) + ui (y) ≤ ui (x ∧ y) + ui (x ∨ y). (A4) ui has increasing differences in si and sj ; i.e., for all si ≥ si , the difference ui (si , sj ) − ui (si , sj ) is nondecreasing in sj . It is easy to see that for finite games, conditions (A1)-(A3) are satisfied. Condition (A4) is the assumption of strategic complementarity: when the second player increases his choice variable(s), it becomes more profitable for the first to increase his as well. (Milgrom and Roberts, 1995, p. 1261.)

Proposition 2 Let G be a supermodular game. Then minimal w-curb sets, minimal curb sets, and pure Nash equilibria coincide. Proof. Suppose that C1 × C2 is a minimal w-curb set of G. Consider a restricted game G = (C1 , C2 , u1 , u2 ). Then G is also a supermodular game. Step 1: G has a pure Nash equilibrium (s∗1 , s∗2 ). By Theorem 5 of Milgrom and Roberts (1990). // Step 2: (s∗1 , s∗2 ) is a Nash equilibrium of G. Since C1 × C2 is a w-curb set, BR1 (s∗2 ) ⊂ C1 . Since u1 (s∗1 , s∗2 ) ≥ u1 (s1 , s∗2 ) for all s1 ∈ C1 , s∗1 ∈ BR1 (s∗2 ). Similarly, s∗2 ∈ BR2 (s∗1 ). // By the minimality of C1 × C2 , we have that C1 × C2 = (s∗1 , s∗2 ). Note that Proposition 2 does not require the assumptions that BRi (sj ) is a singleton or that Si is finite.

10

4

Sophistication may not imply convergence

In this section we provide an example showing that (1) even if all players use high levels of forward-looking behavior rules, the action combination may not converge to a minimal w-curb set, and (2) even if players choose a behavior rule based on the past performance, the action combination may cycle. Therefore the depth of sophisticated reasoning is not as useful in convergence to a minimal w-curb set as the diversity of reasoning in Proposition 1. Example. Consider two populations of firms, 1 and 2. They are producing the same good, for example cereals. Individual players from Population i is called firm i. There are three actions available to both of the populations. The actions are symmetric but firm 2’s action names are capitalized to avoid confusion. Action x (resp. X) is to choose an advertisement x for firm 1 (resp. 2). For example the firm emphasizes that their product is healthy. Action y (resp. Y) is to choose another type of advertisement for firm 1 (resp. 2). For example the firm emphasizes that their product is delicious. There is a third option that they choose to reduce the spending on the advertisement instead of engaging in an advertisement war. The reduction of advertisement is beneficial if and only if both firms choose the reduction z or Z. Firms in population 2 are leading, well known companies and firms in population 1 are new companies in the industry. If the firms do not reduce the advertisement, firm 1 prefers to choose the same advertisement as the leading companies but firm 2 prefers to choose a different advertisement from the entrants. The mutual reduction of advertisement gives the highest payoff for both. An example of the payoffs is shown in Table 1. X Y Z 1\2 x 2, −2 −2, 2 v, 0 y −2, 2 2, −2 v, 0 z v, 0 v, 0 3, 3 Table 1: Example

If v < 0, then the game has a strict Nash equilibrium6 (z, Z) as well as another minimal wcurb set {x, y} × {X, Y } which is also a curb set. If 0 ≤ v < 2, then (z, Z) and {x, y} × {X, Y } 6

There is a mixed Nash equilibrium as well.

11

are still w-curb sets but the latter is no longer a curb set. If 2 ≤ v < 3, then (z, Z) is the unique and strict Nash equilibrium, the unique w-curb, and the unique curb set. In the following we assume v < 0. Table 2 shows the reactions by various behavior rules for each of possible 9 observations. The adaptive behavior rule is called A and various m-step forward-looking behavior rules are called Fm rules. Table 2 shows that up to 3-step forwardlooking rule gives different reactions to the same observation. Fourth-step forward-looking rule is identical to the adaptive behavior rule and thus higher steps are irrelevant. info (x, X) (x, Y ) (x, Z) reaction P1 P2 P1 P2 P1 P2 A x Y y Y z Y F1 y Y y X y Z F2 y X x X z X F3 x X x Y x Z info (y, X) (y, Y ) (y, Z) reaction P1 P2 P1 P2 P1 P2 A x X y X z X F1 x Y x X x Z F2 y Y x Y z Y F3 y X y Y y Z info (z, X) (z, Y ) (z, Z) reaction P1 P2 P1 P2 P1 P2 A x Z y Z z Z F1 z Y z X z Z F2 y Z x Z z Z F3 z X z Y z Z Table 2: Reactions of various behavior rules

It is easy to see that once the action combination falls in one of the minimal w-curb sets {x, y} × {X, Y } or {z} × {Z}, then a learning process stays there forever after. It is also straightforward to see that if the players use the same behavior rule over time, even if that is F3 rule (a rather high level of sophistication), the action combination may not converge to a minimal w-curb set. In fact, using the same behavior rule throughout is not so sophisticated. Smart players can select a behavior rule based on its performance. To allow this possibility, we assume that each 12

player receives information not only about the previous period action combination but also the set of behavior rules, which prescribed an ex-post best response.7 Note that we can still call this one period memory. To be concrete we assume that a player at t who knows the action combination of period t − 1 observes the action combination at t as well and then finds out which behavior rules among A and Fm should have been used. Then he gives this information to the successor at period t + 1. The successor at t + 1 chooses one of the “correct” behavior rules and chooses an action based on the behavior rule and the information of the action combination in period t. We impose one more assumption that if there are multiple behavior rules which prescribed an ex-post best response, they are all used with a positive probability in the next period. This assumption gives more volatility in the learning process than imposing a fixed rule to choose among the behavior rules. To determine a “correct” behavior rule after an observation at period t, we need to consider how the action combination changed from period t − 1 to t. The observation at t tells what was the ex-post best response. Then one uses the previous observation at t − 1 to determine which behavior rule prescribed the ex-post best response. For example, if the same action combination (x, Z) was observed at t − 1 and t, firm 1 at period t can conclude that it should have played z instead of x and the behavior rules which prescribed z after the observation (x, Z) were A and F2. Then firm 1 at t + 1 can hear about this and would choose A or F2. In turns out, however, that selecting an ex-post best behavior rule does not make learning easier. Clearly we only need to consider observations of (x, Z), (y, Z), (z, X) and (z, Y ). Table 3 shows the “correct” behavior rules which prescribed an ex-post best response depending on the last two period observations. t−1\t (x, Z) (y, Z) (z, X) (z, Y ) (x, Z) (A, F2), A (A, F2), F2 F3, (F1, F3) F1, (F1, F3) (y, Z) (A, F2), F2 (A, F2), A F1, (F1, F3) F3, (F1, F3) (z, X) (F1, F3), F1 (F1, F3), F3 A, (A, F2) F2, (A, F2) (z, Y ) (F1, F3), F3 (F1, F3), F1 F2, (A, F2) A, (A, F2) Table 3: “Correct” behavior rules in the order of player 1, player 2. 7

In a fixed player learning context, Selten (1998) proposed a similar mechanism to learn.

13

Using Tables 2-3, we can determine the action combination at t + 1 for 16 different cases of the movements of action combinations from t − 1 to t. In other words, the “states” of the dynamic process with behavior rule changes are not the action combinations but a pair of action combinations showing the two consecutive periods. As we described, if (x, Z) → (x, Z) from t − 1 to t, Table 3 shows that firm 1 will use A or F2. Table 2 shows that after (x, Z) is observed at t, both A and F2 prescribes to play action z for player 1. Hence at period t + 1, firm 1 will choose z. Firm 2 uses A and thus the resulting action at t + 1 will be Y. In this case the process moves from (x, Z) at t to (z, Y ) at t + 1. Take another example that the process moved as (x, Z) → (z, X). Table 3 shows that in this case firm 1 chooses F3 rule, which results in action z at t + 1. Firm 2 chooses F1 and F3 with positive probabilities. F1 prescribes Y and F3 prescribes X as the reaction to (z, X). Hence in period t + 1, the process moves to either (z, Y ) or (z, X) with positive probabilities. Figure 3 summarizes all the movements using the behavior rule changes. It shows that the process stays in the 16 possible movements, hence the action combination cycles and never enters a minimal w-curb set which is also a minimal curb set.

(x,Z) -> (x,Z)

(x,Z) -> (y,Z)

(y,Z) -> (x,Z)

(y,Z) -> (y,Z)

(z,X) -> (x,Z)

(z,Y) -> (x,Z)

(x,Z) -> (z,X)

(z,X) -> (y,Z)

(z,Y) -> (y,Z)

(x,Z) -> (z,Y)

(y,Z) -> (z,X)

(y,Z) -> (z,Y)

(z,X) -> (z,X)

(z,X) -> (z,Y)

(z,Y) -> (z,X)

(z,Y) -> (z,Y)

Figure8 3 8

Lines are movements with probability 1 and dotted lines are movements with a positive probability.

14

5

Concluding remarks

We have shown that rather simple adaptive and one-step forward-looking behavior rules are sufficient for convergence to a minimal w-curb set, provided that the players use different ones with positive probabilities (Section 3) and that sophistication in the sense of higher levels of forward-looking rule or changing rules to a better performed one may not imply convergence to a minimal curb set, let alone a Nash equilibrium (Section 4). Therefore the diversity of behavior rules at each point of time is more instrumental for reaching a minimal w-curb set than the depth of thinking by the players. Moreover, one period memory was sufficient. Although we have shown that minimal w-curb sets and minimal curb sets coincide for some games, it would be nice if a set of simple behavior rules like ours converge to minimal curb sets. An inspection of the proof of Proposition 1 shows that Steps 2-4 hold for curb sets as well. A problem lies in Step 5 where in a finite number of periods the actions move away from the product set of non-w-curb sets. With only one period memory and the adaptive and one-step forward-looking behavior rules, the actions may not move out of non-curb set (which includes actions in a w-curb set.) The reason is that the actions in a curb set is a best response to some mixture of the actions of the opponents’ curb set, and it may not be a best response to a pure action. As we noted in Section 2, the one period memory gives only a single action combination as the information and the simple behavior rules we use also generate only a pure expected action by the opponents (which is an iterative best response at some level). Therefore the players cannot play a best response against a strictly mixed action. This means that some convexification at the level of information or beliefs is needed in order to move away from noncurb actions. The key to Hurkens’ (1995) convergence is the strong assumption that any best response against some convex combination of iterated best responses by the opponents must have a positive probability. An interesting extension is to analyze full population learning, making all players in all populations play the stage game, instead of one player from each population at each period. When all players play the stage game, even under one period memory, multiple actions in a population can be observed. The diversity in observation may make the learning easier. However, it is easy to extend the example in Section 4 for population vs. population learning. Hence the cyclic problem persists even with observations that are not singletons. 15

REFERENCES Basu K, Weibull J (1991) Strategy subsets closed under rational behavior. Economics Letters 36: 141-146. Camerer C, Ho TH (1999) Experience-weighted attraction learning in normal form games. Econometrica 67:827-874. Costa-Gomes M, Crawford V, Broseta B. Cognition and behavior in normal-form games: an experimental study. Econometrica 69:1193-1235. Foster D, Young HP (2002) Learning, hypothesis testing and Nash equilibrium. Manuscript, Johns Hopkins University. Hurkens S (1995) Learning by forgetful players. Games and Economic Behavior 11:304-329. Milgrom P, Roberts J. (1990) Rationalizability, learning, and equilibrium in games with strategic complementarities. Econometrica 58:1255-1277 Milgrom P, Roberts J. (1991) Adaptive and sophisticated learning in normal form games. Games and Economic Behavior 3:82-100. Roth A, Erev I. (1995) Learning in extensive-form games: experimental data and simple dynamic models in the intermediate term. Games and Economic Behavior 8:164-212. Selten R (1991) Anticipatory learning in 2 person games. In: Selten R (ed) Game equilibrium models I, Springer Verlag. Selten R (1998) Features of experimentally observed bounded rationality. European Economic Review 42:413-436. Stahl DO (1993) Evolution of smartn players. Games and Economic Behavior 5:604-617. Stahl DO (1996) Boundedly rational rule learning in a guessing game. Games and Economic Behavior 16:303-330. Stahl DO (1999) Evidence based rules and learning in symmetric normal-form games. Int J Game Theory 28:111-130. Stahl DO (2000) Rule learning in symmetric normal-form games: Theory and evidence. Games and Economic Behavior, 32:105-138. Stahl DO, Wilson PW (1995) On players’ models of other players: theory and experimental evidence. Games and Economic Behavior 10:218-254 Young HP (1993) The evolution of conventions. Econometrica 61:57-84. 16

Learning by Adaptive and Forward-Looking ... - Semantic Scholar

Learning by Adaptive and Forward-Looking ... - Semantic Scholar

Suggest Documents

Adaptive Mobile Learning - Semantic Scholar

Reinforcement Learning and Adaptive Dynamic ... - Semantic Scholar

Adaptive e-learning methods and IMS Learning ... - Semantic Scholar

Adaptive learning by a target-tracking system - Semantic Scholar

Learning Adaptive Parameters with Restricted ... - Semantic Scholar

Multiagent reinforcement learning with adaptive ... - Semantic Scholar

Adaptive Information Filtering: Learning Drifting ... - Semantic Scholar

Building Adaptive Game-Based Learning ... - Semantic Scholar

Adaptive web-based learning: accommodating ... - Semantic Scholar

Adaptive Learning Objects Sequencing for ... - Semantic Scholar

An Adaptive Reinforcement Learning-based ... - Semantic Scholar

Adaptive Background Learning for Vehicle ... - Semantic Scholar

AGENTS IN AN ADAPTIVE LEARNING ... - Semantic Scholar

Learning by playing - Semantic Scholar

Kalman Filters and Adaptive Windows for Learning ... - Semantic Scholar

Autonomous and Adaptive Learning of Shadows ... - Semantic Scholar

Adaptive Swarming by Exploiting Hydrodynamic ... - Semantic Scholar

Biases Introduced by Adaptive Recombination ... - Semantic Scholar

Learning by Doing and Learning Through Play - Semantic Scholar

Learning by Doing and Learning Through Play - Semantic Scholar

learning by doing and learning when doing - Semantic Scholar

Learning-Semantic-Scene-by-Tracking and ... - Semantic Scholar

Adaptive Neuro-Fuzzy Control System by RBF and ... - Semantic Scholar

Adaptive and Dynamic Intrusion Detection by ... - Semantic Scholar