A Multiagent Reinforcement Learning Algorithm using Extended Optimal Response Nobuo Suematsu
Akira Hayashi
Faculty of Information Sciences Hiroshima City University Hiroshima, 731-3194, Japan
Faculty of Information Sciences Hiroshima City University Hiroshima, 731-3194, Japan
[email protected]
[email protected]
ABSTRACT
1. INTRODUCTION
Stochastic games provides a theoretical framework to multiagent reinforcement learning. Based on the framework, a multiagent reinforcement learning algorithm for zero-sum stochastic games was proposed by Littman and it was extended to general-sum games by Hu and Wellman. Given a stochastic game, if all agents learn with their algorithm, we can expect that the policies of the agents converge to a Nash equilibrium. However, agents with their algorithm always try to converge to a Nash equilibrium independent of the policies used by the other agents. In addition, in case there are multiple Nash equilibria, agents must agree on the equilibrium where they want to reach. Thus, their algorithm lacks adaptability in a sense. In this paper, we propose a multiagent reinforcement learning algorithm. The algorithm uses the extended optimal response which we introduce in this paper. It will converge to a Nash equilibrium when other agents are adaptable, otherwise it will make an optimal response. We also provide some empirical results in three simple stochastic games, which show that the algorithm can realize what we intend.
Multiagent systems composed of concurrent reinforcement learners have attracted increasing attention recent years [12]. Multiagent reinforcement learning is much harder than the single-agent case. The hardness mainly comes from the fact that the environment is not stationary from the view of an agent because of the existence of other learning agents. In case there is a conflict of interests among agents, when an agent simply tries to learn an optimal response to its current environment, an adaptation of the agent will cause changes of the other agents’ behaviors, which means the environment of the agent will be altered. So the agent will have to adapt to the altered environment again. Thus, the agents may fall into an endless recursive adaptation. As the theory of Markov decision processes (e.g.[10]) provides a theoretical basis to single-agent reinforcement learning, the stochastic game (Markov game) model (e.g.[4]) provides the basis to multiagent reinforcement learning[8]. Based on the framework of stochastic games, Littman[7] introduced a multiagent reinforcement learning algorithm for zerosum stochastic games and Hu and Wellman[5] extended it to general-sum stochastic games. We can see Hu and Wellman’s algorithm as an extension of the single-agent Q-learning[14, 13]. A Q-learning agent always tries to learn an optimal response to its environment. So, in multiagent domains, Q-learning agents may fall into the endless recursive adaptation. To avoid this, in Hu and Wellman’s algorithm an agent aims at a Nash equilibrium. Thus, if all agents learn with their algorithm, we can expect that their policies converge to a Nash equilibrium. However, their algorithm lacks adaptability in the following sense. An agent with their algorithm always aims at a Nash equilibrium independent of the policies used by the other agents even though they may play with fixed policies which do not correspond to any Nash equilibrium. In addition, Hu and Wellman have not addressed the equilibrium selection problem. This means that even if all agents learn with their algorithm, in case there are multiple Nash equilibria, we have to assume that the agents can make an agreement on the equilibrium where they want to reach. This assumption is unnatural under the standard reinforcement learning setting, in which agents are self-interested. In this paper, we propose a new multiagent reinforcement learning algorithm. In the algorithm, it is assumed that an agent can observe other agents’ actions and rewards, and the algorithm maintains Q-value tables for all agents, like Hu and Wellman’s algorithm. The agent which learns with our
Categories and Subject Descriptors I.2.11 [Artificial Intelligence]: Distributed Artificial Intelligence—Multiagent systems; I.2.8 [Artificial Intelligence]: Problem Solving, Control Methods, and Search—Dynamic programming,Plan execution, formation, and generation
General Terms Algorithms, Experimentation
Keywords Reinforcement learning, Q-learning, stochastic games, Markov games
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. AAMAS’02, July 15-19, 2002, Bologna, Italy. Copyright 2002 ACM 1-58113-480-0/02/0007 ...$5.00.
370
for player i at a Nash equilibrium (1∗ , 2∗ ) is called the value of the equilibrium for player i.
algorithm will be able to converge to a Nash equilibrium if the other agents are adaptable, and the agent will be able to make an optimal response when the other agents have fixed policies which do not correspond to any Nash equilibrium. Recently a multiagent reinforcement learning algorithm called WoLF-PHC has been proposed[2]. WoLF-PHC is an extension of Q-learning using policy hill-climbing with variable learning rate. WoLF-PHC is attractive since it need not assume that an agent can observe the other agents’ actions and rewards, and it maintains only one Q-table like the single-agent Q-learning. However, it seems that policies converge very slowly when WoLF-PHC is used. In Section 2, we describe basic concepts of game theory, Markov decision processes, and stochastic games. In Section 3, we investigate the reason why Hu and Wellman’s algorithm lacks adaptability. In Section 4, we introduce a new algorithm, we call EXORL (Extended Optimal Response Learning) and examine its behavior. Finally, in Section 5 we provide empirical results which show that the proposed algorithm works as intended.
2.
2.2 Markov Decision Processes A Markov decision process (MDP) is a model of sequential decision of an agent embedded in an environment. At each discrete time, the environment is in a state and the agent chooses an action. Then the agent receives a reward which depends on the action and the state, and the environment changes its state to a new state stochastically. An MDP is a tuple S, A, P, R. S is a finite set of states. A is the finite set of actions available to an agent. P : A × S × S → [0, 1] is a transition function. P (s |s, a) denotes the probability of entering in state s when action a is performed in state s. R : S × A → is a reward function. R(s, a) denotes the reward obtained by an agent when the agent performs action a in state s. π = { (s)}s∈S denotes a policy, where (s) is the policy in state s and it is a probability distribution over A. The value v(s, π) is the expected total discounted reward when an agent is in state s at time t = 0 and since then follows policy π,
R
PRELIMINARIES
Stochastic game model provides a theoretical framework to multiagent systems. To describe stochastic games, we need some basic concepts of game theory and Markov decision processes. So, we state them then describe stochastic games briefly.
v(s, π) =
(1)
where γ is a discount factor and E(rt |π, s0 = s) is the expected reward at time t. Using the value function, Q-value function can be defined as Q(s, a, π) = R(s, a) + γ
A bimatrix game is a model of decision making in which two players choose their actions simultaneously and receive rewards depend on the action pair chosen by them. A bimatrix game is defined by a tuple A1 , A2 , R1 , R2 . Ai is the finite set of actions available to player i and Ri is the reward matrix for player i whose entry (a1 , a2 ) is the reward received by player i when player 1 and 2 select actions a1 ∈ A1 and a2 ∈ A2 respectively. The policy (strategy) of player i is denoted by a vector i which is a probability distribution over Ai . A Nash equilibrium of a bimatrix game is a policy pair ( 1∗ , 2∗ ) such that (1∗ )T R2 2∗ ≥ ( 1∗ )T R2 2
γ t E(rt |π, s0 = s),
t=0
2.1 Bimatrix Games
(1∗ )T R1 2∗ ≥ ( 1 )T R1 2∗
∞ X
X
P (s |s, a)v(s , π)
(2)
s ∈S
An optimal policy in an MDP is a policy π∗ such that v(s, π∗ ) ≥ v(s, π)
∀π, s ∈ S.
(3)
It has been shown that the value function for any optimal policy π∗ , v(s, π∗ ) is equal to the unique solution of the optimality equations (Bellman equations) (e.g. [10]), (
v(s) = max a∈A
R(s, a) + γ
X
)
P (s |s, a)v(s )
.
(4)
s ∈S
The solution v(s) is called the optimal value function. We can also define the optimal Q-value function using Equation (2). Any finite MDP has at least one deterministic optimal policy[10]. So, single-agent reinforcement learning algorithms need not deal with probabilistic policies.
1 for any 2 . for any
It has been shown that any finite bimatrix game has at least one Nash equilibrium[9]. In an equilibrium (1∗ , 2∗ ) when the policies i∗ , i = 1, 2 are deterministic (i.e. π∗i (ai ) ∈ {0, 1}, ∀ai ∈ A, i = 1, 2), then the equilibrium point is called a pure strategy Nash equilibrium, otherwise it is called a mixed strategy Nash equilibrium. In a mixed strategy Nash equilibrium (1∗ , 2∗ ), the following equations hold
2.3 Stochastic Games We can see a stochastic game as an extension of an MDP to the multiagent case. A 2-player γ discounted stochastic game is a tuple S, A1 , A2 , P, R1 , R2 . S is a finite set of states and Ai is the set of actions available to player i. P : A1 × A2 × S × S → [0, 1] is a transition function. P (s |s, a1 , a2 ) denotes the probability of entering state s when player 1 and 2 select action a1 and a2 respectively in state s. Ri : S × A1 × A2 → is a reward function for player i. Ri (s, a1 , a2 ) denotes the reward obtained by player i when player 1 and 2 perform action a1 and a2 respectively in state s. The goal for each player is to maximize its discounted total reward. Let the policy for player i be π i : S × Ai → [0, 1]. In this paper we consider only the stationary policy case in which the probability of choosing action ai in state s does
(1+ )T R1 2∗ = ( 1∗ )T R1 2∗ ,
(1∗ )T R2 2+ = ( 1∗ )T R2 2∗ ,
R
where each i+ is a deterministic policy which selects an action from the set of actions {ai | π∗i (ai ) > 0, ai ∈ Ai }. This is an important property to design multiagent reinforcement algorithms which converge to a mixed strategy Nash equilibrium. (1 )T Ri 2 is the expected reward for player i when player 1 and 2 play with 1 , 2 respectively. The expected reward
371
not change over time. For a given initial state s and the policies π 1 and π 2 , the value function for player i is defined by v i (s, π 1 , π 2 ) =
∞ X
γ t E(rti |π 1 , π 2 , s0 = s),
1. Initialize: For all s ∈ S, a1 ∈ A1 , a2 ∈ A2
(5)
Q1 (s, a1 , a2 ) ← 0,
Q2 (s, a1 , a2 ) ← 0,
π ˆ 2 (s, a2 ) ← 1/|A2 |,
π 1 (s, a2 ) ← 1/|A1 |.
t=0
2. Loop
where E(rti |π 1 , π 2 , s0 = s) is the expected reward for player i at time t. In addition, the Q value function can be written as Qiπ1 π2 (s, a1 , a2 ) = P Ri (s, a1 a2 ) + γ s ∈S P (s |s, a1 a2 )v i (s , π 1 , π 2 ).
(a) Choose action a1t according probability distribution 1 (st ) with some exploration. 1 2 (b) Observe rewards (rt+1 , rt+1 ), the opponent’s 2 action at , and the next state st+1 .
(6)
(c) Update Q-values: for i = 1, 2
A Nash equilibrium of a stochastic game is a pair of policies (π∗1 , π∗2 ) such that v 1 (s, π∗1 , π∗2 ) ≥ v 1 (s, π 1 , π∗2 )
∀s ∈ S, π 1
v 2 (s, π∗1 , π∗2 ) ≥ v 2 (s, π∗1 , π 2 )
∀s ∈ S, π 2 .
Qi (st−1 , a1t−1 , a2t−1 ) ← (1 − α)Qi (st−1 , a1t−1 , a2t−1 ) + α[rti + γQi (st , a1t , a2t )]
It has been shown that any discounted finite stochastic game has at least one Nash equilibrium[4]. A Nash equilibrium is a rational choice for the players. So, multiagent reinforcement learning algorithms are expected to converge to a Nash equilibrium. Filar and Vrieze[4] have shown that the following two assertions are equivalent:
(d) Update the opponent’s policy estimation:
ˆ 2 (st ) ← (1 − β)ˆ 2 (st ) + β 2t , where 2t is a vector such as
• (π 1 , π 2 ) is an equilibrium point of the discounted stochastic game with equilibrium payoff (v1 (π 1 , π 2 ), v1 (π 1 , π 2 )), where v i (π 1 , π 2 ) = (v i (s, π 1 , π 2 ))s∈S , i = 1, 2. • For each s ∈ S, ( 1 (s), 2 (s)) constitutes an equilibrium point in the static bimatrix game A1 , A2 , Q1π1 π2 (s), Q2π1 π2 (s) with equilibrium payoffs (v 1 (s, π 1 , π 2 ), v 2 (s, π 1 , π 2 )), where Qiπ1 π2 (s) is the matrix whose entry (a1 , a2 ) is Qiπ1 π2 (s, a1 , a2 ), i = 1, 2.
1 0
if a2 = a2t otherwise
1 (st−1 ) so that (e) Update policy Ost−1 (1 (st−1 )) defined in Equation (9) is maximized. Figure 1: EXORL algorithm
This is a very important property of equilibria in stochastic games to design multiagent reinforcement learning algorithms.
3.
πt2 (a2 ) =
4. A NEW ALGORITHM In this section, we introduce a new multiagent reinforcement learning algorithm. We assume that an agent can observe the opponent’s action and reward. Our goal is to design an algorithm which realizes an agent behaves as follows:
HU AND WELLMAN’S ALGORITHM
An extension of the single-agent Q-learning to the multiagent case has been proposed by Hu and Wellman[5]. In their algorithm, it is assumed that an agent can observe the other agent’s action and reward, and Q-value tables for the both agents are maintained. Under some assumptions, it has been shown that the agent’s policy converges to a Nash equilibrium when the both agents learn with their algorithm [5, 1]. And the algorithm was empirically investigated in [6]. However, the agent always plays with a policy which corresponds to a Nash equilibrium, independent of the opponent’s actual policy. So, the agent with their algorithm lacks adaptability in a sense. In their algorithm, when players select action pair (a1 , a2 ) in state s, and state changes to s and players receive reward (r 1 , r 2 ) respectively, Q-values are updated through
1. It basically tries to learn a policy which is an optimal response to the opponent’s policy, 2. but when the opponent is adaptable, it tries to reach a Nash equilibrium. The single-agent Q-learning tries to learn a policy which is an optimal response to its environment (and the opponent in the multiagent case). So, the first part of the desired behavior is achieved by the single-agent Q-learning, although it can not realize the second part in general multiagent domains. As described in the previous section, the extension of Q-learning proposed by Hu and Wellman has made it possible to converge to an equilibrium. This is the second part of the desired behavior. However, in their extension, the first part has been lost. Figure 1 appears our algorithm, which we call EXORL (Extended Optimal Response Learning). The algorithm uses the extended optimal response which is introduced to realize the desired behavior. In the following two subsections, we describe the two important parts of our algorithm, learning of Q-values and the extended optimal response. Then, we examine the behavior of the algorithm in Subsection 4.3.
Qi (s, a1 , a2 ) ← (1 − α)Qi (s, a1 , a2 ) + α{r i + γV i (s )}, (7) where V i (s ) is the value of an equilibrium of bimatrix game A1 , A2 , Q1 (s ), Q2 (s ) for player i. And the policy is set so that 1 (s) is player 1’s policy at an equilibrium of the bimatrix game A1 , A2 , Q1 (s), Q2 (s). Thus, equilibrium points are used to evaluate states and to derive policies. This is why their algorithm does not adapt to the opponent’s policy.
372
ρs ( 1 ) is the possible increase of expected discounted reward for player 2 which can be achieved by changing its policy ˆ 2 when player 1 uses policy 1 . When an agent takes from account of the adaptability of the opponent, what the agent ˆ 2 (s) but also has to do is not only maximizing (1 )T Qi (s) 1 minimizing ρs ( ). Equation (9) is defined so that these maximization and minimization are accomplished. Note that Os (1 ) defined in Equation (9) is a piece-wise linear concave function, which has a unique maximal point except the rare special case in which it takes the maximum value in a region.
4.1 Learning of Q-values In our algorithm Q-values are updated as Sarsa[13] does. ( (c) in Figure 1. ) Sarsa is an on-policy variation of Qlearning. In the standard Q-learning, Q-values are update as Q(st , at ) ← (1 − α)Q(st , at ) + α{rt+1 + γ max Q(st+1 , a )}, a
where st is state at time t, at is action chosen at time t, and rt+1 is obtained reward at time t + 1. When this update rule is used, under some assumptions, Q-values converge to the optimal Q-values, independent of the policy used while learning. On the other hand, the update rule of Sarsa is
4.3 Behavior of EXORL In this subsection, we explain why we can expect that EXORL can achieve the desired behavior. We divide situations into three cases and show that the policy obtained by maximizing Equation (9) is suitable in each case, when the factor σ is small.
Q(st , at ) ← (1 − α)Q(st , at ) + α{rt+1 + γQ(st+1 , at+1 )}. In this update rule, the value of state st+1 is evaluated by Q(st+1 , at+1 ), where at+1 is the action chosen in st+1 . Thus, Q-values converge to Q(s, a, π) where π is the policy used while learning. As we can see in Equation (7), in Hu and Wellman’s algorithm, a state is evaluated by the value of an equilibrium of a bimatrix game A1 , A2 , Q1 (s), Q2 (s). This means that their algorithm estimates Q-values for a Nash equilibrium, independent of policies used by the players. On the other hand, as shown in (c) in Figure 1, in our algorithm, a state is evaluated by Q-value for the action pair which is selected by the players in the state. Thus, our algorithm estimates Q-values for the policies which are actually used by the players. In a multiagent domain, since an agent can not control the other agent’s policy, Q-values for the policies used by the agents should be estimated. As shown in Figure 1, the opponent’s policy is also estimated in EXORL. We use Robins-Monro’s stochastic approximation[11] for the estimation.
1. When a pure strategy equilibrium (1∗ (s), 2∗ (s)) exists and the opponent uses policy 2∗ (s): In this case, the maximization of the expected reward (1 )T Q1 (s)2∗ (s) and the minimization of ρs ( 1 ) with ˆ 2 (s) = 2∗ (s) are simultaneously accomplished when 1 = 1∗ (s). 2. When a mixed strategy equilibrium (1∗ (s), 2∗ (s)) exists and the opponent uses policy 2∗ (s): In this case, the expected reward ( 1 )T Q1 (s)2∗ (s) is constant for all 1 as mentioned in Subsection 2.1. ˆ 2 (s) = 2∗ (s) is minimized when And ρs (1 ) with 1 1 = ∗ (s). 3. When the opponent plays with a fixed policy 2 which does not correspond to any equilibrium: In this case, if σ is small enough, 1 (s) is simply selected so that ( 1 )T Q1 (s)2∗ (s) is maximized.
4.2 Extended Optimal Response
From the above description, we can see that EXORL will learn an adequate policy when the opponent plays with a fixed policy. In addition, we can see that equilibria of a stochastic game will be fixed points for the multiagent systems composed of EXORL agents. As we have seen above, small σ is preferable to realize an optimal response when the opponent’s policy is fixed (case 3). However, if σ is too small, because of errors in the estimates of Q-values and the opponent’s policy, the agent will not obtain a policy which corresponds to a mixed strategy equilibrium in case 2. σ determines the agent’s sensitivity to the deviation of the opponent’s policy from an equilibrium. Smaller σ means higher sensitivity (i.e. lower stability in a mixed strategy Nash equilibrium). The parameter σ plays another role. The r.h.s. of Equation (9) combines the expected reward for player 1 and the possible gain of reward for player 2. However, in general, the ratio of the magnitude of rewards for player 1 to those for player 2 can be arbitrary large or small, without changing the property of the game. So, σ has to scale ρs (1 ) adequately.
The single-agent Q-learning tries to make an optimal response by estimating Q-values and taking action arg max Q(s, a) a
in state s. Claus and Boutilier[3] proposed joint action learners (JALs), which are modified Q-learning agents for the multiagent case. JALs learn Q-values for action pairs and estimate the opponent’s policy. In this case, an optimal response in state s is the policy 1 which maximizes ˆ 2 (s), Os (1 ) = (1 )T Qi (s)
(8)
ˆ 2 (s) is the estimate of player 2’s policy in state s. where An agent which simply tries to make an optimal response can not learn a policy which corresponds a mixed strategy Nash equilibrium. On the other hand, if an agent derives a policy from a Nash equilibrium of the bimatrix game A1 , A2 , Q1 (s), Q2 (s), the policy may not be a suitable response for the opponent’s actual policy. To solve this problem, we extend the definition of Equation (8) to ˆ 2 (s) − σρs (1 ) Os ( 1 ) = ( 1 )T Q1 (s)
5. EXPERIMENTS
(9)
In this section we show experimental results in three simple multiagent domains. The results show that EXORL can realize the desired behavior described in the previous section.
where σ is a tuning parameter and h
i
ˆ 2 (s). (10) ρs ( 1 ) = max (1 )T Q2 (s)2 − ( 1 )T Q2 (s)
2
373
2
R =
H T
H −1 1
T 1 −1
1.0
T −1 1
0.8
H 1 −1
Pr(H)
Figure 2: Matching Pennies
0.0
0.2
Through all experiments the discount factor 0.95 was used and σ was set to 0.2. In addition, the exploration strategy in which action is selected at random with probability 0.2 was used.
0.6
R =
H T
0.4
1
5.1 Matching Pennies
0
In this subsection, we show the experimental results for the iterated “matching pennies” which is a bimatrix game shown in Figure 2. In this game, policy pair (1∗ , 2∗ ) is a unique equilibrium, where π∗1 (H) = π∗1 (T) = π∗2 (H) = π∗2 (T) = 1/2. The Q-values for the equilibrium are equal to the rewards shown in Figure 2, since the expected future reward is 0. In the experiments for matching pennies, the learning rate we used is α = N (a1 , a2 )−2/3 , where N (a1 , a2 ) is the number of times action pair (a1 , a2 ) has been selected. Figure 3 shows the result obtained when player 1 always chooses H and player 2 learns with Hu and Wellman’s algorithm. Apparently the optimal response for player 2 is to choose T every time, i.e. Pr(H) = 0. However, the learned policy converged to Pr(H) = 0.5 which corresponds to the unique equilibrium. The player learns with Hu and Wellman’s algorithm could not adapt to the other player’s policy adequately. Figure 4 shows the results obtained when the both players learn with EXORL. As shown in Figure 4(a), the policy of player 2 quickly converges to the equilibrium. Q-values also converges to the values for the equilibrium, as we can see in Figure 4(b). Please note that the regions of iterations shown in Figure 4(a) and 4(b) are quite different. Figure 5 shows the results when player 1 plays with fixed policies and player 2 learns with EXORL. Of course, when player 1 plays with Pr(H) = 0.5, the policy of the EXORL agent quickly converges to Pr(H) = 0.5. When player 1’s policy is Pr(H) = 0.4, the learned policy sometimes leaps into Pr(H) = 1, which is the unique optimal response, while it is nearly Pr(H = 0.5) almost all the time. When player 1’s policy is Pr(H) = 0.3, the learned policy is Pr(H) = 1 almost all the time and sometimes it leaps into around Pr(H) = 0.5. When player 1’s policy is Pr(H) = 0.2, Pr(H) = 1 almost all the time and it hardly leaps into around Pr(H) = 0.5. Please note that since we always adopt the exploration strategy in which an action is selected at random with probability 0.2, the effective probability that player 1 chooses “H” is Pr(H) × 0.8 + 0.1. As mentioned in Subsection 4.3, the sensitivity to the deviation of the opponent’s policy from an equilibrium is determined by the parameter σ. So, the behavior described above will change depending on the value of σ.
200
400
600
800
1000
Iterations
0.6 0.4 0.0
0.2
Pr(H)
0.8
1.0
Figure 3: The policy for player 2 in matching pennies when player 1 always chooses H and player 2 learns with Hu and Wellman’s algorithm.
0
500
1000
1500
2000
Iterations
−0.5
0.5
Q(H,T)
Q(H,H)
−1.5
Q values
1.5
(a) Policy
0e+00
2e+04
4e+04
6e+04
8e+04
1e+05
Iterations
(b) Q values
5.2 Presidency Game The presidency game is a two-state stochastic game presented in [4], which models decision making by two political parties of a country. The population of the country periodically has to vote for a president, and both parties nominate a candidate. State 1(2) is the period of the president nomi-
Figure 4: Results in matching pennies when the both players are EXORL agents. (a) The policy for player 2. (b) Q(H,H) and Q(H,T) for player 2.
374
C
0.8 0.4
Pr(H)
N
✘ ✘ ✘✘✘ 40,60 ✘✘✘✘ ✘ F ✘ ✘ ✘ (1/2,1/2) ✘✘ (1/2,1/2) ✘✘ ✘✘45,25 ✘ ✘ ✘ ✘ 70,30 ✘ ✘ ✘✘✘ U ✘ ✘ ✘✘✘ (1/6,5/6) ✘✘✘ (1/6,5/6) 50,50
0.0
Player 1’s Pr(H)=0.5
state 1
0.0
Player 1’s Pr(H)=0.4
0.8
state 2 Figure 6: The Presidency Game
0.4
Pr(H)
U
✘ 30,70 ✘ ✘ ✘ ✘✘ ✘✘✘ C ✘✘ ✘ ✘ ✘ (1/2,1/2) ✘✘ (1/2,1/2) ✘✘ ✘ 25,45 ✘ ✘ ✘✘ 60,40 ✘ ✘✘ ✘ ✘ N ✘ (1/6,5/6) ✘ (1/6,5/6) ✘ ✘ ✘✘ ✘✘ 50,50
0.4
Pr(H)
0.8
F
0.0
Player 1’s Pr(H)=0.3
0.4
Pr(H)
0.8
F U
C N
0.0
Player 1’s Pr(H)=0.2
0
2000
4000
6000
8000
10000
Iterations
C N (974, 974) (964, 984) (987, 961) (962, 956) state 1 F U (974, 974) (961, 987) (984, 964) (956, 962) state 2
Figure 7: Q-value tables of the presidency game, for the equilibrium with the exploration strategy. Each box represents (Q-value for player 1, Q-value for player 2).
Figure 5: The policies for player 2 in matching pennies when player 2 is an EXORL agent and player 1 plays with the fixed policies shown in the plots.
nated by the party 1(2). The presidency game we used for experiments is shown in Figure 6 1 . In the figure, a box with a diagonal divider represents rewards/transition consequences of the corresponding action pair in the state. For example, action pair (U,C) is selected in state 1, player 1 and 2 receive rewards of 70 and 30 respectively, and the state is in state 1 and 2 with probability 1/6 and 5/6 respectively. For the sake of convenience, we denote a deterministic policy as (F,C) which means the player selects F in state 1 and C in state 2. Then, this game has a unique equilibrium ((U,C),(C,U)). Figure 7 appears Q-value tables for the equilibrium with the exploration strategy. In the experiments for the presidency game, the learning rate we used is α = N (s, a1 , a2 )−1/2 . Figure 9 shows the results obtained when two players were both EXORL agents. As we can see in Figure 9(a), the policy quickly converges to the Nash equilibrium. Figure 9(b) appears that Q2 (state 1, U, C) is converging to 961. The Q-value tables at 100000 iterations are shown in Figure 8. We can see that the values in Figure 8 are close to those of Figure 7. Figure 10 shows the policy of player 2 which learns with EXORL when player 1 plays with the fixed policy (F,C). As we can see in Figure 10, the policy converges to (N, U) which is the unique optimal response to player 1 with policy (F,C). We also verified that EXORL agent converges to the Nash equilibrium when the other player learns with Hu and Well-
man’s algorithm, although it is obvious since their algorithm converges to the Nash equilibrium independent of the other agent’s policy. The result shown in Figure 11.
5.3 Battle of the Sexes The battle of the sexes is a well known bimatrix game, whose reward matrices are shown in Figure 12. The battle of the sexes has two pure strategy Nash equilibria (11 , 21 ) and (12 , 22 ), and, a mixed strategy Nash equilibrium T (13 , 23 ), where 1∗1 = 2∗1 = 1 0 , 1∗2 = 2∗2 = T T T 0 1 , and 1∗3 = 3/5 2/5 , 2∗3 = 2/5 3/5 . The equilibrium values for (11 , 21 ), (12 , 22 ), and (13 , 23 ) are (2, 1), (1, 2), and (1/5, 1/5) respectively. In the experiments for the battle of the sexes, the learning rate we used is α = N (a1 , a2 )−1/2 . Figure 13 shows the results obtained when player 2 is an
F U
C N
C N (950, 952) (934, 956) (962, 939) (938, 931) state 1 F U (948, 952) (936, 965) (954, 936) (931, 940) state 2
Figure 8: Learned Q-value tables
1 This is a version of the presidency game provided in Example 6.3.3 in [4] with a = 15.
375
0.8 0.0
0.4
Pr( C | state 1 )
0.8 0.4
0.8 0.4 0.0
0.4
0.8
Pr( F | state 2 )
0.0
Pr( C | state 1 ) Pr( F | state 2 )
0.0
0 0
200
400
600
800
800
1
R =
600
Q( state 1, U, C)
600
800
1000
Figure 11: Result when player 1 learns with Hu & Wellman’s algorithm and player 2 learns with EXORL in the presidency game.
(a) Policy
B O
B 2 −1
O −1 1
2
R =
B O
B 1 −1
O −1 2
200
400
Figure 12: Battle of the Sexes
0
EXORL agent and player 1 plays with fixed policies 1∗1 , 1∗2 , and 1∗3 . We can see that the EXORL agent learned 2∗1 , 2∗2 , and 2∗3 respectively.
0e+00
2e+04
4e+04
6e+04
8e+04
We also made 1000 trials of the game in which two players are both EXORL agents. There was no example in which the mixed strategy equilibrium was reached among the 1000 trials. Please note that the value of the mixed strategy equilibrium is much smaller than those of the other pure strategy equilibria. This fact may mean that the mixed strategy equilibrium is unstable or has very small absorbing area. Apparently, the two pure strategy equilibria, (11 , 21 ) and (12 , 22 ) will be achieved with equal probability, because of the symmetry of these equilibria. In our experiment, (11 , 21 ) was reached in 520 trials and (12 , 22 ) was reached in the other 420 trials. An example of the time evolution of the players’ policies is shown in Figure 14.
1e+05
Iterations
(b) Q value Figure 9: Results in the presidency game when the both players are EXORL agents. (a) Pr(C|state 1) (top) and Pr(F |state 2) (bottom) for player 2. (b) The Q value, Q2 (state 1, U, C). The horizontal dashed line shows the value where Q2 (state 1, U, C) will converge.
0.8
6. CONCLUSION
0.4
In this paper, we present a multiagent reinforcement learning algorithm which converges to a Nash equilibrium when other agents are adaptable, otherwise it makes an optimal response. To realize the algorithm, we introduced the extended optimal response which is described in Subsection 4.2. We examined that our algorithm will be able to realize the desired behavior in Subsection 4.3. We then presented some empirical results which show that the algorithm works as intended. The extended optimal response is obtained by maximizing the function defined in Equation (9). The function has a tuning parameter σ. As already mentioned, the parameter determines sensitivity to the deviation of the opponent’s policy from a Nash equilibrium and stability in a mixed strategy Nash equilibrium. Although we used a fixed σ in this paper, it may be possible to adapt σ using the information available while learning.
0.0
0.4
0.8
0.0
Pr( C | state 1 )
400 Iterations
Iterations
Pr( F | state 2 )
200
1000
0
200
400
600
800
1000
Iterations
Figure 10: Result when player 1 plays with (F,C) and player 2 learns with EXORL in the presidency game.
376
7. REFERENCES
Pr(B)
0.4
0.8
[1] M. Bowling. Convergence problems of general-sum multiagent reinforcement learning. In Proc. 17th International Conf. on Machine Learning, pages 89–94. Morgan Kaufmann, San Francisco, CA, 2000. [2] M. Bowling and M. Veloso. Rational and convergent learning in stochastic games. In In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, August 2001. Seattle, WA, August 2001. [3] C. Claus and C. Boutilier. The dynamics of reinforcement learning in cooperative multiagent systems. In Proceedings of the 15th National Conference on Artificial Intelligence (AAAI-98) and of the 10th Conference on Innovative Applications of Artificial Intelligence (IAAI-98), pages 746–752, Menlo Park, July 26–30 1998. AAAI Press. [4] J. Filar and K. Vrieze. Competitive Markov Decision Processes. Springer-Verlag, 1997. [5] J. Hu and M. P. Wellman. Multiagent reinforcement learning: theoretical framework and an algorithm. In Proc. the 15th International Conference on Machine Learning, pages 242–250. Morgan Kaufmann, San Francisco, CA, 1998. [6] J. Hu and M. P. Wellman. Experimental results on q-learning for general-sum stochastic games. In Proc. the 17th International Conference on Machine Learning, 2000. [7] M. L. Littman. Markov games as a framework for multi-agent reinforcement learning. In Proc. 7th International Conference on Machine Learning, pages 157–163, 1994. [8] M. L. Littman. Value-function reinforcement learning in Markov games. Journal of Cognitive Systems Research, 2:55–66, 2001. [9] J. F. Nash. Non-cooperative games. Annals of Mathematics, 54:286–295, 1951. [10] M. L. Puterman. Markov Decision Processes Discrete Stochastic Dynamic Programming. Jhon Wiley & Sons, Inc., New York, NY., 1994. [11] H. Robins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400–407, 1951. [12] P. Stone and M. M. Veloso. Multiagent systems: A survey from a machine learning perspective. Autonomous Robots, 8(3):345–383, 2000. [13] R. S. Sutton and A. G. Barto. Reinforcement Learning. MIT Press, 1997. [14] C. Watkins. Learning from delayed rewards. PhD thesis, University of Cambridge, England, 1989.
Pr(B)
0.4
0.8
0.0
Player 1’s Pr(B)=1
Pr(B)
0.4
0.8
0.0
Player 1’s Pr(B)=0
0.0
Player 1’s Pr(B)=0.6
0
1000
2000
3000
4000
5000
Iterations
0.8 0.4 0.8 0.4 0.0
Player 2’s Pr(B)
0.0
Player 1’s Pr(B)
Figure 13: The policy learned by player 2 which is an EXORL agent in the battle of the sexes when player 1 plays with fixed policies shown in the plots.
0
200
400
600
800
1000
Iterations
Figure 14: An example of the players’ policies when both players are EXORL agents in the battle of the sexes.
377