Solving Multiplayer Games with Reinforcement ...

10 downloads 0 Views 182KB Size Report
learning methods to a complex multi-player card game to produce AI players ..... to the extent possi- ble, or use a set of pattern-matching trump cards to trump.
Solving Multiplayer Games with Reinforcement Learning Qicheng Ma [email protected]

Hadon Nash [email protected]

Abstract

reflects its private information about the game state. 2. In each time step t each player applies its decision function to compute an action based on its belief bit

We propose a game solver that finds a set of mutually optimized strategies for a multi-player game. It decomposes the problem into a series of single-player machine learning problems, each solvable via Q function reinforcement learning. The solver then approximates a series of distributions of highly rewarded strategies by applying the single-player learning algorithm iteratively, analogous to fictitious play, and ultimately produces a set equilibrium strategies analogous to a mix strategy Nash Equilibrium. We apply several learning methods to a complex multi-player card game to produce AI players with performance comparable to an expert heuristic player. We also demonstrate the game solver’s ability to find the equilibrium strategies in a simple traffic simulation.

ait := ChooseAction(bit ) 3. The Game collects all players’ actions and updates the game state st+1 := ComputeGameState(st , a1t , a2t , . . . , ant ) It also computes player-specific game belief updates and sends each player an observation and a reward. i (oit+1 , rt+1 ) := GenerateGameUpdate(st+1 , st , i) 4. Each GamePlayer(i) receives its private observation and refresh its belief for the next round: bit+1 := UpdateBelief(bit , ait , oit+1 )

1. Background and motivation

The game rule is completely defined by the ComputeGameState() and GenerateGameUpdate() functions

Many simulations involve decisions by rational decision makers, including traffic simulations, environmental simulations, and many others. This project applies modern modern machine learning algorithms to modeling of the rational decisions within such simulations. We describe this modeling as “game solving” because it produces an equilibrium distribution of decision strategies, aka. a mixed strategy Nash equilibrium. Another domain in which optimized strategies are desirable is interactive gaming, in which human players play against automated players (AIs). The most realistic automated players choose actions which maximize their own expected reward given limited observations of the game state and limited experience playing the game. One approach to developing automated players is to train them against other automated players. Within the game solving framework, each automated player is trained against a distribution of opposing strategies and each distribution of strategies is refined as training proceeds.

above.

Each player’s strategy is entirely defined by

ChooseAction() and UpdateBelief().

It becomes clear that this formulation can model perfect information games like Chess, tic-tac-toe with simply bit = st = oit , i.e. the full state information is publicly available to all players. It can also model partial-information games like bridge, poker, etc, with bit representing only a subset of st . It can also model situations where the players can only observe the game with some noise, with oit being some noisy observation of st and bit representing some distribution over the true game state st . Although in this formulation all players make actions simultaneously in each time step, simple modification can be made to model turnbased games where each player make actions in turn and after observing all previous players’ actions.1 From each player’s perspective, the game can be viewed as a (partially observed) Markov Decision Process, where state space, belief state, observations, and actions are defined in the natural sense, and the state-transition probabilities are jointly defined by the game rule and all other players’ strategies. In section 3, we present methods to learn one player’s optimal strategy, treating the game rule and all other players as a fixed but stochastic external environment.

2. Game Framework Formulation We implemented a general game framework in Java and model a multi-agent partial-information game as follows. This model is an adaptation and extension of work in [2].

1 One way is to write the game rule such that only the (t mod numPlayers)-th player’s action can affect the current game state, and that the observations include previous players’ actions in the same round.

1. The Game maintains a global state st for each time step t, and each GamePlayer[i] maintains a belief bit which 1

a to the belief state if the joint space is too large (for example, the after-a-move board configuration of Chess, or some features thereof). φ can also contain mappings to higher degree polynomials to augment the linear regression. We try to learn a regression model of Q(b, a) ∼ wT φ(b, a), by generating as many game play as desired. For the j-th game:

Treating the game as a whole, a given strategy for a player is optimal only with respect to all other players’ strategies (or mixed strategies). From a game theoretic perspective, we would like to find a set of equilibrium strategies for all players such that they are each optimal with respect to others’ strategies (aka. Nash Equilibrium). In section 4 we present ways to learn the equilibrium strategies by evolving all players in parallel.

(b0 , a0 ) →r1 (b1 , a1 ) →r2 . . . →rt¯ (bt¯, .) The most natural estimate for Q(bt , at ) is the sum of actual observed discounted rewards till the end of the game.

3. Learning a Strategy for One Player 3.1. SARSA and learning state-action value

qt =

Posed as a Markov Decision problem, each player’s goal is to maximize the discounted sum of rewards: t¯ X max γ t R(st )

t¯ X

γ k−t rk

k=t

Hence we train the regression model using the generated data set: (j) (j) (j) D = {(φ(bt , at ), qt )}j=1...J,t=0...t¯ = {x(i) , y (i) }i

t=0

Borrowing from reinforcement learning literature, we want to learn a value function over the (state, action) pair: Q(s, a) = R(s) + γEs0 ∼Pπ−i (a) max Q(s0 , a0 ) 0

In practice, we generate a fixed J number of games by playing a greedy strategy according to the current estimated Q, to obtains J t¯ sample points, then we estimate the weights w by minimizing the sum-of-square error. We use Conjugate Gradient Descent [3] to minimize the error and find w.3 Then we repeat with the new w, Q and π.

a

Where the expectation is taken over the next state, drawn from a distribution Pπ−i (a) that’s jointly defined by all other players’ strategies π −i and the current action a. We cannot compute this expectation directly since it depends on other players’ strategies and their interactions implicitly, so we could only hope to estimate the expectation by sampling from it, i.e. simulating the game. Fortunately the SARSA algorithm [1] provides a way to learn the Q function as the player plays a strategy πQ˜ that chooses action greedily ac˜ The simple update rule cording to the current estimate Q. is: (α is the learning rate) Q(st , at ) ← (1 − α)Q(st , at ) + α[rt+1 + γQ(st+1 , at+1 )]

3.3. E-M Mixed Gaussian Learner Next we learn an estimate of the Q function as a mixtureof-Gaussian density up to a scaling constant. The estimated density A(s, a) is described as the “action density” because it is used to select actions stochastically. The E-M algorithm is used to fit the density estimate to a set of weighted samples. Training is repeated for a series of epochs. On each epoch, d, J sample games are played with actions selected with probability proportional to Ad (s, a). The sample actions and rewards are used to produce a new estimate Ad+1 (s, a). On the initial epoch, actions are chosen uniformly. On successive epochs, actions are chosen proportional to higher powers of the expected rewards. As the epoch d approaches infinity, all actions are chosen according to the maximum available expected rewards. Given an action density Ad (s, a), the expected forward reward for (st , at ) is: Qd (st , at ) = rt + γEs0 ∼Pπ−i (a) Ea0 ∼Ad (s0 ,a0 ) Q(s0 , a0 )

for each (st , at , rt+1 , st+1 , at+1 ) in the chain of simulations that we encounter. Adapting this to our game formulation, noting that only the private belief is available to the player, we replace st by bit : i Qi (bit , ait ) ← (1 − α)Q(bit , ait ) + α[rt+1 + γQ(bit+1 , ait+1 )] Using this SARSA update rule, together with the Q-greedy strategy:2 π i (bit ) = arg max Q(bit , a) a

we can train player i by simulating game plays and collecting sample (st , at , rt+1 , st+1 , at+1 ) data points to improve the estimate for Qi and π i .

Compare this to the standard Q value definite in section 3.1, notice that the max operator over the next step’s action is replaced by an expectation according to the action density Ad . The intuition is that Ad will be a sequence of converging “soft-max” densities that eventually approaches the true max. Since the exact expectation Qd cannot be computed explicitly, we sample from it by simulating J games, and ˜ d , which is then used to compute produce an estimated Q

3.2. Learning Q by Regression Next we propose learning an estimate of the Q function by linear regression. First we extract an n-dimensional vector from (b, a) pair, φ(b, a), called the after-state feature vector. This can be identical to (b,a) for small state space, or a reduced representation for the after-effect on applying action 2 Actually we use an -greedy strategy, where  fraction of the time the player chooses a random action instead of the greedy action, so as to explore more states and accumulate stats.

3 since we are working in Java, CGD is easier to write than matrix inverses using the normal equation. It’s also guaranteed to converged in n steps.

2

the next epoch’s action density Ad+1 (s, a): ˜ d (s, a) Ad+1 (s, a) ∝ Ad (s, a)Q

5. Game Solver Applications

The above two steps are repeated to produce a sequence of ˜ d }dˆ . Notice that by chaining the previous equa{Ad , Q d=0 tion, we get: ˜ d−1 (s, a) Ad (st , at ) ∝ Ad−1 (st , at )Q ˜ d−2 (s, a)Q ˜ d−1 (s, a) ∝ Ad−2 (st , at )Q

We apply the learning methods in previous section to a fairly complex card game called Tractor[6] and demonstrate that the strategy learned is comparable to a complex rule-based player written by game experts.

5.1. Learning to Play a Card Game

5.1.1. The Tractor Game Tractor is a positive-point trick-taking game similar to Bridge but played with 2 decks of cards and 4 players divided into 2 teams. The main goal of the game is to capture cards of point-value (5’s, 10’s and K’s) by winning tricks in a similar fashion to Bridge. During each round, the winner of the previous round leads by playing (a) a single card, (b) a pair of identical cards, (c) several consecutive pairs, or (d) a combination of (a)-(c) in the same suit that are each top in the suit. Subsequent players must follow the same number of cards, follow suit and follow pattern to the extent possible, or use a set of pattern-matching trump cards to trump and win if completely void of the leading suit. The winner of each round captures cards of point-value (5’s, 10’s and K’s). The complexity of the game arises from dynamic trump suit, dynamic trump number, mutual inhibition between different card types, collaboration within teams to take advantage of trumping voided suit, inferring information from other players’ moves, etc. The detailed rules and common strategies of the Tractor game is outside the scope of this project. Interested readers are referred to [5] and [6].

∝ ... ∝

d−1 Y

˜ i (s, a) Q

i=0

Which means the action density for round d is proportional to the product of the sampled rewards resulting from following each of the previous action densities. On each round of training, the E-M algorithm takes as input the set of samples (st , at , qt ), where qt is the disP¯ counted forward reward tk=t γ k−t rk . For each data point, qt is used as its weight, so that E-M approximates a density ˜ d , as required for the action density update rule above. Ad Q A useful optimization is to initialize the E-M algorithm with the Gaussian densities learned on the previous epoch. Usually, the same cluster of points are assigned to each Gaussian density, and so only a single E-step and M-step are required on each epoch.

5.1.2. Learning Algorithms The belief state of a player of the game could consist of:

4. Learning a Game Equilibrium

1. Cards currently in my hand; 2. Current round played cards; 3. Superficial information about other players’ hands by simple card tracking (e.g. someone is void of a suit); 4. Inferred information about other players’ hands. (e.g. My partner did not play point cards when we were winning a round, so he probably have no point cards); 5. All previous rounds’ play history.

The GameSolver uses a procedure analogous to the “fictitious-play” procedure from game theory literature, whereby each player repeatedly optimizes his strategy relative to the existing opposing strategies. Fictitious-play is known to converge for many interesting game classes. [4] The GameSolver maintains a distribution of strategies for each player, which are refined gradually as the Q function is refined. In this way, each strategy distribution is gradually improved with respect to the existing opposing strategy distributions. If the procedure does not converge it indicates that the game is likely to be unstable in play between realworld decision makers. With the E-M learning algorithm, strategy densities are defined using a reward-compounding formula. After d rounds of learning, each strategy accumulates a density Rd . As r approaches infinity, this density approaches the set of fully optimal strategies, which are the usual objective Q learning algorithms. This density can be interpreted as the compound reward from r successive game plays. This appears to be a good compromise between exploration and exploitation and a good approximation to real-world bounded rationality.

They are sorted in increasing complexity. We choose to only use information from (1)-(3) since representing (4) and (5) is extremely costly. Our learning player extracts a set of features φ(b, a) computed using only (1)-(3) above for b and the candidate action a. The set of features is selected by a game expert’s experience about what matter most (e.g, whether my partner is void in the suit being lead), plus some pre-computed winning probabilities given previously played cards (e.g. when both A’s are gone a K will have 100% winning probability). The feature vector used in testing is in 18-dimensional integer domain, with an average branching factor of 3. We implemented and tested the following players:4 4 We

3

also tried implementing an EM learner, but it did not work well.

• A SARSA learner using φ(b, a) instead of the full (b, a) as the index of the Q function. • A LinearRegression (LR) learner that fits linear model to Q ∼ wT φ(b, a) • A QuadraticRegression (QR) learner that fits a 2nd degree polynomial in φ(b, a). • A CubicRegression (CR) learner that fits a 3rd degree polynomial in φ(b, a). —— • A Random (RAND) player which selects randomly among the candidate actions. • An Expert (XP) player derived from human game experts’ intuitions and experiences. The Expert player is based on heuristics and rules such as “Try to lead with A’s and sure-win cards first”, “Play point cards when my partner is winning”, “Play point cards when my partner but not my opponent is void of that suit,” etc.

Figure 1: Learning against RANDOM players

5.1.3. Simulations In each test run, we run a pair of learning players of the same type against a pair of Random or Expert players, with alternate training phases (10 games) and testing phases (1000 games) and report the ratio of scores between the two teams in the test phases. Figure 1 shows that all learning players beat random substantially and XP is the best. Although XP player never learns, its variance in the graph is mostly due to the intrinsic variability of the game, and also due to the denominator (Random) being very close to zero. Figure 2 shows a more pronounced learning curve against the XP player. As the order of the regression learner increases from Linear to Quadratic to Cubic, the ultimate learned strategy is trending higher, although the amount of training data required (the length of the upward part of the learning curve) is also increasing. Since SARSA tries to learn the exact Q function by accumulating stats for each encountered φ(b, a) value, it requires a large amount of data to learn and has not converged yet. As a one-off, we ran SARSA for long enough (10k games) and it ultimately achieves a performance ratio of 0.45, which suggests even with enough data, it may be overfiting and not generalize well.

Figure 2: Learning against EXPERT players

5.2. Learning Commuter Decisions for Traffic Simulation This section demonstrates the use of the game solver as a simulation tool for decision support. We have implemented a simple traffic simulation tool which models the rational decisions of individual commuters. It can be used to project the benefits of a new train route and the impacts on other modes of transportation such as highways. We suppose that a new train route is being planned to run parallel to an existing highway route. We suppose that 300,000 commuters travel this route every day, and each commuter will decide each day whether to drive his car or to take the train. We assume that each commuter will attempt to minimize the inconvenience of his commute each day. The commuters are represented by three opponent play-

5.1.4. Simultaneous Learning and Shadow Learning We explored starting with all 4 learning players and evolve them simultaneously. Due to the near zero-sum nature of the game, it’s hard to see that they are improving using intraining scores. However, we can test against a benchmark (XP) in a parallel test game, and observed that the learning performance is almost identical to Figure 2. We also tried letting the learning player learn from the experience generated by the Expert player (i.e. “shadow” learning), which seems to speed up the initial learning of high-order learner a little bit, but does not affect the final performance. 4

172880.5

ers. Multiple opponents are necessary because the expected behavior of self-interested players differs from that of cooperating players, as shown in the analytic solution below. The reward assigned o each player is: Rp = exp(−Dp3 ) where Dp is the commute time of player p, and Dp3 represents the inconvenience cost of the commute. Delay is calculated according to a simple simulation of the highway and the train: Dc = max(Tc , Vc ∗ Ec )

117948.2

Dt = Tt + min(Ft , Et /Vt ) where Dc and Dt are the car and train delay, Vc and Vt are the car and train rider volume, Tc and Tt are the car and train travel time, Ec is the car exit time, and Et and Ft are the train capacity and maximum train wait. The formulas for Dc and Dt approximate the following rules: cars queue up to exit the highway sequentially, and trains exit the station as soon as they are filled with passengers. The game state is represented by the volume of rail and road traffic. Each player observes the current traffic volume, and decides how to commute. Each player can divide his commute between road and rail, because each represents a large number of similar people. Each player learns a Q function approximating the expected reward for each action in each game state using the E-M learning algorithm described above. A finite number of actions are available to each player, which enables the player to make a stochastic action choice by simply evaluating his action density for each available action. The analytic solution for large number of players and high traffic volumes is:

63015.9 66748.9

330591.7

594434.5

Figure 3: Car volume (y) vs. total volume (x) for the traffic simulation simply replacing the feature- and the reward- functions. Also, we have demonstrated discovery of a set of equilibrium strategies for a simple traffic simulation by iterative reinforcement learning. The learned strategies approximate the analytic solution to the game and we expect them to also approximate the most prominent strategies among real world decision makers. In future work, the reinforcement learners can be applied to producing adaptive players for other interactive games, where a heuristic player may not be available. Also, the game solver can be applied to other real-world decision problems to improve the accuracy of simulations.

Dc = Dt p Tt + (V Ec − Tt )2 − 4Ec (V Tt + Et ) V Vc = − 2 2Ec Essentially, with self-interested players the road becomes congested to the point that it is no more convenient than the train. By contrast, with cooperating players a greater total reward can be achieved with a lower volume of car traffic. d d Vc Dc3 = Vt Dt3 dVc dVc Figure 3 shows the car volume as a function of total traffic volume. For low total traffic volume, all players opt for using cars (forming the slope=1 line), and for high total traffic volume, car commuters remain around 100,000, which is consistent with the analytical solution above given our particular parameters.

References [1] Sutton, Richard S. and Barto, Andrew G., Reinforcement Learning: An Introduction. 1998. The MIT Press, Cambridge, MA. (http://www.cs.ualberta.ca/˜sutton/ book/the-book.html) [2] Michael Genesereth, Nathaniel Love, General Game Playing: Overview of the AAAI Competition 2005. Stanford University, CA. (http://games.stanford.edu/ competition/misc/aaai.pdf) [3] Jonathan Richard Shewchuk, An Introduction to the Conjugate Gradient Method Without the Agonizing Pain, August 1994. (http://www.cs.cmu.edu/˜jrs/ jrspapers.html) [4] Ulrich Berger, Brown’s Original Fictitious Play 2005. Game Theory and Information 0503008, EconWPA. (http://ideas.repec.org/p/wpa/wuwpga/ 0503008.html) [5] John McLead, Rules of Card Games: Tractor. (http:// www.pagat.com/kt5/tractor.html) [6] Wikipedia, Tractor (Card Game), viewed on 2008-1201. (http://en.wikipedia.org/wiki/Tractor_ (card_game))

6. Conclusions and future work We have demonstrated several reinforcement learning methods for a complex multi-player card game, with final performance comparable to that of an expert heuristic player. These methods are easily generalizable to other games by 5

Suggest Documents