Comparison of Selection Strategies in Monte Carlo Tree Search ... - DTAI

Comparison of Selection Strategies in Monte-Carlo Tree Search for Computer Poker

Boris Iolis

[email protected]

Gianluca Bontempi

[email protected]

Université Libre de Bruxelles, Département d’Informatique, Machine Learning Group, Boulevard du Triomphe CP212, 1050 Bruxelles, Belgium

Abstract This paper discusses and compares two alternative approaches to the selection problem in Monte-Carlo Tree Search for computer poker. The first approach relies on conventional multi-armed bandit algorithms, which aim to balance exploitation vs exploration in order to perform a sequence of actions which maximizes the sum of collected gains. The second approach relies on a “selecting-the-best” strategy where the objective is not to maximize the sum of collected gains but to maximize the probability of a correct action after a finite number of steps. These two ways of solving a sequential selection problem lead to two alternative algorithms in the context of computer poker. This paper presents the two algorithms and assesses their impact on the performance of a Texas Hold’em poker bot.

1. Introduction Computer poker has recently become an interesting testbed for machine learning research for two main reasons. First, poker is known to be more difficult than conventional games, like chess, since it is a stochastic, incomplete information game. Second, the growing popularity of Texas Hold’em worldwide is attracting new players, as well as new researchers, to this game. In computer literature, the most studied variant of poker is the 2-player limit Texas Hold’em, for which the best performing bots are already close to defeating the best human players. Among the existing approaches, the bots based on game tree search seem to be the most promising ones. Game tree search bots (for example, Schauenberg, 2006, Billings et al., 2004) maintain a tree representation of the game, using an opponent model to represent the opponent’s behaviour. When the bot needs to take a decision, it explores the game tree, in order to compute the expected values of all the ————— Appearing in Proceedings of the 19th Machine Learning conference of Belgium and The Netherlands. Copyright 2010 by the author(s)/owner(s).

possible actions, and finally chooses the action with the highest reward. The most popular version of Texas Hold’em is the nolimit Hold’em. The difference between limit and nolimit lies in the possibility of raises. In limit poker, a player can only raise by a predefined amount, while in no-limit, the raise can be of any desired amount, limited by the player’s stack. On the one hand this makes the game more interesting (it allows bluffing, for instance), on the other it increases its computational complexity. In fact, the corresponding game tree becomes too large to be fully explored, due to the increased branching factor. However, to circumvent this problem, a new approach, based on game tree search, has been recently proposed (Van den Broeck et al., 2009). This approach is based on the Monte-Carlo Tree Search algorithm (MCTS, first proposed by Coulom in 2006 for computer Go). MCTS can be used to explore the game tree partially, and obtain an approximation of the expected values for all actions, which allows the poker bot to take decisions without exploring the game tree in its entirety. This paper focuses on the core of the MCTS algorithm, which consists in a series of selection steps for which various methods have been proposed in literature. The most known approach consists in interpreting the selection steps as instances of the multi-armed bandit algorithm. The multi-armed bandit problem, first introduced by Robbins (1952), is a classical instance of an exploration/exploitation problem in which a casino player has to decide which arm of a slot machine to pull to maximize the total reward in a series of rounds. Each of the arms of the slot machine returns a reward, which is randomly distributed, and unknown to the player. Several effective algorithms exist in literature to deal with this problem and their application to computer games has already been proven effective (Gelly and Wang, 2006, among others). However, the paradigm of the bandit problem is not the only manner of interpreting a selection problem in a stochastic environment. The problem of correct selection has also been extensively dealt with by the community of stochastic simulation. Researchers of stochastic simulation proposed a set of strategies (hereafter denoted as “selecting-the-best” algorithms) to sample alternative options in order to maximize the probability of correct selection in a finite number of

Comparison of Selection Strategies in Monte-Carlo Tree Search for Computer Poker steps. A detailed review of the existing approaches to sequential selection as well as a comparison of “selecting-the-best” and bandit strategies is discussed in (Caelen, 2009). Here, we advocate that a “selecting-thebest” approach should be more suited to the kind of selection problem tackled in MCTS. The reason is that “selecting-the-best” algorithms are designed to, first of all, obtain the highest information on the expected values of the alternative actions using a number of exploratory trials, and then choose, at the whole end, the best alternative on the basis of the collected information. The amount of intermediate gains, obtained during exploratory actions, is not relevant for the final quality of the algorithm. The “selecting-thebest” rationale seems to better correspond to the type of selection task required in MCTS, which is only used to explore the game tree and approximate expected values, and not to actually take the final decision in the poker game. In this paper, our main contribution is to study and discuss the use of both approaches for the selection problem, in the context of computer poker. We also assess two new algorithms for this problem, and compare them both to a more classical approach, through a series of simulated poker games. The first one is BAST, a multi-armed bandit algorithm, first introduced by Coquelin and Munos (2007). The second is ExploreMaxµg, a “selecting-the-best” algorithm, first presented by Caelen (2009). The rest of the paper is organized as follows; first, section 2 provides a thorough explanation of the game tree search approach, followed by a description of MCTS in section 3. In section 4, we present two types of algorithms to solve the selection problem. In section 5, experimental results of our implementations are presented. Finally, section 6 holds a short conclusion, as well as some future directions.

2. Game Tree Search The basic idea behind the game tree search approach is to build a game tree, representing a full hand of poker, and then to explore this tree to compute the expected values for each possible action of the bot. To represent the opponent’s strategy in the tree, information from an opponent model is used. In the limit poker variant, the tree can be fully explored each time the bot has to take a decision, and the exact expected values can be computed (assuming the opponent model is accurate). This approach has already been used in various successful poker bots (for example, Schauenberg, 2006, Billings et al., 2004). Figure 1 shows an example of a poker game tree taken from (Billings, 2006). The circles represent decision nodes, each having three child nodes, one for each possible action in limit Hold’em (fold, call or raise). The squares represent states in which one of the players has folded. Hexagons represent chance nodes, where additional community cards are dealt into play at the end of the round.

Figure 1. Example game tree for 2 players limit hold’em (Billings, 2006)

3. Monte-Carlo Tree Search The Monte-Carlo Tree Search (MCTS) algorithm has been extensively used for computer Go in the recent years (for example in (Coulom, 2006)), and was formalized for games in general by Chaslot et al. (2008). It is an algorithm which builds a partial game tree incrementally, in order to estimate the expected values, without handling the full game tree. It works in 4 phases, which are iterated until the bot runs out of thinking time. Selection: A path from the root to a leaf node of the partial game tree is taken. Expansion: If the leaf node is not a leaf in the full game tree, its children are added to the partial game tree. Simulation: The game is simulated until conclusion to obtain a new sample. Back Propagation: The sample is propagated upwards on the path to the game tree.

Figure 2. Steps of the MCTS algorithm (Chaslot et al. 2008)

Comparison of Selection Strategies in Monte-Carlo Tree Search for Computer Poker In this paper, the selection phase is viewed as follows; starting from the root node of the tree, a path is built by iteratively selecting the next node in the path. This subproblem of selecting the next node to be added to the path is really the core of this selection phase. It can be formalized as follows. Let k be the last node that was added to the path during the current selection phase, and let C(k) be the set of children of node k (with N being the size of C(k)). The next step in the selection phase will be to select a node to augment the current path. The selection phase is over when the last node in the path has no children. In this paper, the process of selecting the next node in the path is referred to as a selection problem. After the selection is done, the tree is expanded, and a data sample is returned by the simulation step. Based on these samples, MCTS can estimate the expected values of all possible actions. When the bot runs out of thinking time, it will choose the action a*, associated to the node i*, such that (1) Where is the estimate of the expected value for node i, and r is the root node of the game tree. In other terms the aim of the selection phase is to explore the paths that will lead to the most informative data about the expected value of each action.

4. Approaches to the Selection Problem This section presents two approaches to the selection problem. Note that these methods are used only on some nodes in the game tree, i.e. the bot’s own decision nodes. The selection problem for opponent decision nodes and chance nodes is not addressed here, but can be solved by using more direct approaches, as described in section 5.1. 4.1 The Multi-Armed Bandit Approach A classical approach to solve the selection problem in a game context is to use a multi-armed bandit algorithm (Gelly and Wang, 2006, among others). A bandit problem concerns the selection among a set of arms, each giving rewards according to an unknown probability distribution. At each step, the user can choose an arm, and receives a reward according to his choice. The goal of a bandit algorithm is to maximize the profit in the long run, as the number of steps is infinite. In our MCTS selection problem, the child nodes represent the arms of the bandit and the rewards, associated to each node, are the data samples returned by the simulation step. In the case of MCTS for computer poker, Van den Broeck et al. (2009) proposed to use UCT (Kocsis, Szepesvari, 2006), as well as their own variation, UCT+, for the selection problem. In this paper, we adapt another bandit algorithm, BAST (Coquelin,

Munos, 2007), to this problem. This algorithm was designed as an improvement over UCT, especially in the case of smooth trees. The authors, however, have proven that it performs at least as well as UCT in all cases. Similarly to most bandit algorithms, BAST computes an upper confidence bound for each using the previously obtained data samples. After computing these bounds, the algorithm selects the child node with the highest bound. For each node , the BAST upper bound is defined as follows.

for any non-leaf node, and

for leaf nodes, where samples at node i, and

denotes the mean of the

With being the number of available samples for node i, n being the total number of samples obtained for the whole game tree, and is a probability, used in the proof of convergence for BAST (see Coquelin, Munos (2007)). The parameter represents the smoothness coefficient of the tree, depending on the depth d at which the selection is taking place. This parameter should be defined such as, for any node k of depth d+1, is satisfied. It can for instance be set as a polynomial function of d. Note however that, while BAST was designed for smooth trees, the game tree of no-limit Hold’em is not smooth. In fact, the range of expected values that can be obtained in a particular situation does not decrease with depth. For example, the decision to call an all-in bet in the last betting round could yield the biggest reward possible (entire opponent’s stack), while folding in the same situation will give a negative reward, equal to the chips invested so far in the pot. For this reason, we propose to modify the BAST bound, replacing the smoothness coefficient with a variance measure. This gives the following bound definition for our modified BAST;

for non-leaf nodes, and

for leaf nodes, where is the standard deviation of the samples at node i, and C is a tuning parameter1. As it is the case for many bandit algorithms, this bound is defined as the sum of two terms; an exploitation term and an exploration term. Here, represents the exploitation term, favouring the child nodes with highest estimate so far, while is the exploration ————— 1

For the experiments presented in section 5, C was set to 1.

Comparison of Selection Strategies in Monte-Carlo Tree Search for Computer Poker term, favouring the nodes for which the estimate is the most variant, and thus uncertain. In our modified version of BAST, the part is the same as the UCT+ bound (Van den Broeck et al. 2009). However, the other term can make the bound tighter. Indeed, in some cases we could have

with with

, as is also computed samples coming from nodes . These samples from sub-optimal nodes could increase the variance of , and thus increase when compared to . some

Finally, it should be noted that multi-armed bandit algorithms in general are not particularly well adapted to this problem of selection, as pointed out by Coulom (2007). These algorithms aim to maximize the incremental long term reward, while the goal of MCTS is to maximize the information collected from data, in order to obtain a better estimation of the expected values. 4.2 The Selecting-the-Best Approach This section details an alternative approach to the MCTS which relies on a “selecting-the-best” strategy, where the goal is to perform a certain number of sampling actions in order maximize the probability of selecting the best alternative after a finite number of iterations. It is important to understand the difference between these two approaches. On the one hand, in multi-armed bandit algorithms, the gain of each intermediate step matters since it contributes incrementally to the objective function to be maximized. On the other hand, in “selecting-the-best” algorithms, intermediate steps do not matter individually (in other words they are “free”) since their only goal is to return enough information to maximize the gain of the only action that matters, i.e. the last one (see equation 1). For these reasons we deem that the “selecting-the-best” interpretation of the MCTS selection step is more adapted than the bandit interpretation. In MCTS, the choices made by the algorithm during the construction of the partial game tree have no cost and should not be taken into consideration. All the effort should be concentrated on maximizing the probability of performing a correct final decision. Note that the number of iterations is limited by the available amount of thinking time. While thinking time remains, the algorithm can have “free” iterations. When this time runs out, the algorithm chooses what has been determined as the best action. In this paper we employ the “selecting-the-best” algorithm ExploreMaxµg proposed in (Caelen, 2009). This algorithm is based on the definition of , which is the expected reward of a greedy decision. Here, by greedy decision we mean the act of choosing the alternative with the highest estimated expected value.

The rationale of this algorithm is to perform a set of explorative actions in order to maximize , and eventually, when the thinking time runs out, to select the alternative in a greedy way, i.e. to select the child with the highest associated estimate.

Algorithm 1 ExploreMaxµg for the selection problem if a child was tested less than 2 times Select child i else Compute estimators and , where represents the child with the highest estimate so far for each child

end for if

and

Choose child else Randomly choose a child node, with probability to choose child i end if end if

as the

Algorithm 1 details the pseudo-code of ExploreMaxµg. First, the algorithm makes sure that all child nodes have been selected at least 2 times, in order to have a decent estimate of each expected value. Then, the expected gain of a greedy action is estimated by a MonteCarlo procedure. Afterwards, an upper bound on the variance of the samples ( ) at node i, as well as , representing the relative variance upper bound of node i compared to other nodes in are computed. Finally, a test is made, in order to determine whether choosing the child will increase . If it is indeed the case, is chosen (exploitation). If not, a random node is chosen (exploration), with being the probability to choose node i. In this case, the algorithm will favour the nodes with the highest , which will lead to a more efficient exploration of these nodes, on which the current estimate is the most uncertain. Experimental results on the comparison of this algorithm with bandit strategies in synthetic and model selection tasks are available in (Caelen, 2009). In the following section we will report some results in the context of MCTS for poker.


5. Experimental Results

5.2 Results

The experimental assessment and comparison of the selection strategies discussed in the previous section is done by means of an open source development framework called CSPoker2. This framework implements all the necessary components for poker play, as well as some poker bots. These bots are based on the work of Van den Broeck et al. (2009) and implement the MCTS algorithm. Our implementation work consisted in adapting the existing MCTS by plugging in our contributed selection strategies. To test the performance of our new strategies, we ran a number of simulations, made of a series of games between two CSPoker bots, using different selection methods. We will first detail the setup of these experiments, followed by the results table.

We first tested the BAST and ExploreMaxµg strategies against the classical UCT, already present in the CSPoker framework. Table 1 contains the obtained results (where “conf.” refers to the used configuration, see section 5.1).

5.1 Setup Each simulation, comparing selection strategy A with selection strategy B, is composed of a number of matches between two bots. Both bots use exactly the same parameters and have the same time and space resources. The only difference lies in the selection strategy used (either A or B). The thinking time for each action was set to 500 ms. Each match is played for 3000 consecutive hands. At the beginning of each match, the memory of both bots is reset, so they have no knowledge of their opponent. At the end of each match, the average chips won or lost by hand, expressed in terms of small blinds per hand3, is saved. The result of a simulation is the average of 100 such matches. To run these simulations, we used two distinct configurations. To understand these configurations, an explanation is required regarding the way the selection strategy is originally implemented in the CSPoker bots. In these bots, the multi-armed bandit approach is used only for the bot’s decision nodes in the game tree. For the other nodes (opponent decision nodes and chance nodes), more direct approaches are implemented. These approaches implicitly reflect the beliefs of the bot on what will happen in those situations. We chose two different approaches here to define our configurations. For the first configuration, the selection strategy for those nodes is to always select the child node i with the minimal estimate . This implies the very pessimistic belief that the worse situation for the bot will always happen, and results in a very conservative game. For the second configuration, a more reasonable selection strategy was used, selecting a child node i randomly with a probability defined as

Table 1. Simulation results of BAST and ExploreMaxµg against UCT TESTED

CONF.

MEAN EV

STRATEGY

BAST EXPLOREMAXµg BAST EXPLOREMAXµg

CONFIDENCE INTERVAL

1 1 2 2

(IN SB/HAND)

(0.05)

0.0935 0.1887 3.5152 3.1716

[0.089, 0.098] [0.164, 0.214] [3.087, 3.953] [2.774, 3.570]

These results show the superiority of the two new strategies when compared to UCT, in both configurations. The second part of the experience consisted in having BAST playing against ExploreMaxµg. We obtained a positive result of 0.1571 small binds per hand (in configuration 1) in favour of ExploreMaxµg ([0.133, 0.181] 95% confidence interval). This preliminary result suggests that the “selecting-the best” approach better suits the selection problem of MCTS than the multi-armed bandit approach.

6. Conclusions and Future Work In this paper, we presented two alternative approaches to the selection problem in Monte-Carlo Tree Search applied to no-limit Hold’em. For both approaches, one particular algorithm was implemented in an existing framework. Finally, we showed preliminary results for both strategies when compared with a classical selection strategy. Future work will first focus on additional experimental tests. Another interesting topic will be to develop methods to remove the assumption of stationary distributions for the values, which is currently underlying both bandit and selecting-the-best approaches. Then we will conduct studies on the sensitivity of the bot performance to 1) the hyperparameters of the MCTS selection algorithms, 2) the other steps of MCTS procedure and 3) the other components of the game strategy, notably opponent modelling.

This results in a more balanced play by the bot.

Acknowledgments

—————

This work was carried-out during an internship at the INRIA Lille-Nord-Europe research center, in the team of Rémi Munos. This internship was funded by the Erasmus Placement program.

2 3

Available at http://code.google.com/p/cspoker/

The amount of chips won each hand is divided by the small blind value.


References Billings, D. (2006). Algorithms and Assessment in Computer Poker. Phd Thesis, Department of Computer Science, University of Alberta. Billings, D., Davidson, A., Schauenberg, T. C., Burch, N., Bowling, M., Holte, R., Schaeffer, J., Szafron, D. (2004). Game-tree search with adaptation in stochastic imperfect-information games. Computers and Games: 4th International Conference, 21-34, Springer-Verlag GmbH Caelen, O. (2009). Sélection Séquentielle en Environnement Aléatoire Appliquée à l'Apprentissage Supervisé. Phd Thesis, Département d’Informatique, Université Libre de Bruxelles. Chaslot, G., Bakkes, S., Szita, I., Spronck, P. (2008). Monte-Carlo Tree Search: A New Framework For Game AI. Proceedings of the BNAIC08, 389-390. Coquelin, P. A., Munos, R. (2007). Bandit Algorithms for Tree Search. In 23rd Conference on Uncertainty in Artificial Intelligence, UAI 2007, University of British Columbia, Vancouver. Coulom, R. (2006). Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search. In the 5th International Conference on Computer and Games, 2006-05-29, Turin, Italy. Gelly,S.,Wang,Y. (2006). Exploration-Exploitation in Go: UCT for Monte-Carlo Go. Twentieth Annual Conference on Neural Information Processing Systems (NIPS 2006) Kocsis, L., Szepesvari, C. (2006). Bandit-based MonteCarlo Planning. Lecture Notes In Computer Science 4212, 282-293. Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5) : 527-535. Schauenberg, T. C. (2006). Opponent Modelling and Search in Poker. Master's Thesis, Department of Computer Science, University of Alberta Van den Broeck, G., Driessens, K., Ramon, J. (2009). Monte-Carlo Tree Search in Poker using Expected Reward Distributions. 1st Asian Conference on Machine Learning (ACML 2009).