Checkers

Checkers Reinforcement learning project: AI Checkers Player Ing. C.L. Dubel Agent Technology University Utrecht

Ing. J. Brandsema Agent Technology University Utrecht

e-mail:[email protected] L. Lefakis BSc Applied Computer Science University Utrecht

S. Szóstkiewicz BSc Agent Technology University Utrecht

April 20, 2006

Abstract This paper describes the different aspects of the Monte Carlo method, TD-leaf and neural networks to learn a computer learn how to play the game of checkers. Experiments with different kind of players i.e. random or neural net players have been done. Also experiments with several neural networks (different number of hidden neurons) have been performed, all to assess the effectiveness of the different parameters. Although the program wins over 99% of the time against a random player, we are unsure if it's a challenge for a relatively good human player.

1

1. Introduction Reinforcement learning (Sutton, 1988) is a “hedonistic” learning method which is based on the interaction between an agent and the environment in which the agent is situated. Contrary to for example the supervised learning method in which the agent learns from examples provided to him, reinforcement learning enables agents to learn by producing rewards following from actions taken in the environment. This natural approach informs an agent how good the chosen action is. The agent seeks to maximize his reward over the long run. In this particular case the agents choice of action will be based on the current sensory information and on a FA (function approximator) which tells us how good the current action is. There are several ways to implement a FA. Neural networks provide a reliable and model independent solution. To learn the model we need to train the neural network, to train the neural network we need some data to learn from. To provide this we can consider: randomplay, playing against humans or learning from a database of already played games. In random play, the environment replies to the agents actions by choosing between all possible actions with an equal probability. In our project we are mostly using randomplay. This might be quite slow when used with look-ahead and backpropagation but it enables us to run a variety of experiments with a set of different parameters.

between man and machine took place in 1997 between Deep Blue and the world chess champion Gary Kasparov and for the first time machine prevailed marking a new era. Many argue that this success is not due to commonly known “intelligence” but to enormous computational power. In games like checkers positions and available moves grow exponentially so it’s nearly impossible to search through the whole game tree to find the (eventually) best move. Although we can settle with approximations. The most successful programs like TD-Gammon (Tesauro, 1995) using before mentioned techniques play on the level of a world class player.

1.2 Overview In the next sections we will describe and explain the rules of checkers as they are used in our work. We will also describe our approach to the problem and the used algorithms such as Monte Carlo and TD leaf. At the end we will explain and discuss the results of our experiments and take into consideration possible future work.

1.1 Games Scientists have always been intrigued by games of different sorts, especially board games. This is not coincidental, as board games stimulate and broaden the mind. Games like chess, checkers and go have a reputation for being both challenging and complex. These games played for centuries are still attracting AI scientists today. A milestone (and at the time considered the ultimate) battle

2

2. States, Environment, Actions, Rewards

Figure 1: The user interface when playing checkers. The player always plays with white and starts at the bottom side. We consider the states of the problem to be the raw board representation. Thus we have an array which signifies the dark tiles of the checkers board on which the game is played. Each element in the array represents a tile on the board. The value of this element is determined by whether a piece occupies the tile and, if any, the color and nature of that piece. That is if the tile is empty then the corresponding element in the array is set to 0, if it is occupied by a piece then it is set to -1.0 or 1.0, if it’s a king from respectively the opponent or the player, or -0.5 or 0.5, if it’s a man from respectively the opponent or the player. The above state representation fulfills the Markov property as it contains all relevant information for deciding the next move. All that is needed in order to select an action, in a checkers game, is the positions on the board of the different pieces and of course to which player each piece belongs.

The actions that can be made at each state are all the legal checkers moves for that state. Thus after each action returned to the environment by the agent, we obtain a new board position which is created by that move. After the agent made a move, the environment makes the board ready for the opponent and lets the opponent make a move. Subsequently the opponent makes a move, after which the environment gives the agent the new board state. The goal of the game is obviously to win. This can be accomplished either by eliminating all the opponent's pieces or by reaching a board state where it is the opponents turn to play and though he still has pieces on the board he does not have a legal move. The loss conditions follow from the winning conditions for the opponent. Finally we consider the game drawn in the case where the board is left with two kings of opposite color. We set the reward function to be +1 for a winning move, -1 for a losing move and 0 for a draw. All actions that do not lead to a final board position are assigned a reward of 0. The states that correspond to final boards are considered to have a value of V Final= 0 . The values of the rest of the states are given by the output of the function approximation used (i.e. the neural net).

3. Function approximator The game of checkers is of a large enough complexity to make it impossible to use a tabular representation for the values of the different states. Estimates have placed the number of board states to be approximately 1018, while the game tree is considered to be of rank 1031. These numbers make it necessary to use a function approximator. The function approximator that is chosen is an artificial neural network. The input of the neural network is the state, that means we have 32 input

3

nodes one for each position on the board. The values of these inputs are between -1 and 1 as stated in a previous chapter. Due to the complexity of the game, a hidden layer is used. Experiments were run using 20, 40 and 60 hidden units. The best results were obtained when using 40 hidden neurons. Each neuron of the hidden layer is connected to all the input nodes. These neurons use a deviation of the sigmoid function in order to compute their output. In particular we have for every neuron i of the hidden layer the output is given by:

where:

where w HO are the weights connecting the hidden layer with the output layer and w IH are the weights connecting the input layer with the hidden layer. The constant a is the learning rate of the neural network and is set to be a= 0.01 . The values and by the following equations :

are given

and

Both the value of the learning rate and the number of hidden neurons are taken from previous studies where positive results have been acquired using these parameters.

4. Policy The outputs of the neurons are then connected to a single output neuron which also uses the previous sigmoid function. This output is the estimated value of the state inputted to the neural network. During training the weights of the neural network are updated in order to achieve the minimum mean square error

over the input data. If

O1 is the output of the neural network given an input st and if the desired output is d, then the weight are updated as follows : for i=1 to (the number of hidden neurons)

and for i=1 to 32 and j=1 to (the number of hidden neurons)

In reinforcement learning problems we have the known problem of exploration vs. exploitation. In order to study the effect of different ratios of exploitation , we experiment with different policies. In particular experiments were run using greedy, ε-greedy and softmax action selection policy.

4.1 Greedy policy. For each state V t presented to the agent , the action which leads to the next state (afterstate) with the largest possible value is chosen. Thus with this policy , no exploration is performed and at each step the agent opts to exploit his knowledge to the maximum. Despite its lack of exploration, the greedy policy was found do be as good as the other policies tested. This can be attributed to the randomness of the environment which systematically leads the agent to new unknown states, thus creating a form of environment induced exploration.

4

4.2 ε-greedy Despite the inherent randomness of the environment, experiments we run using the ε-greedy policy in order to see the effects of policies with exploration on the agents learning. For each state

V t visited by the agent, the greedy action (which leads to the afterstate with the largest value) was chosen with 1− ε , while with a a probability probability ε some other random action was chosen. Experiments we run for both ε = 0.1 and ε = 0.01 . In both cases, no real improvement of the agent's learning process was observed. It seems that the randomness of the environment is enough to create all the necessary exploration. Thus further exploration enduced by the policy is unnecessary.

performance of the other policies, the rest of the experiments that were run all used the greedy policy.

5. Reinforcement Learning Algorithms Throughout the experimentation a wide range of reinforcement learning algorithms were used. In particular experiments were conducted using TD(0), TD(0.9), TD(1) (a.k.a. MonteCarlo) and TD-leaf, the latter being used in combination with the min-max look ahead algorithm.

5.1 Temporal Difference Learning Temporal difference algorithms involve bootstrapping the value of a given state to the values of its proceeding states. In the simplest method TD(0), the value of each state is predicted by the equation:

4.3 Softmax action selection policy For each state V t presented to the agent , an action is chosen based on the following formula. The probability

When eligibility traces are used then we have the following update rules for the state values (using replacing traces).

of an action a being chosen is: Experiments run using replacing λ= 0 , λ= 0.9 eligibility traces and didn't show any improvement over the MonteCarlo Method and thus this approach was abandoned in favor of Monte-Carlo learning. Where created

if

we

is the state that is chose action a for

state V t . The results of these experiments showed that not only did softmax slow down the learning algorithm but also gave worse results then the previous two policies. The latter was probably due in great part to the poor choice of temperature τ = 0.1 . This is a known problem with softmax where the choice of an appropriate temperature requires a certain degree of knowledge of the problem. Based on the above results and seeing as the performance of the greedy policy is just as good if not better than the

5.2 Monte-Carlo In Monte-Carlo learning methods the state of each state is estimated as the average of the rewards following a visit to s in an episode. In particular we use constant − a MC . Thus we generate a state (by having the agent play against a random player) and for every state st appearing in the episode we set the value of that state to be: where

R

is the reward gained after

visiting state st (+1 for win, -1 for loss). This method proved to be just as

5

good as any other and was used for all the experiments shown here (with the exception of course of those that used TD-leaf).

draws. The results are the average over 10 experiments.

5.3 TD-leaf TD-leaf is a combination of min-max lookahead and TD-learning. In particular after we have scanned the game tree to find the best possible action by using the min-max criterion, the value of that leafstate is used to update the value of the current state. That is the value of state s becomes . This method proved especially succesfull as we shall see, reaching a winning percentage of 99%.

Figure 2: Using a hidden layer with 20 neurons.

6. Experiments In this section we will describe some of the experiments we performed and their outcome. As we shall see, the results are very encouraging, showing that reinforcement learning is appropriate for learning the game of checkers. When these methods are combines with lookahead in the game tree (using the minmax criterion) then the result are exceptionally high.

6.1 Simple Intelligent Agent vs. Random Player The first of the experiments shown here is a simple intelligent agent playing against a random player. Constant-a MC learning is used and a neural network is trained to approximate the state-value function. We generate 25.000 episodes for the agent to use for learning, after each game the visited states and there updated values are used to train the neural network. The following graph shows the agent's cumulative winning percentage for these 25.000 games, for different numbers of neurons in the hidden layer (20,40,60). The (gray) bottom area denotes the winning percentage. The part above the crossed area is the loss percentage, while the crossed area itself is the percentage of

Figure 3: Using a hidden layer with 40 neurons.

Figure 4: Using a hidden layer with 60 neurons. As we can see the agent learns fairly quickly. Within the first 5000 games the agent has reached a performance rate of 80%. In the cases of forty or sixty

6

hidden neurons the agent learns even faster reaching high performance after only 2000 games. As we can also see after this point, the agents performance more or less stabilizes , increasing slightly only in the case of 40 hidden neurons. The case of 40 hidden neurons also has the best overall performance, reaching a higher win percentage than the other two cases. For all these reason the remainder of the experiments is all ran using forty hidden neurons. However as can be observed the experiments run using 40 neurons seemed to have better initialized weights (weights are initialized randomly) and this may be responsible to some effect for the results.

6.2 Agent using look-ahead vs. Random Player The agent looks ahead to a certain depth in the game tree and chooses an action based on the min-max criterion. Constant-a MC learning was previously used to help the agent learn the statevalues and of course a neural network is used to approximate the value-function. The look-ahead greatly increases the agent's performance. By allowing the agent to look-ahead to a depth of three we obtain results close to 90% winning accuracy. Experiments with a lookahead deeper than three ply got better results, up to 98% for a 7 ply lookahead.

6.3 Agent using Random Player

TD-leaf

vs.

The agent uses look-ahead while learning. At each state presented to the agent, the agent looks ahead to a depth of three and after finding the best possible leaf-state based on the minmax criterion, updates the root-state's value. This leads to an extremely high performance rate, with a win percentage of 99%. In the experiments ran using TD-leaf, the weights were initialized using the weights gained from previous learning methods. Thus the agent already had a

80% performance rate when starting out. This was done in order to assist the agent in the look-ahead part of the algorithm. With a poor initial function approximator, the agent would not be able to perform meaningful searches in the game tree. In fact when random initial weights were used the TD-leaf learning process became unstable, reaching high performance most of the time but failing miserably on occasion.

7. Conclusions As we can see from the experimentals results, reinforcement learning methods in combination with neural networks for function approximators and look-ahead, can lead to an intelligent agent with almost perfect performance against a random player. This however does not make the agent suitable for play against more intelligent opponents. Even mediocre human players, had no problem winning against the agent. Taking into consideration this fact but also the fact that the agent seems to learn very quickly up to a certain performance and then stabilizes, leads us to believe that they are certain concepts ingerent to the game that are relatively easy for the agent to learn form the raw board representations while other concepts of the game cannot be learned without further information. A few experiments were run in which material balance was also used as part of the state representation. Though we are aware that this has lead to positive results in other studies (in which however material balance was amongst a number of extra information given to the agent), in this case no particular improvement was observed. Another matter which may (and probably was) have been detrimental to the agent's performance, was the fact that the agent's experience was comprised entirely from games played against a random player. It would perhaps be better to allow the agent to interact with more intelligent opponents,

7

perhaps by using a database of actual games as experience.

7.1 Future work Future work could be using a different TD method to compensate for the problems caused by a fixed opponent were we could be learning bad habits

from a weak player. An algorithm that could be suitable for this situation is TD(µ) designed by Beal. This algorithm uses the evaluation function to examine the moves made by the opponent and not learning from them if the move is considered bad.

References Patist, J-P., & Wiering, M.A. (2004). Learning to Play Draughts using temporal difference learning with neural networks and databases. In A. Nowe & K. Steenhout (Eds.), Benelearn'04: Proceedings of the Thirteenth Belgian-Dutch Conference on Machine Learning (pp. 87-94). Ghory, I. (2004). Reinforcement learning in board games. Technical Report CSTR04-004, Department of Computer Science, University of Bristol. Kootstra, G. & Zedde, R. van de (2002). TD(λ)-Batumi. Rijksuniversiteit Groningen. Sutton, R.S. (1999). Reinforcement Learning. In Rob Wilson and Frank Keil (Eds.) The MIT Encyclopedia of the Cognitive Sciences, MIT Press. Gurney, K. (1996). An Introduction to Neural Networks. Dept. Human Sciences, Brunel University Uxbridge, Middx. Sutton, R.S. & Barto A.G. (1998). Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA. Griffith, N.J.L . & Lynch, M. NeuroDraughts: the role of representation, search, train regime and architecture in a TD draughts player. Dept of Computer Science and Information Systems, University of Limerick, Ireland. Samuel, A. L. (1959). Some studies in machine learning using the game of Checkers. IBM Journal of Research and Development 3, 210-229.

8