CHECKERS: TD(λ) LEARNING APPLIED FOR DETERMINISTIC GAME Halina Kwasnicka, Artur Spirydowicz Department of Computer Science, Wroclaw University of Technology, Wyb. Wyspianskiego 27, 50-370 Wroclaw, Poland. Tel. (48 71) 320 23 97, Fax: (48 71) 321 10 18, E-mail:
[email protected] Abstract: In the paper we present a game-learning program called CHECKERS. The program contains a neural network that is trained on the basis of obtained rewards to be an evaluation function for the game of checkers, by playing against itself. This method of training multilayered neural networks is called TD(λ). The main aim of the paper is to explore possibilities of using reinforcement learning, the commonly known TD-Gammon, for a game without random factors. For this purpose, we decided to use a popular game – checkers. Developed program have been tested by playing games against: people (the authors and their colleagues), simple heuristic and other program – NEURODRAUGHTS. Developed program plays checkers at the level more then intermediate – it plays astonishingly well. Keywords: Reinforcement learning, machine learning, computer game playing, checkers. Introduction People enjoy playing games of skill, such as chess and checkers, because of the intellectual challenge and the satisfaction derived from playing well. They use knowledge and search to make their decisions at the board. The person with the best “algorithm” for playing the game wins in the long run. Without perfect knowledge, mistakes are made and even World Champions will lose occasionally. This gives rise to an intriguing question: is it possible to program a computer to play a game perfectly? Recently, some games have been solved, for example Qubic [6] and Go-Moku [1]. The most known computer program, which is able to learn play game and win with champions, is TD-Gammon. It plays the game called backgammon. As Bill Robertie says in [8], TD-Gammon's level of play is significantly better than any previous computer program. It plays at a strong master level that is close to the world’s human players. The paradigm of reinforcement learning is intuitive: a pupil (learning agent) observes an input state and he produces an ‘action’ – an input signal. After this, he receives some ‘reward’ from the environment. The reward indicates haw good or bad was the output produced by the pupil. The goal of such learning is to produce the optimal actions (output signals) leading to maximal reward. Often, the reward is delayed, what means that the reward is known (given) at the end of along sequence of input and output actions. The problem for the pupil is known as the “temporal credit assignment”. It is intuitive paradigm because the learner (pupil) is able to learn to perform its task from its own experience, without any intelligent ‘teacher’. Despite the considerable attention devoted to reinforcement learning with delay, there is difficult to find many significant practical applications. It seams that multilayered perceptrons can be capable of learning complex nonlinear functions of their inputs. Temporal difference learning seems to be a promising general-purpose technique for learning with delayed rewards, not only for prediction learning, but also for a combined prediction and control task where control decisions are made by optimizing predicted output. TD-Gammon allows to explore the capability of multilayer neural networks, trained by TD(λ) method to learn complex nonlinear functions. This program gives possibilities to compare the TD learning with the other approach of supervised training on the basis of expert labeled exemplars, as in NEUROGAMMON, trained by backpropagation using a database of recorded expert games. NEUROGAMMON achieved only intermediate level. Backgammon has some features that are absent in checkers, and they probably caused that the TD-Gammons plays surprisingly good. One of them is the stochastic nature of the task – it comes from the random dice rolls. This assures a wide variability in the positions visited during whole training process. The pupil can explore more of the state space and discover new strategies. The problem is in deterministic games (as checkers) where self-playing training can stop itself exploring only a small part of state space because the narrow range of different positions are produced. Problems connected with self-playing training were identified in such deterministic games as checkers and Go [10]. The second significant feature of backgammon is that for all playing strategies, the sequences of moves will terminate (winning or lost) even if we start play with randomly initiated networks. However, in deterministic games we can obtain cycles and, in such cases, trained network is not able to learn because the final reward cannot be produced. It is necessary to omit this problem when we would like to use TD(λ) learning method for deterministic games. Another advantage of nondeterministic games is smoothness and continuity of target function that the pupil must learn. It means that small changes in the position cause small changes in the probability of winning. Deterministic games, such chess, are discrete – a player can win, lose, draw, the target functions are more discontinuous and harder to learn. The article presents a game-learning program, called CHECKERS, which uses TD(λ) learning method for feedforward neural network. A network learns play checkers by experience, achieving positive and negative rewards. Further sections present used algorithm, developed program and obtained results. The program was tested by playing against people (i.e. the authors and their colleagues) and computer programs: implemented in the program simple heuristic and a program called NEURODRAUGHTS. Obtained results show that our approach give satisfactory results, modified TD(λ) used in TD-Gammon for backgammon, can be successfully used also for deterministic games.
1. Checkers Before discussing the TD checkers learning system, we state a few silent details regarding the game and used approach. Checkers is a very popular game around the world. We can find over 150 documented variations of checkers, but the two versions are commonly used. In the United States and the British Commonwealth checkers are played on an 8×8 board, with checkers moving one square forward and kings moving one square in any direction. Captures occur by jumping over an opposing piece on an adjacent square and landing on an empty square (capture moves are forced). In one move, a single piece is allowed to jump multiple pieces. A capture consists of jumping over a piece on an adjacent square and landing on an empty square. Checkers promote to kings when they advance to the last rank of the board. Second popular version is International Checkers; it is popular mainly in Netherlands and Russia (former Soviet Union). The 10×10 board is used, with checkers allowed to capture backwards and kings allowed to move many squares in one direction. In the paper we consider the United States version of checkers. Complex board games such Go, chess, checkers, Othello, and backgammon, are an ideal testing ground for exploring a variety of approaches in artificial intelligence and machine learning fields. Termed by Sutton “Temporal Difference”, or simply TD learning methods are a class of methods for approaching the temporal credit assignment problem. The learning is based on the difference between temporally successive predictions. The most recent of these TD methods is an algorithm proposed in [4] for training multilayer neural networks called TD(λ). Program TDGammon, developed by G. Tesauro for backgammon uses TD(λ) method, it represents an approach that uses massive computing power not for deep on-line searches, but for extensive off-line tracking of a sophisticated evaluation function. Researchers are investigating applications of TD(λ) to other games, but their results are not so good as in TDGammon. In the literature, it is underlying that TD approaches to deterministic games require a kind of external noise to produce the variability and exploration, such this obtained from the random dice rolls in backgammon. After opinions of G. Tesauro and others researches, we can conclude that [15]: • During training process, all possible states of game should be taken into account, • Training episodes should be ended with receiving a reward, if we regarded games – the reward must be positive and negative (game always should stop and learner should win or lose). The second condition means that the neural network ought to use not too strong as well as not to week trainer – it must receive positive and negative rewards to achieve progress in the learning process. To decrease above mentioned constraints for using TD learning, we propos a modification of a learning process. In our program CHECKERS, four modifications are implemented: 1. During learning process a number of steps without capture are identified. It indicates that program probably generates cycle (i.e. move forward and move backward). In such a case, a game is finished and a reward is established based on a difference between numbers of learner and opponent checkers being on the board. 2. Observations phases of learning process show that at the beginning learning goes properly, but when on the board are only few checkers, their moves look as irrational. Such moves do not assist learning, therefore we assume following hypothesis: Program, which is learned to eliminate efficiently opponent checkers in general, will be able to do this also in the final phase of game (when only few checkers are present on the board). With this hypothesis, we can stop the game with small number of checkers on the board and the reward depends on a difference between numbers of learner and opponent checkers on the board. 3. The method requires testing during learning process a great number of possible states of game. Variety of states can be obtained by random selection of initial state (game begins from the random state). Of course, we cannot assure that randomly selected states are representative for all possible states of game, but it allows obtaining a quite good diversity of tested states. 4. To enlarge diversity of a game, we introduce a random noise into an output of trained neural network. It allows eliminating full determinism in selection of move at a given state. Above assumptions are taken into account in the developed program CHECKERS. 2. TD(λ) learning This section gives a short outline of the applied reinforcement learning method, Q learning, temporal difference learning and, at the end – TD-learning, used in the CHECKERS program. Reinforcement learning Reinforcement learning methods deal with a class of problems in which an agent perceives environment using sensors and acts in this environment by effectors. Environment is in one of possible (defined) states and agent can produce action that causes transition from one state to another. A response of environment has a form of reward. The goal of an agent is to learn such a strategy (policy) π → S x A, where π is a strategy (policy), S – a set of possible states, A – a set of possible agent’s actions, which produces maximal reward given by environment. Discounted cumulative reward Vπ with an initial state st and strategy π is equal to: ∞
V π ( st ) = rt + χ ⋅ rt +1 + χ 2 ⋅ rt +2 + ... = ∑ χ i ⋅ rt +i ,
(1)
i =0
where rt is a value of reward in time t, χ – coefficient, χ∈[0,1], it causes differences between instant reward after undertaken action (after making move) and the rewards obtained after further actions (moves) using the strategy π.
Larger coefficient χ means greater further rewards – in such a way an agent is interested rather rapid results. In practice, usually a sequence of actions produces results, not a single action. In CHECKERS, finished game produce relevant reward. We can conclude, that the goal of learning algorithm is to find optimal strategy π*, that maximizes discounted cumulative reward: π * ≡ arg maxπ V π ( s), ∀s . (2) Presented above method does not require any training set, what, of course, is its great advantage. Q learning
From the previous section we know, that a strategy π → S x A, should be found, but there is no any training examples. We have only the three sequences: actions a1, a2, …, an, states s1, s2, …, sn, and rewards r1, r2, …, rn. What it is looked for is π. Let us define an optimal action for a state s as: π * ( s) ≡ arg max a (r ( s, a) + χ ⋅ V * (δ ( s, a )) , (3) where δ ( s, a ) is a state reached by an action a from s, r ( s, a ) is a reward obtained after an action a from a state s, χ – discount coefficient, V* – a discounted cumulative reward for a state s with using the optimal strategy. We are looking for the action that maximizes a sum of instant reward r ( s, a ) and, decreasing by χ, discounted cumulative reward in the next state δ ( s, a ) . With known r ( s, a ) and δ ( s, a ) , a learning process lies on discovering V* – the optimal strategy π* can be reached. But in practice, an agent can see a new, reached state, as well as the obtained reward, after making selected action. Therefore, instead of knowing further reward, a heuristic function Q for assessment of quality of a given action is developed: Q( s, a ) = r ( s, a ) + χ ⋅ V * (δ ( s, a)) . (4) Comparing equations 3 and 4, one can see that: π * ≡ arg max a Q( s, a ) . (5) * To find the optimal strategy π , the Q function must be known. Value of Q is approximated iteratively, according to the pseudocode [15]: Q algorithm: For each couple (s,a), insert zero into the array Q (Q(s,a)=0), For each current state s do Select an action a and perform this action Receive instant reward r Go to the state s’=δ(s,a), ∧
∧
Modify an adequate value in the array Q according to expression: Q( s, a) ← r + χ ⋅ max a Q( s ' , a ' ) , Change current state as s’:
s' ← s .
For detailed description, see [13]: An agent actualizes value of reward for this state that it just left. An example is shown in Figure 1. Let us assume, that robot R migrate in environment, in which obtaining only one state gives a positive reward, but the others are neutral (reward is equal to 0). Initial state for the robot R is s1, and initial values of function 90 R 100 s1 73 s2 100 ∧ aright Q( s, a) (assigned with transitions between states) are shown on 66 81 66 81 R ∧
left diagram in Figure 1. For example, Q( s1 , aright ) = 73 , where Figure 1. An example of evaluation of quality function [MITT97]
aright denotes transition from s1 into s2. Making action aright , a robot obtains the instant reward r equal to 0, and the value of ∧
Q( s1 , aright ) is actualized on the base of V * ( s2 ) = max a Q ( s 2 , a ) . A new value of approximation of Q for the state s1 is calculated as follow: ∧
∧
Q( s1 , aright ) ← r + χ ⋅ max a ' Q( s2 , a' ) ← 0 + 0.9 ⋅ max{66,81,100} ← 90 ∧
It can be shown that the estimated value Q converges into real value Q, under assumption that the all possible states will be visited (in infinite time). Temporal difference learning Q learning takes into account only next state, but we can regard two further states (Q(2)) , or three (Q(3)) , or, in general, n states ahead (Q(n)) : ∧
Q ( n ) ( st , at ) = rt + χ ⋅ rt +1 + ... + χ n ⋅ max a Q ( st +n , a) .
(6)
Richard Sutton [14] proposes to take into account all states: Q ( λ ) ( st , at ) = (1 − λ )[Q (1) ( st , at ) + λ ⋅ Q ( 2) ( st , at ) + λ2 ⋅ Q (3) ( st , at ) + ... , where λ is coefficient responsible for preferring instant or further rewards, λ∈[0,1]. The method is known as TD(λ) learning, it is very efficient, what one can see observing results of TD-Gammon program.
(7)
Neural networks and temporal difference learning Reinforcement learning is used for searching optimal strategy in some decision process (i.e. in game), it requires a set of states, a set of actions, and a function for evaluation of rewards for each couple (s,a). But, what we can do, if our state space is very large – it is impossible to visit all states. In such cases we must learn the strategy based on the subset of states, and to approximate results onto states not visited during learning process. For such tasks, neural networks with their possibilities of generalizing seem to be good choice. Gerald Tesauro, in TD-Gammon used multilayered neural network trained according to equation [14]: t
wt +1 = wt + α (Yt +1 − Yt ) ⋅ ∑ λt −k ⋅ ∆ w ⋅ Yk ,
(8)
k =1
where wt+1 and ww mean weights of neuron in time t and t+1 respectively, α is a learning coefficient, Yt+1 and Yt are output of network in time t+1 and t respectively (when one game is finished – one episode of learning – Yt+1 is replaced by a reward), λ decides about an influence of a current error produced by a network on the size of changes of earlier estimations of V*, ∆ w ⋅ Yk is a vector of derivatives of output vector with respect to the weights. Neural network receives an input vector – a state on the board, it considers all combinations of moves (for one or two moves ahead), and each possible state is treated as an input vector and assessed taking into account discounted reward. An action that produces maximal discounted reward is selected. In G. Tesauro program, the learned network plays against itself instead of using a learning set. 2. CHECKERS – TD(λ) learning method applied for checkers
Described in previous section attempt was used to learn play checkers. Developed computer program works in two modes: learning and playing game. CHECKERS builds a search tree and uses mini-max with alpha-beta search algorithm. We learned a three-layered neural network, with 32 input neurons – they code 10 considered features, and one output neuron – it plays a role of an evaluation function. A sigmoid is used as transfer function. Output signal indicates a chance of reward. The set of features being the input values is selected as below: • • • • • • • • • •
Difference between numbers of checkers of player and opponent – 4 neurons, Supervising selected fields on the board (fields ‘1’ an ‘2’ are very significant) – 1 neuron, Supervising the center of the board (a number of own checkers in central 4×4 place) – 3 neurons, Number of checkers that are nearly of king state – 3 neurons, Number of checkers that lie on double diagonal – 4 neurons, Number of kings in the central place of the board – 3 neurons, Difference in lost checkers – 4 neurons, Supervising a center of board from the opponent point of view – 3 neurons, Possible opponent moves – 4 neurons, Exposure of checkers (a number of checkers that potentially can be captured) – 3 neurons.
Values of all above features are positive. The coding schema is presented in Table 1. Table 1. Input layer in a neural network – features coding schema (Nx:y means that neuron x receives input y) Value of coded feature
Coding on 3 neurons
Coding on 4 neurons
0 1 2,3 4,5 6+
N1:0 – N2:0 – N3:0 N1:1 – N2:0 – N3:0 N1:1 – N2:1 – N3:0 N1:1 – N2:1 – N3:1 N1:1 – N2:1 – N3:1
N1:0 – N2:0 – N3:0 – N4:0 N1:1 – N2:0 – N3:0 – N4:0 N1:1 – N2:1 – N3:0 – N4:0 N1:1 – N2:1 – N3:1 – N4:0 N1:1 – N2:1 – N3:1 – N4:1
This schema of coding results a kind of continuous representation: neural network seeing an advantage of 4 checkers knows, that it means also, that advantage is at least 3 checkers. Using a search tree, usually we must solve a horizon effect – program can search only for assumed depth. But in checkers or chess, we can meet with exchange of checkers: an opponent can have capture after capture of player, and, in the effect, the state with such capture may not be as good as we can assess without knowing about this. The possible capture of opponent can be over horizon of searching. Therefore in CHECKERS, every final position is checked about the capture possibilities. If it occurs, the deeper nodes of tree are considered. Both networks, pupil and trainer, are initially identical, they have the same random weights, in consequence, they produce random outputs. A trainer is a kind of pupil clone. Each learning episode consists of one game: a pupil (trained network) against a trainer. After each move of pupil, its weights are changed, and output of trained network in a former state (Yt) becomes closer to the output in a current state (Yt+1). From time to time, the control games between pupil and trainer are performed without any changing of weights. Assumed size of pupil winning causes that all weights from the pupil are copied to the trainer and a trained network receives opponent strong enough to achieve progress in learning.
3. CHECKERS – obtained results
Developed program allows tuning all parameters using a dialog window. There are four groups of parameters: Parameters of network • • • • •
Learning coefficient (α in equation 8), Parameter λ – it defines an influence of error in time t on the earlier states, Maximal value for weights (it defines a range from [–maximum, maximum]), Search level (a depth of searching a tree of game), Momentum – a learning parameter, usually used in backpropagation method.
Learning parameters • • • •
Maximal number of moves without captures allowed during learning process, Minimal number of checkers – smaller number of checkers stops a game, Random selection of initial state (yes/no), Introducing noise into output of network (yes, no, value of disturbance).
Parameters of testing • • • • •
A kind of tester: random (legal moves are generated); simple heuristics (advantage in a number of checkers); trained network (network trained by playing games against itself); cloned network (a trainer-clone is used as a tester); “simple checkers” (an external program); “cake++” (an external program). Search level – a depth of searching a tree of game by a tester, Time (in second) – it is used for external programs instead of a search level, A number of tested games during a break in a learning process, Frequency of tests – a number of learning games after which the tests are performed.
Parameters of trainer • • • •
Kind of trainer – the same as for tester, Search level – a depth of searching a tree of game by a trainer, Time (in second) – it is used for external programs instead of a search level, Cloning condition – percent of winning games by a pupil after which cloned network (trainer) copies weights from the pupil.
Each game can be presented on the board. A status bar shows some usefulness and interesting statistics, such a number of assessment nodes, maximal depth of evaluated node, calculation time, data about clone, and so on. Developed program was tested to find answers for a number of questions. Does a trained network is able to learn play checkers? A learning network is initialized randomly selected weights and it learns itself by playing games against trainer using random moves. An average number of winning games by a trained network should increase as the learning process goes on. This indicates that the developed network is able to learn something. The number of winning games is shown in Figure 2. One can see that network learns to win against a random strategy, but it is true only for the first part of learning process. It stops about 80% winning games. The level of search was changed into 3, it causes that a network was able to win all games (se Figure 3).
Figure 2. Number winning games during learning process with random trainer, α=0.25, λ=0, level of search = 2, cloning condition = 66%, max. Disturbances = 0,01
Figure 3. Further learning of network from Figure 2, but with a level of search = 3 Haw good player is a trained network? A network was trained (5000 epochs with clone as trainer) and such trained network was tested using simple heuristics (advantage in a number of checkers) and using external program NEURODRAUGHT developed by M. Lynch [4]. The authors and colleagues also played games against trained network (the program CHECKERS). The trained network (without disturbances) and the simple heuristics make deterministic moves (all game between them are the same),
therefore the first three moves of simple heuristics were random. Table 2 presents obtained results. Trained network played 1000 games against simple heuristics. The network wins more games when a game tree is deeper searched. The results suggest that the network uses more sophisticated knowledge on this level of game. NEURODRAUGHT is developed using Table 2. Results of games played by the neural network learned TD(λ) algorithm temporal difference learning therefore it is interesting to compare its competence in Match with simple heuristics playing checkers with competence of our Depth of Winning Drawn game Lost program CHECKERS. These tests were made search manually; their results suggest that the level 553 113 334 2 of game tree searching in the 455 122 423 3 NEURODRAUGHT is stated to 5. Both of them, 341 161 498 4 CHECKERS and NEURODRAUGHT, plays in 212 203 585 5 deterministic ways, therefore more games are Match with NEURODRAUGHT unnecessary. It is worth to mention that Checkers won with NEURODRAUGHT on the Depth of Results search level of search equal to 6 – it is very interesting result because of we can find in Lost 4 [4] that NEURODRAUGHT won with Dave Drawn game 5 Harte, county master in category up to 18 years, and it can win with other experienced players. The authors of the CHECKERS seem to be not serious opponents against developed program. The learned network often wins with us. 4. Summary
Developed program and presented briefly tests demonstrate that TD learning can be applied to learn other games than backgammon. Introduced modifications allow to eliminate weakness of checkers as a game suitable for TD(λ) learning, that is: all possible states of game should be taken into account during learning process, and training episodes should be ended with receiving a reward (winning or loss). Developed program plays checkers at least on the intermediate level. It is easy to find better checkers programs, but they do not have learning possibilities. They work on the basis of coded knowledge acquired from experts. More of them have a library of opening moves and endgame databases. Developed in this study learning method can be used in other real applications. It seems, that the program may be improved by introducing methods allowing processing a greater number of nodes of a tree of game in a shorter time. References
[1] Allis L.V., van den Herik H.J., Huntjens M.P.H., Go-Moku Solved by New Search Techniques, Proceedings of the AAAI Fall Symposium on Games: Planning and Learning, 1993, 1-9. [2] Buro M., Experiments with Multi-ProbCut and a New High-Quality Evaluation Function for Othello. Workshop on Game--Tree Search, NEC Research Institute, 1997. [3] Findler N., Studies in machine cognition using the game of poker. Communications of ACM 20(4):230-245, 1997. [4] Lynch M. Neurodraughts – An Application of Temporal Difference Learning to Draughts, Dept. of CSIS, University of Limerick, Ireland. [5] Olson D.K. Learning to Play Games From Experience: An Application of Artificial Neural Networks and Temporal Difference Learning, December 1993. [6] Patashnik O., Qubic: 4×4×4 Tic-Tac-Toe, Mathematics Magazine 53, (1980), 202-216. [7] Pollack J.B., Blair A.D., Why did TD-Gammon work? Advances in Neural Information Processing Systems 9, pages 10-16. MIT Press, Cambridge, 1997. [8] Robertie, B. Carbon versus silicon: Matching wits with TD-Gammon. Inside Backgammon 2, 2, (1992), 14-22. [9] Samuel A. Some studies in machine learning using the game of checkers. IBM Journal on Research and Development, 3:211-229, July 1959. [10] Samuel A. Some studies in machine learning using the game of checkers II. IBM Journal on Research and Development, 11:601-617, 1967. [11] Schaeffer J., Darse Billings, Lourdes Pena, Duane Szafron. Learning to Play Strong Poker. Department of Computing Science, University of Alberta. Research supported by the Natural Sciences and Engineering Council of Canada. [12] Schraudolph N.N., Dayan P., Sejnowski, T. J. Temporal difference learning of position evaluation in the game of Go. in J. D. Cowan J.D. et al. Eds., Advances in Neural Information Processing Systems 6, 817-824. Morgan Kaufmann, San Mateo, Calif., 1994. [13] Spirydowicz A., Machine Learning in games, Master Thesis, Wroclaw University of Technology, 2000. [14] Sutton R., Learning to predict by the methods of temporal differences. Machine Learning, 3, 9-44, 1988. [15] Tesauro, G. Temporal Difference Learning and TD-Gammon, Communications of the ACM, March 1995, Vol. 38, No. 3.