Improving Temporal Dierence Learning for Deterministic Sequential Decision Problems Thomas Ragg, Heinrich Braun and Johannes Feulner Universitat Karlsruhe, ILKD, D-76128 Karlsruhe, Germany, email: fragg,braun,
[email protected]
Abstract: We present a new connectionist framework for solving deterministic sequential decision problems based on temporal dierence learning and dynamic programming. The framework will be compared to Tesauro's approach of learning a strategy for playing backgammon, a probabilistic board game. It will be shown that his approach is not applicable for deterministic games, but simple modi cations lead to impressive results. Furthermore we demonstrate that a combination with methods from dynamic programming and the introduction of a re-learning factor can reduce the necessary training time by several orders of magnitude and improve the overall performance. INTRODUCTION For most real world decision problems of practical relevance there exist no ecient algorithm to construct an optimal strategy. Therefore, heuristics are used in order to get a solution as good as possible in the given computing time and space limit. Recently, neural networks have gained much interest for solving such problems by learning. We can discriminate two dierent approaches: supervised learning and reinforcement learning. For supervised learning we need a teacher who presents the learner (i.e. neural network) prototypical examples. The neural network has to learn these examples and it is expected, that it can suciently generalize to unseen situations by interpolation. A disadvantage of this approach lies in the fact, that the performance of the learner is limited by the teacher. Moreover, for many problems there exist no suciently well performing teacher. The approach of reinforcement learning doesn't need such a teacher, but the learner gets just sometimes reinforcement signals about how well he is doing. His aim is to maximize the reward. A well known approach to assign credit to temporal successive states after receiving an reinforcement signal are temporal dierence methods (Sutton, 1988). These methods have an intrinsic relationship to dynamic programming (Barto et al., 1991). They have been applied to several simple learning problems (Sutton, 1984, Anderson, 1987, Kaelbling, 1990), but also to more complex benchmarks (Tesauro, 1992). Especially Tesauro's TD-Gammon should be mentioned since his seminal approach led to a commercially available program that achieves master-level of play (Tesauro, 1993). For the evaluation of our approach we use also a game because they oer a convenient testbed: they have a parsimonious speci ed world model while being combinatorically complex, and there are many game playing programs for benchmarking. Applying temporal dierence methods and connectionism to learn scoring functions for probabilistic games has some particular properties in comparison to a deterministic problem. Firstly, transitions in a probabilistic game proceed randomly, thus a network will always explore dierent paths from the same state which
yields in general a smooth scoring function. Secondly, an optimal strategy for a probabilistic game maximizes in each move the win expectation, whereas an optimal strategy for a deterministic game minimizes (resp. maximizes) in each move the distance (i.e. number of moves) to the win (resp. loss). Thirdly, the main dierence to a deterministic decision problem is that a wrong action in a probabilistic game will only lower the probability of winning, but a wrong action in a deterministic game could mean the loss. As a benchmark for our research we chose the endgame of Nine Men's Morris. The whole game has recently be solved by a branch and bound method (Gasser & Nievergelt, 1994). Using the computing time of a workstation cluster for three years Ralph Gasser constructed a database with 10 GB containing the distance to win (resp. loss or remis) for all possible board positions (using an optimal strategy). Although Nine Men's Morris is now already solved, it is nevertheless and just for this reason a convenient testbed, since we can measure the level of skill of a network by comparing its move selections to the optimal action. The endgame has abo ut 2 million possible board positions which can be reduced by various symmetries to 56922 essentially dierent con gurations. Winning the game may take up to 26 half-moves, if optimal play is assumed. To our knowledge this is the rst application of temporal dierence learning to such a complex deterministic sequential decision problem.
PREVIOUS CONNECTIONISTIC APPROACHES In the following we suppose always the following general approach: The strategy of the player (learner) is to use a Min-Max-strategy of depth 1, i.e., in a board position the player takes the move to the succeeding board position which maximizes the scoring of all succeeding board positions. Therefore the task of the player is to learn an appropriate scoring function. An optimal scoring function is the negative distance to the win in a winning position resp. positive distance to loss in a loosing position or else 0 for remis (as given by the data base of Ralph Gasser). With such a scoring function the player uses an optimal strategy. Our approach builds up on earlier work which was done at our institute on the same topic of learning scoring functions with multilayer neural networks. Several problems arise when a task is dealt with neural techniques. The rst one is always to nd a good representation of the problem for the network. Possible strategies to this task lie outside the scope of this paper. In our former approach (Braun et al., 1991) we investigated two models for supervised learning. In the rst approach the scoring function has to be given by the teacher. This approach has two serious disadvantages. Firstly, it is even for a good human player (teacher) nearly impossible to judge the distance to the win (resp. loss or remis) or to give another reliable scoring function. Secondly, it may be dicult for the neural network to imitate the strategy of the human player although there may be a winning strategy which is easily learnable for the neural network, i.e., easy for a human player is not the same as easy for a neural network. For example, there exists a very easy winning strategy for the endgame king+castle against king which needs at most twice as much moves as the optimal strategy. A chess expert (using the optimal strategy) may overstress the neural learner although the neural learner could have learned the simple, but not optimal strategy. In our second approach, we demanded from the teacher only a comparative judgment: Given two board position constructed as direct successors of a common preceding board position the teacher has to judge which is the better one. With this information we train our neural scoring network by the following \decision network" (Fig.1): The basis of the decision network are two identical copies of the scoring network (weight sharing) whose outputs are compared by the output unit of the decision network. For training we feed the two board position to the two copies of the scoring network respectively and the judgment of the teacher to the output unit of
the decision network. In order to achieve a good generalization capability it was important to select salient training patterns carefully. Our solution was to re ne the training set incrementally by an expert playing against the network. If he detects a bad move, he adds the board position to the training set and supplies a better move. For our testbed we found that this procedure works much better as if the pairs of positions are chosen randomly and taught to the network. Even if we use the (perfect) database for the judgment and 10 times more patterns we couldn't achieve such a playing skill as with the incrementally re ned training set. Using this approach we succeeded in training a scoring network called Sokrates which outperformed two commercially available game playing programs for Nine Men's Morris (Braun et al., 1991). The problems remaining unsolved are the dependence on a teacher, who has to select a training set, and his ability to compare moves in a given board position, which will be an upper bound on the networks performance. Temporal dierence learning and dynamic programming are promising methods to overcome this limitation.
6
-v
? ? ? A A A A A Sl A 6 6 6 6
D
@ I @
v
@ A A
A A Sr A A 6 6 6 6
Figure 1: The gure shows the decision network which consists of two subnets Sl and Sr with weight -v resp.v to the decision neuron D.
LEARNING A STRATEGY FOR DETERMINISTIC SEQUENTIAL DECISION PROBLEMS If one aims to improve a scoring function for a deterministic game by temporal credit assignment and reinforcement learning, it is clear from the temporal dierence algorithm wt = (Pt+1 ? Pt)
X ? r P ; t
k=1
t k
w k
(1)
that all predictions P will converge against the reward value, the closer the scoring function gets to an optimal scoring function. This is incompatible with using a strategy of move selection that is to choose the following board position with the lowest scoring for the opponent and will deteriorate the performance at a certain point during learning. It seems naturally for deterministic problems to introduce a cost term ct or a decay parameter that re ects each position's distance to the goal. This is done by replacing (Pt+1 ? Pt) in (1) with ( Pt+1 ? Pt ? ct+1) as also proposed in (Sutton, 1988). Thus Tesauro's approach is not directly applicable for deterministic games. Learning by experience through self-play converges slowly. TD-Gammon needed more than one million training games before it reached master level of play (Tesauro, 1993). Our results on the endgame of Nine Men's Morris con rm this observation (Fig.2, = 1). This shows that if temporal dierence methods are to be applied to real world problems successfully training time has to be decreased. One possible way to speed up learning is to replay games from the same starting position. In a similar fashion to those proposed by (Lin, 1992) we used a pipeline of starting positions. Games were replayed from each starting position several times, before the
rst position was removed and a new position was appended. We couldn't nd any signi cant increase of performance by this method of replaying. By monitoring the value of each state in the sequence, we found that a xed sequence of board positions needs to be re-learned several hundred times by the TD-Algorithm before all predictions reached their target values, and this happens in a non-monotonic way. Therefore we altered the approach in the following way. We xed the scoring network, i.e., the move selector and trained a copy of this scoring network as predictor with the above TD-Algorithm. That is re-learning a sequence for a given starting position means re-learning exactly the same sequence (the move selector is xed). By that we can save the very time consuming move-selector (all direct succeeding board position have to be evaluated for each move) just by storing the sequence for re-learning. After re-learning each sequence times (re-learning factor) we substitute the old scoring network by the prediction network and repeat the whole until the performance stagnates. This approach is deduced from dynamic programming, where it is proven, that we can construct a optimal policy (scoring function) by the following strategy: Repeat (Fix the policy. Learn the correct predictions. Construct the new policy with this predictions) until policy unchanged. This combination with dynamic programming yields to the following learning algorithm: 1. Initialize the network randomly. 2. Play with xed scoring function. 2.1 Choose n starting states 2.2 Create 2n sequences by self-play with the following strategy: Choose from all possible moves the one which leads to a state with the lowest value according to the scoring function. If a terminal state is reached, split the sequence of states in a "winner" resp. `"looser" sequence, receiving 1 resp. 0 as reward. The cost term is c for the winner sequence, ?c otherwise. If the game was aborted c is set to 0 and the reward is set 0:5 for both sequences. 3. Learn from experience. 3.1 Train all sequences alternately times. 3.2 Replace the previous scoring function through the trained network. 4. Measure the performance of the network, if possible. Continue with 2: or abort, if no further learning is necessary.
RESULTS Using the algorithm given in the previous section* only little parameter tuning needed to be done to receive good results. The overall performance was almost independent of and the learning rate , for 0:1. was set to 1, and c = 0:015. A game was aborted if a terminal position wasn't reached after 32 half-moves. Our rst experiment examines the dependency of the network's performance on the re-learning factor . Figure 2 plots the mean deterioration compared to the optimal move averaged over 2000 test positions equally distributed with respect to their distance to the end. The graphs show a inverse linear relationship between the re-learning factor and the necessary training time to receive good performance. Note that the number of learning steps are about the same for all values o . The time needed for a multilayer network to learn one state ,i.e., one backward pass, is small against the time to select a move, i.e., a forward pass for every possible move, with the network as scoring function. Thus the training time for = 1 is about 80 times higher than for = 100. A dynamic adaption
of could lead to even better results. A side eect was that higher values of also led to a slightly better overall performance. Mean Deterioration
Mean Deterioration
7
7 1 2 5 10
6
a)
50 100 200 6
5
5
4
4
3
3
2
2
1 0
50000
100000
150000 Training games
200000
250000
300000
b)
1 0
2000
4000 6000 Training games
8000
10000
Figure 2: Evolution of performance during training dependent on the re-learning factor . The graphs show the average deterioration of the network during subsequent training states. The deterioration was measured on 2000 test positions equally distributed with respect to their distance to the end. a) Re-learning factor 2 f1; 2; 5; 10g, b) Re-learning factor 2 f50; 100; 200g. Note the dierent scales on the x-axis.
We now consider the generalization capabilities of a TD-network and the network Sokrates, which depends on the distribution of board positions in the test set. The deterioration to the ideal scoring function is used as a measurement of performance. If positions are chosen Distribution Network Deterioration compared to the optimal move 0 = opt: 2 4 6 8 10 10 random TD 96:3 1:2 0:9 0:4 0:4 0:1 0:7 random Sokrates 94:7 1:3 1:2 0:5 0:7 0:6 1:0 equal TD 77:3 8:3 5:5 3:2 2:0 1:2 2:4 equal Sokrates 64:0 9:0 6:1 4:7 4:6 3:3 8:3
Mean deter. 0:22 0:34 1:15 2:4
Fatal moves 0:1 0:26 0:5 2:3
Table 1: Deterioration of move selection by Sokrates and TD for random and equal distribution of test
states in detail. Column 3 shows the percentage of optimal moves, column 4 to 8 gives the percentage of moves where tempi are lost. Column 10 shows the average deterioration compared to Optimus, and coloumn 11 shows the percentage of fatal moves.
randomly we get about 95% optimal moves for both networks. Since there are more board positions with low distances the test was repeated with a equally distributed board positions with respect to their ideal scoring. For dicult board positions far from the end the TD-network still selects good moves. Table 1 shows the percentages of optimal and sub-optimal moves for a random and equal distribution of test positions for the TD-network and Sokrates. The generalization capability is no direct measure for the performance in a real game. Sokrates, the TD-network and Optimus, an ideal scoring function, played from 435 starting positions against each other. The TD-network's playing strength lies between Sokrates and Optimus (Table 2). Surprisingly Optimus doesn't win more than 50% of the games against the TD-network, i.e., the latter looses tempi but doesn't make fatal moves, that is, from a winning position to a loss. Sokrates TD Optimus
Sokrates 45 49 56
TD Optimus 43 41 47 43 50 49.9
Table 2: Performance of playing of the strategies Sokrates, TD and the optimal move selector Optimus against each other measured in percentage of games won.
CONCLUSIONS We studied a connectionistic approach to solve a deterministic sequential decision problem - the endgame of Nine Men's Morris. Temporal dierence learning and dynamic programming could be proved to be powerful methods to learn a nearly optimal scoring function. The resulting network surpasses the level of skill of Sokrates, which plays the whole game at expert level and beats commercially available programs like Ramses. By introducing a re-learning factor learning was speeded up by two orders of magnitude, which is especially important in case of a dynamic environment, where adaptation to the new situation should happen fast. Dynamic programming was used to improve the performance. The extension of this approach to all phases of the game is currently under investigation.
ACKNOWLEDGMENTS We want to thank Richard Sutton for fruitful discussions on reinforcement learning and on our practical applications of his temporal dierence learning algoritm.
REFERENCES Anderson, Charles W. (1987) Neuronlike adaptive elements that can solve dicult learning control problems. In Proceedings of the Fourth International Worhshop on Machine Learning pp.103-114. Barto, Andrew G., S.J.Bradtke, and S.P. Singh (1991) Real-time learning and contriol using asynconous dynamic programming. Technical report, University of Massachusetts. Braun, Heinrich, Johannes Feulner, and Volker Ullrich (1991) Learning startegies for solving the problem of planing using backpropagation. Gasser, R. and J. Nievergelt (1994) Es ist entschieden: Das muhlespiel ist unentschieden. Informatik Spektrum, 5(17):314{317. Kaelbling, L.P. (1990) Learning in embedded systems. Doctoral dissertation, Stanford University, Departement of Computer Science. Lin, Long-Ji (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8(8):293{321. Sutton, Richard S. (1984) Temporal credit assignement in reinforcement learning. Doctoral dissertation, University of Massachusetts, Departement of Computer and Information Science. Sutton, R.S. (1988) Learning to predict by the method of temporal dierences. Machine Learning, 3:9{44. Tesauro, G. (1992) Practical issues in temporal dierence learning. Machine Learning, 8:257{ 277. Tesauro, G. (1993) TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation, pages 215{219.