Reinforcement Learning in 2-players Games Kazuteru Miyazaki Sougo Tsuboi National Institution for Academic Degrees TOSHIBA 3-29-1 Ootsuka Bunkyo-ku 1 Toshiba Komukai Saiwai Tokyo 112-0012 Japan Kawasaki 212-8582 Japan
[email protected] [email protected] Abstract The purpose of reinforcement learning system is to learn an optimal policy in general. However, in 2players games such as the othello game, it is important to acquire a penalty avoiding policy. In this paper, we are focused on formation of penalty avoiding policies based on the Penalty Avoiding Rational Policy Making algorithm [2]. In applying it to large-scale problems, we are confronted with the curse of dimensionality. To overcome it in 2-players games, we introduce several ideas and heuristics. We show that our learning player can always defeat against the well-known othello game program KITTY.
1
INTRODUCTION
Reinforcement learning (RL) [3] is a kind of machine learning. It aims to adapt an agent to a given environment with a clue to rewards. We can classify RL systems on Markov Decision Processes (MDPs) into two types. One is focusing on identifying state transition probabilities, such as Q-learning (QL) [4] and k-Certainty Exploration Method [1]. The other is focusing on whether there is the state transition or not. 2-players games discussed in this paper belong to the latter. Recently, in most RL systems, a positive reward called a reward is given to the agent when it has achieved a purpose, and a negative one called a penalty is given to it when it has violated a restriction. If we set incorrect reward and penalty values, in the class where it is focused on whether there is the state transition or not, the agent will learn unexpected behavior. Therefore it is important to distinguish a reward from a penalty [2] in the class. We know the Penalty Avoiding Rational Policy Making algorithm (PARP) [2] as a RL system to make a distinction a reward and a penalty in the class. Though it can suppress any penalty as stable as possi-
Shigenobu Kobayashi Tokyo Institute of Technology 4259 Nagatsuta Midori-ku Yokohama 226-8502 Japan
[email protected]
ble and can get a reward constantly, it has to memorize many state-action pairs. In this paper, we aim to adapt PARP to 2-players games such as the othello game. We introduce several ideas and heuristics to overcome the combinational explosion in large-scale problems. We have confirmed the effectiveness of proposed method in the othello game.
2
THE DOMAIN
Consider an agent in some unknown environment. At each time step, the agent gets information about the environment through its sensors and chooses an action. As a result of some sequence of actions, the agent gets a reward or a penalty from the environment. We assume that target environments are MDPs. A pair of a sensory input (a state) and an action is called a rule. We denote a rule ‘if x then a’ as xa. where ‘x’ is a state and ‘a’ is an action. The function that maps states to actions is called a policy. We call a policy rational if and only if expected reward per an action is larger than zero. The function that maps a state (or a rule) to a reward (or a penalty) is a reward function. We assume that we can know a reward function and a candidate for a descendant state of the state transition. When the agent selects an action a in the state st at time t, we can know variation of the state st+1 at time t + 1 and its immediate reward or penalty. They are not necessary correct functions. Though it is not confused by incomplete information such that some penalty or state that should be existed on are not given to the agent, it is confused by incredible information such that some penalty or state that should not be existed on are given to the agent. It is natural assumption in 2-players games such as the othello, igo, shougi, backgammon and so on. We call a sequence of rules used between the previous reward (or penalty) and the current one an
P a b
y x a
b
Figure 1: An example of penalty rules (xa, ya) and a penalty state (y)
episode. We call a subsequence of an episode a detour when the state of the first firing rule and it of the last firing rule are the same though both rules are different. The rule that does not exist on a detour in some episode is rational. Otherwise, a rule is called irrational. We call a rule penalty if and only if it has a penalty or it can transit to a penalty state in which there are penalty or irrational rules. For example, in figure 1 xa and ya are penalty rules, and state y is a penalty state. We call a policy that cannot have any penalty rule penalty avoiding policy. For each sensory input, a deterministic policy always returns an action. We assume that there is a deterministic rational policy in penalty avoiding policies.
3
3.1
PENALTY AVOIDING RATIONAL POLICY MAKING IN 2-PLAYERS GAMES The Penalty Avoiding Rational Policy Making algorithm (PARP)
PARP can always learn a deterministic rational policy in the class where there is it in penalty avoiding policies. It uses the Penalty Rule Judgment algorithm (PRJ) (Fig. 2) to suppress all penalty rules in the current rule set. After suppressing all penalty rules, it makes a rational policy by the Rational Policy Improvement algorithm (RPI) [2]. RPI does not need in 2-players games because suppression of all penalty rules leads to get a reward. PRJ has to memorize all rules that have been experienced and descendant states that have been transited by their rules to find all penalty rules. In applying PRJ to large-scale problems, we are confronted with the curse of dimensionality. To overcome it in 2-players games, it is important to save the memory and restrict exploration. They are discussed in section 3.2 and 3.3, respectively.
procedure The Penalty Rule Judgement begin Set a mark on the rule that has been got a penalty directory do Set a mark on the following state ; there is no rational rule or there is no rule that can transit to no marked state. Set a mark on the following rule ; there are marks in the states that can be transited by it. while (there is a new mark on some state) end.
Figure 2: The Penalty Rule Judgment algorithm
3.2
How to Save the Memory by Calculation of State Transition
We aim to adapt PRJ to the domain discussed in section 2. Before selecting an action, it tyies to find all penalty rules in the current rule set by calculation of all states that can be transited from the current state. After selecting an action, if the agent gets a penalty, it tries to find a penalty rule again. They are realized by the long-term and the short-term memories. The long-term memory : An unknown penalty rule is found on the short-term memory, it is memorized on the long-term memory. The long-term memory is holding in learning. The short-term memory : All states and actions in the current episode are memorized on the short-term memory. After calculating all states and rules that can be transited from the current state, they are memorized on the short-term memory. If there is the states in the long-term memory, PRJ tries to find a penalty rule. The short-term memory is initialized for each episode. We can save the memory to strage state transitions by these memories.
3.3
How to Restrict Knowledge
Exploration
by
In applying PRJ to large-scale problems, we need many trials to spread a penalty rule. Especially, it is a serious problem in long episode. To overcome it, we introduce a semi-penalty that is defined by each problem to expand the definition of a penalty by knowledge. We call a rule semi-penalty if and only if it has a penalty or a semi-penalty, or it can transit to a penalty state or a semi-penalty state in which there are semi-penalty,
penalty or irrational rules. After finding penalty rules by PRJ, we use PRJ to find semi-penalty rules. Remark that, if we define incorrect semi-penalty, we need more trial to find penalty rules than PRJ without a semi-penalty since exploration is biased. Since a semi-penalty does not always cause a penalty, it has a possible that all states are semi-penalty ones even if there is a penalty avoiding rational policy. The problem is avoided by an action selector. Usually, we should select a rational rule that is not a penalty and a semi-penalty rule. If we cannot select any rational rule in semi-penalty states, we should select a rational rule that is not a penalty rule.
getting state s from IOS matching s with the short-term memory and KIFU database.
penalty state? no
no
4.2 4.2.1
Construction of the RL Player The Setting of the Game
We describe our RL player. It gets the state of the game from IOS. It can calculate variations of actions in the state. It selects an action and returns it to IOS. If it cannot select any action, it returns ‘PASS action’ to IOS. If it cannot win the game, it gets a penalty from IOS. We set the size of the short-term memory 1000. It is enough to storage at least one step state transition. It can calculate two or three steps state transitions in first stages and one step it in middle stages. Remark that there is no irrational rule in the othello game. 4.2.2
Knowledge of the Othello Game
It is important to restrict exploration from first to middle stage since there are huge state spaces in middle 1 http://forum.swarthmore.edu/ jay/learn-game/ systems/kitty.html
yes
suppress all penalty and semi-penalty rules and select the most frequently used rule.
suppress all penalty and semi-penalty rules and select an action by the basic action strategy.
The Well-known Othello Game Program : KITTY
We implement our method as an othello game player’s learning system. We use KITTY1 by Igor Durdanovic as an opponent player. It is the near-strongest program in open source players. We use kitty.ios in KITTY’s source code. It has interface of Internet Othello Server (IOS) that makes a filed of the othello game. We do not give KITTY learning mechanism. Therefore KITTY’s action selection probability is stable. The depth of its search mechanism is 60 that is the maximum value.
yes
suppress all penalty rules and select an action by the basic action strategy.
on KIFU database?
APPLICATION TO THE OTHELLO GAME
4.1
game over
semi- penalty state?
no
4
yes
Figure 3: The Action Selector stages. We use the following two type knowledge to realize it. i. KIFU database KIFU database makes to memory steps in previous famous games. We use NEC’s KIFU database 2 . It contains about 100,000 games. We can get typical state transitions in first stages from KIFU database. It may contribute to avoid wasteful exploration in first stages. ii. Evaluation Funcion We use KITTY’s evaluation function that is sent to IOS by KITTY as our RL player’s evaluation one. KITTY returns a value from −64.00 to +64.00 to IOS as the evaluation value of a state. We define a semipenalty state as the state whose evaluation value is larger than +1. 4.2.3
How to Select an Action
Our RL player selects an action based on figure 3. The basic strategy means to select an action whose number of transition states is the least in all actions. It contributes to restrict wastfull exploration. If the total number of cells is larger than 54, our RL player calculates the end of the game. On the other hand, KITTY calculates it if the number is larger than 2 ftp://ftp.nj.nec.com/pub/igord/IOS/misc/ database.zip
100000
15000 find a victory strategy (1580)
10000
with a semi-penalty rule
with a semi-penalty rule (without a semi-penalty state)
5000 with a penalty rule
The number of memorized states
The number of memorized states
20000
80000
60000
40000 with a semi-penalty state
with a semi-penalty rule (without a semi-penalty state)
20000 with a penalty rule
0
0 0
500
1000 1500 The number of games
Figure 4: The number of memorized states when we use the semi-penalty discussed in 4.2.2
50 since it can use the min-max exploration with its evaluation function.
4.3
0
2000
Results and Discussion
Our RL player has learned a deterministic rational policy in 1580 games as the first player (the black player). It means that our method is better than KITTY since winners of KITTY vs. KITTY games are always the second players (the white players). If we do not use the semi-penalty, the frontier of penalty rules is 34 cells in 2000 games. On the other hand, in the experiment, we can select a semi-penalty rule at 18 cells in 949 games. It means that we can overcome the slow spreads of penalty rules by the semipenalty. Figure 4 is the number of memorized states plotted against the number of games when we use the semipenalty discussed in 4.2.2. It is very compact for the number of non-semi-penalty states with a semi-penalty rule. It means that the semi-penalty is hard to change to non-semi-penalty states. We have defined another semi-penalty. When the number of action is less than 2 in the next stage, we define the state as a semi-penalty state. Figure 5 is the result of it. It is not stable for the number of non-semipenalty states with a semi-penalty rule. It means that the semi-penalty is easy to change to non-semi-penalty states. Therefore we should define a semi-penalty that is hard to change to non-semi-penalty states.
500
1000 1500 The number of games
2000
Figure 5: The number of memorized states when we use inefficient semi-penalty
5
CONCLUSIONS
In this paper, we adapt the Penalty Avoiding Rational Policy Making algorithm [2] to large scale MDPs such as 2-players games. We have implemented our method as an othello game player’s learning system. It can always defeat against the well-known othello game program KITTY. In the future works, we will compare our method with KITTY with learning mechanism. Furthermore, we will extend our method to Partially Observable Markov Decision Processes and multi-agent systems.
References [1] Miyazaki, K., Yamamura, M. & Kobayashi, S. kCertainty Exploration Method : An Action Selector on Reinforcement Learning to Identify the Environment, Journal of Artificial Intelligence, Vol.91, pp.155-171, (1997). [2] Miyazaki, K. & Kobayashi, S. Reinforcement Learning for Penalty Avoiding Policy Making. 2000 IEEE International Conference on Systems, Man and Cybernetics, pp.206-211, 2000. [3] Sutton, R. S. & Barto, A. Reinforcement Learning: An Introduction. A Bradford Book, The MIT Press, 1998. [4] Watkins, C. J. H., and Dayan, P.: Technical note: Q-learning, Machine Learning Vol.8, pp.5568, 1992.