Multi-Agent Task Division Learning in Hide-and-Seek Games Mohamed K. Gunady1, Walid Gomaa1,*, Ikuo Takeuchi1,2 1
Egypt-Japan University of Science and Technology, New Borg El-Arab, Alexandria, Egypt {mohamed.geunady, walid.gomaa}@ejust.edu.eg 2 Faculty of Science and Engineering, Waseda University, Tokyo, Japan
[email protected]
Abstract. This paper discusses the problem of territory division in Hide-andSeek games. To obtain an efficient seeking performance for multiple seekers, the seekers should agree on searching their own territories and learn to visit good hiding places first so that the expected time to find the hider is minimized. We propose a learning model using Reinforcement Learning in a hierarchical learning structure. Elemental tasks of planning the path to each hiding place are learnt in the lower layer, and then the composite task of finding the optimal sequence is learnt in the higher layer. The proposed approach is examined on a set of different maps and resulted in convergece to the optimal solution. Keywords: Hide-and-Seek, task division, multi-agent systems, Qlearning, task sequence learning
1
Introduction
Most of the real life applications of multi-agent systems and machine learning consist of highly complicated tasks. One learning paradigm for solving such complex tasks is to learn simpler tasks first then increase the task complexity till reaching the desired complexity level, similar to the learning paradigm followed by human children. Various practical planning problems can be reduced to Hide-and-Seek games in which children learn the basic concepts of path planning, map building, navigation, and cooperation while working on teams. Robotics applications in various domains such as disaster rescuing, criminal pursuit, and mine detection/demining are typical examples. One important aspect when dealing with cooperative seekers is how to divide the searching environment among them for a better seeking performance, namely the territory division, or in more abstract term, task division. Previous researches have been done on similar fundamental problems like the problem of multi-agent coverage. A survey of the coverage problem was done by Choset in [1]. Choset discussed some heuristic, approximate, and exact coverage algorithms based on cellular decomposition. Some complete coverage algorithms for *
Currently on leave from Faculty of Engineering, Alexandria University, Egypt.
multi-robot were also discussed. Another idea is using spanning trees as a base for efficient coverage paths in multi-robot coverage. Agmon et al. exploited this idea in [2] by introducing a polynomial-time tree construction algorithm in order to improve the coverage time. However, they assume that the robots are initially distributed over the terrain and they build the spanning tree so that the robots do not pass by each other throughout the coverage plan. The problem of multi-agent coverage is a little different from the territory division problem in Hide-and-Seek games. In seeking tasks it is commonly the case for the seekers to start at adjacent points and their seeking trajectories might intersect. Furthermore, it is not necessary to cover all the terrain in Hide-and-Seek; not only scanning the possible hiding places is sufficient, but scanning the good hiding places first is also preferable. These are the main differences by which our approach is motivated to overcome using Reinforcement Learning. In this paper, we propose a new learning model to solve the problem of territory division using a hierarchical learning structure. The following Section 2 gives more details about the problem of seeking task division in Hide-and-Seek games. Section 3 provides a quick review of the Q-Learning algorithm adopted in this work. Section 4 illustrates the proposed learning model for both single and multiple seekers. Experimental results and analysis of the proposed approach is presented in Section 5. Finally, Section 6 concludes the paper.
2
Task Division in Hide-and-Seek
Consider a simple form of the Hide-and-Seek games in which there are multiple seekers and a hider. Before the game starts, the hider chooses a hiding place based on a pre-determined hiding probability distribution over the environment based on an early learning process. Afterwards, it is the task of seekers to search for the hider and find his hiding place in minimum time. A game terminates as soon as one seeker finds the hider, in other words neither escaping nor chasing is involved. Seekers cannot just scan the whole environment for the hider, or they will suffer from a poor seeking performance. They should learn the possible hiding places and perform an optimized scan that minimizes the average time for finding the hider. For a game with multiple seekers, a good seeking strategy is to cooperate to employ the concept of dividing seekers' territories to increase the seeking efficiency and thus decrease the seeking time. Nevertheless such cooperation requires physical communication channel between the seekers. In this paper it is assumed that a full communication is provided in the learning phase, while after learning seekers can work without communication. To achieve a good territory division strategy there are three possible approaches: first and simplest is to adopt some fixed heuristics, e.g. East-West, North-South, etc. or any similar partitioning strategy. However, this approach is too naive to achieve well load-balanced performance with different map topologies. The second approach is to solve the territory division problem algorithmically. To construct an algorithm that finds the optimal seeking strategy is a hard task that requires a rigorous understanding for the problem model and its details. The k-Chinese postmen problem is a close example, which is classified as an NP-complete problem.
The third approach is learning; the learning approach can be considered a moderate solution. By learning it is easy for the system to be adapted to different map topologies and hiding probability distributions. In the next sections, a learning model will be illustrated for solving the multi-seekers territory division problem by applying a simple Reinforcement Learning algorithm, Q-Learning.
3
Q-Learning
Reinforcement Learning (RL) is a machine learning paradigm in which an agent learns over time through repetitive trial-and-error interaction with the surrounding environment. Rewards are received depending on the agent's actions attached to the observed states of the environment. RL aims to learn an optimal decision-making strategy that gains the maximum total (discounted) reward received from the environment over the agent's lifetime. Most RL algorithms are based on the Markov Decision Process (MDP) model in which the agent needs to make a decision to choose an action a depending only on the current state s and resulting in a new state s' while a cumulative reward function R is used to quantify the quality of the temporal sequence of actions. One of these algorithms is the Q-Learning of Watkins [3]. The agent’s task is to learn an optimal policy , that tells the agent which action aA to take given the current state sS with the goal of maximizing the cumulative discounted reward. Given a policy π and a state st at time step t, the value of st under π (assuming infinite horizon) can be calculated as follows: (1) where 0 < γ < 1 is the discount factor and rk+1 is the immediate reward at time k. Thus by calculating V*(s), the optimal value function, an optimal action policy π* is readily available. Unfortunately, in many practical problems the reward function (and generally the environment model) is unknown in advance which means equation (1) cannot be evaluated. Let Q(s,a) be the maximum discounted cumulative reward achieved by taking action a from state s and then acting optimally. In other words, the Q value equals the immediate reward given by taking action a from state s, plus the value gained by following an optimal policy afterwards as follows: (2) Notice that sive form as follows:
which allows us to rewrite Equation (1) in a recur(3)
where s' is the next state reached from state s after taking action a (this formula assumes deterministic environment dynamics). The Q-Learning algorithm constructs a lookup table for its estimated Q values and observes the experience tuple
(𝑡)
< 1, 3 >
𝑡=1
𝑡
< 1,2 , (2,1) >
(𝑡)
< 2,
𝑡=5
𝑡
< 3,4 , (4,1) >
(𝑡)
(𝑡)
< 4,
𝑡=8
𝑡
< 2,4 , (4,4) >
> ,4 > >
Game terminates!...
Fig. 2. An example of location-action interpretation with time steps for two seekers in a 4 4 map. The dashed lines represent the seeking trajectories of the two seekers. Circles represent four possible hiding places. Seekers start from the map origin (1,1) in the top-left corner.
5
Experiments
This section discusses some experiments to examine the proposed model. A set of 2D grid maps were used as a testbed for the proposed approach. A seeker has physical action space {N, S, E, W} used in the lower level of learning. Results are to be shown, criticized, and compared with the real optimal solution. Figure 3 shows a set of different maps used for the experiments with different sizes, topologies, and different hiding probability distributions. The X on each map marks a starting position for the adjacent seekers, while black cells represent blocks that seekers cannot go through. Zigzagged and dashed cells mark possible hiding places with different hiding probabilities, for instance a zigzagged cell has hiding probability three times higher than a dashed one. Experiments are done for two seekers on each map. Figure 4 shows the seeking trajectories of two seekers that the learning procedure converges to. It should be noted that the optimal seeking strategy does not require the seekers to visit all the cells, but rather only those related to the possible hiding places. Furthermore, it is not necessarily preferred to visit the nearest hiding place first, because it may have low hiding probability which increase the expected time to find the hider. The results of map (e) illustrate an example of such case. Seekers prefer to visit far but high probability hiding places and then backtrack to lower probability hiding places. Over time the maximum Q value for the initial state with all its possible actions gives an indication for the progress of learning, as it converges to the optimal value gained by applying the optimal policy. To check the completeness of the results, they are compared to the results produced by performing an exhaustive search in all the solution space. The comparison showed that the proposed system converges to the real optimal solution for all cases listed here. Figure 5 shows the convergence progress towards the optimal solution. It compares the maximum Q value of the initial state learnt so far with the actual cumulative discounted reward gained by the optimal solution.
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 3. Different maps used for experiments. X marks the starting point of the seekers, zigzagged and dashed cells are possible hiding places.
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 4. Results of experiments for two seekers. The dashed lines represent the optimal seeking trajectories while the solid line is an intersection between the two trajectories. = 0.99.
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 5. The convergence progress over time. It compares the maximum Q value learnt so far (solid curve) for the initial state, verses the optimal cumulative discounted reward (dashed line) gained by the optimal policy obtained by an exhaustive search. After some learning episodes the maximum Q value converges to the optimal value. In the action selection method, actions are chosen probabilistically based on the estimated Q value of the state-action pairs.
As discussed in Section 2, the problem of multi-agent territory division is similar to the k-Chinese postman problem which is classified NP-complete. So it is expected for the learning approach to suffer from similar problems. But the learning approach has an advantage over a brute force searching algorithm. First it doesn't require a rigorous understanding for the whole problem details, but our proposed model can be applied for different variations of task decomposition with some adjustments. Second and more important, the learning process can give a good sub-optimal solution within a reasonable time. This sub-optimal solution gets closer to the optimal solution by increasing the learning time limit with a reasonable increase rate; and this is not the case with a brute force searching algorithm. This can be observed from the results in Figure 5. With increasing the problem size in the number of hiding places and the map size, the number of learning episodes required till convergence highly increases. For example, map (a) of size 4 4 has only 3 hiding places and it required only 5 episodes till convergence. Whereas map (f) of size 10 10 has 8 hiding places and it required 30000 episodes, whilst a good sub-optimal solution which is very close to the optimal one is found within 1500 episodes. The problem of time and space complexity that increase with the problem size can be tackled by the techniques of generalization such as state aggregation. It is like adding another layer in the hierarchical structure discussed in Section 4. We plan to apply our previous work discussed in [8] about state aggregation for Reinforcement Learning. It exploits the idea that agents can group some similar states together into one abstract state and then learn on the abstract level which is much reduced in size. As for the problem of Hide-and-Seek some hiding places reside close to each other, so that it is obvious that a seeker should search all of them together. These close hiding places can form a group that can be used for an abstract level of learning. Since state aggregation is an approximation technique, a slight compromise may be needed between the space reduction efficiency and the absolute optimality.
6
Conclusion
In this paper, the problem of multi-agent territory (or task) division is discussed. This problem appears in various real robotics applications that can be modeled as a simple Hide-and-Seek game. The target is not to perform a full coverage of the terrain, but scanning only the possible hiding places so that the total expected time for finding the hider is minimized. That requires finding the optimal sequence of visiting to the hiding places according to the hiding probability. We present a new learning approach to solve this problem using Reinforcement Learning with an illustration of how the learning system is built in a hierarchical structure. In which, the elemental tasks of path planning for each hiding place is learnt first in the lower layer, then the composite task of finding the optimal visiting sequence is learnt in the higher layer. The full module is described including the state, action definitions, the reward function for both single and multiple agents. A revised version of the standard updating rule of Q-Learning is also discussed. This new equation is essential to deal with the problem of unequal time steps for agents' actions. The proposed approach is examined
on a set of different maps in which the learning process converged to the complete solution. The impact of the problem size and the complexity of the system are also studied. The result discussion show that the proposed learning approach has advantages over the algorithmic techniques in that, it is easier to design, can be applied for different variations of task decomposition, and gives a good near-optimal solution within a reasonable time for problems of large space.
Acknowledgements This work has been funded by a scholarship grant from Mitsubishi Corporation International Scholarship to E-JUST which is highly acknowledged.
References 1. Choset, H.: Coverage for robotics a survey of recent results. Annals of Mathematics and Artificial Intelligence, 31(1-4):113–126 (2001) 2. Agmon, N., Hazon, N., Kamink, G.A.: Constructing Spanning Trees for Efficient Multirobot Coverage. In: Proceedings of the 2006 IEEE International Conference on Robotics and Automation, ICRA (2006) 3. Watkins, C.J.C.H., Dayan, P.: Q-learning. Machine learning, vol. 8, no. 3, pp. 279–292 (1992) 4. Singh, S., Jaakkola, T., Littman, M.: Convergence results for single-step on-policy reinforcement-learning algorithms. Machine Learning, no. 1998, pp. 287-308 (2000) 5. Singh, S.P.: The Efficient Learning of Multiple Task Sequences. Advances in Neural Information Processing Systems 4, pp. 251–258 (1992) 6. Sutton, R.S.: Learning to predict by the methods of temporal differences. Machine Learning, vol. 3, no. 1, pp. 9-44 (1988) 7. Dijkstra, E.W.: A note on two problems in connexion with graphs. Numerische Mathematik 1: 269–271 (1959) 8. Gunady, M.K., Gomaa, W.: Reinforcement learning generalization using state aggregation with a maze-solving problem. Proceedings of IEEE Japan-Egypt Conference on Electronics, Communications and Computers (JEC-ECC), pp. 157–162 (2012) 9. Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement learning: A survey. Arxiv preprint cs/9605103, pp. 237-285 (1996) 10. Mitchell, T.M.: Machine Learning book. McGraw-Hill Science/Engineering/Math (1997) 11. Watkins, C.J.C.H.: Learning from Delayed Rewards. PhD thesis, Cambridge Univ., Cambridge, England (1989) 12. Singh, S.P.: Transfer of learning by composing solutions for elemental sequential tasks. Machine Learning (1992) 13. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA (1998)