A Reinforcement Learning Based Method for ... - Semantic Scholar

1 downloads 0 Views 1017KB Size Report
This method is applied to RoboCup Rescue Simulation Fire brigade agent's decision ..... and 0.7, respectively based on trial and error. The value of e factor in e- ...
A Reinforcement Learning Based Method for Optimizing the Process of Decision Making in Fire Brigade Agents Abbas Abdolmaleki1,2, Mostafa Movahedi5, Sajjad Salehi6, Nuno Lau1,4, Luís Paulo Reis2,3 1

IEETA – Institute of Electronics and Telematics Engineering of Aveiro, Portugal 2 LIACC – Artificial Intelligence and Computer Science Lab., Porto, Portugal 3 DEI/FEUP – Informatics Engineering Dep., Faculty of Engineering, Univ. of Porto, Portugal 4 DETI/UA – Electronics, Telecommunications and Informatics Dep., Univ. of Aveiro, Portugal 5 Sheikh Bahaee University, Department of Computer Engineering, Isfahan, Iran 6 Young researchers club, Qazvin branch, Islamic azad university, Qazvin, Iran [email protected], [email protected], [email protected], [email protected], [email protected]

Abstract. Decision making in complex, multi agent and dynamic environments such as disaster spaces is a challenging problem in Artificial Intelligence. Uncertainty, noisy input data and stochastic behavior which are common characteristics of such environment makes real time decision making more complicated. In this paper an approach to solve the bottleneck of dynamicity and variety of conditions in such situations based on reinforcement learning is presented. This method is applied to RoboCup Rescue Simulation Fire brigade agent’s decision making process and it learned a good strategy to save civilians and city from fire. The utilized method increases the speed of learning and it has very low memory usage. The effectiveness of the proposed method is shown through simulation results. Keywords: RoboCup Rescue Simulation, Fire Brigade ,Decision Making, Reinforcement Learning

1. Introduction Disaster rescue is one of the most serious social issues which involve very large numbers of heterogeneous agents in an hostile environment. The intention of the RoboCupRescue project is to promote research and development in this socially significant domain at various levels which involve multi-agent team work coordination. In the RoboCup Rescue Simulation League, a generic urban disaster simulation environment was constructed. Heterogeneous intelligent agents such as fire brigades, police forces, ambulances and civilians conduct search and rescue activities in this virtual disaster world [1, 2]. In rescue simulation, agents must perform a sequence of actions to fulfill their tasks efficiently, like extinguishing fires or rescuing injury civilians. Each action alters the environment and also influences on choosing the next action. The goal of

the agents is to achieve the best score at the end of the simulation. So agents should do a sequence of actions by which best score is achieved. Fire brigades are responsible to control the fire. The spread of fire depends upon wind speed and wind direction. Due to fire, buildings get damaged and collapse which resulting in blocked roads. Additionally, fire can result in civilian death and agents getting hurt. The most important issue is to select and extinguish the best fiery point to reduce damage of the civilians and city. In this paper we are going to present an approach to select the best fiery building for extinguishing in each cycle which finally leads to the best score. So far many methods are proposed for decision making of fire brigade agents. As in [3] is claimed, since fires start in separate locations and spread outwards, the buildings on fire tend to form clusters around their respective points of origin. And because smaller clusters of less intense fires are much easier and vital to extinguish, it’s proposed to put out the smallest fiery points first then start to control larger fiery areas. Although this idea may provide good performance to save a city from fire, it doesn’t consider civilian positions; hence it does not give any guarantee to save civilians from fire. Another method is to prioritize fire sites from a fire brigade's perspective [4]. In this method each fire brigade gives a priority to each fire site based on its properties like volume of fire and civilians around it and selects the critical building in highest priority fire site to extinguish. Because this process has been done in each time step, it consumes a lot of memory and CPU usage. In [5] a decision tree for priority extraction has been proposed. Decision tree for decision making has been widely used in multi agent systems. But in this paper the extracted features to describe environment are not enough. In [6] an evaluative fuzzy neural network is proposed for fire prediction and fire selection. Fuzzy logic has been used to solve many decision-making problems. Because of intricacy of fire prediction and selection in Rescue Fire brigade agents; this method seems to be good for solving this problem. To make coordination between agent three options are proposed in [7]: environment partitioning, centralized direct supervision and decentralized mutual adjustment. Among these three approaches decentralized approach is more flexible but it does not mean it is always better. In [8] a hybrid approach is proposed to use advantages of both centralized and decentralized approach. In order to make a decision about the number of ambulances which should cooperate to rescue civilian, evolutionary reinforcement learning is utilized by [9]. In [10] fire brigade learns to do task allocation and learns how to choose the best building for extinguishing to maximize the score. The learning features include number of civilians in the building, area of the building, level of fieriness of building and building material. In [11] fire brigades learn how to distribute in the city using neural reinforcement learning. The major contribution made in this paper is to introduce a novel algorithm based on reinforcement learning for fire brigade agents to save city and civilians from fire. With regard to the limits on the environment cognition of agents and the fact that agents don’t know the effect of their action on environment, using reinforcement learning seems a good method to solve this problem. Reinforcement learning is learning what to do (how to map situations to actions) so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of

supervised machine learning, but instead must discover which actions yield the most reward by trying them [12]. The mechanism which we have used to learn the agents is based on Temporal Difference method of reinforcement learning [12]. Because TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap), So this method is applicable for complex environment like rescue simulation which an action should be selected in each cycle. The rest of this paper is organized as follow. In Section 2 we explain the test bed which is used to implement and test our approach. In section 3 a brief explanation of reinforcement learning is presented. Section 4, discusses how to design and model the problem for applying reinforcement learning, in detail. In Section 5 details of implementation are presented and achieved results are shown. And Section 6 concludes.

2. Test bed RoboCup rescue simulation is a simulation of an earthquake effect in a city. The aim of creating this environment is to learn the best rescue strategies for humans and also extinguishing fires tactics caused by the earthquakes in real situations. It is currently a major league in RoboCup simulation competitions. The simulation consists of a kernel as the main part of the simulation, and some simulators which simulate the earthquake, fire, blockades and civilians of the city and also a viewer which shows the city changes during the time. There are three kinds of agent including Police Force Agent, Ambulance Agent and Fire Brigade Agent in the simulation that should perform the job of rescue. Police Force Agents control the traffic in the city and open the blockades of the closed roads. Ambulance agent’s duty is to find the injured civilians and give first-aid to them until they are carried out of buried building and they should also carry them to refuges. The main job of the Fire Brigade Agents is to extinguish the fire caused by the earthquake. The performance score of a simulation is usually computed from the quantity of alive civilians and the safe areas of the city. When the fire brigade agents want to extinguish fire they have to sort the set of the fired buildings to extinguish them one after another. We have applied our method to this environment to find an optimum policy for doing best action in each state of environment.

Fig. 1 Rescue Simulation Environment

Fig.1 shows a screenshot of the simulation environment. It displays map of Kobe city after an earthquake. The blue, white and red circles demonstrate the police force agents, Ambulance agents and fire brigade agents respectively. The bright and dull green circles display healthy and hurt civilians and black circle represent the dead civilian. The yellow, orange, dark red and black building represents level of fieriness of a building. Yellow shows least level of fieriness of a building and black shows a burned out building. There is a special type of buildings: the buildings that are marked with home icon are refuges where saved civilians are taken. Black areas on the roads represent blockades. The simulation runs for 300 time steps. In order to reduce the complexity of simulation process we designed and implemented a simple rescue simulation environment that has all features we need such as fire simulator. So our simple rescue simulation is much faster than the official rescue simulation while keeping the necessary capabilities. In Fig.2 a screen shot of simple rescue simulation is displayed. This environment has a fire simulator with the same algorithms of Rescue Simulation. So burning of buildings and fire spread is the same as original Rescue Simulation. Also the algorithm to calculate the health of civilians is the same as the one in misc simulator of Rescue Simulation.

Fig. 2. Simple rescue simulation

The learning of agent and test phases are performed in this simulator and the result with a little modification is used in original rescue simulation.

3. Reinforcement Learning Reinforcement learning is a branch of machine learning, in which an agent tries to find optimum actions to achieve the goal, without having full knowledge about environment or about the impact of its actions on the environment. In this type of learning the agent is not told how to do, but the agent is told what to do instead. The agent tries to achieve the goal by optimizing the received rewards. In [12] three fundamental classes of methods for solving the reinforcement learning problems are described. Dynamic programming (DP), Monte Carlo methods, and Temporal Difference (TD) learning are these three classes. Each class of methods has its strengths and weaknesses. Dynamic programming methods are well developed mathematically, but require a complete and accurate

model of the environment. In rescue simulation, agents don’t have a complete view of the world model, so this class of methods cannot be applied on rescue simulation. Monte Carlo methods do not require a model and are conceptually simple, but are not suited for step-by-step incremental computation as is the case in rescue simulation. They don’t update their value estimates on the basis of the value estimates of successor states. In other words, it is because they do not bootstrap. Finally, temporal-difference methods require no model. Like DP(dynamic programming), TD(temporal-difference) methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap). So we chose this method for applying on fire brigade agent.

4. Design This section presents necessary issues including description of environment, the design of the reward function and the learning algorithm used in our approach. Finally a proposed learning process for the problem which is called lesson-by-lesson learning is presented. 4.1 Description of Environment Although in Temporal Differential methods the model of environment is not required, it needs a clear description of the states of the model. So in this part we are going to describe rescue simulation environment with some discrete and finite states. The states of environments are modeled by the following features. 1. freeEdges: The number of edges that fire can spread from represented by 0,1,2,3 or 4. For example in Figure2 the fire can spread only from left side, so number of freeEdges is 1. 2. distanceFromCenter: The distance of nearest fiery building to center of city, that is low, medium or high. 3. distanceFromCivilian: The distance of nearest fiery building to nearest civilian, that is low, medium or high 4. volumeOfFieryBuildings: The total volume of fiery buildings, that is veryLow, low, medium, high, veryHigh or huge. So there are 5×3×3×7=270 different states. These states can describe all conditions in all maps. After defining the potential states for environment, the possible actions for agents should be determined as well. So agents in each state can perform one of following actions: 1. 2. 3. 4.

extinguishEasyestBuilding: Extinguish the building that has lowest temperature. extinguishNearestToCivilian: Extinguish the nearest fiery building to civilian. extinguishNearestToCenter: Extinguish the nearest building to center of city. extinguishNearestToMe: Extinguish the nearest building to fire brigade.

These four actions are the most reasonable actions that a fire brigade in different situations can do. The problem is to find the best action in each state of environment.

4.2 Reward function Reward function plays an important role in RL as it directs learning process to a solution. In a burning city the world situation is getting worse over time, hence the agent always gets penalty. So the Fire brigade must reduce taken penalties. Reward function is defined as follows: 

For each waterDamaged building (the building that is extinguished and is damaged because of too much water) in each time step the agent takes a reward -1.



For each fiery building in each time step the agent takes reward -2.



For each burned out building in each time step the agent takes reward -3.



For each damaged civilian in each time step the agent takes reward HP-1000. The HP (Health Points) of each healthy civilian is 1000 and if the building that contains a civilian ignited, the civilians’ HP reduces over the time.



The first reward causes the fire brigade to putout fire without wasting water. The second and third rewards forces the fire brigade to putout fire as soon as possible and the last reward causes the fire brigade to save civilians from fire.

4.3 Learning algorithm The meaning of learning here is not to learn the way of doing an action by agents, but the agents learn to do which behavior in which time or situation. Fire brigade agent with regards to environment state learns which one of the mentioned actions is most valuable (leads to best score) in a specific state. To do so, the agents are taught by an on-policy TD control method called SARSA [12]. Q values of each action in each state were updated by formula (1) considering their rewards or penalties. ( ,

)← ( ,

)+ [

+

(

,

)− ( ,

)]

(1)

In above equation st is previous state before taking action a, st+1 is current state after action a, r is reward received from the environment after taking action a in previous state and reaching the current state, a is the last executed action, Q is the QTable, alpha is learning rate and gamma is discount factor. The learning rate specifies to what extent the newly learned information affect the learned old information. The learning rate, set between 0 and 1. Setting it to 0 will make the agent learn nothing. Setting a high value will make the agent learn quickly. The discount factor determines to what extent future rewards are important. The discount factor, set between 0 and 1. It implies that immediate rewards are more worth than future rewards. The learning algorithm is as follows: (

( , ) ℎ

):



(



ℎ ,

ℎ ( , )← ( , ← ; ←





, )+ [

( −



)

): +

(

,

( − )− ( ,

) )]

This algorithm uses the e-greedy action selection algorithm. It means that with probability of e the agent takes random action (exploration) and in all other times takes the max valued action from previous experiences of agent (exploitation). 4.4 Lesson-by-Lesson Learning In reinforcement learning methods the rewards provide guidance to the agent. If agent gets good reward it understands that it did right action sequence, and if it gets bad reward it understand that it did the wrong action. Now if agent gets bad rewards most of the time, it can’t find the right action. In a very complicated environment such as rescue simulation, the problem search space is huge. So the agent may get bad rewards most of the time and can’t learn the optimum policy. To overcome this problem we started our learning phase in a Lesson-by-Lesson mode. In this mode we have these five steps: 1.

First step is to teach the agent how to extinguish some easy scenarios, which have just one fiery point in the city. After a while the agent finds out the optimum policy to extinguish that fire.

2.

Second step is to find the optimum policy to extinguish some scenarios, which have some fiery points in different parts of a city with different fieriness levels.

3.

Third step is to find the optimum policy to extinguish some scenarios, which have some fiery points in different parts of a city and one civilian.

4.

Forth step is to find the optimum policy to extinguish some scenarios, which have some fiery points in different parts of the city and some civilians in different parts of the city

5.

In the fifth step extinguish distance has been restricted as it is in rescue simulation. So the fire brigade has to move to near the fiery points to extinguish them.

6.

In the last step we limit the tanker capacity of the fire brigade as is case in rescue simulation. It means that after a while the water of fire brigade will be finished and it will has to refill it. This process takes usually between 10 and 20 cycle based on distance from refuge.

In each step the fire brigade uses obtained Q-Table in its previous step as the initial knowledge and the output is a better Q-Table.

5. Implementation and results In this section the achieved results of using explained method in simplified and original rescue simulation are presented and the advantages of using lesson-by-lesson learning are discussed. 5.1 Implementation in Simplified rescue simulation In order to train the fire brigade, the parameters and α of Q-Table are considered 0.5 and 0.7, respectively based on trial and error. The value of e factor in e-greedy selection algorithm is set in each episode based on = ∗ 0.9 formula. This formula causes agent in primary episodes more explore the search space and in end episodes more uses its obtained experience in previous episodes and try to converge to a good solution. In order to train agents, five scenario sets were chosen which in each set there are 4 maps as was explained in lesson-by-lesson learning section. In the first episode the initial Q-Table is set to zero. And in next episodes the previous obtained Q-Table is used for initializing. The end condition of each training phase is when, agent converge to an optimum policy for a scenario. Agent learns appropriate action in each state after training in all scenario sets using lesson-by-lesson learning approach. To test the trained agent, 2 sample scenarios were used. One of them has 3 fire point (2 point near a civilian and 1 far from civilian) and another one has 4 fire point (3 point near the civilian and 1 far from civilian). In these scenarios five other extinguishing strategies including random, extinguish nearest fire to civilian, extinguish nearest fire to center, extinguish easiest fire and extinguish nearest fire to me to compare with trained agent strategy are tested. The result is shown in Table 1. It can be noticed that our trained agent shows better results than other agents. In Figures 3, 4,5,6,7 and 8 a snapshot of end of simulation in scenario 1 using different strategies for fire extinguishing is displayed. In Fig. 3 which shows the final result of using RL based agent, it is obvious that fire is extinguished and civilian is alive. This strategy seems has a good performance and considering chosen actions, states and our method, it is optimum policy among possible policies. Fig 4 represents the result of using extinguishing nearest fire to civilian strategy. Although fire has been extinguished before it could damage the civilian, about half of the city burnt out. So it seems not to be a good strategy. Table 1 Map score based on strategy performed

trained Firebrigade

Nearest fire to civilian

Nearest fire to center

Easiest fire

Nearest fire to me

Random action selection

Score(scenario1)

1.953

1.725

0.889

0.306

0.695

0.685

Score(scenario2)

1.509

1.215

0.367

0.400

0.541

0.567

In Fig 5 which shows the performance of random strategy, about half of the city is burnt out and fire has reached to civilian. So it is not a good strategy at all. In Fig 6 which shows the performance of extinguishing the nearest fire to center strategy, the fire is extinguished well but there is a civilian that has gotten stuck in one of the fiery buildings. Therefore it does not have any advantage to our explained method. Fig 7 and 8 display results of using extinguishing easiest fire strategy and extinguishing nearest fire to me strategy. It is obvious that the fire brigade couldn’t save the city and civilian well. So they are not acceptable strategies.

Figure 3. Learned Fire Brigade

Figure 5. Random Strategy

Figure 7. Extinguishing easiest fire

Figure 4. Extinguishing nearest fire to civilian

Figure 6. Extinguishing the nearest fire to center

Figure 8. Extinguishing nearest to me

5.2 Implementation in original rescue simulation In order to observe the performance of our approach in official rescue simulation, we selected two standard maps and compared the extinguishing fire task of our agent using trained Q-table with the strategy of the first three teams of 2010 worldwide rescue simulation competitions in Singapore and also with previous sterategy of our team (which is championship of IranOpen 2011). The scores were extracted by using the released codes of RoboAKUT, IAMRescue and ZJUBase rescue teams. To use the obtained Q-table of simple rescue simulation in official rescue simulation we should first adapt the Q-table with this environment so we first trained the agent 50 episodes by using obtained Q-table in simple rescue simulation as initial table in original rescue simulation. The parameters are set same as which were set in simplified rescue simulation. Table 2 shows the results of the experiments on all maps. It can be noticed that on all test cases our agent show better results when using the trained Q-Table. This demonstrates that the reinforcement learning system tends to refine and keep better strategy for the fire brigade agent. Table 2 results of the experiments

Team BraveCircles(Old) BraveCircles(New) RoboAKUT IAMRescue ZJUBase

Kobe1 37.116 43.698 24.692 38.862 27.475

Kobe2 11.892 14.972 9.5 8.93 9.284

5.3 Discussion on lesson by lesson learning Now to show the advantage of lesson-by-lesson learning, we compare the speed learning of two different agents. One of them is experienced agent which has passed first five mentioned steps in lesson by lesson learning process and other one is newbie agent which has not trained at all. After limiting the tanker capacity of agent(step 6), and starting the train process for both experienced and newbie agents, we observed that the learning speed of experienced agent was much more than newbie agent. Table 3 compares the speed of learning in two experienced and newbie agents in 3 different scenarios. Results show an experienced agent can adapt itself to changes, faster than a newbie agent. Fig 9,10 show the convergence of experienced and newbie agents in a same scenario. In Fig 9,10 horizontal axis is episode and vertical axis is score.

Table 3 expert vs. newbie fire brigade

Map Map1 (Easy)

Map2 (medium)

Map3 (hard)

Score

Expert Firebrigade

Newbie Firebrigade

Score

2.853

2.853

Episode of convergence

83

95

Score

1.888

1.888

Episode of convergence

56

220

Score

2.907

2.907

Episode of convergence

188

4760

Figure 9. Expert Firebrigade (Map3)

Figure 10. Newbie fire brigade (Map3)

6. Conclusions and Perspectives for Future Research In this paper we discussed about using temporal difference learning to find the optimum policy for fire extinguishing task of fire brigades, and results show that the trained agent learned by TD has a good performance to extinguishing fires. This method increases the speed of learning and it has very low memory usage as well. Also we used a lesson-by-lesson learning process and according to results the experienced agent can learn the optimum policy much faster than the newbie agent.

For next steps we are going to find the best parameters for RL by using optimization algorithms like Genetic and PSO. Also as a future work we want to use this method for other kind of agents like police and ambulance agent to find the optimum policy for their task. Because the rescue simulation environment is a multi-agent system, the coordination between agents is an important issue. So in future we are going to design and implement a method based on RL to achieve an optimum policy for coordination of different agents.

7. Acknowledgment We would like to acknowledge the support of FCT – Fundação para a Ciência e Tecnologia, through the project “Intelligent wheelchair with flexible multi modal interface” under grant FCT/RIPD/ADA/109636/2009.

References 1.

Kitano, H., Tadokoro, S.: RoboCup rescue: A grand challenge for multiagent and intelligent systems. AI Magazine 22(1), 39–52 (2001) 2. Takeshi, M.: How to develop a RoboCupRescue agent (2000) 3. Maitreyi Nanjanath, Alexander J. Erlandson, Sean Andrist, Aravind Ragipindi, Abdul A. Mohammed, Ankur S. Sharma, and Maria Gini: Decision and Coordination Strategies for RoboCup Rescue Agents. Simulation, Modeling, and Programming for Autonomous Robots (2010) 4. Francesco Maria Delle Fave, Heather Packer, Oleksandr Pryymak, Sebastian Stein, uben Stranders, Long Tran-Thanh, Perukrishnen Vytelingum, Simon A. Williamson, and Nicholas R. Jennings, RoboCupRescue 2010 Rescue Simulation League Team Description IAMRescue (United Kingdom) (2010) 5. Shahbazi, H., Zafarani, R. 2006.: Priority Extraction Using Delayed Rewards in Multi Agents Systems: A Case Study in RoboCup. CSICC’06, Iran,571- 574.(2006) 6. Shahgholi Ghahfarokhi, B. Shahbazi, H. Kazemifard, M. Zamanifar, K:Evolving Fuzzy Neural Network Based Fire Planning in Rescue Firebrigade Agents. SCSC 2006, Canada (2006) 7. Paquet, S., Bernier, N., Chaib-draa, B.: Comparison of different coordination strategies for the RoboCup rescue simulation. In: Innovations in Applied Artificial Intelligence. Volume 3029. Springer-Verlag (2004) 8. Mohammadi, Y.B., Tazari, A., Mehrandezh, M.: A new hybrid task sharing method for cooperative multi agent systems. In: Canadian Conf. on Electrical and Computer Engineering. (May 2005) 9. Mart´ınez, I.C., Ojeda, D., Zamora, E.A.: Ambulance decision support using evolutionary reinforcement learning in RoboCup rescue simulation league. In: RoboCup 2006: Robot Soccer World Cup X. Springer-Verlag (2007) 10. Sebastien Paquet, Nicolas Bernier and Brahim Chaib-draa: From global selective perception to local selective perception. In AAMAS, pages 1352–1353, (2004) 11. Saman Ampirpour Amraii, Babak Behsaz and Mohsen Izadi. S.o.s 2004: An attempt towards a multi-agent rescue team. In Proc. 8th RoboCup Int’l Symposium, (2004) 12. R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA,(1998)

Suggest Documents