Comparison of reinforcement and supervised learning methods in Farmer-Pest problem with delayed rewards ´ zy´ Bartlomiej Snie˙ nski AGH University of Science and Technology Faculty of Computer Science, Electronics and Telecommunications Department of Computer Science Al. Mickiewicza 30, 30-059 Krak´ ow, Poland
[email protected]
Abstract. In this paper we propose a method based on the time-window idea which allows agents to generate their strategy using supervised learning algorithms in environments with delayed rewards. It is universal and can be used in various environments. Learning speed of the proposed method and reinforcement learning algorithm are compared in a FarmerPest problem with delayed rewards. Farmer-Pest problem is chosen for the comparison because it is designed especially for learning algorithms benchmarking. It has several dimensions which change environment characteristics and allows to test algorithms in various conditions. This paper presents results for one reinforcement learning method (SARSA) and three supervised learning algorithms (Na¨ıve Bayes, C4.5 and Ripper). These algorithms are tested on configurations with various complexity. Keywords: agent learning, supervised learning, reinforcement learning
1
Introduction
The problem of learning naturally appears in multi-agent systems [11] which are efficient architectures for decentralized problem solving. In complex or changing environments it is very difficult, sometimes even impossible, to design all system details a priori. To overcome this problem one can apply a learning algorithm which allows to adapt the system to the environment. To apply learning in a multi-agent system, one should choose a method of learning, which fits well to the problem. There are many algorithms developed so far. However, in multi-agent systems most applications use reinforcement learning or evolutionary computations [7]. These methods are relatively slow and the knowledge learned is difficult to analyze by humans. The goal of this research is to show that supervised learning algorithms may be applied by agents to generate their strategy in an environment with delayed rewards. We also compare learning speed of supervised and reinforcement learning algorithms. For this purpose we use a Farmer-Pest problem, a learning environment designed especially for benchmarking. Reinforcement learning
2
´ zy´ Bartlomiej Snie˙ nski
is designed to work well in environments with delayed rewards. For supervised learning it is not straightforward. In this paper we propose a solution based on a time-window approach to take into account several time stamps. It makes the search space larger, but still supervised learning algorithms achieve good results faster than reinforcement learning. The other advantage is that the knowledge learned is readable. This paper is an continuation of [15], where comparison of reinforcement learning algorithm (SARSA) and three supervised learning algorithms (Na¨ıve Bayes, C4.5 and Ripper) was performed in the environment without delays. The same learning algorithms are compared here. In our research, we make the following contributions to the state of the art: we propose a method how supervised learning may be applied to generate agent strategy in environment with delayed rewards, we present a multi-dimensional domain allowing comparison of leaning algorithms; we show that methods other than reinforcement learning can be used for strategy generation; we compare learning algorithms in configurations with various complexity and show that supervised learning can improve performance of agents much faster that reinforcement learning. In the following sections we overview learning in multiagent systems, present how supervised learning may be applied to problems with delayed rewords, and describe the proposed problem domain. Next, experimental results are described. Finally, conclusions and the further work are outlined.
2
Learning Agents
Good survey of learning in multi-agent systems working in various domains can be found in [7] and [11]. Experiments are performed in various environments, using many kinds of learning algorithms. Below we demonstrate variety of these works using three example domains. Very popular in multi-agent systems is a soccer domain. The environment consist of a soccer field with two goals and a ball. Two teams of simulated or real robots are controlled by the agents. The performance is measured by the difference of scored goals. In [6] genetic programming is utilized to learn behavior-based team coordination. In [10] C4.5 algorithm is used to classify the current opponent into predefined adversary classes. Reinforcement learning can be also applied. Good example is [16], where a ”keep away” version of the domain is used. Agents learn when to hold the ball an when to pass it. It appears that this domain is difficult for the learning algorithms. The state space is large, uncertainty and long delays of action effects should be taken into account. One of environments, which is often used in research is Predator-Prey domain. It is a simple simulation with two types of agents: predators, and preys. The aim of a predator is to hunt for a prey. Prey is captured if predator (or several predators if cooperation is tested) is close enough. In [18] predator agents use reinforcement learning to learn a strategy minimizing time to catch a prey. Additionally, agents can cooperate by exchanging sensor data, strategies or episodes. Experimental results show that cooperation is beneficial. Other researchers work-
Title Suppressed Due to Excessive Length
3
ing on this domain successfully apply genetic programming [5] and evolutionary computation [4]. Results of rule induction application in this domain can be found in [13, 14]. Target observation is another interesting problem domain. Good example is [21], where rules are evolved to control large area surveillance from the air. In [8] Parker et al. present cooperative observation task to test autonomous generating of cooperative behaviors in robot teams. Agents cooperate to keep targets within specific distance. Lazy learning based on reinforcement learning is used to generate strategy better than random, but worse than a manually developed one. Results of application of reinforcement learning mixed with state space generalization can be found in [3].
3
Reinforcement Learning
Classical learning method in MAS is reinforcement learning. There are several algorithms using this strategy. In experiments we used SARSA algorithm [17] which is described below. Knowledge is stored in a function Q that estimates quality value of the action in a given state: Q : Act × X → R, where Act is a set of actions, X is a set of possible states, and R is a set of real numbers. Reinforcement learning agent gets description of the current state xt ∈ X, and using its current strategy chooses an appropriate action at ∈ Act. Usually, action with the highest Q value is chosen. Next, using reward rt obtained from the environment, the next state description xt+1 ∈ X, and action at+1 ∈ Act that will be next executed, it updates its strategy: ∆ := rt + γQ(at+1 , xt+1 ) − Q(at , xt ) Q(at , xt ) := Q(at , xt ) + β∆
(1) (2)
where γ ∈ [0, 1] is a discount rate (importance of the future rewards), and β ∈ (0, 1) is a learning rate. The reward characteristics depends on the problem. It represents quality of the action. It is positive in case of achiving a goal (like hiting the target), and negative in case of a failure (like hitting an obstacle). To speed up the learning process various techniques are developed. One of them is temporary differences mechanism TD(λ > 0) [20], which updates not only last state but also these visited recently. Parameter λ ∈ [0, 1] is a recency factor. Values close to 0 mean that traces are very short. In reinforcement learning various techniques are used to prevent from getting into a local optimum. The idea is to explore the solution space better by choosing not optimal actions from time to time (e.g. random or not performed in a given state yet). Boltzmann selection [17] is used in experiments for this purpose. Instead of selecting the action with highest value, the action a∗ is selected in state x with probability P (a∗ , x) calculated according to the following formula: ∗
eQ(a ,x)/τ P (a∗ , x) = P Q(a,x)/τ , ae
(3)
4
´ zy´ Bartlomiej Snie˙ nski
where τ > 0 is a temperature parameter. High values of τ make probability of all actions almost the same, regardless their quality.
4
Supervised Learning
Generally, supervised learning allows to generate an approximation of a function f : X → C which assigns labels from the set C to objects from set X. To generate knowledge a supervised learning algorithm needs labeled examples which consist of pairs of f arguments and values. Let us assume that elements of X are described by set of attributes A = (a1 , a2 , . . . , an ), where ai : X → Di , and Di is a domain of attribute ai . Therefore xA = (a1 (x), a2 (x), . . . , an (x)) is used instead of x. If size of C is small, like in this research, the learning is called classification, C is set of classes, and the approximation h is called classifier. In experiments we use three algorithms: Na¨ıve Bayes (NB), C4.5 and Ripper. NB is a simple probabilistic classifier, in which every attribute node a1 , a2 , ..., an depends on the class attribute c. Learning is a process of calculation of a priori probabilities P (c) and conditional probabilities P (ai |c). C4.5 is an decision tree learning algorithm developed by Ross Quinlan [9]. Ripper developed by William W. Cohen [2] generates decision rules instead of trees. 4.1
Agent Architecture for Supervised Learning
In [12] we propose architecture for agent applying supervised learning for strategy generation (see Fig. 1). The agent consists of four modules: Processing Module is responsible for basic agent activities, storing training data, executing learning process, and using learned knowledge; Learning Module is responsible for execution of learning algorithm and giving answers for problems with use of learned knowledge; Training Data is a storage for examples used for learning; Generated Knowledge is a storage for learned knowledge.
Fig. 1. Learning agent architecture
Title Suppressed Due to Excessive Length
5
These components interact in the following way. Processing Module receives Percepts from the environment. It may process them and execute Actions. If during processing learned knowledge is needed, it formulates a Problem and sends it to the Learning Module, which generates an Answer for the Problem using Generated Knowledge. Processing Module decides also what data should be stored in the Training Data storage. When necessary (e.g. periodically, or when Training Data contains many new examples) it calls the Learning Module to execute the learning algorithm to generate new knowledge from Training Data. The learned knowledge is stored in the Generated Knowledge base. Learning module may be defined as a four-tuple: (Learning Algorithm, Training Data, Problem, Answer ). Characteristics of the training data, the problem and the answer depend on the learning strategy used in the learning module.
5
Supervised Learning for Delayed Rewards
Below we propose a method which allows supervised learning to work in environments with delayed rewards. It is based on a time window method. During learning, instead of the single example x, we use a sequence of examples X 3 xt = (xt , xt−1 , . . . , xt−dt+1 ) to represent situation in time stamp t with window of size dt. Every xi is described by attributes A : a1 (xi ), a2 (xi ), . . ., an (xi ), a(xi )). The last attribute a(xi ) ∈ Act represents an action executed in time step i. Two categories are used: C = {+, −} representing success (+) or failure (−) of executing actions a(xi ) in the sequence xt . In case of continuous rewards, success can be defined as a reward above certain threshold. Training data can be defined as a sequence ((x1 , c1 ), (x2 , c2 ), . . . , (xm , cm )). After learning we get the classifier h : X → C, or more precisely, h : X, C → R which for given xt and c returns certainty of this classes. Our goal is to select the best action a∗ ∈ Act for the current time t using h. We decided to apply the following method: 1. Prepare xt by setting all attributes describing current and previous states and leaving action attributes unknown. Classifier should be able to work with unknown attribute values. If t < dt then attributes describing non-existing states are also set to unknown value. 2. Find t∗ ∈ {t, t − 1, . . . , t − dt + 1} for which the dispersion of certainties of positive class for various actions is the highest, assuming that actions in other time stamps are unknown: (a) For xt prepare a set of examples {xact k }, for all act ∈ Act and k = is created from xt by setting a(xk ) = act. t, t−1, . . . , t−dt+1 where xact t act (b) Calculate δk = maxact h(xact k , +) − minact h(xk , +) for every k. (c) Choose k ∗ which maximizes δk . (d) Return action to execute: a∗ = argmaxact h(xact k∗ , +). We tried also entropy measure instead of the maximal dispersion but experimental results were worse. Experimental results showed that exploration improves results of supervised learning too. Therefore Boltzmann selection was also applied here.
6
6
´ zy´ Bartlomiej Snie˙ nski
The Farmer-Pest Problem
We tested the learning algorithm described above on a Farmer-Pest problem. This environment borrows the concept from the specific aspect of real world, in which farmers struggle to protect their fields and crops from pests. Each farmer (this is the only type of agent in the problem) can manage multiple fields. On each field, a multiple types of pests can appear. Each pest has a specific set of attributes e.g. number of legs, color. Values of these attributes depend on the pest type. To protect the field, the farmer can take the advantage of multiple means (e.g. pesticides) called actions. However, each pest type has different resistance to each farmer’s action (hereinafter referred to as resistance matrix ). Usually, the problem is time-limited to discrete number of turns. In every turn an agent can execute one action only. It makes simple strategies like applying all actions for every pest inefficient. The key assumption here is that the farmer agent is not aware of the possible types of the pests nor the resistance matrix. What he can see are the pests’ attributes. Based on them, he needs to learn how to recognize different pest types. To learn the resistance matrix, the agent needs to experiment with different actions and observe their effects (i.e. whether the pet dies or not). To make the problem more complicated, the effects are not always immediate and they depend on the resistance matrix. The resistance of the specific pest type to a specific action is valued by the time after which the pets dies (pest’s immunity to the action is valued as infinite time). Pests can also have a maximum life-span after which they die regardless of the agent’s actions. This maximum life span is called alive time. The problem can be further extended by introduction of deviations to the observed values of the pests’ attributes or limiting the number of attributes the farmer agent can see.
7
Experiments
Using the Farmer-Pest problem we are able to make comparison of reinforcement and supervised learning algorithms. Two dimensions are chosen to define various versions of the environment: attribute distributions and the delay of action results. The objective is to analyze how quickly agents may improve their performance and show that various conditions favor different learning algorithms. 7.1
Setup
Two experiments with various settings were performed: simple and complex. Four learning agents take part in every experiment. They use SARSA, Na¨ıve Bayes, C4.5 or Ripper learning algorithms. Their results are compared to those obtained by a fifth agent that executes random actions. Pests are described by four attributes. There are eight actions Act = {act1 , act2 , . . . act8 }, and every
Title Suppressed Due to Excessive Length
7
Table 1. Pest’s attribute probability distributions for Experiment 2 Pest type 1 2 3 4
size 40 10 x x x x
legsno 40 30 x x x x
speed jump Pest size legsno speed jump 40 20 10 40 20 10 type 40 10 40 30 40 20 10 40 20 10 x x 5 x x x x x x 6 x x x x x x 7 x x x x x x 8 x x x x
pest pt can be killed by actt only, after delay d. Three values of d were used in experiments: 2, 4 and 8. As a consequence, we used dt = 8. Data was collected by running simulation software developed in Java (Weka’s J48 and JRip implementations of C4.5 and Ripper were used). Every experiment consists of 100 simulations. Every simulation consists of 20 consecutive games. 80 pests appear in every game. Knowledge of agents is preserved from game to game. Although, it is cleared between simulations. Supervised learning is executed between games while reinforcement learning is performed after every turn. In the experiments we check how the performance of agents, measured by pests eliminated, changes during simulation. Figure 2 present means of numbers of pests eliminated by every agent in consecutive games. Experiment 1 – Simple Environment We start with simple environment. Every pest pt of type t has unique values of all attributes: ai (pt ) = 10t, i = 1..4, t = 1..8. It allows to recognize every type using single value of any attribute. Experiment 2 – Complex Environment In this experiment four attributes with domains of size two or three are used. The attribute values are presented in Tab. 1. As we can see, distributions are chosen in such a way that no single attribute can be used to recognize pest type – tests on several attributes are necessary. Therefore, this domain is significantly more complex than the previous one, even though sizes of attribute domains are smaller. Parameters were tuned using hill-climbing algorithm on the complex environment with d = 4. The following values were chosen for SARSA: λ = 0.9, γ = 0.8, β = 0.3, τ = 0.05. For supervised learning algorithms only τ had to be tuned: for Na¨ıve Bayes τ = 0.15, for C4.5 τ = 0.15 and for Ripper τ = 0.1. 7.2
Discussion of the Results
Results for the simple case are presented in Fig. 2 (a-c). Random agent and NB agent have poor results. Explanation for NB is simple. This model is not able to take into account dependency between state attributes and action attributes. Remaining learning agents perform much better. If d is larger, learning is slower, but finally results are still good. The only exception is result for SARSA and d = 8. Always in the last game Ripper and C4.5 are better than random choice and NB and the difference is statistically significant (using t-test, at p < 0.05). Results for the Complex Environment are presented in Fig. 2 (d-f). NB is not working again. The less d, the better results can be finally achieved. For
8
´ zy´ Bartlomiej Snie˙ nski
d = 2 C4.5 and SARSA learn quickly and outperform other agents (result is statistically significant at p < 0.05). For d = 8 C4.5 is statistically better then others (at p < 0.05) and Ripper is better than SARSA. Comparing results for both environments, we can observe, that change in the complexity affects Ripper algorithm more than SARSA and C4.5. For the same d SARSA have similar learning speed in both cases. It is coused by the tabular Q-function representation, which has the same performance for both cases. C4.5 is only slightly slower for complex environment. This classifier is able to separate classes well even in more complex environment. Ripper classifier has worse accuracy for the second one.
8
Conclusion and Further Research
In this paper we present a method of agent strategy generation using supervised learning in environments with delayed rewards. We compare performance of the reinforcement and supervised learning algorithms: SARSA, Na¨ıve Bayes, C4.5 and Ripper. These algorithms were used by agents taking part in a simulation of Farmer Pest Problem which is a scalable multi-dimensional problem domain for testing agent learning algorithms. This environment provides a large number of configurable dimensions which enables preparation of different testing conditions. Experimental results show that Na¨ıve Bayes can not be used for this purpose because of the attribute independence assumptions. Performance of other algorithms depends on the environment configuration (e.g. d value). For simple environments, in which any attribute value allows to decide on action, every learning algorithm give fast improvements, unless the reward delay d is too large (it is a problem especially for SARSA). If environment is more difficult, when it happens that the agent should take into account values of several attributes to choose an appropriate action, and d is not small (d ≥ 4) C4.5 and Ripper supervised learning algorithms perform better than SARSA. It appears that the most difficult for learning algorithm is uniform distribution of most of the attributes which generates a noise, especially dangerous for SARSA algorithm. It should be noted that in cases when the knowledge learned by agents should be interpreted by humans, supervised learning algorithms with symbolic knowledge representation should be preferred (if it gives acceptable results). Knowledge stored during the learning process can have a form which makes it possible to interpret by human. Especially decision rules or trees are good choices. Future work will be concentrated on the execution of experiments with more algorithms. Another aspect of work will be the extension of testing environment to cover cooperation between agents and delays in action results. Symbolic knowledge representation allows to take into account complex dependencies between environment attributes and decisions to be made. Next, we are planning other applications like resource allocation [1] and Focused Web Crawling [19]. Acknowledgments The research presented here has received founding from Polish National Science Centre under the grant no. UMO-2011/01/D/ST6/06146.
Title Suppressed Due to Excessive Length
(a)
(d)
(b)
(e)
(c)
(f)
9
Fig. 2. Means of numbers of pests eliminated by every agent in consecutive games in Experiment 1 for d = 2 (a), d = 4 (b) and d = 8 (c) and in Experiment 2 for d = 2 (e), d = 4 (f) and d = 8 (g); the following symbols are used: × for SARSA, + for Na¨ıve Bayes, for C4.5, for Ripper, and 4 for random
10
´ zy´ Bartlomiej Snie˙ nski
References 1. Cetnarowicz, K., Drezewski, R.: Maintaining functional integrity in multi-agent systems for resource allocation. Computing and Informatics 29(6), 947–973 (2010) 2. Cohen, W.W.: Fast effective rule induction. In: Proceedings of the 12th International Conference on Machine Learning (ICML’95). pp. 115–123 (1995) 3. Fernndez, F., Borrajo, D., Parker, L.E.: A reinforcement learning algorithm in cooperative multirobot domains. Journal of Intelligent Robotics Systems (2005) 4. Giles, C.L., Jim, K.C.: Learning communication for multi-agent systems. In: WRAC. pp. 377–392 (2002) 5. Haynes, T., Sen, I.: Evolving behavioral strategies in predators and prey. In: Adaptation and Learning in Multiagent Systems. pp. 113–126. Springer Verlag (1996) 6. Luke, S., Hohn, C., Farris, J., Jackson, G., Hendler, J.: Co-evolving soccer softbot team coordination with genetic programming. In: Kitano, H. (ed.) RoboCup-97: Robot Soccer World Cup I. pp. 398–411. Springer-Verlag (1997) 7. Panait, L., Luke, S.: Cooperative multi-agent learning: The state of the art. Autonomous Agents and Multi-Agent Systems 11, 2005 (2005) 8. Parker, L.E., Touzet, C.: Multi-robot learning in a cooperative observation task. In: In Distributed Autonomous Robotic Systems 4. pp. 391–401. Springer (2000) 9. Quinlan, J.: C4.5: Programs for Machine Learning. Morgan Kaufmann (1993) 10. Riley, P., Veloso, M.: On behavior classification in adversarial environments. In: Distributed Autonomous Robotic Systems 4. pp. 371–380. Springer-Verlag (2000) 11. Sen, S., Weiss, G.: Learning in multiagent systems, pp. 259–298. MIT Press, Cambridge, MA, USA (1999) ´ zy´ 12. Snie˙ nski, B.: An architecture for learning agents. In: Bubak, M., van Albada, G., Dongarra, J., Sloot, P. (eds.) Computational Science - ICCS 2008. vol. III, pp. 722–730. Springer (2008) ´ zy´ 13. Snie˙ nski, B.: Agent strategy generation by rule induction in predator-prey problem. In: ICCS 2009. pp. 895–903. LNCS, Springer-Verlag (2009) ´ zy´ 14. Snie˙ nski, B.: Agent strategy generation by rule induction. Computing and Informatics 32(5) (2013) ´ zy´ 15. Snie˙ nski, B., Dajda, J.: Comparison of strategy learning methods in farmerpest problem for various complexity environments without delays. Journal of Computational Science 4(3), 144 – 151 (2013) 16. Stone, P., Sutton, R.S., Kuhlmann, G.: Reinforcement learning for robocup-soccer keepaway. Adaptive Behavior 13, 2005 (2005) 17. Sutton, R., Barto, A.: Reinforcement Learning: An Introduction (Adaptive Computation and Machine Learning). The MIT Press (March 1998) 18. Tan, M.: Multi-agent reinforcement learning: Independent vs. cooperative agents. In: In Proceedings of the Tenth International Conference on Machine Learning. pp. 330–337. Morgan Kaufmann (1993) 19. Turek, W., Opalinski, A., Kisiel-Dorohinicki, M.: Extensible web crawler - towards multimedia material analysis. In: Dziech, A., Czyewski, A. (eds.) Multimedia Communications, Services and Security, CCIS, vol. 149, pp. 183–190. Springer Berlin Heidelberg (2011) 20. Watkins, C.J.C.H.: Learning from Delayed Rewards. Ph.D. thesis, King’s College, Cambridge (1989) 21. Wu, A.S., Schultz, A.C., Agah, A.: Evolving control for distributed micro air vehicles. In: In IEEE Conference on Computational Intelligence in Robotics and Automation. pp. 174–179 (1999)