Reinforcement Learning of Communication in a Multi-Agent ... - limsi

9 downloads 0 Views 393KB Size Report
communicative acts to determine the value of the chocolate variable: this information is ... something between melted and hot chocolate (and this information is ...
Reinforcement Learning of Communication in a Multi-Agent Context Shirley Hoet LIP6 Pierre et Marie Curie University Paris, France [email protected]

Abstract—In this paper, we present a reinforcement learning approach for multi-agent communication in order to learn what to communicate, when and to whom. This method is based on introspective agents that can reason about their own actions and data so as to construct appropriate communicative acts. We propose an extension of classical reinforcement learning algorithms for multi-agent communication. We show how communicative acts and memory can help solving non-markovity and asynchronism issues in MAS. Keywords-Communication Learning;Reinforcement Learning; Multi-Agent System

I. I NTRODUCTION Usually, indirect communication in MAS, it is assumed that agents know when to send a message, the type of communicative act they must use, the content of the message and the recipient(s) of the message. However, these hypothesis do no longer hold in open and heterogeneous MAS. Thus, agents must learn what to communicate, to who and when. In this paper, we focus on one single learner agent, learning request and query messages. Much has been done in the field of mono-agent behaviour learning, in particular using reinforcement learning. However, existing techniques have several limits when it comes to learning to communicate in MAS. First, since the MAS is open and loosely coupled, the learner agent has no idea of the preconditions and effects of other agent’s abilities. Yet, to delegate a task to an other agent (and to build relevant request messages), it must determine the context and possible actions of each agent in the system. Second, the environment is only partially observable by the learner agent. There exist techniques to learn good behaviours in a partially observable environment[1], [2]. However they require that the agent knows the model of the environment, such as its state space and/or the transition probability from a state to another given the action performed by the agent. If the agent evolves in an open and loosely coupled MAS, it has no access to this information. Another approach consists in separating two hidden states of the environment either with the agent’s memory [3], [4] or with information obtained by communication [5], [6]. In

Nicolas Sabouret LIP6 Pierre et Marie Curie University Paris, France [email protected]

our approach, we will use communication to discover hidden states and store them in the agent’s memory. Third, when agents interact to delegate tasks (through request acts) in asynchronous MAS, a requested action can be executed several time steps after the answer was sent and received. Thus, the state of the system at time t, as seen by the learner agent, can depend on tasks that were delegated at time t − k. This supports the idea that learner agents should store their past delegated actions whose effect could be delayed. However, adding such a memory increases the learner agent’s state space in an exponential manner, which will prevent the reinforcement learning to converge. Furthermore, the learner agent will have to learn to wait for the delegated action to be performed before executing another. In the following section, we present our solution for the first two issues: using simple MAS protocols in the context of introspective agents allows us to discover possible interactions. In section III, we present an iterative approach to building a memory for solving the third problem (related to the agent asynchrony). We discuss our evaluation results in section IV and related work on multi-agent interaction learning in section V. Finally we conclude in section VI. II. B UILDING MESSAGES In our model, we use the VDL multi-agent platform [7] which proposes an agent communication language (ACL) based on the FIPA ACL model, extended with speech acts for introspection. The VDL model assumes that agents have access at runtime to the list of their capacities (with their preconditions and effects) and internal state. Thus, a VDL agent can answer questions such as “what can you do now?”. This capability is used to extract the capacities of the learner agent’s peer, and, thus, to determine the content of future request and query messages. Our model uses two interaction protocols: • The what-query protocol allows agents to discover what they can ask to other agents about their internal state and, thus, to build eligible query message. Discovered query messages lead to inform answers. The content of these inform message is used to store beliefs in the learner agent’s state.

The what-order protocol allows agents to discover what action they can ask other agents to perform and, thus, to build eligible request message. Discovered request messages lead to agree or impossible messages, depending on the peer’s internal state. These protocols are illustrated on figure1. •

Figure 1.

The what-query and the what-order protocols.

Learning messages The general principle is that the agent explores the MAS with our two protocols what-order and what-query in order to construct a maximum number of communicative acts and to test the resulting messages. This process occurs in parallel with our behavioural learning mechanism (described in section III). We deal with the exploration/exploitation dilemma, using a simple simulated annealing metaheuristic. However, even if the agent is able to build query and request messages and to interpret the answers, it isn’t able to understand the meaning of the messages it sends and to determine their effects (it has no access to the semantics of the contents, as interpreted by its peers). Thus, the difficulty is to determine when to send a message in order to obtain the desired effect. To this purpose, the agent will consider each different message that it can send as a possible action and use the reinforcement algorithm presented in the next section to learn to execute it advisedly. III. L EARNING TO COMMUNICATE As was explained in section I, the learner agent’s state at time t depends on all tasks previously delegated to other agents (i.e. on all request messages previously sent) and on its own actions in between (including wait actions). Therefore the learner agent must memorize its past actions. In addition, after the reception of an inform message, the learner agent will store the information as a belief. However, this belief can become false after some time but the agent cannot spend his time sending query messages to keep it up to date. Thus, it needs to learn how long a given belief must be maintained in the memory. Thus, our agent must learn when to delegate actions (by sending request messages) and when to update its beliefs (by

sending query messages) based on its current state, which includes all its memorized beliefs and past actions. A. Background In [3], the authors have proposed an iterative algorithm based on Q-Learning [8] to find the minimal memory size for convergence to a good policy. At the beginning, the agent has no memory and the algorithm performs N steps of the QLearning. Then it extracts the k most ambiguous states from the Q-table (i.e. the states in which the Q-Learning couldn’t determine an optimal action) and for each of these states, one additional level of memory is added to the agent, which enables it to memorize its latest action and observation. Thus, the agent’s state is characterized by S + M1 (instead of S). This process iterates until no improvement is detected in the policy. After i stages, some states can contain up to i slots. What makes this algorithm efficient is that it increases the state space only for the ambiguous states and not for others. Using different size of memory depending on the context allows the agent to reduce its search space, to keep the Q-Learning algorithm practicable and to converge towards a solution. In this subsection, we briefly present the QLearning algorithm and Dutech’s heuristic for ambiguous states. 1) Q-Learning: Q-Learning [8] is a reinforcement learning algorithm that iteratively builds the values of Q(s, a) ∈ R which denote the expected reward for doing action a in state s. The best action for state s is the one with highest Q-value. One limitation of Q-Learning w.r.t. our problem is that it relies on a MDP model of the problem with the hypothesis that each state only depends on the immediately preceding state and performed action. Using memory allows to overcome this limit. 2) The ambiguous state: In order to determine the ambiguity of a state, [3] propose to use three criteria. A state s is more ambiguous when: • the Q-values of its two best actions Q(s, a1 ) and Q(s, a2 ) are very close to each other. • its Q-Values Q(s, .) are often updated (i.e. the state is often encountered, which could correspond to several hidden sub-states). • the Q(s, a) does not converge (strong variations appear from one try to another). Every state is ranked according to the sum of its score w.r.t. each criterion. The k more ambiguous are then associated to an additional level of memory. B. Iterative learning in a MAS context Our solution for reinforcement learning of communicative acts in an asynchronous context relies on Dutech’s solution. As proposed in [3], we consider the memory as a variable of the learner agent (at the same level that its beliefs). The

state st of the learner agent is defined by a tuple st = {o1t , ...ont , bt , m1t , ..., mkt } where : • •



oit

∀i ∈ [1, n], is a couple of variable and value. bt is the belief obtained by receiving an inform message at time t−1 (bt can be empty if the learner agent didn’t receive an inform message at time t − 1). {m1t , ..., mkt } is the memory of learner agent at time t, composed alternatively of past actions and past observations (oit and bt ). Formally, ∀p ∈ N, mt2p+1 is the action that learner executed at time t − p and where 0 1 n m2p t is a copy of the belief bt and {ot0 , ..., ot0 } as they 0 were at time t = t − p.

To determine the size k of the agent memory for each state using an iterative approach, we use the initial heuristic of [3] enriched with a new criteria associated to the wait action. In our first experiments, we have observed that agents tend to learn that wait is the safest action in a non-markovian environment, which prevents the algorithm’s convergence. Indeed the wait action is relevant when the learner agent needs to wait the effect of a past delegated action in a state s. However, if this action is not stored in the memory, the agent cannot determine whether it can pursue its policy or if it should still wait. This leads the agent to select always the wait action and to be jammed in the same state, indefinitely waiting. For this reason, a state in which the best action is wait should always be provided with an additional slot of memory. Thus, we compute the ambiguity of a state s according toX this function : amb(s) = waits + 13 (rs [up(s)] + 1 rs [ N ∆qa ] + rs0 [qa1 − qa2 ]) where:

IV. E XPERIMENTS AND RESULTS The experiment we present here was used to validate our approach against several issues: asynchronism of agents (wait ), partially observable environment (building and using query messages) and delegation (building and using request messages). The problem consists for a cook learner agent to obtain melted chocolate through interactions with a pan agent (see figure 2).

Figure 2.

The cook problem

The difficulty of this problem is due to the fact that the cook agent (learner) can only observe the variable on. It has no direct access to the chocolate variable which evolves randomly. Thus, the cook agent must learn to send query communicative acts to determine the value of the chocolate variable: this information is required to prevent the chocolate to burn and to determine when to retrieve it from the pan.

a∈As • • • • •



a1 and a2 are the best actions to perform in the state s (they maximize the Q-Value Q(s, .)) waits = ∞ if a1 =wait,0 else. qa = Q(s,a) and ∆qa is the last modification on qa N is the number of Q(s,a) for the state s rs0 [x] (respectively rs0 [x]) returns the position of the state s if the set of states is ranked in increasing order (resp. decreasing order) according to the value of x. up(s) is the number of update of the state s, which means the number of times where this state has been visited during the learning.

Moreover, we do not want agents to spend their time sending messages when it is not necessary (as for the wait action, this can be evaluated as a safer action, when in doubt). For this reason, we chose the learner agent receives a negative reward whenever it sends a message. This cost will be in function of the received answer of the message. If the agent receives a positive answer (i.e an inform or an agree message) the cost will be less important than if the agent receives a negative answer (i.e. an unknown , a notunderstood or an impossible message).

Figure 3. Average reward for 10 steps with number of steps for an agent 1) with memory but without query messages (?); 2) with query but without memory (+); 3) with both query and memory (×).

Figure 3 illustrates the final reward obtained during the learning for three different configurations (with memory, with query , with both). This experiment shows that the learning process cannot converge without using query communicative acts (the average reward at the end of the process in the first configuration stays at −100: the agent only obtains cold or burnt chocolate at every run). However, the use of query messages is not sufficient to find a good policy

if it isn’t coupled with memory: in the second configuration, the algorithm converges on a policy where the learner agent receives an average reward of -25 with a slot of memory. Using memory and query messages, the algorithm finds a policy with an average reward of 60 when using both memory and queries. This sub-optimal result comes from the non-deterministic environment. Not only is the optimal reward impossible to obtain with certitude (there is always a 30 percent probability that the chocolate switches from melted to burnt), but also the fact that changes are nondeterministic makes it difficult for the Q-Learning algorithm to converge. Thus, the learnt policy consists in retrieving something between melted and hot chocolate (and this information is obtained through the query message), which appears to be the safest choice. V. R ELATED W ORK As showed in [5], communication is at the core of multiagent reinforcement learning because it allows agents to exchange information to improve the learning. For example, in [6], [9], the authors use communication in order to improve the coordination of their agents. Therefore each agent knows when to communicate and how to communicate in these models. We go beyond this limit by considering agents in asynchronous MAS that learn what and when to communicate, as presented in section I. In [10], the authors have proposed a reinforcement algorithm to determine when agents must communicate or perform an action. This work makes two strong hypothesis: first, the learner agent knows in advance the content of its messages (it only learns when to use it). Second, the agent evolves in a synchronous MAS, which removes the problem of memory. In [11], the authors have been interested in communication signal learning in MAS. Messages initially have no semantic content from the receiver’s point of view: the agents must learn by trial and error (by observing the impact of their induced actions) the meaning of the message. This algorithm allows to determine a communicative act policy by reinforcement learning without agents knowing in advance the content of their messages. One limitation of all work presented above is that it focuses on learning informative acts (query and alike). In this paper, we propose a model that considers the request speech act (and all the issues relevant to it). VI. C ONCLUSION In this work, we proposed protocols and an incremental learning algorithm for a single agent to learn what and when to communicate in a non-markovian and asynchronous MAS. Our experiments results lead us to consider three research perspectives: first, we would like to extract the preconditions and the effects of communicative acts built by the agent in order to reuse the acquired knowledge of

our agent in different contexts or for different tasks. Second, we would like to separate relevant actions from the others. Indeed in an open and loosely coupled MAS, we don’t want our algorithm to explore the whole set of actions uselessly. Last, we would like to extend our research in the context of multi-agent learning. R EFERENCES [1] P. Poupart, “Exploiting structure to efficiently solve large scale partially observable markov decision processes,” Ph.D. dissertation, Toronto, Ont., Canada, Canada, 2005, aAINR02727. [2] F. Doshi, J. Pineau, and N. Roy, “Reinforcement learning with limited reinforcement: using bayes risk for active learning in pomdps,” in Proceedings of the 25th international conference on Machine learning, ser. ICML ’08. New York, NY, USA: ACM, 2008, pp. 256–263. [Online]. Available: http://doi.acm.org/10.1145/1390156.1390189 [3] A. Dutech, “Solving pomdps using selected past events,” in ECAI, 2000, pp. 281–285. [4] M. T. Todd, Y. Niv, and J. D. Cohen, “Learning to use working memory in partially observable environments through dopaminergic reinforcement,” in Advances in Neural Information Processing Systems 21, D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, Eds., 2009, pp. 1689–1696. [5] M. Tan, “Multi-agent reinforcement learning: Independent vs. cooperative agents,” in In Proceedings of the Tenth International Conference on Machine Learning. Morgan Kaufmann, 1993, pp. 330–337. [6] M. J. Matari’c, “Using communication to reduce locality in distributed multi-agent learning,” Journal of Experimental and Theoretical Artificial Intelligence, vol. 10, pp. 357–369, 1998. [7] Y. Charif and N. Sabouret, “An agent interaction protocol for ambient intelligence,” in 2nd International Conference on Intelligent Environments (IE’06), 2006, pp. 275–284. [8] C. J. C. H. Watkins, “Learning from delayed rewards,” Ph.D. dissertation, Cambridge University, Cambridge, United Kingdom, 1989. [9] D. Szer and F. Charpillet, “Improving coordination with communication in multi-agent reinforcement learning,” in ICTAI ’04: Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence. Washington, DC, USA: IEEE Computer Society, 2004, pp. 436–440. [10] F. S. Melo and M. Veloso, “Learning of coordination: exploiting sparse interactions in multiagent systems,” in AAMAS ’09: Proceedings of The 8th International Conference on Autonomous Agents and Multiagent Systems. International Foundation for Autonomous Agents and Multiagent Systems, 2009, pp. 773–780. [11] T. Kasai, H. Tenmoto, and A. Kamiya, “Learning of communication codes in multi-agent reinforcement learning problem,” in Proceedings of the 2008 IEEE Conference on Soft Computing in Industrial Applications (SMCia/08), 2008, pp. 1–6.

Suggest Documents