In this paper we focus on the process of fault-detection. We present a non-invasive event reporting technique that allows agents to emit events like report-, ...
Fault-Management in Multi-Agent Systems Peng Xu, Ralp h Deters University of Saskatchewan 57 Campus Drive Saskatoon, Saskatchewan, S7N 5A9 pex066, deters @mail.usask.ca
ABSTRACT Complex multi-agent systems (MAS) that consist of large numbers of loosely coupled, heterogeneous agents tend to be difficult to manage. The management of MAS, which i s similar to that of other decentralized and highly distributed systems, covers a range of activities including fault, configuration, performance and security management. In this paper we focus on the specific issues of faultmanagement in distributed multi-agent systems. We present an “event message” based management approach that i s applicable to all FIPA conform agent systems and report about its performance.
General Terms Management, FIPA, MAS
Keywords FIPA-OS, Fault-Management
1. Agent-Management Complex multi-agent systems (MAS) consist of large numbers of loosely coupled, heterogeneous agents that are hosted by different processes. Agents are highly autonomous entities that communicate by use of message passing in an asynchronous manner. Due to the lack of centralized control structures they tend to be very difficult to manage especially since MAS are designed to self-organize allowing them t o constantly change their functional decencies. The management of distributed systems typically covers a variety of activities including configuration-management, performance-management, security-management and faultmanagement. While all of above-mentioned activities are important fault-management is the essential one. Fault-Management itself is a complex activity that can be decomposed into the three basic steps of fault-detection, fault-isolation and fault-clearance. In this paper we focus on the process of fault-detection. We present a non-invasive event reporting technique that allows agents to emit events like report-, state-change-, warningand error-events as they see fit. The event messages of the agents form the event stream of the system that is sent t o event-manager processes for storage and processing.
2. State-Detection A central problem in the process of fault-detection is the constant monitoring of the system state and its evaluation. The key issue is to have a process in place that enables the fast, precise and non-invasive detection of critical system states. The state of the MAS can be viewed as the weighted states of its agents whereby the weight of each individual agent is based on its importance to the system at a given moment in time. Since the MAS is self-organizing these values can change thus making the state evaluation of the system a complex task. Fast and precise state detection is very important since fault and failures of single agents tend to impact other agents, leading to the problem of fault-propagation. The only way t o avoid fault-propagation is to ensure that the critical states are detected as fast and precise as possible. In addition, the process of state-detection has to be conducted in the least invasive way to minimize the management overhead. To ensure the autonomy of the agent it is also important to have a process in place that is under control of the agent. By allowing the agent to emit event messages based on its own state we are able to retrieve important data without violating the autonomy of an agent. Besides the obvious goal of determining if the MAS is in a critical state, a system manger typically wants to observe other important aspects of the system. This includes, the number of agents, their tasks and task-state e.g. started, completed, aborted etc. In case the MAS supports agent migration (mobile agents) locality also becomes an issue.
3. Some existing management methods The problem of managing MAS has been recognized in the community as a central issue in the successful deployment of MAS. Consequently a large variety of approaches have been developed. These approaches focus on adding specialized guard/police/sentinel agents into the system that monitor agents and intervene in case of critical states or conflicts. While being fairly easy to implement, these agents tend t o introduce an additional overhead and to be rather invasive since they actively query the agents. Rather than having an agent report these approaches focus on retrieving periodically data from the agent resulting in interruptions and additional work. Only one approach, the exception handling approach of Klein avoids additional costs and can be therefore used as a realistic approach to state detection.
3.1 Exception handling (EH) The concept of exception Handling (EH) was first introduced by Klein [1]. It focuses on providing an exception handling service that is “plugged”, with little or no customization, into existing agent systems. This service can be viewed as a kind of “coordination doctor”. It knows about the different ways multi-agent systems can get “sick”, actively looks system-wide for symptoms of such illnesses, and prescribes specific interventions instantiated for this particular context from a body of general treatment procedures. Thus, the EH service has to first maintain a database for various symptoms. Once the agent registered with the EH ,the service will have a set of possible exceptions for this agent based o n the task this agent carries on. For example, it is typical for agents to require the output of another agent. The processes for managing such flow dependencies need to make sure that the right thing gets to the right place at the right time (Malone and Crowston 1994). This immediately implies a set of possible failure modes including an input being late (wrong time), of the wrong type (wrong thing) and so on. The Exception Handling service communicates with agents using pre-defined languages for learning about the exceptions (query language) and for describing exception resolution actions (action language). The query language represent the medium by which the exception handling service interacts with the problem solving agents to detect, diagnose and resolve exceptions. The query language is used to get agent state information, and the action language i s used to modify it which includes changing the process model (re-ordering, deleting or adding new tasks; changing the resources allocated to a task; canceling tasks) and changing the work package contents. In real time the EH service uses query language to ask questions to the executing processes in order to detect exceptions. For example the EH may ask a special question to test whether the agent is still alive or not, or the EH can ask the agent for which other agents they are awaiting inputs in order t o deadlocks. If an exception is found the EH service will pick an EH plan and execute the plan by the action language. On the agent side, the agents have to implement a minimum set of interfaces so that they can answer the questions from the EH service. This vision is realized by building on the following key innovations: o
o
o
Define a clear division of labor. That is, problem solving agents, as originally designed to carry out the tasks; exception handling agents, focus o n detecting and resolving exceptions. The exceptionhandling agents implement a set of standard EH service that can be “plugged” into any MAS framework and/or system with limited customizations. Develop a set of standard language for the communication between exception handling agents and problem solving agents. So the problem solving agents are able to understand the questions asked by the exception-handling agents. And the exception-handling agent should be able to come up with questions in this language for different purposes. Create an efficient database for detecting and resolving exceptions.
The exception-handling services provide a way to detect critical situations of agents in a multi-agent system. The only major drawback of EH is its reactive nature. It only detects critical situations after they emerge - it does not predict them.
3.2 Events Manager Instead of having exception-handling agents retrieve data b y periodically querying, it seem better to have the agents report themselves by emitting event messages to an events manager. Possible events in a MAS are for example creation, migration, state change, deletion and agent resources usage to name a few. One agent can also report another agent’s failure as events. Each event will have attributes like agent ID, timestamp, events type and etc. The type of events can be purely application dependent. The purpose is to monitor the agent by knowing its most recent events. The events are sent to the events manager. The events manager does not have to be an agent, but it is “plugged” into the system. The events manager can provide different types of events viewer that allow organize and display them in various ways. Similar t o the EH approach, agents will implement a minimum set of interfaces in order to generate events. The generated events should follow some kind of patterns as for certain tasks. The pattern for correctly carrying out a task should be defined. And it’s also good to know some incorrect patterns for carrying out a task. Then failures can be predicted b y knowing whether the agent is working on a correct pattern for carrying out a task. The Travel Agency example in the paper adopts this events method, more detail will be covered later.
3.3 EH + EM Overview In the following we will present an approach that combines these two methods, we call it EHEM (EH + EM) (Figure 1).
Figure 1 In this approach, exception-handling agents (3,6,3) are set t o monitor the other agents in their process for event messages. It works in the way as in EH services. But all agents in each host will generate events about their works to the events managers; multiple events managers can be used to avoid single point failure. Since there could be a large amount of events and only certain events are concerned at one time, a
4.2 Testing the MTS Performance A central issue for use was the speed and available bandwidth of the MTS.
seconds
FIPA-OS
performance
200 150
5 processes
9
# of messages
Figure 3 We concluded from these tests that even for small MAS the additional load of event messages couldn’t be handled b y the FIPA-OS framework.
5. Event Message Service Since the FIPA-OS MTS couldn’t handle the load we searched for alternative means of enabling the agents to send event messages. Since we wanted to have an open yet scalable solution we decided to enable the agents to send event messages using CORBA as a middleware. The main advantages of using a distributed computing middleware like CORBA are its platform independence, language independence, high level of abstraction, and performance. The fact that all FIPA-OS platforms are designed to support CORBA also supported our decision in using CORBA. Each event is a CORBA struct consisting of event name, agent ID and time stamp plus an unspecified tuple list to store the problem specific data. The timestamps are used primarily t o insure that the order of the events can be preserved even if race conditions occur. Below is the performance graph indicating the impact the number of events had on the delivery speed (events are shown in units of thousand event messages). events
performance
5 4 3 2 1 0 0
100
2 processes
200 0
seconds
In our first test we wanted to exam the stability and speed of the MTS when exchanging simple messages between two agents. We used a single process that created to FIPA agents each send messages to each other. Time is measured on the receiver side in receiving a certain number of messages (Figure 2).
1 process
400
7
As mentioned above, the application was implemented using the FIPA-OS platform, which follows the widely accepted FIPA standard. In FIPA agents communicate with each other on this platform using Message Transport Service (MTS) of FIPA-OS. Before discussing the demo itself we first introduce how MTS works and its performance. FIPA-OS provides two basic components, Directory Facilitator (DF) and Agent Management Service (AMS). A DF is a mandatory component that provides a yellow pages directory service to agents. An AMS is a mandatory component that is responsible for managing the operation of an AP, such as the creation of agents, the deletion of agents, and it assigns ID to new agents [2]. As one agent wants to use a certain service, it can search the DF for this service. The DF will return the agent ID of the agent that provides this service. After knowing which agent provides the right service, the requesting agent can use the Message Transport Service (MTS) to contact that agent. The MTS provides a mechanism for the transfer of FIPA-OS ACL messages between agents. Agent Communication Language (ACL) is a data structure that contains performative, sender, receiver, reply-to, content.
600
5
4.1 Agents Communication
1,2,5 processes
3
To test our approach we developed a test MAS in FIPA-OS which models the transactions of a Travel Agency. We implemented three types of agents: client agent, travel agency agent and bank agent. The client agent makes reservation at the travel agency agent by providing destination, price range and credit card number; travel agency agent sends the credit card number to bank agent t o check the validity of the credit card; bank agent sends the check result back to travel agency agent; travel agency agent sends confirmation to client agent.
1
4. Manage According to Events
(possibly cross different machines). Tests of 2 and 5 processes running on the same machine were are also made, showing that the time increased as we increased the number of agents, which reflects the increased load. It took 8.6 and 27.8 seconds for a receiver to receive 100 messages when there are 2 and 5 processes running respectively. Figure 3 shows the impact of adding more processes.
seconds
filter used when the events manager passes events to the events viewer.
2
4
6
8
10
12
number of events (k)
50 0 0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Figure 4
number of messages (k)
Figure 2 The result showed that the receiver takes 5 seconds to receive 100 messages. The time increased as more messages were sent. However, in most cases there are multiple processes (pair of sender and receiver) running in the same platform
5.1 Events Manager Each agent has to implement our proprietary even interface t o ensure that it can communicate with the event managers, which can be seen as a disadvantage. Once the agents have implemented the event interface they can be set t o
communicate with an events manager process or event manager agent using the relatively fast CORBA method invocations. The events themselves are CORBA structs, which are passed by value ensuring that each event manager has copies of the received event messages. Besides giving u s better performance than MTS this approach also minimizes the impact of event message routing on the normal event message flow. The scalability problems of the MTS forced us to abandon the use of normal messages as a means for event message delivery. The events manager communicate with the agent servant interface via CORBA method calls. That is, each agent servant is owned by a specific agent and generates events as the agent is working. The servant then sends the generated events through CORBA connection to event manager. The events manager acts as a CORBA server that provides function calls for the agent servants, it receives events as the servant invokes the function on it and passes the events as parameter in. This communication structure has better performance in handling a large amount of traffic load than MTS does. After receiving the events the event manager will then save the events for visualization and analysis. To avoid single point failure, multiple events manager can be used for agents. So the agents will send their events to all event servers.
5.2 Events Viewer To view the events we allowed external processes called event viewers to connect to the event manager and to display the event data for a user.
what tasks the agent has been working on and which agents are most busy in the system.
Figure 6 – Events View In the events view: events are displayed according to their types as Create Events, Delete Events, Report Event and etc. The individual event generated by each agent is stored under the events categories. A user can click the events to retrieve details. By using the events view the user can monitor/analyze the system in terms of which kind of events are generated most frequently and what tasks are most often carried out by the agents. If some events have specific meaning to the system, like resources alert, the manager can do necessary adjustment to the system when knowing these events.
6. Summary In this paper we discussed the problem of fault-management and presented event messages as a new approach for state detection. Based on our test results with a FIPA-OS application we could show that event messages can not be implemented using the standard messaging services and that they require a more efficient transport layer. Our preliminary results show that the use of a standard CORBA ORB i n combination with one or more event manager processes can allow for continues monitoring of all activities within MAS.
7. Future Work Figure 5 – Agent View We developed the event viewers as thin clients that connect to the event manager processes to retrieve the current data, preprocess it and to display it. To simplify the management task for the user we developed two kinds of views: agents view (Figure 5) and events view (Figure 6). In the agent view the events are displayed according to their senders. The agents that have generated events are listed b y their IDs, and the events generated by these agents are displayed under each agent’s category. The events are sorted by the generation time for each agent. This view can reflect
Future Work will focus on the development of a JADE implementation of the event message services and on tools for fast event message analysis.
8. REFERENCES [1]
Mark Klein, Chrysanthos Handling in Agent Systems”
Dellarocas, “Exception
[2] FIPA Agent Management Specification , XC00023H [5] FIPA-OS Developer Guide