Improving Fault-Tolerance by Replicating Agents - CiteSeerX

Improving Fault-Tolerance by Replicating Agents Alan Fedoruk

Ralph Deters

Department of Computer Science University of Saskatchewan Saskatoon, Canada

Department of Computer Science University of Saskatchewan Saskatoon, Canada

[email protected]

[email protected]

ABSTRACT

1.

Despite the considerable efforts spent on developing multiagent systems the actual number of deployed systems is surprisingly small. One of the reasons for the significant gap between developed and deployed systems is their brittleness. The absence of centralized control components makes it difficult to detect and treat failures of individual agents thus risking fault-propagation that can seriously impact the performance of the system. Using redundancy by replication of individual agents within a multi-agent system is one possible approach for improving fault-tolerance. Unfortunately the introduction of replicates leads to increased complexity and system load. In this paper we examine the use of transparent agent replication, a technique in which the replicates of agents appear and act as one entity thus avoiding an increase in system complexity and minimizing additional system loads. The paper defines transparent agent replication and identifies the key challenges in using it. Special attention is given to the inter-agent communication, read/write consistency, resource locking, resource synthesis and state synchronization. An implementation of the transparent agent replication for the FIPA-OS framework is presented and the results of testing it within a real-world multi-agent system are shown.

Multi-agent systems (MASs) are well suited to developing complex, distributed systems. However, as distributed systems, MASs are susceptible to the same faults that any distributed system is susceptible to, such as, processor failures, communication link failures or slow downs and software bugs. The modular nature of a MAS gives it a certain level of inherent fault tolerance, however the non-deterministic nature of the agents, the dynamic environment, the interconnectedness of the agents and the lack of any central control point make it impossible to foresee possible fault states and make fault handling behaviour unpredictable. A minor fault in a single agent can propagate through the system and cause the entire system to fail. Much of the experimentation with MASs is done using a closed and reliable agent environment, which does not need to handle faults. When multi-agent systems are deployed in an open environment—where agents from various organizations interact in the same MAS, systems are distributed over many hosts and communicate over public networks— more attention must be paid to fault-tolerance. In an open system, agents may be malicious, poorly designed or poorly implemented, hosts may get overloaded or fail, network connections may be slow or down. For a system to avoid failures it must be able to cope with these types of faults [11, 15, 17]. Incorporating redundant copies of system components, either through hardware or software, is widely used in engineering and in software systems to improve fault-tolerance. The idea is simple—if one component fails, there will be another ready to take over. A common example of the technique in software systems is document replication on the World Wide Web (WWW). Copies of documents are made available on several servers, and as long as at least one of the servers is up at any given time, the document will still be available. This technique can also help improve document retrieval times; the document on the nearest or fastest server can be chosen for retrieval. However, this added faulttolerance comes at the price of complexity and system overhead. Keeping all of the various documents updated requires extra communication and extra processing. Once the original document is updated, there will always be a period of time when the copies are out of date. The client is also faced with added work; given a choice of servers to retrieve a document from, how does a client know which to choose? Replicating agents in a MAS provides some of the same benefits as document replication on the WWW, but also creates some of the same problems. Replicated agents can

Categories and Subject Descriptors I.2.11 [Computing Methodologies ]: Artificial Intelligence Distributed Artificial Intelligence [Multiagent systems]

General Terms Reliability

Keywords Replicating Agents, Fault-Tolerance, FIPA-OS

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. AAMAS’02, July 15-19, 2002, Bologna, Italy. Copyright 2002 ACM 1-58113-480-0/02/0007 ...$5.00.

INTRODUCTION

Host1

Environment

Agents

Fault Type Program bugs

Host2

Network

Host3

Communication Links

Unforeseen states

Processor faults Communication faults Emerging unwanted behaviour

Figure 1: A Multi-Agent System

Description Errors in programming that are not detected by system testing. Omission errors in programming. The programming does not handle a particular state. System testing does not test for this state. System crash or a shortage of system resources. Slow downs, failures or other problems with the communication links. System behaviour which is not predicted; Emerging behaviour may be beneficial or detrimental.

Table 1: Fault Types provide increased fault-tolerance by adding redundancy and performance gains by adding more control over performance management. However, keeping agents state synchronized is a greater challenge than keeping documents synchronized as documents are static entities and agents are active. Communication between a single agent and a group of agents raises interesting problems, which agent to communicate with and what to do with all of the replies? The remainder of this paper is structured as follows: Section 2 gives background on multi-agent systems, describes faults and failures as they relate to MASs and to agent replication and related work is summarized; Section 3 presents agent replication, describing what it can be used for, the problems that it introduces and some methods for dealing with those problems; Section 3 introduces the idea of transparent replicate proxies; Section 4 describes an implementation of proxies using FIPA-OS [2] and reports one application of that implementation; Section 5 presents conclusions and identifies further work.

2.

MULTI-AGENT SYSTEMS, FAULTS AND FAILURES

There are four essential characteristics of a multi-agent system: a MAS is composed of autonomous software agents, a MAS has no single point of control, a MAS interacts with a dynamic environment, and the agents within a MAS are social (agents communicate and interact with each other and may form relationships). Figure 1 shows a typical MAS. A failure occurs when the system produces results that do not meet the specified requirements [13]. A fault is defined to be a defect within a component of a MAS that may lead to a failure. Faults in a MAS can be grouped into five categories as shown in Table 1 [22]. When a fault does occur in a MAS, interactions between agents may cause the fault to spread throughout the system in unpredictable ways. Several approaches to fault-tolerance in MASs are documented in the literature; each focuses on different aspects of fault-tolerance but none explores the possibilities of replication. Schneider [19] creates a method for improving faulttolerance with mobile agents using the TACOMA mobile agent framework that focuses on ensuring that replicated agents are not tainted by faulty processors or malicious en-

tities as agents move from host to host. Kumar et al. [15] present a methodology for using persistent teams to provide redundancy for system critical broker agents within a MAS. The teams are similar to the replicate groups presented here, but replicate group members all perform the same task while team members do not. Hanson and Kephart [12] present methods for combating maelstroms in MASs. Maelstroms are chain reactions of messages that continue cycling until the system fails. The proposed technique is specific to this one type of fault. Klein [14] developed sentinels, agents that observe the running system and take action when faults are observed. Design goals of this system are to take the burden of fault-tolerance off the agent developer and have the agent infrastructure provide the fault-tolerance. Klein’s work focuses on observing agent behaviour, diagnosing the possible fault and taking appropriate remedial action. Toyama and Hager [20] divide system robustness into two categories, ante-failure and post-failure. Ante-failure robustness is the traditional method—systems resist failure in the face of faults. Post-failure robustness deals with recovery after failure. Marin et al [16, 10] present the DARX framework, a scheme to provide adaptive object replication. Tasks within an MAS are replicated and are able to switch between active and passive replication at run-time. The focus of this work is providing adaptive replication, and on only using replication for tasks that are deemed to be critical. This work uses a proxy-like structure to manage the replicate group. The approaches listed above can be classified as either agent-centric or system-centric. Agent-centric approaches build fault tolerance into the agents. This has the drawback of increasing the complexity of the agent. System centric approaches move the monitoring and fault recovery into a separate software entity. This increases the number of agents in the system, but has the advantage of being able to detect system wide faults. For example, agents may individually be functioning correctly, but the interaction they are engaged in may be faulty. A technique which uses aspects of both agent-centric and system-centric fault-tolerance is given in [17] which describes DaAgent—A Dependable Mobile Agent System. This agent

platform is distinguished from others by having dependability as a key design principle. The platform is geared toward providing mobile agents—in fact, agents are defined by their mobility in this work. The platform has several features, similar to sentinels, which are called AgentHome, Watchdogs, and AgentConsultants, they monitor the agents and ensure that they are able to reach their destinations reliably and continue to run. Failure recovery is provided by state checkpointing, which provides a means of restoring an agent. Agent replication also uses aspects of both agent-centric and system-centric approaches. The redundent replicates and any proxies are external to the original agent, using a system centric approach, but individual agents must have the capability to utilize replication.

Type Homogeneous Homogeneous

Heterogeneous

Description Agents in a replicate group are identical and deterministic. Agents in a replicate group are identical and nondeterministic. Agents in a replicate group are not identical, but are designed to perform the same function.

Faults Types Processor, Communications Bugs, States, Processor, Communications Bugs, State, Processor, Communications

Table 2: Replication Types

3.

AGENT REPLICATION

3.1

Definition of Agent Replication

Agent replication is the act of creating one or more duplicates of one or more agents in a multi-agent system. Each of these duplicates is capable of performing the same task as the original agent. The group of duplicate agents is referred to as a replicate group and the individual agents within the replicate group are referred to as replicates. Once a replicate group is created in a MAS, if that replicate group is visible to the rest of the MAS, there are several ways that agents can interact with it: • An agent can send requests to each replicate in turn until it receives an appropriate reply. • An agent can send requests to all replicates and select one, or synthesize the replies that it receives. • An agent can pick one of the replicates based on a particular criteria (speed, reliability, etc.) and interact only with that agent. All of the replication schemes presented here are variations on these ideas. Related work can be found in literature dealing with object group replication, [9], N-Version Voting [4], distributed systems and distributed databases [5, 3]. Agent replication is an extension of object group replication and many of the same challenges are faced when using agent replication. Distributed systems and databases provide some of the basic techniques for solving transaction and concurrency issues. There are two basic types of agent replication: heterogeneous and homogeneous. In heterogeneous replication, replicates are created, which are functionally equivalent, but individual replicates may have been implemented separately. In homogeneous replication, replicates are exact copies of the original agent—the replicates are not only functionally equivalent but are copies of the same code. Agents can be of two basic types, deterministic or non-deterministic. Agents interact with their environment in some way. When creating replicates, the replicates may not all have identical environments: some may be on fast processors, some may have access to certain databases and some may be unreliable hosts. When using homogeneous replication with deterministic agents, and when the agent encounters a program error or an unforeseen state, then all of the agents in the replicate group will encounter the same bug or unforeseen state. Replication

will not help in this case. If the agents are non-deterministic, or the percepts received from their environments are different, then not all of the replicates will be in the same state, so they may not all encounter the program bug or unforeseen state fault, and in this case, replication will increase fault-tolerance. If members of the replicate group are running on more than one processor, replication will increase resistance to processor and communication faults. One of the processors may be overloaded or have failed while others have not, therefore those agents can continue running. Communication faults may only be occurring along certain paths so agents on other paths are able to communicate. If heterogeneous replication is used, fault-tolerance to bugs and unforeseen states will be increased. Two separately programmed agents may not have the same bugs, so a heterogeneous replicate group may be able to continue functioning even if some of the replicates encounter a bug and fail. The case is similar for unforeseen states, one implementation may continue given a certain system state, while another implementation fails. This property can be exploited when testing new versions of an agent. A replicate group is created which consists of current version agents and new version agents. If the new agents encounter a bug and fail, the current version agents are still in place and the system will continue to operate. This concept is an extension of work done on N-Version voting [4]. Heterogeneous replication increases fault-tolerance to processor and communication faults in the same fashion as homogeneous replication.

3.2

Key Challenges Raised by Agent Replication

Key challenges raised by agent replication were investigated by using replication transparent to the rest of the MAS. That is, an agent in the system should not know it is interacting with a replicate group.

3.2.1 Inter-Agent Communication and Results Synthesis Once an agent has been replicated, a replicate group created, and all the replicates become active, communications within the MAS becomes problematic. For example, AgentA wants to interact with AgentB, which has been replicated and is a replicate group consisting of replicates B0 , B1 , . . . BN . As MASs can be made up of agents from different developers and organizations, AgentA may not know that AgentB

is part of a replicate group and if AgentA is operating in an open environment, why should it? If AgentA wants to communicate with AgentB, it can do one of two things: it can choose one of AgentB’s replicates based on some criteria and carry on the conversation with that agent, or it can initiate a conversation with all of the replicates. In this case AgentA will have to be equipped to deal with multiple replies. If AgentB wants to communicate with AgentA, similar problems arise: either the replicate group must somehow select which of the replicates will initiate the conversation, or they all initiate the conversation and let AgentA deal with multiple incoming messages. Results synthesis is the act of taking a set of results and creating a single result from it. For certain types of results (arithmetic), it may be possible to use mathematical aggregation functions like averages and means to create one answer. For other types of results voting may be possible; whichever reply is in the majority is the one used. There is no general results synthesis algorithm that can be applied in the all cases.

3.2.2

Read/Write Consistency

Agents always interact with their environments [18]. When a replicate group exists within a MAS each replicate in the group needs to get percepts from its environment. Each replicate may perceive different data from the environment due to differing processor capacities as well as processor loads and varying delays in communication. Borrowing a term from database research, this phenomenon will be referred to as read consistency [6]. At one end of the scale, each replicate will get a different value for some piece of data that is needed to complete its task and each will arrive at different results. At the other end of the scale, for a value that is static in the environment, all replicates will get the same data and produce the same results. Write consistency is closely related to read consistency. When agents are getting percepts from their environments it is expected that they will act upon the environment in some way. Whether the action taken is writing to a file or a database, or taking some physical action like turning on a light or opening a valve, all actions may be considered to be writing to the environment. Since the replicates in a replicate group are all doing the same task, they will all want to write the same data (maybe even at the same time) to the same place. Consequently, this can lead to data inconsistency problems. In the classic database example, AgentA reads a value, AgentB reads the same value, AgentA calculates a new value by incrementing, and writes it back. At the same time, AgentB also increments the value and writes it; however, one of the increments will have been missed. In another example, in a replicate group of six replicates, two replicates determine that the action to take is to open valveA while four replicates have determined that valve-A should remain closed. What action will be taken? What action should be taken?

3.2.3

State Synchronization

Agent state is a central issue raised by agent replication. If the state of all members of a replicate group are kept synchronized, read and write consistency becomes less of a problem; all the agents will arrive at the same results and therefore will not produce inconsistencies. When a member

of a replicate group fails, and if one of the other replicates takes over, it has to know the state of the failed agent in order to take over where the other left off. Different agent architectures will have different definitions of state. To synchronize states, agents must have the capability to read their current state and to set their state to a given one, while maintaining consistency. Setting the state requires that the agent’s current state be replaced with another. For an inactive agent, this can be done in a straightforward manner, but if the agent is running, care must be taken to not disrupt current operation. When getting the state from a running agent, care must be taken to ensure that the state is consistent and that all pending operations that may effect the state have completed. In database research, a transaction is defined as a series of operations that take a database from one consistent state to another [5]. This concept can be applied to agents. Only committed data should be included in the returned state. Database literature has many techniques for locking, rollback, and committing, which can be used by an agent to ensure that consistent states are returned. The FIPA-OS [2] architecture with its concept of Tasks is well suited to this idea. Tasks are responsible for leaving the agent in a consistent state. When homogeneous replication is being employed, each agent state will be defined in the same manner. However, heterogeneous replication makes state synchronization more difficult. The heterogeneous agents must be able to read and write their states in a format acceptable to all agents. If a history of agent states is kept, it becomes possible to rollback and roll-forward. This idea is expressed by Toyama and Hager in [20]. If a certain state leads to failure, return to an earlier state, take a different path and try again.

3.3

Replicate Group Proxies

A transparent replicate group proxy acts as an interface between the replicates in a replicate group and the rest of the MAS. Proxies provide two important functions: they make the replicate group appear to be a single entity and they control execution and state management of a replicate group. Using proxies means that replication is transparent to agent developers. No extra work is required to interact with a replicate group. Figure 2 illustrates a heterogeneous replicate group (B Proxy, B0, B1, B2, B3, B4 ), the agents environment and another agent A interacting with the replicate group. A only sees B Proxy and not any of the replicates.

3.3.1 Communication Proxy The communication proxy handles all communication between replicates and other agents in the MAS. To agents outside the replicate group, the replicated agent would appear and act as a single agent. This introduces five new issues. 1. The communication proxy will need to perform results synthesis. If there are N replicates in the replicate group, the proxy may get as many as N different results. Logic to determine which result or combination of results to use must be included in the proxy. 2. The proxy will introduce another agent into the system, increasing resource usage. 3. The number of messages that the agent infrastructure

B0

A

B1 B Proxy

B2 B3

Environment Replicant Group

B4

Figure 2: Replicate Group Structures and Proxies must pass doubles. Instead of one message going from A to B, a message goes from A to P roxy to B. 4. The proxy will have to cope with multiple conversations. 5. The proxy is a single point of failure. Three ways to avoid this are available: a backup copy of the proxy can be created and periodically synchronized to the state of the active proxy; the proxy may be serialized and placed in persistent storage from time to time and recreated as needed; a hierarchy of backup proxies may be used—if the first does not respond, try the next and so on.

3.3.2

Data Proxy

Data proxies are closely related to communication proxies and are used to address the problems of read and write consistency. In the case where agents in the MAS are communicating-only agents, data proxies and communication proxies are one and the same. Data proxies exist between the replicates and the environment. A data proxy can ensure that all replicates receive the same percepts, and can handle results synthesis to ensure write consistency. However, this introduces new issues. Like a communication proxy, a data proxy is a single point of failure. Like a communication proxy, a data proxy needs to be able to deal with multiple conversations and results synthesis. Data proxies may also need to interact with a variety of environments, for example, reading a sensor, querying a database, or lighting a signal. Data proxies introduce a delay between the agents and the environment. This delay may mean that the agent gets out-of-date data.

3.3.3

Replicate Group Management

The other role that proxies can fill is management of the replicate group. Through various replicate group management policies, some of the replication issues can be dealt with. Many of the issues with read/write consistency are only a problem when more than one of the replicates is active at any given time. If only one replicate is active, we are back to the same scenario as if no replication was used. Three replication management policies are identified: hot-standby, cold-standby, and active. In hot-standby, the proxy selects one of the replicates as the active replicate. The other replicates are placed in a dormant mode. Periodically, the state of the active replicate is transferred to each of the dormant replicates. If the active replicate fails, the proxy will detect

the failure and select a new replicate to be active. As the new active replicate has been getting state updates, it can start processing immediately, resuming where the previous agent left off. The proxy is handling communication for the replicate group, and any messages are now routed to the new active agent. In cold-standby, the proxy selects one of the replicates as the active replicate and the other replicates are again dormant. The difference is that the state of the active replicate is stored in the proxy, but not transferred to the dormant replicates. When the active replicate fails, the proxy will detect the failure, select a new replicate to be active and transfer the current state to the new active replicate. Coldstandby will have slower switch-over times but will have less overhead while the system is running. Hot and cold standby both have the advantage that results synthesis is not required and read and write consistency problems are avoided. Active replication has all of the replicates active at the same time. The proxy must deal with results synthesis and ensure that reads and writes from the replicates do not result in any inconsistencies. Communication and data proxies can be used to control the state synchronization of an active replicate group. Since the two proxies will effectively control all input from the outside world to the agents within the replicate group, the proxies can control the timing of data and message arrival, ensuring that all replicates receive and process the data at the same time. This will keep the replicates synchronized, however this approach also implies that the replicate group can only proceed at the rate of the slowest group member.

3.3.4 Performance Management Proxies can be used to improve the performance management of the replicate group. In hot and cold standby, when the proxy is choosing a new active agent it can make a decision designed to meet a system goal. If the proxy knows something of the characteristics of the various replicates (either by being informed of the characteristics or by observing and learning them), that knowledge can be used in the selection process. For instance, the fastest or most reliable replicate can be chosen.

4.

IMPLEMENTING AGENT REPLICATION WITH PROXIES

4.1

A Replication Server

The purpose of the replication server implementation is: to apply the agent replication technique using proxies and obtain a measure of the effectiveness in improving the system failure rate; to gauge the added complexity and resource usage incurred by replication; and to provide an infrastructure for further experimentation. The FIPA-OS [2] agent toolkit was chosen as a platform for the implementation. FIPA-OS is widely used, it implements the FIPA [1] standard, and it is open source Java code. The replication server implements transparent replication with the following features: a) communication proxy; b) hot and cold standby replication; c) homogeneous and heterogeneous replication;

PersAgent

RepServer1

RepServer0 RepGroupA

RepGroupB A

B

RepGroupC

PersAgent

PersAgent

MatchMaker

PersAgent

PersAgent

PersAgent

UIAgent

UIAgent

C

A0

A1

A2

B2

Figure 4: Simplified I-Help MAS B1

B0

Figure 3: Replication Architecture d) state synchronization within a replicate group. The replication server functions as follows: a) A replicate group is created. b) Agents are created within the replication server that created the replicate group, in another replication server or as a separate process. See Figure 3 c) Agents register with a replicate group—agents are placed in the role of either dormant or active. d) Periodically, the replication server checks that the active agent in each replication group is still reachable and if the active agent is deemed unreachable, a new active agent is chosen from the remaining replicates. e) Periodically, the active agent will send its state to the replication server. The replication server will, in turn, store this state and distribute it to the other replicates in the group. f) All messages going to and from a replicate group are funneled through the replicate group message proxy. The replication server is implemented as a standard FIPAOS agent, RepServer. A running MAS can have many RepServer agents. Each RepServer agent consists of one or more replicate groups, (RepGroup), and provides replicate group management services for those replicate groups. Each RepGroup consists of a list of agents that make up the group, a message proxy agent, a reference to the currently active agent or agents and a stack of past agent states. The list of agents consists of AgentIDs, a flag to indicate whether or not the last contact with this agent was successful, and a time-stamp to indicate when this agent state was last updated. When a group is created, the message proxy registers with the platform Agent Management Service and Directory Facilitator agents, the members of the replicate group do not. When state information is sent to the replication server the state is placed on top of a stack within the appropriate RepGroup. Currently, the most recent state is always used for synchronizing replicates. However, having a stack of states allows other policies to be implemented—if the current state leads to a failure, it could be advantageous to start the new active agent with one of the previous states.

The stack of states can be moved to persistent storage and be used for recovery when the RepServer fails. A replication agent shell (RepAgent) is to be used by agent developers. RepAgent implements the structure and functions that an agent requires to participate as part of a replicate group. The agent developer is responsible for creating methods to get and set an agent’s state and for ensuring that state information is consistent. A replication server accepts requests to perform the following functions: create a replicate group, register an agent with an existing replicate group, and create an agent. The replication server, for each replication group, will periodically ping the active agent and if it does not respond, a new active agent will be selected. Currently, the first agent that is found to be reachable is chosen as the next active agent. Other policies for choosing the next active agent can be implemented, such as, choosing the most up to date agent, or choosing the fastest agent to reply to an inquiry. The active agent drives state synchronization. The agent decides when its state has changed and when it is consistent. The state is pushed to the replication server and the other replicates in the group are updated as appropriate. In the case of hot-standby, the replicates are kept as up to date as possible; in cold-standby, the replicates are only updated when they become active. Currently, many replication servers can exist within a MAS. A meta-replication server can be created to allow requests for creation of replication groups to be distributed in a balanced way over all available replication servers. The replication server allows users to create transparent replicate groups for use in a MAS built on a FIPA-OS platform.

4.2

Application

A simplified version of the I-HELP MAS [21] is implemented as a test bed for the replication server. I-HELP is a peer help system currently in use at the University of Saskatchewan. The implementation of I-HELP is susceptible to failures due to individual agents failing, particularly the matchmaker agent [7]. I-HELP allows student users to find appropriate peer helpers. Each user of the system has a personal agent that maintains a small database of information about its user, such as topics the user is competent in, the identity of the user and whether or not the user is currently online and willing to help others. Each user communicates to his personal agent with a user interface agent. A matchmaker agent in the system maintains a list of all personal agents within the system. When a user initiates a request for help, the request is sent to the matchmaker agent, which broadcasts it to all of the known agents and then assembles the replies, which are returned to the requesting agent.

1. The non-replicated version of the system ran with 1 matchmaker agent, 2 user interface agents and a variable number of personal agents, from 2 to 64.

50 45 40 Response Time (sec)

This application is implemented in two versions. The first uses standard FIPA-OS agents with no replication. The second uses replicated agents for personal and matchmaker agents. Replicate groups are managed using hot-standby management. As this is a closed and reliable environment, the replicated agents have been given a built-in failure rate to simulate an open environment. For these experiments, the failure rate was set at 3 failures per hundred message arrivals. The application is implemented in FIPA-OS 2.1.0 and Java Version 1.3.0. All tests are performed on a Sun SunFire 3800 with 4 UltraSparcIII CPUs running at 750MHz and 8GB of RAM, running Solaris 2.8. The evaluation consists of four tests:

In all experiments, the measured variable was time required for the system to respond to a request. Increases in response time will indicate increased use of system resources, which will give a comparative measure of increased overhead and complexity. If no response is returned, a failure is assumed. For each configuration, a number of requests are performed and the average response time was recorded. This removed any variation due to underlying system or network load. In these tests, each non-replicated agent ran as a separate process. For each replicated agent, a separate replication server (a single process) was used. This means that the tests all use similar numbers of processes and are more directly comparable.

4.3

Discussion

Experiment 1, see Figure 5, isolates the performance effects of the underlying host processor and the FIPA-OS platform. The results show a linear increase in the response time as more personal agents are added to the system. This result is expected; the number of messages passed increases directly with the number of personal agents in the system. Experiment 2, see Figure 5, determines the amount of overhead replication adds to a system. The replicated version shows a consistent increase in response time of roughly 1.25 times over the case without replication. The message proxy setup and the increased number of agents account for this. The use of a message proxy doubles the number of messages that the agent platform must deliver. Experiment 3, see Figure 6, illustrates the effect of increasing the replicate group size on system performance. The observed increase is very small. Increasing the number of replicates increases the system load by having extra agents. However, since hot-standby replication is being used, the

25 20 15

No Replication Replication

5 0 10

20

30

40

50

60

No. of PersAgents (N)

Figure 5: Response Time vs. Number of Personal Agents 8 Response Time 7 Response Time (sec)

4. The agent failure rate was varied from 5% to 90%.

30

10

2. The replicated version of the system ran with 1 matchmaker agent, 2 user interface agents and a variable number of personal agents, from 2 to 64. Each replicate group consists of three agents. 3. The replicated version of the system ran with 1 matchmaker agent, 2 user agents, 4 personal agents, with variable numbers of replicates in each replicate group, ranging from 3 to 28.

35

6 5 4 3 2 1 0 0

5

10 15 20 No. of Replicants per Group (N)

25

30

Figure 6: Response Time vs Number of Replicates per Group, 4 Personal Agents extra agents are mostly idle. Extra overhead comes mainly from updating standby replicates. Experiment 4 illustrates the effectiveness of replication at varying fault rates. For lower values of the fault rate, (< 50%) few replicates are needed to decrease the failure rate to near zero. Even with a higher failure rate of 50%, only 10 replicates are needed to lower the failure rate to 1 : 1024—a reliability rate of 99.90234375%. Assuming agent failures to be independent events the probability of a failure is equal to the product of the probabilities of failure of the individual agents. At very high failure rates (90%) several agents can fail before one gets any data loaded (the UIAgent registered) so measuring failure rate is difficult. However, with enough replicates the reliability will be increased, and as was shown by Experiment 3, adding more replicates is not costly. Much of the overhead observed when using replication comes from the increased number of messages passed. A series of messages is needed each time the state of a replicate is updated and when a switch of active agent occurs. Performance improvements can be realized by using a direct link between the replicates and their proxy rather than passing the message through the agent message transport sub-system [8].

5.

CONCLUSION AND FUTURE WORK

This paper introduced the topic of agent replication and examined the issues associated with using agent replication in a multi-agent system. Transparent proxies were introduced as a method of dealing with the main issues of agent communication, read/write consistency and state synchronization. A replication server based on FIPA-OS was presented. A simple application was implemented and tested with the replication server. Results showed that replication imposed an acceptable degree of overhead on the system and it improved system reliability. In future this research will focus on distributing the proxy function in an attempt to alleviate the single point of failure problem imposed by the proxy. An agent teamwork approach will be used. The replicate group will act as a team with one member working as a proxy. If the proxy fails, another team members will take on the proxy role. Test system will be built in a more open environment and a variety of host types (large servers, desktop computers, laptops, handheld and wireless devices) will be used.

6.

SOFTWARE

Full source code for all applications discussed is available for download from http://www.cs.usask.ca/grads/amf673/RepServer.

7.

REFERENCES

[1] FIPA foundation for intelligent physical agents. http://www.fipa.org/, 2001. [2] FIPA-OS. http://fipa-os.sourceforge.net/, 2001. [3] G. R. Andrews. Multithreaded, Parallel, and Distributed Programming. Addison-Wesley, 2000. [4] A. Avizienis. The n-version approach to fault-tolerant software. IEEE Transactions on Software Engineering, pages 1491–1501, Dec. 1985. [5] T. Connolly and C. Begg. Database Systems: A Practical Approach to Design, Implementation, and Management, Second Edition. Addison-Wesley, 1999. [6] C. J. Date. An Introduction to Database Systems, Seventh Edition. Addison Wesley, 2000. [7] R. Deters. Developing and deploying a multi-agent system. In Proceedings of the Fourth International Conference on Autonomous Agents, 2000. [8] A. Fedoruk and R. Deters. Using agent replication to enhance reliability and availability of multi-agent systems. In (to appear) Proceedings of Fifteenth Canadian Conference on AI, Calgary, Canada, May 27-29 2002. [9] P. Felber. A Service Approach to Object Groups in ´ CORBA. PhD thesis, Ecole Polytechnique F´ed´erale de Lausanne, 1998. [10] Z. Guessoum, J.-P. Briot, P. Sens, and O. Marin. Toward fault-tolerant multi-agent systems. In MAAMAW’2001, Annecy, France, May 2001. [11] S. H¨ agg. A sentinel approach to fault handling in multi-agent systems. In Proceedings of the Second Australian Workshop on Distributed AI, in conjunction with the Fourth Pacific Rim International Conference on Artificial Intelligence (PRICAI’96), Cairns, Australia, August 1996.

[12] J. E. Hanson and J. O. Kephart. Combatting maelstroms in networks of communicating agents. In Proceedings of the Sixteenth National Conference on Artificial Intelligence and Eleventh Conference on Innovative Applications of Artificial Intelligence. AAAI Press/MIT Press, 1999. [13] J. Musa, A. Iannino and K. Okumoto. Software Reliability, Measurement, Prediction, Application. McGraw Hill Book Company, 1987. [14] M. Klein and C. Dallarocas. Exception handling in agent systems. In O. Etzioni, J. P. M¨ uller, and J. M. Bradshaw, editors, Proceedings of the Third International Conference on Autonomous Agents (Agents’99), pages 62–68, Seattle, WA, 1999. ACM Press. [15] H. J. Levesque, S. Kumar, and P. R. Cohen. The adaptive agent architecture: Achieving fault-tolerance using persistent broker teams. In Proceedings, Fourth International Conference on Multi-Agent Systems, July 2000. [16] O. Marin, P. Sens, J.-P. Briot, and Z. Guessoum. Towards adaptive fault tolerance for distributed multi-agent systems. In Proceedings of MAAMAW2001, 2001. [17] S. Mishra and Y. Huang. Fault tolerance in agent-based computing systems. In Proceedings of the 13th ISCA International Conference on Parallel and Distributed Computing Systems, Las Vegas, NV, USA, August 2000. [18] S. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice-Hall, 1999. [19] F. B. Schneider. Towards fault-tolerant and secure agentry. In Proceedings of 11th International Workshop of Distributed Algorithms, Sept. 1997. [20] K. Toyama and G. D. Hager. If at first you don’t suceed. In Proceedings of the Fourteenth National Conference on Artificial Intelligence, pages 3–9. AAAI Press/MIT Press, 1997. [21] J. Vassileva, G. McCalla, R. Deters, D. Zapata, C. Mudgal, and S. Grant. A multi-agent approach to the design of peer-help environments. In Proceedings of AIED’99, 1999. [22] D. N. Wagner. Liberal order for software agents? an economic analysis. Journal of Artificial Societies and Social Simulation vol. 3, no. 1, 3(1), 2000.

Improving Fault-Tolerance by Replicating Agents - CiteSeerX

Improving Fault-Tolerance by Replicating Agents - CiteSeerX

Suggest Documents

Rhodanine agents active against non-replicating intracellular ...

Improving Speedup and Response Times by Replicating ... - CS - Huji

Improving Performance of Heterogeneous Agents - CiteSeerX

Replicating Software Engineering Experiments - CiteSeerX

improving penetration of anti-cancer agents by increasing vascular ...

LEARNING BY SINGLE FUNCTION AGENTS DURING ... - CiteSeerX

Dynamic Pricing by Software Agents - CiteSeerX

Concept Acquisition by Autonomous Agents - CiteSeerX

Personalizing Museum Exhibition by Mediating Agents - CiteSeerX

LEARNING BY SINGLE FUNCTION AGENTS DURING ... - CiteSeerX

Discovery of Replicating Circular RNAs by RNA

Non-replicating recombinant vaccinia virus encoding ... - CiteSeerX

Differentiated Strategies for Replicating Web Documents ... - CiteSeerX

Replicating the Kuperee Authentication Server for ... - CiteSeerX

A Canine Conditionally Replicating Adenovirus for ... - CiteSeerX

Replicating Human-Human Physical Interaction - CiteSeerX

Self-replicating robots for lunar development - CiteSeerX

Improving Comparison Shopping Agents' Competence through ...

Improving Cancer Control With Radiosensitizing Agents

Improving Situated Agents Adaptability Using Interruption ... - IntRoLab

Improving Applicant Reactions by Altering Test ... - CiteSeerX

Improving Classification Accuracy by Identifying and ... - CiteSeerX

improving acoustic models by watching television - CiteSeerX

Improving Supervised Learning by Feature Decomposition - CiteSeerX