Chameleon: Adaptive Fault Tolerance Using Reliable ... - CiteSeerX

6 downloads 8276 Views 174KB Size Report
network partitioning, and crashed processors), [Mos96]. • Transis incorporates ...... Sending the execution agent code (24.612 Bytes) over to the HD: 63 ms.
Submitted to FTCS-28 in category “Regular Paper” This paper is a candidate to the William C. Carter Award. The paper is based on graduate dissertations work of two students: Saurabh Bagchi and Keith Whisnant

Chameleon: Adaptive Fault Tolerance Using Reliable, Mobile Agents S. Bagchi, K. Whisnant, Z. Kalbarczyk, R.K. Iyer Center for Reliable and High-Performance Computing University of Illinois at Urbana-Champaign 1308 W. Main St., Urbana, IL 61801 E-mail: [bagchi, kwhisnan, kalbar, iyer]@crhc.uiuc.edu

Abstract This paper presents Chameleon, an adaptive infrastructure which allows different levels of availability requirements to be simultaneously supported in a networked environment. Chameleon provides dependability through the use of special agents—reliable and mobile components that control all operations in the Chameleon environment. Three broad classes of agents are defined: •

Managers oversee other agents and recover from failures in their subordinates.



Daemons provide communication gateways to the agents at the host node. They also make available a host’s resources to the Chameleon environment.



Common agents implement specific techniques for providing application required dependability.

Above all else, Chameleon provides a flexible architecture through which adaptive fault tolerance may be achieved in an unreliable and heterogeneous network. Key concepts used to accomplish this goal include the automated creation of new agents, the automatic extension of existing agents, the seamless integration of existing and new agents in existing execution strategies, and the creation of new faulttolerant execution strategies. To our knowledge Chameleon is one of very few real implementations which maintain fault tolerance via a software infrastructure only. Chameleon provides fault tolerance from the application point of view as well as the software infrastructure itself is fault-tolerant. To gain a sense of the performance degradation associated with the Chameleon environment, we have developed a simulation of several execution strategies (dual, TMR, and quadruple execution modes) and an implementation prototype of dual and TMR execution modes. Through these testbed environments, we measure the execution overhead and recovery time from failures in either the user application, the Chameleon agents, the hardware or the operating system. Keywords: adaptive fault tolerance, highly available networked computing, software-implemented fault tolerance, COTS, extendible modular architecture

1

Introduction

Traditionally, fault tolerance has been provided through dedicated hardware, dedicated software, or a combination of the both. In the case of hardware, manufacturers such as Tandem provide stand-alone machines with high reliability through extensive hardware redundancy. Unfortunately, dedicated faulttolerant architectures such as these offer a static level of fault tolerance that remains fixed throughout the lifetime of the machine. As such, these architectures are often oriented towards specific classes of applications. Distributed environments often employ software-based solutions to provide dependability. Typically, services are replicated throughout the network to provide the prerequisite level of redundancy. The applications, however, usually must be written with the intent to run in such an environment, so the benefits of such a system go unnoticed to existing applications. In contemporary networked computing systems, a broad range of commercial and scientific applications must coexist—each potentially requiring varying levels of availability and reliability. It is not cost effective to provide dedicated hardware-based fault tolerance to each application, nor is it cost effective to rewrite each application to take advantage of fault tolerance incorporated into a distributed network through specialized software. The pressing issue then becomes the best way in which to achieve high dependability with off-the-shelf, unreliable hardware and off-the-shelf applications. In this paper, we propose Chameleon, an adaptive infrastructure which allows different levels of availability requirements to be simultaneously supported in a single networked environment. Chameleon provides dependability through the use of special agents—reliable and mobile components that control all operations in the Chameleon environment (note, that in the remainder of the paper we use term agent to refer to not only common agents but also to other entities, including daemons and managers). We have developed three broad classes of agents: 1. Managers. Managers oversee other agents and recover from failures in their subordinates. Primary managers include the Fault Tolerance Manager (FTM), the highest-ranking manager, and surrogate managers that oversee the fault-tolerant execution strategy of a specific user application. 2. Daemons. Daemons allow Chameleon to access a node in the network, provide agent error detection, and provide the means through which agents may communicate among themselves across the network. 3. Common agents. Common agents implement specific techniques for providing application required dependability. Examples of common agents include execution agents, voter agents, checkpoint agents, and heartbeat agents. The key contributions and innovations of the proposed approach are as follows: •

Use of reliable agents to provide an adaptive infrastructure capable of supporting a user-specified level of fault tolerance for executing off-the-shelf applications such as computationally intensive scientific applications.



Derivation of agents (i.e., FTM, surrogate mangers, daemons, common agents) from a base agent class. This establishes a hierarchical architecture where different entities inherit properties supported by a corresponding parent class. This enhances reliability by reducing the debugging efforts to a set of functions specific to a particular entity.



Automated creation/synthesis of agents using a library of primitive building blocks, extensions to the capabilities of existing agents and their seamless integration with available fault-tolerant strategies through an agent manufactory. 2



Support for a simple specification semantic for representing user’s availability requirements and a parser for interpreting these specifications.



Operation in a network of heterogeneous computation nodes, including UNIX and NT platforms.



Simulation and implementation of the actual prototype.

The remainder of the paper is organized as follows: Section 2 discusses some related research projects and contrasts these projects with the approaches we are taking in designing Chameleon. A behavioral overview of Chameleon is presented in Section 3. Section 4 provides an in-depth look at the functionality of the Chameleon agents. Details of the error detection and recovery mechanisms incorporated into Chameleon are discussed in Section 5. Section 6 lays out the underlying agent hierarchy. Issues of dynamic composition and customization of agents are covered in Section 7. Section 8 describes some preliminary results obtained from a working prototype. The final conclusions are provided in section 9. In addition, an Appendix gives simulation results (from a very early design stage of Chameleon) to assess any performance penalty that Chameleon may introduce to the system.

2

Related Research

Current approaches for providing fault tolerance in a network of unreliable components are based mainly on exploiting distributed groups of cooperating processes. Consequently, the primary focus is on providing a dedicated software layer to maintain and coordinate reliable communications among groups of processes. Over the last several years, the group communication paradigm has been employed as a key premise in designing and implementing of many distributed systems1. •

ISIS provides tools for programming with process groups. By using these tools, a programmer may construct group-based software that provides reliability through explicit replication of code and data, [Bir94].



Horus, a new generation of ISIS, introduces a flexible group communication protocol that may constructed by stacking well-defined microprotocols at run-time, [Ren96].



Totem attempts to provide high performance and soft real-time guarantees to applications by providing a hierarchy of group communication protocols that is capable of delivering messages to member processes in the presence of communication and processor failures (including message loss, network partitioning, and crashed processors), [Mos96].



Transis incorporates multicast services that are capable of recovering from network partition failures, [Dol96], [Ami92].



Rampart addresses security aspects of group communication by providing tolerance for malicious intrusions, [Rei93].

These systems are more concerned with group communication than with fault tolerance. Although reliability may be achieved through the use of these protocols, “fault tolerance,” Birman notes [Bir93], “is something of a side effect of the replication approach.” There exist, however, examples of wellknown systems, which explicitly address the issue of fault tolerance in conjunction with atomic multicast:

1

The review presented here is not intended to be comprehensive, rather an attempt is made to illustrate major trends in the area of distributed computations. For thorough analysis of issues related to group communications and more complete characterization of existing systems, the reader can refer to [Bir96]. Issues related to distributed fault tolerance are covered in [Cri91].

3



Delta-4 sought to define and design an open dependable distributed architecture through the use of group communication layers built on top of an atomic multicast protocol and employing a specialized hardware—network attachment controller—to support a fail-silent failure semantic, [Pow94], [Pow91], [Bar90].



Piranha, an extension to Horus via the CORBA2-interface provided by Electra, addresses the issue of service availability in distributed applications by using a highly sophisticated ORB that provides failure detection, [Maf97].



“Wolfpack”, Microsoft clustering technology provides clustering extensions to Windows NT for improving service availability and system scalability [Wolf97]. Although, this approach is not based on the process group paradigm it maintains functions typical for operation of a distributed environment, including maintaining cluster membership, sending periodic heartbeat messages to detect system failures.



At Sun Microsystem work has been done on Ultra Enterprise Cluster [Sun97], design to provide highly available data services. The Ultra Enterprise Cluster High Availability server provides automatic software-based fault detection and recovery mechanisms. Specialized software allows a set of two computing nodes to monitor each other and redirect data requests in the case of software or hardware failure.

Process-group-based computing model has also been used in real-time systems such as Advance Automated systems for the Air Traffic Control network being developed by IBM [Cri90], and MARS, [Kop88], a distributed system for real-time control. The technology here aims at providing strong realtime guarantees by employing accurate measuring of timing properties of the network and the hardware to achieve highly predictable behavior. Each of these systems require a specialized and often complex software layer and/or additional hardware in order to provide group communication and fail-silent behavior. More importantly, most of these systems only provide an environment through which a programmer may construct a distributed application and provide fault tolerance through replication. When making comparisons between these systems and Chameleon, it is important to keep this in mind, for the focus of Chameleon is slightly different. Chameleon explicitly provides fault tolerance through a wide range of error detection and error recovery mechanisms. Several of the above systems detect failures solely through the use of timeouts, and some do not even mandate that recovery be initiated once failures have been detected. Finally, Chameleon tries not to make any assumptions concerning the fail-stop behavior of any of its entities. Of course, the coverage of our error detection mechanism is the overriding factor in determining if such a claim is reasonable.

3

Overview of Chameleon

In this section, we examine the steps that Chameleon takes to execute a user application in a faulttolerant execution strategy We also introduce the key components of the Chameleon environment. Note that this section is primarily concerned with introducing the error free behavior of Chameleon; Chameleon’s error detection and recovery techniques are discussed in detail in Section 5.

2

The Common Object Request Broker Architecture (CORBA), [OMG95], is emerging as a major standard for supporting objectoriented distributed environments with the design goals of heterogeneity, interoperability and extensibility. While increasingly vendors are providing applications conforming to CORBA specifications, still the vast majority of the existing applications were not built around CORBA objects. Research of availability and reliability issues in CORBA is now emerging and we expect several CORBA-based dependability solutions in the near feature.

4

The Chameleon provides for fault tolerance through the use of specialized agents. Agents include the Fault Tolerance Manager (the highest-ranking manager in the Chameleon system), surrogate managers (managers responsible for executing a particular user application under a fault-tolerant strategy), daemons (the agents responsible for error detection and communication among agents), and common agents responsible for carrying out specific actions on behalf of the managers. 3.1 Initialization of the Chameleon Environment Essentially any network of unreliable nodes may be configured to participate in the Chameleon environment. A system administrator (or some user with comparable network privileges) manually installs a program known as the Fault Tolerance Manager (FTM) on an arbitrary node. The FTM oversees the Chameleon environment by executing as a background process that handles user requests to run an application and that configures additional nodes to participate in the Chameleon environment. It should be emphasized, that a failure during the FTM initialization requires the system administrator to repeat the installation procedure. Successfully installed FTM, invokes a host daemon (to handle communications with remote hosts) and a heartbeat agent (to detect failures of remote nodes) on a local computation node. Finally the FTM creates a backup FTM (usually a first surrogate manager is designated as a backup FTM, as will be see further in this section) which closely monitors the FTM and, upon detecting an error, promotes itself to become the new FTM. Because the backup FTM’s primary responsibility is to monitor the FTM, rapid error recovery is possible in the event of a failed FTM. Once the backup FTM is set up we have stable Chameleon environment ready to accept and serve user requests. Other nodes in the network may request to join the Chameleon environment through the FTM. Upon a new node request to join the infrastructure, the FTM sends the necessary code to compile and execute a host daemon on the node wishing to join Chameleon3. Daemons constitute an important part of the Chameleon architecture, as all communication between agents occurs through the daemons. The node, therefore, becomes a fully-functioning member of the Chameleon environment after it has a daemon installed. 3.2 Interpreting User-Specified Dependability Requirements The Chameleon environment allows the user to run several different applications in a fault-tolerant manner—each application potentially having different availability and reliability requirements. From the user’s standpoint, he or she simply submits an application to the FTM with availability requirements specified in a semantic language that the FTM understands, and the FTM selects an appropriate execution strategy through which the fault-tolerant requirements can be met. To accomplish this selection, the FTM has a registry of several different execution strategies and their associated semantic. Selecting a fault-tolerant strategy, therefore, becomes nothing more than looking up the user-specified semantic description in the FTM’s registry. The initial implementation of Chameleon provides the following execution strategies: (1) dual execution mode with and without voting, (2) triple modular redundancy (TMR) execution mode, and (3) quadruple execution mode (four application replicas with results being fed through a two-level voting hierarchy). In addition, checkpointing and recovery can be enabled in each of the above execution modes. It is important to emphasize that these are not fixed execution strategies—other execution strategies may be easily developed using the Chameleon components and mechanisms to be 3

Since the node wishing to join the Chameleon environment must be able to communicate with the FTM, a lightweight background process called the initialization process exists for this purpose. Once the FTM properly configures the node, the initialization process serves no useful purpose and may be terminated.

5

described later. As will be seen later, Chameleon utilizes a well-structured agent class hierarchy composed of substitutable components through which new execution strategies may easily be constructed. Once new execution strategies are developed, they only need to be registered with the FTM to become available for use. After the FTM selects a particular execution strategy, it chooses or creates a corresponding surrogate manager to carry out the fault-tolerant execution of the user-supplied application (the creation of a surrogate manager is discussed later). In general, there is one surrogate manager for each user application (and usually, the first created surrogate manager is (also) a backup FTM). The surrogate manager utilizes Chameleon agents to realize the execution strategy, thus freeing the FTM from having to manage the application being executed. Since several user-supplied applications may simultaneously execute in the Chameleon environment, the fact that the FTM and the executing application are loosely coupled allows the FTM to be more responsive to other operations (such as responding to additional user requests, overseeing the recovery from a failed network node, overseeing the recovery of a surrogate manager, etc.) 3.3 Invoking a Fault-Tolerant Execution Strategy Once the FTM selects an appropriate surrogate manager to execute the application, the FTM installs the surrogate manager on a node in the Chameleon network (i.e., a node with a daemon installed as described in Section 3.1). Installing the surrogate manager consists of sending the surrogate manager code and required libraries of Chameleon components to the daemon of the node on which the surrogate manager is to be installed. The daemon receives the code, compiles the code, and runs the resulting executable file. As a result, the FTM must have access to the source code for the surrogate managers and the associated libraries of Chameleon functions. Recompiling at the destination node gives the added flexibility of multi-platform support, assuming the surrogate manager source code and Chameleon libraries are written to be platform-independent. Surrogate managers utilize Chameleon components called agents to meet the user-specified levels of fault tolerance for a given application. Agents are specialized entities that perform much of the work in the Chameleon environment, and they are designed in such a way that no individual agent is a single point of failure. As it is discussed later, agents are constructed from a common library of basic building blocks and are intended to be reusable and extendible. Agents supply much of the functionality to the Chameleon system. All agents in Chameleon fall into one of three broad categories: managers, daemons, and common agents. We have already seen examples of managers (FTM and the surrogate managers) and daemons. From the standpoint of the Chameleon architecture, managers and daemons are merely specialized agents. The generic term agent, therefore, refers to the common agents as well as managers and daemons. Common agents are those agents that the surrogate managers use to attain a particular level of dependability. The initial implementation of Chameleon provides the common agents listed in Table 1. All agents are registered with the FTM and are available for use by any surrogate manager (or any manager for that matter). When a surrogate manager begins execution, it typically installs the necessary agents on other nodes to complete its assigned task. For example, the surrogate manager responsible for executing an application in TMR mode install three copies of an execution agent (one for each application replica) and a voter. It is important to emphasize that after the surrogate manger successfully installs agents necessary to execute the user application, it notifies the FTM where the individual agents are located. Consequently, the FTM is able to initiate correct recovery actions in the case of a surrogate manager failure.

6

As can be seen, Chameleon provides for the reliable execution of a user application through the use of agents. A summary of agents (in the initial implementation of Chameleon) classified into the groups is provided in Table 1. These agents are closely monitored through the techniques described in Section 5. Table 1: Categories and Examples of Agents Category Managers

Daemons

Common agents

4

• • •

Member Agents FTM Backup FTM Surrogate Managers

Platform-specific daemons • Execution agent • Voter agent • Checkpoint agent • Heartbeat agent

• •

Notable Features Remotely installs agents Recover from failures in managed agents

• Locally installs agents • Monitors locally-installed agents • Communicates with remote agents on behalf of locally-installed agents Executes and monitors a copy of the user application Votes upon the results obtained from executing the user application Takes checkpoints of the user application to use in the event of error recovery Used by the FTM to detect node and daemon failures

Agent Functionality

In this section, we discuss the specific responsibilities of the individual agents in Chameleon (note that the discussion does not cover behavior in the case of an agent failure - this is described in section 5). As mentioned in the previous section, all agents in the Chameleon environment may be classified into one of three groups: managers, common agents, and daemons. A description of each group and representative examples of each group are given below. Note that the set of specific agents found in this section may be expanded to incorporate new fault-tolerant features into the Chameleon environment. 4.1 Managers Managers are specialized agents that possess the following common capabilities: •

The ability to remotely install agents on other nodes.



The ability to assign system-wide identifiers to the agents that it installs. Each agent has a unique identification number that allows it to be distinguished from any other agent in the Chameleon system.



The ability to maintain an updated list of its subordinate (i.e., agents installed by the manager on remote hosts) and a mapping of their identification numbers to the nodes on which they are installed. At this level, nodes are identified by the identification number of the daemon installed on the node. Since all agents (and hence daemons) have unique identification numbers, these number uniquely identify all nodes in the network.

The remainder of this section describes specific examples of managers in the Chameleon environment. Note that this does not represent a static list of managers, but rather examples of managers that are included in the initial implementation of Chameleon. Fault Tolerance Manager (FTM). The FTM is a centralized, key manager of the Chameleon environment. It has the following functionality: •

Interfacing with the user to accept the application for the environment and communicating the final results of the run back to the user.



Interpreting the user’s dependability specifications for the application and mapping it into one of the available fault-tolerant strategies. 7



Determining the hosts that will be used to support the selected fault-tolerant execution strategy. Criteria used in the selection process may include the current load on a prospective node and the history log of failures of previous application on the target node4. The list of hosts available for use by the FTM may change dynamically because of node failures, because of nodes voluntarily leaving the environment, and because of new nodes joining the environment.

Surrogate Manager (SM). Surrogate managers are specialized managers with the following features: •

It is responsible for executing a single user application under a specific fault-tolerant strategy. Currently, surrogate managers exist for dual, TMR, and quadruple execution modes.



The surrogate manager is capable of installing any other agents required for executing an application under the fault-tolerant strategy selected by the FTM.



When assigning identification numbers to the agents it installs, the surrogate manager must query the FTM for a list of valid identification numbers (numbers not previously assigned to other active agents).

4.2 Common Agents Heartbeat Agent. This is an elementary common agent invoked by the FTM to query the status of nodes in the environment. The heartbeat agent, in its simplest incarnation, can be a simple ping message to determine if the node is alive or not. At the other end, the heartbeat agent can be quite sophisticated and encapsulate within it information about the health of the node being monitored. Our prototype heartbeat agent can monitor the number of errors in the various components of a node (e.g., errors in memory and errors in I/O). The absence of a heartbeat from a node may not indicate that the node is being down, but rather the agent that responds to the heartbeat may be down, or the heartbeat interval is too short and the machine (might be a slow machine) is unable to respond within the specified heartbeat interval. Therefore, the heartbeat agent pings the machine a fixed number of times before it declares the machine as being down. Execution Agent. This is the basic agent responsible for installing an application on a particular host, overseeing its execution, and finally, communicating the result of the application back to the manager. Voter Agent. This is a generalized agent capable of majority voting on the results obtained from other agents (such as execution agents or voter agents). Different voting strategies and characteristics may be obtained by overriding the default behavior of the voter agent (e.g., n of m voting, self-checking voting, etc.). A critical voter parameter is the timeout interval for which the voter waits for the results to come from the agents. Initially the timeout is determined based on the hints from the user who supplied the applications. During the run-time the timeout is tuned depending on the relative speed of machines which execute the application. For example, consider an application which is executed in TMR mode with additional checkpointing to support recovery from errors. In this scenario each execution agent measures the application execution time. After a fixed number of checkpoints, the agent sends the measured time to the voter. The voter compares the collected times and readjusts the voting timeout according to the ratio - worst (i.e., the longest), best (i.e., the shortest) time reported by the agents. For example if the timeout was initialized to 5s and the three measurements (arrived from the three agents) are 1.3, 1.1, and 2.2, the voting

4

For now, we are using the simple metric of the total number of agents installed on the node to ascertain the node’s workload.

8

timeout is readjusted to the value of 10s ( i.e., 5 * (2.2/1.1)) to take into account the performance of the slowest machine5. Checkpoint Agent. The checkpoint agent interacts with the execution agent to enable the checkpointing of applications running on a particular node. This agent is crucial for application recovery on homogenous nodes. On detection of an application failure, an initial attempt is made to restart the application on the same node from the last checkpoint. If this fails, then the checkpoint agent notifies its manager to initiate recovery on a different node. 4.3 Daemons Daemons are an important class of agents that perform the following functions in the Chameleon environment: •

Installs agents locally on the node. This is in contrast to managers, which remotely install agents by sending messages to a daemon. Daemons perform the low-level installation of an agent on a node (i.e., spawns a new process, sets up an appropriate communication channel between itself and the agent, notifies the FTM as to the location of the newly-installed agent, etc.).



Monitors all locally-installed agents. Details of the error detection provided by the daemons can be found in Section 5.



Serves as the primary gateway for all communication between agents in the Chameleon environment. Section 4.4 describes the agent communication process in more detail.



Responds to heartbeat messages from the heartbeat agent.

Because of the vital role that daemons play in the providing communication capabilities to locallyinstalled agents, the loss of a daemon renders the entire node unusable—the locally-installed agents effectively become isolated from the rest of the Chameleon environment. Care must be taken, therefore, to accurately detect daemon failures and rapidly recover from such failures. Section 5 discusses how the heartbeat agent and FTM work together to provide daemon error detection and recovery. Since the daemons perform the actual network communication in the Chameleon environment, they must be specific to a particular network protocol. Our current implementation is built on TCP/IP because of the portability and ease of implementation it offers, but there is parallel effort to use MPI as well (we have some early implementation of some Chameleon features based on the MPICH libraries, [MPICH97]). 4.4 Agent Communication All actions in Chameleon are invoked through message passing. Daemons play an important role in this by delivering the messages to the appropriate destination. When an agent wants to communicate with another agent, it invokes the services of the local daemon. The sending agent provides the daemon with the unique identification number of the agent to whom the message is being sent. The daemon then translates this Chameleon identification number into a physical network location (e.g., an IP address in a TCP/IP implementation). If the local daemon does not know the physical location of the destination agent, it makes a request to the FTM’s daemon to provide this information. Since the FTM is the highest-ranking manager in the environment, its daemon tracks the

5

To our knowledge, the issue of how to implement the voter in an environment like Chameleon, has not been explicitly addressed in existing software-based, distributed fault tolerant architectures. Although, Delta-4 talks about voting upon messages, nevertheless, it is not clear where and how the voter is implemented.

9

physical location of all agents installed in the system. When the host daemon receives a message it performs a validity check. An example of the validity check would be ensuring that the intended destination agent of the message actually resides on the local node. In addition, the message can be protected by a CRC or a checksum to provide an extra layer of error detection. Note that because an agent sending a message only needs to know the Chameleon identification number of the agent to receive the message, the agents are free to migrate across the network, thus providing for agent mobility with little overhead associated with each agent.

5

Error Detection and Recovery

The Chameleon system is designed to recover from frequently-occurring failures. These failures may be classified into three categories: (1) failures in hardware, (2) failures in the user application, and (3) failures in the Chameleon agents. Chameleon provides specific methods to detect and recover from these failures. 5.1 Failures in the Hardware and the Operating System The failure of a computation node (hardware or software, i.e., the operating system) is detected by lost of heartbeat. If a node fails or becomes inaccessible, its local daemon does not respond to the heartbeat agent’s periodic heartbeat message, and the heartbeat agent notifies the FTM which removes the node from the network. Consequently, when a node appears inaccessible, all agents and user applications running on the node must be presumed to be lost. For example, the failure of a networking hub will make all nodes connected to the hub inaccessible to the network (assuming the nodes were only connected to the network via the hub). The FTM then systematically reinstalls and starts over all affected agents installed on the failed node. In most cases, the FTM does not perform much of the agent reinstallation itself, but instead delegates this responsibility to the surrogate managers (if the surrogate managers are still alive). Since the surrogate managers have knowledge as to where they have installed their agents, the surrogate managers are capable of restarting any of their effected agents on other nodes. In the case of an erratic (or inconsistent) node behavior, the node may continue to send heartbeats despite the failure (i.e., the node is not fail-silent). The resulting application misbehavior can be captured by: (a) the local execution agent (if the node failure did not compromise the execution agent), (b) the host daemon on another (remote) computation node to which the faulty node sent a message (e.g., a message with valid but incorrect identifier of a destination agent, i.e., the destination agent does not exist on the node which received the message), and (c) the voter agent which detects incorrect computation results. Note that Chameleon does not distinguish between the node and the network, i.e., link, switch, failures. An efficient means to cope with link and switch failures is to use the redundant network. In this relation, ServerNet from Tandem is the only, commercially available fault-tolerant network, [Hor95]. 5.2 Failures in the User Application Execution agents monitor the user-supplied application executing under a fault-tolerant strategy. In most cases, the surrogate manager assigns one execution agent per copy of the application so that each copy may be monitored independently. If the application fails and the execution agent detects the failure, the execution agent restarts the application. If checkpointing is enabled the execution agent rollbacks the application to the most recent checkpoint and restarts the execution. The execution agent also notifies the appropriate surrogate manager that the recovery took place. The surrogate manager, then, sends a message to the voter agent (if the voter is used) for readjusting the voting timeout, if the voting strategy cannot mask the error, (e.g., in the case of a dual execution mode). The use of 10

checkpoints is determined at the initial invocation of the application and is based on the user requirements. The key point to note is that the overall responsibility of application recovery is with the execution agent. 5.3 Failures in the Chameleon Agents The Chameleon agents are rigorously tested against erroneous execution to ensure that they are free from software defects. As such, any failure of the agents should be due to other external faults (e.g., transient faults in hardware). When a daemon locally installs a particular agent, it also begins to monitor the agent so that any premature or abnormal termination may be detected. If a daemon detects that one of the local agents has failed, the daemon notifies the failed agent’s manager. Every agent in the Chameleon environment has a manager, and the agent-manager relationship is defined at the time the agent is installed. For example, if Agent A installs Agent B, then Agent A is the manager of Agent B. In essence, Agent B becomes the responsibility of Agent A. Hence, the daemon monitoring Agent B would notify Agent A if Agent B were to fail. Agent A would then take appropriate actions (either reinstalling Agent B on the same node or different node, depending on the circumstances). Such failure notification can be extended to the case when a surrogate manager fails. Since the FTM installs the surrogate manager, the FTM is responsible for recovery. An important point to note in the surrogate manager recovery is that if the execution strategy involves voting then the voter agent holds the computation results until the surrogate manager comes on-line (recreated by the FTM) and is ready to receive them. It is FTM responsibility to notify the voter agent about the corresponding surrogate manager failure. The FTM is the only Chameleon agent that does not have a direct manager. The FTM, therefore, is monitored by a special agent, namely a backup FTM (usually, a first created surrogate manager is designated as a backup FTM). The backup FTM periodically sends its own heartbeat message to the FTM to ensure that the FTM is alive. In the event it does not receive a reply from the FTM, the backup FTM promotes itself to be the FTM. In this way, the Chameleon system can recover from an FTM failure. Any failure to the backup FTM is handled in the same way that any other failure of an agent is handled—specifically, the local daemon notifies the FTM. FTM can fail when a new computation node wants to join the structure. In this scenario, the node willing to join the Chameleon re-sends the request a fixed number of times (this action is supported by the initialization process, see Section 3.1), and either the FTM is reestablished and the new node can join the structure or the node gives up. 5.4 Failure Modes Table 2 presents the primary failure modes for Chameleon environment. The table is intended to be selfexplanatory, and hence we will make very generic remarks on the table contents. Each failure mode is characterized by a brief description of the consequences on the environment. In addition, the table identifies the agent responsible for detection of a particular failure and finally, it gives a detailed description of the fundamental steps in recovery from the detected error. From Table 2 one can observe that there exists a certain hierarchy in error detection and recovery. This hierarchy is illustrated in Figure 1 which provides primary paths of error detection and recovery activities. The secondary paths, such as detection (via the host daemon) of an erratic node behavior, are not depicted to preserve clarity of the Figure 1.

11

Table 2: Chameleon Failure Modes and Recovery Failure Mode

Node

Network

Application

Consequence

Detection

Crash

All agents lost on the node

Heartbeat agent

Link down

Unreachable node

Switch down

Network down

Abnormal termination

Program fails to complete normally

Same as node crash Heartbeat agent Execution agent

Livelock

No forward progress made in the application

Execution agent through a usersupplied timeout

Compilation error

Application executable not generated

Execution agent

Erroneous computation Crash

Incorrect results

Voter agent

Lost agent

Daemon

Recovery HB agent notifies FTM FTM removes node from list of registered nodes FTM restarts any affected agents it manages to a new node • FTM notifies immediate managers of the crashed node; these managers restarts any of their agents and recursively notify all subordinate managers Same as node crash if a redundant link is not available; No actions are necessary if a redundant link is available Cannot recover if a redundant switch is not available; No actions are necessary if a redundant switch is available • Notify the execution agent’s manager • Restart the application (with assistance from a checkpoint agent if enabled) • Kill application • Restart the application. • If repeated restarts result in livelock, notify execution agent’s manager • Manager may elect to reinstall the execution agent on a node with a different platform • If installing the agent on a new platform fails, the user will be notified of an apparent software bug • Retry compilation • If retry repeatedly fails, request a fresh copy of the source code from the execution agent’s manager • If new source code cannot compile, notify the execution agent’s manager to try installing the execution agent on a node with a different platform • If the application will not compile on the new platform, notify the user of an unrecoverable error • Dual mode: restart the application and notify the user; • TMR mode: mask the error (optionally notify the user) • • •

Notify the crashed agent’s manager (recovery as described in section 5.3). •

Common Agent

Compilation error

Agent not installed

Daemon

• • •

Process alive, but unresponsive Daemon

Crash/Unrespo nsive

FTM

Crash/Unrespo nsive

FTM Daemon

Crash/Unrespo nsive

Agent cannot process incoming messages All agents on the same node cannot communicate with remote agents

Daemon

Environment without overseeing manager

Backup FTM (designated surrogate manager)

FTM unreachable

Backup FTM

Heartbeat agent

• •

Ensure that all Chameleon libraries are present on the node. If not, request the appropriate libraries from the agent’s manager. If re-compilation repeatedly fails, request a fresh copy of the source code from the agent’s manager If new source code cannot compile, notify the agent’s manager to try installing the agent on a node with a different platform If the agent will not compile on the new platform, notify the user of an unrecoverable error Kill the unresponsive agent Notify the agent’s manager to reinstall the agent

• •

Notify the daemon’s manager (the FTM) Most likely, the daemon’s manager will treat a daemon failure as if the entire node has crashed and recovers as for the node failure. • Backup FTM promotes itself to become the FTM • New FTM notifies all its managed agents of the change; all subordinate managers recursively notify managed agents • New FTM promotes a new backup FTM from one of the surrogate managers Assume the FTM has crashed and recover as above

12

B a cku p F TM

FTM

B a ckup FT M

SM

D aem ons

FT M D aem on FTM HB Agent HB Agent Exec Agent

D aem ons

App

O th e r A g e n ts

B

A

Exec Agent

App

A g e n ts

Figure 1: (A) Error detection (source node detects errors in the sink node); (B) Error recovery (source node recovers from errors in the sink node)

6

Agent Class Hierarchy

This section presents an in-depth look at the underlying agent hierarchy in the Chameleon environment. Chameleon places emphasis on ensuring both internal and external consistency among all components in the system. The need for consistency makes the Chameleon environment well suited to object-oriented concepts and constructs. In fact, much of the functionality and behavior of Chameleon can be succinctly represented in the class structure in Figure 2. Note that all Chameleon objects derive from a common base class, namely the base agent class named Agent. In this way, functionality common to all components may be placed in the Agent class. Not only do all components inherit this common functionality, but they also are allowed to modify the default behavior and to add behavior of their own. Having a well-structured class hierarchy also allows responsibility and functionality to be added selectively to a group of related objects through the inherent scope provided by the class structure. For example, descendants of the Manager class possess certain behavior and functionality that are not available to other agents (e.g., managers have the ability to install other agents remotely and also have the added responsibility of recovering from failures in their managed agents). Extensive detail as to the functionality of the various agents in Chameleon can be found in Section 4. Agent (Base Agent Class)

Manager

Common Agents

SurrogateManager

AgentHeartbeat

Daemon

Daemon_TCP

AgentExecute SMDual

Daemon_TCP_Unix AgentVoter

SMTMR

Daemon_TCP_Win32 AgentCheckpoint

SMQuad

Daemon_MPI

FTM Daemon_MPI_Unix Daemon_MPI_Win32

Figure 2: Agent Class Hierarchy 13

Most importantly, the agent hierarchy in Figure 2 provides a structured approach for defining the type of an agent. The use of the term type in this case goes beyond the traditional definition of the term as it applies to programming languages. The type of an agent dictates the context in which the agent may be used in Chameleon. Take, for example, the quadruple execution mode in Figure 3. This fault-tolerant execution strategy expects four execution agents and three voter agents. From the E xec perspective of the class hierarchy in Figure 2, this V o te r means that the agents need to be descendants of E xec V o te r AgentExec and AgentVoter, respectively. E xec Descendants of AgentVoter, for example, all must V o te r perform the same semantic actions (because they are all E xec of the type AgentVoter), but they may implement their actions differently (e.g., a descendant may decide to accept all values within a specified range). In this Figure 3: Quadruple Execution Mode way, the class hierarchy serves to organize the Flow of Control functionality of all agents into specific types. These types originate from the expectations of other Chameleon components such as the fault-tolerant execution strategy shown here. The class hierarchy, however, is completely flexible to allow for new agents and types to easily be introduced into the system. Section 7 will go into more detail as to how these new agents may be created and integrated into Chameleon. A p p li c a ti o n f ro m S u rr o g a te M a na g er

7

R e s u l ts t o S u r r o g a te M a na g er

Adaptive Composition and Customization of Agents

In this section we look at ways to allow for adaptive agent creation and modification in order to meet user requirements. First, we introduce the library of basic building blocks and then, in the remainder of the section, we discuss different levels of agent manufacturing, engineering and re-engineering. By engineering, we mean constructing new agents from the basic building blocks, and by re-engineering we mean extending the capabilities of existing agents either through adding additional building blocks or overriding the behavior of existing building blocks. 7.1 Library of Basic Building Blocks As alluded to in the previous section, Chameleon functionality is distributed throughout the agent hierarchy at appropriate levels. This functionality is provided through a set of basic building blocks. Agents are constructed by combining basic building blocks in a particular manner to produce an component that behaves in a specific way. Some basic building blocks are available to all agents—these are represented by basic building blocks in the Agent class. A subset of these common basic building blocks are dubbed primitives, as they are not implemented in terms of other building blocks. Being that most of the behavior of Chameleon is event-driven, the following communication-oriented primitives provide access to the underlying Chameleon architecture (see Table 3): Table 3: Chameleon Primitives Primitive get_type get_id get_manager_id get_daemon_id send_message

Functionality Returns the type of the agent (e.g., voter agent, surrogate manager, etc.) Returns the unique ID of the agent Returns the unique ID of the agent’s manager Returns the unique ID of the daemon of the node on which the agent is installed Sends a message to the agent with the specified ID

Other basic building blocks may be constructed from these primitives. For example, an install_agent basic building block found in manager agents can be implemented in terms of 14

send_message. In fact, the construction of the install_agent basic building block illustrates how functionality may be easily added to agents: a manager behaves like any other agent, with the added capability of sending “install agent” messages to daemons; and a daemon behaves like any other agent, with the added capability of interpreting “install agent” messages6. Agents are then simply constructed by utilizing the appropriate basic building blocks provided in the agent class hierarchy. For example, the surrogate manager responsible for the TMR fault-tolerant execution strategy (SMTMR) may be implemented by performing the following simplified steps: 1. Make a request to SMTMR’s manager for a list of hosts on which SMTMR may install its agents. This may be accomplished by sending an appropriate message to SMTMR’s manager. 2. Install the agents required for TMR-execution mode (three execution agents and one voter agent) on the hosts provided by step one. 3. Send a “begin execution” message to the four agents installed in step two to initiate the faulttolerant execution strategy. Although this is an admittedly stripped-down description of the surrogate manager responsible for the TMR execution strategy, it should be clear that each of these steps may be implemented using the basic blocks previously discussed. 7.2 Encapsulating Behavior and Functionality By constructing all agents from the basic building blocks described above, a significant step is made towards providing a flexible architecture—the step of encapsulating specific implementation details. The more functionality that is encapsulated in Chameleon, the easier it is to dynamically alter the behavior of the Chameleon system. Essentially, it is simply a matter of using another set of substitutable components—be they primitives, basic building blocks, or agents—in place of another set of components. The reconfiguration provided by encapsulation also suggests a rather simple workaround to a limitation of the static class inheritance of C++—namely, the problem of trying to automatically derive functionality from a base class. With encapsulated components, creating a new agent is simply a matter of selecting the proper components to combine. Moreover, such component selection can be made at run-time and can even dynamically change during the lifetime of the object, thus paving the way for onthe-fly reconfiguration. From the encapsulation point of view, the agents previously described are simply specific compositions of the encapsulated basic building blocks presented in Section 7.1. The Chameleon architecture allows any agent or basic building block to be constructed from a variable number of other basic building blocks or other agents. In fact, the architecture allows for the exact composition of any specific entity to change dynamically without the need for recompiling. As a concrete example, consider the need for a well-defined message-passing protocol. Chameleon allows for several kinds of message-passing protocols by encapsulating the exact specifications of the protocol in the send_message primitive. Since all of these encapsulated protocols conform to the semantics of “sending a message” through a common interface, they may easily be substituted for one another in any agent that uses send_message. In fact, by using the flexible agent manufacturing described in Section 7, the encapsulated send_message primitives may even by dynamically substituted without disturbing the

6

This is by no means an exhaustive list of how managers and daemons differ from other agents. Rather, this serves as one example of how agent functionality may be extended through inheritance, basic building blocks, and message handling.

15

system. For these reasons, the encapsulation of several aspects of Chameleon is currently being investigated to augment the traditional static-inheritance approach of providing functionality. 7.3 Flexible Agent Manufacturing To take advantage of the created agents we need a flexible mechanism to incorporate them into the fault tolerance strategy which has been determined for executing the application. Consider the quad execution mode as presented in Figure 3 of Section 6. In order to have a truly extendible design, Chameleon should not force the quad-execution mode surrogate manager (SMQuad) to use particular instances of execution agents or voter agents. Suppose, for example, SMQuad creates four voter agents of the type AgentVoter. The C++ code needed to generate one instance of the voter agent would resemble the following: AgentVoter *pAgentVoter = new AgentVoter; Now assume that the user extends the AgentVoter class as before to include a new type, AgentRangeVoter. In order for SMQuad to use objects of the new class, the code would need to be modified to refer to AgentRangeVoter instead of AgentVoter. Or, more accurately, the agent creation code used by the host daemons when installing new agents would need to be modified to instantiate AgentRangeVoter objects rather than AgentVoter objects. Manufacurer Ideally, surrogate managers should behave more like templates that accept voter agents of any type and execution agents of any type. Taken even further, the surrogate manager should not mandate that all voter agents be of the same type or all execution agents be of the same type (only that they be derived from an expected type). Unfortunately, the C++ language stipulates that the type of an object be statically specified at compile time. To get around this limitation of C++, the task of agent creation is encapsulated in a special object called an agent manufactory. The agent manufactory instantiates a specific agent object based solely on a token passed to it as an argument. A token may be an integer or string (or any other object) than uniquely identifies a particular Chameleon object. Assuming integers are used as tokens, an excerpt of the agent manufactory code appears in Figure 4.

Agent *AgentManufactory::create_agent (int token) { switch (token) { case 1: return new AgentVoter; case 2: return new AgentExec; case 3: return new AgentExecPlus; default: return NULL; } }

Although the agent manufactory appears innocent enough, it is actually quite powerful. Now, the same agent creation code can be used to create an instance of Figure 4: Sample Agent Manufactory any Chameleon agent—the surrogate manager only needs Code to provide to the daemon the token of the agent that needs to be installed.

The simple agent manufactory presented above works because all agents are derived from the base agent class (so all pointers to agents may be cast to the type Agent *) and because only member functions in the base agent class are called by the agent creation code7.

7

The exact agent creation code will not be shown here, but after the agent factory creates a new agent, initializations common to all agents will be performed. Agent-specific initialization can occur either through the constructor of the class or through messages sent to the agent (most likely from the agent’s manager).

16

Note that the agent manufactories do not eliminate for code changes when new agent classes are introduced to the Chameleon system. Changing code in a single agent manufactory object is significantly better, however, than having to support an agent creation code fragment for each class, or having to create additional surrogate managers to utilize the new agent classes. In fact, given the regularity of the agent manufactory code, it would be highly feasible to automate such code changes whenever a new agent class is registered with the FTM. 7.4 Agent Re-Engineering Agents with new functionality are created via agent engineering and re-engineering. The Chameleon software architecture provides for manual agent re-engineering through the paradigms of the agent class hierarchy and the encapsulation of object behavior. Functions of an agent are implemented by an appropriate composition of basic building blocks. Each building block encapsulates the implementation of an elementary function of the agent’s overall behavior. Consequently, agents with new functions can be manually derived from existing agent classes by overwriting or adding basic building blocks. Consider an example of AgentVoter that is composed of fine-grained building blocks, one of which implements a voting algorithm for exact comparison of the computation results. Now assume that the user needs to change the exact comparison to a check if the value is within a specified range. To achieve this, a new class AgentRangeVoter may be derived from AgentVoter, and the building block which encapsulates the implementation of the voting algorithm (i.e., exact voting) can be replaced by a new version of the voting algorithm (i.e., a range check). Note that other building blocks incorporated into AgentVoter do not need to be modified. This example demonstrates that Chameleon’s approach to agent re-engineering is quite powerful and relatively easy to implement. Chameleon also supports automated agent re-engineering. For this purpose simple semantic constructs are provided to allow for a unified representation of user specifications. An example format in which the user may represent his specification is as follows: {availability; results; resources; vote; agree; time_out} where : Required availability level (availability) Results of interest (results) System resources required (resources) Voting strategy (vote) Agreement criterion (agree) Application time out (time_out)

specified as an integer starting from 1 (for the lowest availability level) up to n. specified as variables of interest at the end of the execution of the application or as the output file into which the results of the application are saved. specified as an amount of runtime memory required or other application required resources such as ghostscript or gnuplot specified as n_of_m, i.e., n among m machines must agree for application to succeed (this also includes “no_voting” strategy). specified as an exact match or match in a range of values defined in terms of absolute value or percentage variation from mean. an upper bound on the execution time of the application provided by the user.

The above example provides a simple set of semantic constructs for specifying the application requirements. The semantic can be easy extended by incorporating additional fields to capture new application characteristics, e.g., a requirement of a fail-silent behavior in application execution. It is worth noting that in developing such a semantic, we must balance the level of the semantic complexity and its capability of accommodating necessary information. Making the semantic very extensive might require the user to invest a significant amount of time to understand how to correctly specify application requirements. A primitive that is too primitive, on the other hand, may severely limit and complicate collecting the necessary data on the application requirements and consequently may lead to a selection of an incorrect or inappropriate execution strategy. Apparently, obtaining a robust semantic will require

17

an iterative approach based on experience in using the system and analysis of different operational scenarios. Having the application requirements correctly specified, we need a mechanism to interpret them in terms of a concrete fault tolerance strategy. Chameleon provides a parser to transform the specifications in an unified representation structured for automatic interpretation. The parser is capable of mapping the requirements into a primitive functionality that may be realized through the library of basic building blocks and the library of already existing agents. Based on the identified requirements, the FTM determines the fault tolerance strategy, checks for the available resources, and designates a set of agents to support the application execution and to support error detection and error recovery. 7.5 Significance of Agent Automated Manufacturing and Re-engineering The ability to create agents based on an integer token (as described above) is the first step towards providing for a truly adaptive infrastructure. By being adaptive, Chameleon may be able to adjust its fault-tolerant execution strategies to provide appropriate levels of dependability—or even different forms of dependability—without having to recompile key modules in the environment. As an immediate example, consider SMQuad from the example in Section 6. By using the flexible agent manufacturing, this single surrogate manager can provide different fault-tolerant execution strategies that vary according to their agent composition (i.e., what kinds of execution agents it chooses to use and what kinds of voter agents it chooses to use). This same surrogate manager also has the potential to use different kinds of execution agents and different kinds of voter agents that have yet to be created (provided that these new agents register themselves with the FTM). All of this takes place without having to recompile the surrogate manager code. This, in itself, brings up an interesting side-effect. Suppose the user submits two applications to run in quadruple execution mode. Because of the dynamic agent composition supported by flexible agent manufacturing, both surrogate managers are not required to have the same composition. Two quadruple execution strategies with different implementations—but using the same surrogate manager template—may coexist in the Chameleon environment. Extending the idea of dynamic composition to include common agent as well as execution strategies opens the door for even more exciting possibilities. Just as execution strategies can be thought of a collection of agents, agents can be thought of a collection of building blocks. Applying the idea of flexible agent manufacturing to building blocks allows Chameleon to dynamically instantiate any building block based solely on a unique token. In this way, the composition of any agent may dynamically change without the need to recompile the agent code. The flexible agent manufacturing provides the mechanism through which newly created agents and building blocks may be seamlessly integrated into the Chameleon environment. So, by using the encapsulation provided by the Chameleon architecture as described in Section 7.2, new agents and building blocks may be constructed. Then, the flexible agent manufacturing may be used to dynamically incorporate them into existing fault-tolerant strategies and components in Chameleon. By providing this technique to dynamically reconfigure execution strategies and components, Chameleon is able to offer services that stress the ability to adapt to a changing environment or to a changing user specification. 8 Implementation We have a prototype implementation of Chameleon on a testbed of heterogeneous computing nodes at the Center for Reliable and High Performance Computing at the University of Illinois. The computation nodes communicate with one another using TCP/IP protocol over the 10Mb/s Ethernet. The software

18

has been ported to Sun OS, Sun Solaris, HP-UX and Windows NT. The configuration of the machines on which the environment is running is summarized in the Table 4. Table 4: Machine Configurations Name tyagaraj mozart nahoona dvorak franck bizet karl bernstein wolf berg intel18

Manufacturer/Model HP9000/715 Sun Ultra1-170 Sun Ultra1-170 HP9000/C160 Sun Ultra1170 Sun Ultra1-140 Sun Ultra1200 Sun Ultra1-140 Sun Ultra1170 Sun Ultra1-140 Intel P6-200

Memory 32M 128M 64M 64M 64M 64M 192M 64M 128M 64M 64M

Disk 2.0G 2.1G 2.1G 2.1G 2.1G 2.1G 4.2G 2.1G 2.1G 2.1G 4.2G

OS HPUX9.05 Sol2.5 Sol2.5.1 HPUX10.20 Sol2.5.1 Sol2.5.1 Sol2.5.1 Sol2.5.1 Sol2.5.1 Sol2.5.1 Windows NT 4.0

The prototype supports the following execution modes: (1) a single node execution offering baseline reliability, (2) duplicated execution where the first result is accepted, (3) duplicated execution with requirements on the two executions to agree for success, and (4) TMR execution mode with majority voting. All of the above modes can optionally utilize the functionality provided by a checkpointing agent. Currently, however, the checkpointing library has been developed only for the Solaris platform. Consequently, if the application wishes to take advantage of rollback recovery, it has to be running on the Solaris nodes from among the machines listed above. More importantly, we have taken some steps towards supporting off-the-shelf distributed applications, in addition to stand-alone applications. We have successfully executed computationally intensive operations (such as matrix multiplication), hand parallelized, and then running in our environment, taking advantage of the various reliability levels offered. Following in the lines of the hierarchical approach, we develop our agents around the base class primitives. Examples of developed agents are listed in Table 5. Table 5: Basic Developed Agents Base class primitive install_agent get_appinfo

Invoked by Surrogate manager FTM

voter

Surrogate manager

monitor_agent send_ready_notif

Host Daemon Execution Agent

Function Installs agent on a particular host Collects information from the user specification about the requirements of the application Votes on a parametrized number of results. Can be passed the entities from which results are expected and the comparison scheme to be followed. Monitor one of the agents installed on the host. 9 Send notification to the SM that it is ready to accept application for execution.

8.1 Scalability The Chameleon implementation has been found to be easily extensible to accommodate different fault tolerance strategies, as concluded from our experiences in migrating from the duplicated to the TMR execution modes. As the environment is sought to be capable of operating in a heterogeneous network of computation nodes, all features which are platform dependent have multiple implementations specific to different platforms. Adding a node to the environment is easy if the node’s architecture is already supported by the environment (refer to the enumeration of the architectures in Table 4). If the architecture falls outside this list, then an additional case needs to be added in the entity’s (i.e., agent’s) machine-dependent part of the code. Our experience with porting the environment to Windows NT has 8

For the NT machine, we have the FTM, the SM for duplicated mode and the voter capable of running on the NT node and installing agents on Unix nodes.

9

The monitoring in the current implementation is done by trapping illegal signals raised and by timeout intervals.

19

been quite positive in this respect. The portions which needed most effort were the process/thread management and dealing with the shell commands. 8.2 Application and Background Workload The benchmark application used in our experiments is a distributed matrix multiplication (results presented in the next section are for a run with two matrices of sizes 200*400 and 400*200; the size of the executable file for each sub-part of the application is 33.5k). It employs the simple matrix multiplication algorithm10, distributed over two machines. Each part at the end of its computation dumps the result into a file. The results are combined at the voter and then the combinations are voted upon. The application is run in a TMR mode, which involves three independent pairs of machines, and each pair executes a replica of the distributed application. We have one execution agent on each of the six machines, monitoring the execution of each part of the distributed application. The FTM, the surrogate manager and the voter run on a separate computation node. Since we wish to use application checkpointing and make measurements of recovery times for application failure, we are constrained to run the six copies on Solaris machines. The configuration is: the FTM, the surrogate manager and the voter run on monn, the three copies of the application run on the following pairs of machines: (berg, bernstein), (bizet,karl), (mozart,nahoona). The background workload for the experiment is varied from the baseline case of normal background on our network of machines to one or more copies of a computationally intensive task of a factorial calculation which is executed in machines participated in the Chameleon environment. We do not mandate an idle workload because we felt that the normal workload would be more representative of the workload that will be experienced in a cluster where the nodes on the cluster are not dedicated ones. 8.3 Measurements In order to determine the effectiveness of the Chameleon environment it is essential to demonstrate the system capability of providing the fault tolerance against different failure modes (i.e., application, hardware, and agent failures) while preserving acceptable level of performance overhead. Therefore we have conducted direct measurements, in the prototype implementation of Chameleon, to obtain the overheads in the application execution and recovery times for various failure scenarios. Note that the entities on which measurements are made, are in an active phase of the development, and hence the numbers do not reflect any of the optimization techniques (chiefly with respect to the number of hand shaking messages being exchanged) that we plan to apply to reduce the communication overhead. 8.3.1 Time to launch basic agents

To gather sense of the time overhead which is involved in launching different components of the Chameleon infrastructure we conducted corresponding measurements which are provided below. •

Installing the execution agent (average time on six machines): Opening connection with host daemon(HD): Sending the execution agent code (24.612 Bytes) over to the HD: HD compiling and starting execution of an agent:

10

2589 msec 8 ms 63 ms 2518 ms

If Cm,n = Am,z * Bz,n then c(i,j) = ∑k=1,...,z a(i,k) * b(k,j) where A and B are input matrices and C is a resulting matrix. The distribution is such that each part computes one horizontal strip of the result matrix consisting of half the rows.

20



Installing the host daemon (on dvorak):

4526 ms 11

Opening connection with the initialization process : Sending the host daemon code (11.151 Bytes) over: Initialization process compiling and starting HD:

8 ms 32 ms 2281 ms

2205 ms

Initialization agent: 8.3.2 Overhead in the Application Execution and Recovery Times

To quantify the time overhead in the application execution we have conducted several experiments to measure this overhead. First, we give time to detect the node (or host daemon) and the application failures: •

Time for local detection of failure by trapping abnormal signals, (as is done by the execution agent while monitoring an application if it misbehaves on berg): 928 ms



Overhead of heartbeat agent (implemented as ICMP requests and echoes) (from the FTM at monn): 10,494 s (if node is failed; the default timeout period of 10 s dominates); 2.716 ms (if node is okay).

Second, we provide measurements of the time overhead in the application execution for: (1) a fault free execution without checkpointing and (2) an execution with a fault injection to the application and recovery from a checkpoint. The measurements (averaged out over five application runs) are given in Table 6 as a function of the background workload. The workload varies from 0 to 3 additional processes per node (each process executes a factorial computation program). Table 6: Time Overhead in Application Execution Number of workload processes 0 1 2 3

Time for standalone execution [s] 38.01 49.36 54.87 73.52

Time for fault-free execution in Chameleon [s]; [overhead %] 49.40 (30.0%) 53.86 (9.1%) 59.78 (8.9%) 79.72 (6.2%)

Time for execution in Chameleon with fault injection and recovery [s]; [overhead %] 63.69 (28.9%) 73.05 (35.6%) 97.64 (63.3%) 99.17 (34.9%)

Specific comments to execution times given in Table 6 are provided below: 1. all execution times include the time to compile the two parts of the application, 2. the fault free execution in Chameleon encompasses the time to launch the necessary agents for supporting the application (i.e., execution agents, the surrogate manager, and the voting agent), time to vote upon results from the application and communicate the results to the surrogate manager, 3. the execution with fault injection and recovery comprises (in addition to the times described above in point 2), time to launch the checkpoint agent, modify the application (i.e., to incorporate the appropriate function call for invoking checkpointing) and link the checkpoint agent with the execution agent in the node. The execution time involves also the time spent on setting checkpoints (the overhead for each checkpointing operation is 70 ms (on monn), and the frequency of checkpointing is 1.5 s). Finally, the execution time includes the time for error detection and recovery from the last checkpoint. For the readings presented in Table 6 we injected a single fault at a random offset from the start of the application The observed overhead in the application execution is about 30% as compared to the fault11

As it was mentioned in Section 3.1, an initialization process exists on each node wishes to join the environment. The initialization process is capable of accepting and installing the host daemon from the FTM.

21

free execution in Chameleon. A higher overhead of 63%, is measured for the execution while two background processes are running on each node. This higher overhead is due to changes in the load on the individual nodes and in the network traffic - recall that the application is executed in the network of regular workstations which are used by other users. Third, we present times for recovery from failures of various components under the normal workload case (i.e. no copy of our workload process running). The times are estimated from the summation of the times for each of the sub-operations and are given in Table 7. Note, that for the host daemon recovery the assumption is that the node is alive (i.e., only the daemon process crashed) and a single execution agent and application were on the host which needs to be restarted. The installation times for the application, the execution agent, the voter agent and the host daemon are averaged out over the six machines. Table 7: Recovery Times Entity to be recovered Agent Host Daemon Surrogate Manager Voter

Recovery time (ms) 5017 8925 4446 4099

Finally, we present some preliminary results from the execution of a simple loop application in the duplicated mode with the FTM running on a Windows NT machine (intel1), the two machines selected for executing the application are wolf and monn. •

Time to transmit, install and get an acknowledgment from the execution agent: Average (taken from six measurements) 1462 ms (on wolf); 1717 ms (on monn)



Time to send the application to the exec agent and get the results back: Average (taken from six measurements) 1056 ms (on wolf); 1059 ms (on monn)

8.4 Discussion In this section we discuss lessons learned from our first, very positive experience with the Chameleon environment. We contend that while the theory of fault tolerance in distributed systems is rich and varied - there is as yet limited experience in the research community in actually building software-based, fault-tolerant systems. As a result, the actual design and implementation of distributed, dependable systems is in the phase of experimental systems. Most of existing approaches to software-based, distributed fault tolerance employ the process-group paradigm e.g., Delt-412, Isis, Horus, Totem, (see Section 2 for description of related work). The underlying communication layer is built on top of a reliable multicast protocol (such as atomic multicast) and the systems rely on fail-silent behavior of their components. Although, these architectures are capable of improving the availability of delivered services it is still not clear that anyone provides the best answers how to build a distributed fault-tolerant system outof unreliable computation resources. The question to be asked here, is why do not we base the underlying communication layer in Chameleon on the reliable multicast protocol? The answer to this question is multifold. The group-communication paradigm is not always the most efficient approach. For example in providing access to the reliable data base, the primary-backup paradigm may offer a relatively cheap and robust solution (e.g., Tandem

12

Delta-4 was the first system which explicitly addressed the issue of fault tolerance in distributed systems. Work on the Delta-4 project resulted in many valuable ideas. Not least of them is the assurance of fail-silent behavior via the custom hardware.

22

process pairs used in providing highly fault-tolerant architectures, [Jew91]). Similarly, systems such as Wolfpack from Microsoft, [Wolf97], and Ultra Enterprise Cluster from SunMicrosystems, [Sun97], although on different stages of development, promise to provide fault-tolerant services without using the group communication paradigm. Also, Lucent Technologies - Bell Labs, by employing software implemented fault tolerance in some telecommunication products enhanced the products’ dependability with acceptable performance overhead, [Hua93]. Ideally one would have a system to be able to invoke the multicast when necessary. The design of the Chameleon is such that it can be easily ported on top of existing reliable atomic multicast. We feel that a broader experimentation with variety of systems is the only way to understand what techniques work better and under what conditions. To investigate some of the features of the existing systems we have conducted direct fault injection experiments with a representative of a family of systems which maintain groups of cooperating processes. Our key observation is that in most of created failure scenarios the system survived and was able to operate correctly. However, the system crashed or hang when the assumption on the fail-silent behavior was violated or when a component of the protocol stack failed. Our view is that it would be a mistake to consider different systems as competing only, rather they are complementary approaches which contribute to the ultimate goal of achieving high availability services in a network of unreliable components. Detailed failure modes analysis and evaluation of error detection and recovery mechanisms can provide valuable insight into capabilities of these systems and can be used as a basis for meaningful comparison among them. However, such a comparison can be inadequate or inaccurate (incomplete in the best) because there are many implicit assumptions made in designing the system which are hard (or impossible) to quantify even for the designers themselves. Ideally would be to develop a common benchmark capable of assessing the different approaches. While such benchmarks are still not widely available, there is on-going research. Two examples of existing benchmarks include a robustness benchmark developed in CMU, [Sie93], and a fault tolerant benchmark designed and implemented at the University of Illinois, [Tsa96]. We believe that to make brake through in designing and implementing cost effective, distributed, faulttolerant architectures a joint efforts from industry and academia are needed. We expect that a number of different implementations developed by researchers at universities and available as experimental testbeds would provide a unprecedented opportunity to make a rapid progress in the field of distributed, dependable computing. Variety of software infrastructures is not only good for the fault-tolerant community itself but also allows to identify and understand key issues and to determine difficulties in designing of robust, distributed fault-tolerant architectures. 9 Conclusions This paper presents Chameleon, an adaptive infrastructure which allows different levels of availability requirements to be simultaneously supported in a single networked environment. Chameleon provides highly structured agent class hierarchy and utilizes an extensive encapsulation of behavior. These two features promote a flexible architecture that allows for dynamic - even automatic - adaptation to changing dependability requirements. Carefully defined and implemented error detection and recovery techniques play the key role in attaining high availability for of-the-shelf applications. The initial implementation of Chameleon demonstrates the feasibility of the proposed agent-based technology which allows for composing an entire service including processes and data management, error detection, checkpointing and recovery. The proposed software infrastructure is capable of providing adaptive level of fault tolerance in a network of unreliable, heterogeneous computation nodes (including UNIX and Windows NT based platforms). The experimental results show that the overhead in application execution and recovery time are acceptable for computation intensive applications including scientific stand-alone programs and distributed applications such as matrix multiplication. We 23

are aware that much work needs to be done on extending the Chameleon. In particular, the library of basic building blocks and mechanisms for flexible and automated agent manufacturing need further development. We will also need to provide more sophisticated semantic and parser of user supplied requirements. To our knowledge Chameleon is one of very few real implementations which maintain fault tolerance via a software infrastructure only. Chameleon provides fault tolerance from the application point of view as well as the software infrastructure itself is fault-tolerant. Robust error detection and recovery mechanism provide recovery against application failures, failures due to hardware and operating system, and failures of Chameleon entities (agents) themselves. Our first, very positive experience with the Chameleon environment convinced us that the chosen path is the right one. Acknowledgment This work was supported in part by the Defense Advance research Projects Agency (DARPA) under contract DABT63-94-C-0045 and by NASA under grant NAG 1-613, in cooperation with the Illinois Computer Laboratory for Aerospace Systems and Software (ICLASS). We would like to thank J. Wang, M. Kalyanakrishnan and E. Haakenson for their contributions in developing the application, porting to Windows NT and conducting simulation-base fault injection. REFERENCES [Ami92] Amir Y., D. Dolev, S.Kramer, D. Malki, “Transis: A Communication Sub-System for High Availability,” Proc. FTCS-22, 1992, pp.76-84. [Bar90] Barrett P.A., et al. “The Delta-4 Extra Performance Architecture,” Proc. of FTCS-20, 1990, pp.481-488. [Bir96] Birman K.P., “Building Secure and Reliable Network Applications”, Manning Publications Co., 1996. [Bir93] Birman K.P., “The Process Group Approach to Reliable Distributed Computing,” Communications of the ACM, vol.36, No.12, 1993, pp. 37-53. [Bir94] Birman K.P., R. van Renesse, “Reliable Distributed Computing with the Isis Toolkit,” IEEE Computer Society Press, Los Alamitos, California, 1994. [Cri91] Cristian F., “Understanding Fault-Tolerant Distributed Systems,” Communications of the ACM, vol.34, No.2, 1991, pp.57-78. [Cri90] Cristian F., B. Dancey, J.Dehn, “Fault Tolerance in Advanced Automation System”, in Proc. FTCS-20, 1990, pp.6-11. [Dol96] Dolev D., D. Malki, “The Transis Approach to High Availability Cluster Communication,” Communications of the ACM, vol. 39, No. 4, 1996, pp.64-70. [Hor95] Horst R.W., “TNet: A Reliable System Area Network,” IEEE Micro, February 1995, pp.37-45. [Hua93] Huang Y., C. Kintala, “Software Implemented Fault Tolerance: Technologies and Experience,” in Proc. FTCS23, 1993, pp.2-9. [Jew91] Jewett D., “Integrity S2: A fault Tolerant Unix Platform,” in Proc. FTCS-21, 1991, pp.512-519. [Kop88] Kopetz H., et al., “Distributed Fault-Tolerant Real-Time Systems: The MARS Approach,” IEEE Micro, vol.9, No.1, pp.25-40. [Maf97] Maffeis S., “Piranha: A CORBA Tool for High Availability,” IEEE Computer, vol.30, No.4, 1997, pp.59-66. [Mos96] Moser L.E., P.M.Melliar-Smith, D.A.Agarwal, R.K.Budhia, C.A.Lingley-Papadopoulos. “Totem: A FaultTolerant Multicast Group Communication System,” Comm. of the ACM, vol. 39, No. 4, 1996, pp.54-63. [MPICH97] MPICH - A Portable Implementation of MPI, http://www.mcs.anl.gov/mpi/mpich. [OMG95] Object Management Group. The Common Object Request Broker: Architecture and Specification (CORBA), Inc. Publications, 1995. Revision 2.0. [Pow94] Powell D., “Lessons Learned from Delta-4,” IEEE Micro, vol.14, No. 4, 1994, pp.36-47. [Pow91] Powell D., ed., “Delta-4: A Generic Architecture for Dependable Distributed Computing,” ESPRIT Research Reports, vol. 1, Springer-Verlag, 1991. [Rei93] Reiter M.K., “Distributing Trust with the Rampart Toolkit,” Comm. of the ACM, vol.36, No.12, 1993, pp. 71-74. [Ren96] van Renesse R., K.P. Birman, S. Maffeis, “Horus: A Flexible Group Communication System,” Communications of the ACM, Vol. 39, No. 4, 1996, pp.76-83. [Sie93] Siewiorek D., Hudak J., B-H. Suh, Z. Segall, “Development of a Benchmark to Measure System Robustness,” in Proc. FTCS-23, 1993, pp.88-97.

24

[Sun97] [Tsa96] [Wolf97]

Sun RAS solutions for Mission-critical Computing, White Paper, October 1997, http://www.sun.com/cluster/wp-ras/ Tsai T., R. Iyer, D. Jewett, “An Approach towards Benchmarking of Fault-Tolerant Commercial Systems,” in Proc. FTCS-26, pp.314-323. Microsoft Clustering Architecture “Wolfpack,” White Paper, May 1997, http://www.microsoft.com/ntserver/info/wolfpack.htm.

25