Availability Management of Distributed Programs and Services Markus Endler Departamento de Ci^encia da Computaca~o IME-Universidade de S~ao Paulo S~ao Paulo, Brazil E-mail:
[email protected]
Abstract
Toolkit[12], the Megascope tool[15] within the Project Pilgrim[14] and the tools developed by Huang and Kintala[7]. However, until now, we have not seen any system supporting the availability management of DCE-based applications. Sampa, which stands for System for Availability Management of Process-based Applications, will be a decentralized and fault-tolerant system intended to support the management of fault-tolerant DCE-based services and programs through the enforcement of application-speci c availability speci cations. Sampa is supposed to detect faults such as node crashes, network partitions, process crashes, and hang-ups, and to automatically execute the necessary recovery actions according to a user-provided availability speci cation. Due to Sampa's limited support for checkpointing, its fault detection and recovery capabilities are constrained to the automatic detection and recovery of the faults mentioned above, and to periodic checkpointing and recovery of some of the program's internal state, which corresponds to level 2 in Huang and Kintala's [7] classi cation of fault-tolerance facilities. The main application areas for this kind of fault tolerance are systems with higher demands on availability than on strong data consistency, such as telephone switching systems or information retrieval systems. In this paper we will focus on the design of some of the services within Sampa, and show how these can be used for managing a generic faulttolerant service. Section 2 presents the system's global architecture and its main components. In sections 3, 4, and 5 we describe the design and the main features of its monitoring, checkpointing, and con guration management facilities, respectively. In section 6, we present Sampa's language for writing availability speci cations, through the
Modern distributed applications pose increasing demands for high availability, automatic management, and dynamic con guration of their software systems. This paper presents the architecture of Sampa, a System for Availability Management of Process-based Applications, which aims at ful lling these requirements. The system has been designed to support the management of faulttolerant DCE-based distributed programs according to user-provided and application-speci c availability speci cations. It is supposed to detect and automatically react to faults such as node crashes, network partitions, process crashes, and hang-ups. In this paper, we focus on the design of some of its services { the monitoring, checkpointing, and con guration management facilities { and show how they can be used for managing a generic fault-tolerant service.
1 Introduction Distributed applications are becoming bigger, more complex and users are increasingly demanding high availability from these applications. This situation creates a huge demand for systems that support automatic and availability-oriented management of distributed programs and services through monitoring and dynamic con guration facilities. Such management aims at guaranteeing the reliable and ecient execution of distributed programs, and the availability of essential services or processes in spite of node or communication failures. Some work has been done in implementing speci c tools for monitoring and controlling distributed applications, such as the Meta 1
use of an example. Finally, in sections 8 and 9, we comment on related work, and draw some conclusions on the current status and future steps of the project.
Although in principle this architecture may be used with more than one supervisor, each of which being specialized in a certain kind of management (e.g., performance, con guration, or availability management), our initial goal in the project is to implement a single supervisor that controls monitoring, system availability, and recon guration.
2 Global Architecture
2.1 Base Services
Sampa's architecture is based on the principle of separating all management functions (e.g., monitoring, process management, checkpointing) into two levels. At the lower level, agents executing on every host execute all sorts of local process management tasks (e.g., process creation, deletion, stopping and resuming) checkpointing) and monitoring operations (e.g., collection, ltering, and data pre-analysis), and interact with a supervisor process for notifying fault occurrences and for receiving monitoring and recon guration commands. At the higher level, a supervisor makes global management decisions for each distributed service (or application program) according to its availability speci cation, and based on the monitored data received from the agents. The availability speci cation consists of global monitoring and recon guration directives which are interpreted by the supervisor. The supervisor then delegates these directives in the form of simpler commands to the corresponding agents. The supervisor also provides a user interface for con guring the distributed programs/services in an ad-hoc manner. Following this approach, Sampa's architecture consists of some base services, agents executing on every system node, and a single coordinating supervisor. The agents are responsible for monitoring several hardware, operating system and DCE resources at their node (e.g., CPU and memory utilization, RPC request rate, etc.), and analyzing the data before sending it to the supervisor. Besides this, the agents have full control of all application processes and servers executing on their node. This control can be implemented eciently through the use of signals and local process communication facilities, which are available in most operating systems. Due to the fault-tolerance requirements of the system itself, agents are also in charge of monitoring the availability of the other agents and the supervisor, and of reporting any failures (node crash, network partition) to the other agents and the supervisor.
For implementing fault-tolerant services and applications, a minimum set of base services is required on top of DCE. These services are reliable group communication, monitoring, and basic checkpointing support. They will be used by Sampa's agents (e.g., to maintain a coherent view of the system, and to recover their state when a node fails), and may also be used by some applications. The above-mentioned services are being implemented as runtime libraries and as system processes (daemons or clerks) which will issue operating system calls and use DCE services. DCE does not provide any form of reliable multicast communication mechanism. However, since many fault-tolerant programs (e.g., those which require their processes to maintain a consistent and synchronized state) rely on such facility, we decided to implement such a service in Sampa. Although many others have also implemented reliable group communication for various kinds of operating systems and runtime environments, the main challenge in building such a service for DCE is to implement it eciently in user space and with RPCs. The architecture of the group communication service is described in [5]. The second of the base services is the monitoring support, which is essential for the ability to detect stable failures or temporary overload situations, such as a node failure, a process failstop, or a server overload. For the sole purpose of availability and con guration management we found that one needs monitoring only at the process level. This monitoring can be implemented by simple queries to the operating system and runtime libraries for checking the availability and activity of processes and performing simple performance measurements. Such queries do not need any instrumentation of the application programs and can be implemented in user space. The basic idea is to implement the sampling, storage, ltering, and pre-analysis of the monitored data in 2
special monitoring processes, which are created by the agents, and communicate with them for receiving commands with the monitoring criteria, and for transmitting monitored data and events to the supervisor. The third of the base services is a checkpointing support, which includes facilities for saving and restoring values of global variables within application processes, and for retransmitting the messages sent to a failed process. With respect ot the rst facility, the basic idea is to instrument all critical application processes with procedures to save and load the relevant portion of its execution state to and from stable storage. We chose to automate this instrumentation by a macro preprocessor for application programs written in the C language. Based on an annotation of the variables and data structures that characterize the relevant process states, this macro preprocessor generates the procedures to save and load the values of these variables and data structures to and from a le, and also the calls to DCE's runtime library to export this interface and to wait for call requests. At runtime, procedures load state and store state may then be executed by this or any other authorized process (e.g., an agent). However, this will happen only at user-speci ed program points, where the process stops and temporarily waits for an execution request for any of the two procedures. The second facility is implemented through checkpoint daemons, which perform the logging of all the messages sent among application processes, and which re-transmit these messages to the new instances of failed processes. This makes it possible to re-establish the global state of the application program reached just before the failure of one of its components. For all these services, the main challenge has been to design them in such a way that they do not require any changes to the underlying DCE runtime system and operating system. A second goal has been to automate as much as possible the instrumentation of the application processes using these services.
visor, thus combining the functionality of Huang and Kintala's watchdog daemon [7] and Pilgrim's [15] Generic Instantiation Service. Agents are in charge of creating and destroying application processes, checkpointing these processes, and controlling the monitoring activities at the host. In order to perform these functions, agents use Sampa's base services, local operating system facilities, and DCE services. The following are the three main tasks of the agents. They are determined by commands that are received from the supervisor and are derived from the availability speci cations. Monitoring One of the main tasks of agents is to
con gure and control the monitoring of local processes with respect to their availability, resource usage, request and message receive rates, etc.
Con guration Control Agents are also re-
sponsible for creating, checkpointing, and deleting application processes, and keeping track of the current communication binding relations among these processes.
Mutual Control Because Sampa is supposed to
be itself fault-tolerant, agents will periodically check for the availability of other agents and the supervisor, and broadcast any faults to the remaining agents and the supervisor. This mutual control will be performed using a synchronous membership protocol based on a cyclic organization of the agents [4].
2.3 Supervisor
The management supervisor is responsible for analyzing the monitored data and enforcing the global availability policy for each of the managed programs or services. Since a supervisor must be able to manage several programs within the same network, every such application will have a unique application identi cation (AppId), which will have the status of a DCE principal [16] and thus will be subject to the authorization and authentication procedures in DCE. Within each application program, every process will be uniquely identi ed by an identi cation of the agent managing it (AgentId) and by an unique
2.2 Agents
The management agents are responsible for performing all the basic and local monitoring and process control at one host on behalf of the super3
process identi cation (ObjectId). The triple AppId, AgentId, ObjectId will thus uniquely identify every process in Sampa and will be used in every monitoring or con guration control message between the supervisor and an agent. In some supervisor-agent interactions, some of the above-mentioned identi ers may be set to a default empty value; e.g., when AppId is set but the other two elds are empty, the request is de ning a global, application-speci c parameter that applies to all agents. The main tasks of a supervisor are to maintain and display an image of the current con guration of a distributed program or service in terms of the distribution of its processes and the communication patterns; accept monitoring and recon guration commands from the user; and control the distributed programs and services according to the speci c availability speci cations provided by the user. These availability specs are written in a rule-based language as event-action-pairs. In section 6 we will show an example of such a speci cation.
either to the host or to an individual application process running on that host. Every monitoring instance is derived from a monitoring type, which de nes a monitoring metric: i.e., the type of data collected and the parameters that can be set for every instance of this type, such as a sampling interval, low lim and high lim thresholds, or a noti cation interval. Examples of monitoring types could be programs that query the operating system state (similar to the Unix utilities ps, vmstat), call the DCE runtime library and services, or periodically measure the RPC response time between two machines. Every type provides a speci c kind of monitored data, which can range from a simple integer to a structured set of data.
3.2 Architecture
As mentioned, all monitoring activities at a host will be controlled by the agent executing on that host. It will create processes responsible for monitoring a particular instance of a host, operating system, or DCE resource according to the functionality de ned by its corresponding type. Instances will sample performance and resource utilization data by making the appropriate calls to the operating system and shared DCE runtime libraries; store the data in internal data structures; and communicate with the controlling agent whenever data or an event needs to be forwarded to the supervisor. Communication between the agent and its monitoring instances will be performed using the local IPC mechanism, such as the Unix named pipes. All the agent's local monitoring control will be done according to monitoring commands received from the supervisor. Through these monitoring commands, the supervisor may enable or disable monitoring of a particular instance, set threshold values and sampling intervals, or set other monitoring parameters. The main advantage of this monitoring architecture is that since monitoring instances are actually processes fork-ed and exec-ed by the agent, this yields to a exible monitoring approach where new and operating system-speci c types can be easily incorporated. The other advantage is the uniform treatment among monitoring and application processes within Sampa. The obvious disadvantage of having additional
3 Monitoring Support Since monitoring in Sampa is focused on availability management, it requires only facilities for collecting process-level information: i.e., data and events that are visible externally to the processes and can be collected asynchronously to the processes' execution. Examples of such data are CPU utilization, process activity, and RPC request rates. Thus, Sampa's monitoring facilities are restricted to the sampling and analysis of data that can be obtained by queries to the local operating system and the DCE runtime library. Such external monitoring has the advantage of being more exible and less intrusive than other approaches in which application processes and runtime libraries must be instrumented with probes that provide runtime information to sensors. The other advantage of external monitoring is that it does not require any changes to the operating system or runtime libraries, which makes it more portable than other techniques.
3.1 Basic Concepts
We de ne monitoring instances to be active entities at one host that collect runtime data related 4
processes for monitoring is that their resource usage will have some in uence on the system's performance. The other disadvantage is the lack of control on the scheduling of the monitoring instances (done by the operating system), which makes it impossible to enforce speci c sampling and noti cation intervals.
instance queries the con guration service for obtaining the new bindings to the interfaces of that instance. The preprocessor is responsible for generating the procedures save state and load state, which store and load the values of annotated (global) variables to and from a le1 provided as a parameter to these procedures. These procedures also checkpoint the current SSN and RSN values, and send these values to the corresponding cpd, so that it can prune its message log. Procedures save state and load state are also exported to the CDS namespace, by which they may be invoked by other processes too; as, for example, the agents. This facility is necessary, for example, for the purpose of a process migration triggered by the supervisor. Notice that the above procedures have exactly the same signature for every process, but their de nition will dier from one process to another due to the speci c data structures to be saved in each process. By exporting the bindings of this checkpointing interface into separate, and process-speci c, entries of the CDS namespace every process (with the proper authorization) will be able to obtain this bindings and distinguish them from the own procedures.
4 Checkpointing Support Sampa's approach to checkpointing is based on the premise that state-saving operations are essentially application-speci c, and therefore should incorporate much of the application programmer's knowledge of which data structures within each process are the ones that fully characterize the program state. Therefore, we decided to implement a simple checkpointing support where the application programmer has to specify which data is to be included into the checkpoint, and at which program points it is to be done. Due to the high overhead of coordinated checkpointing, we chose to support asynchronous checkpointing (i.e. where processes perform their checkpoints independently of the other processes) with a message logging and retransmission facility. Sampa's checkpointing facility consists of a source-code preprocessor and a set of checkpoint daemons (cpd) implementing a variant of the sender-based message logging proposed by Johnson and Zwaenepoel [8]. Each of the cpds is responsible for a set of application processes, for which it maintains a log of all the messages sent by any of the processes. In this log, every message is stored with following information: (Data, SenderId, ReceiverId, SenderSequenceNr (SSN), ReceiveSequenceNr (RSN)). When a process fails, a new instance is restarted from the last chekpoint and all the messages received by this (faulty) process since the checkpoint are re-transmitted by all cpds to the cpd responsible for the new instance, which then orders the messages according to the RSN- eld, and forwards them to the new instance. This message retransmission protocol assumes that cpds are noti ed about any process crash, and about the identity of the newly created instance and its supervising cpd. It also requires that the cpd responsible for collecting and forwarding the retransmitted messages to the new
4.1 The preprocessor
In this section we present some of the preprocessor macros, and explain how the preprocessor works along a simple example. As mentioned earlier, the source code of an application process must be slightly modi ed in order to allow for its relevant data to be checkpointed. These changes include the annotation of some variables and data structures (e.g. a tilde ~), as well as macro calls for initialization (INITCP), and for enabling checkpointing (CPHERE) at any program point within main. Figure 1 shows a simple example of an annotated source code. In this case, variables state and list represent the process state to be checkpointed. After running the checkpoint preprocessor over this source, the original source code is expanded, as shown in Figure 2. The main extensions are the de nition and call of proce1 Hence our approach requires the existence of a distributed le system like DCE's DFS.
5
main(int argc, char **argv) { int ~state; reg_t* ~list; INITCP while (1) { CPHERE /* other processing */ } }
#include "spcp.h" int state; reg_t* list; void sp_cp_init(void); void listen_until(int t); void save_state(char *filename); void load_state(char *filename); main(int argc, char *argv[]) { sp_cp_init; while (1) { pthread_create(listen_until(T)); rpc_server_listen(..) /* other processing */ } } void save_state(char *file) { FILE *fd; fd = fopen(file,"w"); sv_int(fd, state); sv_lregt(fd, list); sv_seq_nr(_ssn,_rsn); inform_cpd(_ssn,_rsn); fclose(fd); } ..definition of the other procedures ..
Figure 1: Annotated source code dures to initialize and wait for requests for executing save state and load state, which then have also been generated. Notice that CPHERE has been substituted by a call to DCE's blocking rpc server listen()2 and the starting of a thread that will unblock our program after time (T) by calling DCE's runtime procedure rpc mgmt stop server listening. Notice that T must be chosen big enough to allow for either save state or load state to complete. In the generated procedure save state, sv int is a library procedure for saving integer values (in a canonic format) to the le, sv lregt is a generated procedure for saving a linked data structure based on rec t. Procedures sv seq nr() and inform cpd() are standard procedures for saving and informing the values of SSN and RSN to the checkpoint daemon. Sampa's checkpointing approach thus requires the application programmer to decide which data structures characterize the relevant process state, and to choose the program point(s) where checkpointing should be allowed. For server processes that also export another interface, the programmer need not set CPHERE, since the process can listen to all its interfaces (including the checkpoint interface) simultaneously.
Figure 2: Expanded Source Code application processes, but also handles the communication bindings among them. In order to be manageable by Sampa, an application process must include calls to a special runtime library that perform the registration (and updates) of its interfaces and handles at the local agent, which in turn passes them to the supervisor. As with the checkpointing support, the required instrumentation of the application program will be supported by language preprocessing. The supervisor implements a central database holding an image of the current global con guration of every application program being managed. To keep this image consistent with the actual program con guration, the supervisor must be noti ed of every change in the pattern of process-toprocess bindings. Every application process must register a set of interfaces and a set of handles. An interface consists of a set of functions provided by the pro-
5 Con guration Control Unlike the other base services, con guration control is the central and unifying service in Sampa. It is coordinated by the supervisor, and is supported by all the agents. Con guration control not only deals with the creation and removal of 2 If local checkpointing should also be done at this point, save state must be called after this command.
6
cess to its environment, and a handle is a potential reference to a remote interface required by the process. Every such interface and handle has a name and a cell-wide unique identi cation (UUID). Two processes are said to be connected (or bound) when the server's interface (i.e., Interface Id) is assigned to the matching handle on the client's side. Such an assignment is possible only if the interface and the handle have matching UUIDs, but it can be changed arbitrarily often during program execution. A client process that is within an application program and is subject to Sampa's con guration management may connect itself to a server in either of two ways. It may establish the binding directly via DCE's CDS and rpcd, and then register the newly established connection at the supervisor; or it may send a connection request to the supervisor (via the agent) for obtaining the required binding. In either case, the supervisor will be informed about the new binding, and may update its image of the program con guration accordingly.
con gurations may cause message loss.
6 Availability Speci cation In this section we give an idea of Sampa's language for specifying the availability policy of distributed programs, using a simple example. This language is a rule-based scripting language similar to Perl[17], with a set of basic operators for comparison and manipulation of strings, list and sets, such as == (equality), () (list catenation), in and ?? (set membership and set difference), and others. Besides these, there is a set of build-in primitives for performing basic con guration and monitoring control, which are translated into equivalent supervisor commands to the agents. The table in Figure 3 gives the semantics of some primitives for dierent purposes, where prog, proc, hand, and itf are place-holders for the name of an executable (i.e. the command line), a process, a handle, and an interface identi er, respectively; and the sux L always denotes a list of such names/identi ers.
5.1 The Agent's Role
Con guration:
Essentially, con guration management in Sampa is carried out through agent actions (on behalf of supervisor commands) on the application processes state and interface. Whenever a process starts execution, it rst registers all its interfaces and handles and stops waiting for a signal from the agent. The agent sends the interface and handle declarations to the supervisor, and waits for a process start command that unblocks the process. The supervisor may also send a request to establish, remove, or update a connection, which causes an update of the corresponding handle of the process. When an agent receives such a request from the supervisor it stores the binding update and sends a special signal to the corresponding process. Only when the application process makes a call to a library routine for querying the validity of a binding or for updating it, is the new binding assigned to the corresponding handle. Thus, any program subject to Sampa's management should always check for the validity of a binding before using it. Notice that Sampa's con guration management service does not guarantee the consistency of the program's state during a recon guration, since re-
Create(prog,host): creates new process of prog proc at host Connect(proc,hand,itf) sets handle hand of proc to itf SiteOf(proc): host returns the host of proc GetItf(proc,itfname): interface Id of itfname at itf proc LinkedHandles(itfL): all handles bound to any handL interface in itfL UpdHandles(handL,itf) updates all handles in handL with ift
Monitoring:
Crashed(hostL): hostL crashed host(s) among hostL Unreach(hostL): unreachable host(s) among hostL hostL Aborted(prog,hostL): process crash on any of procL hostL Stopped(prog,hostL): process hang-up on any of procL hostL
Miscellaneous: Any(list): elem
selects any elem from list
Figure 3: Language Primitives Every rule in this script language is an event7
action-pair written in the form event >> actions. An event may be the detection of a failure, the satisfaction of a condition involving a boolean expression and script variables, or the passing of a given time period. The latest case is used for specifying periodic control actions. There is also init, which is a special event that is triggered by an interactive user command to start a distributed program. The language also supports script variables (preceded by %), whose values are lists of names, identi ers, or integers, all of them represented as strings. As in most scripting languages, those variables do not have to be declared before they are used. Some special (environment) variables have prede ned values that are set when Sampa is started, such as %AllHosts, which contains the list of all managed hosts. However, most of these variables can be rede ned for each application.
In order to accomplish the rst task, we assume that during its normal operation, pri periodically checkpoints its state into a service-speci c le, say PBfile, which will be loaded at initialization of the new backup. The second task is done by updating the binding handles at all current clients to the interface address of the new pri, and also by updating the CDS entry accordingly with the new interface. As mentioned in section 5, this binding redirection is done cooperatively between the supervisor and the agents. Since the supervisor has complete information about the current bindings, it can identify and locate all clients using the PBservice. It then sends handle update request to the corresponding agents, which in turn notify the processes of their new bindings. sync_in
cli_1 serv_itf
7 Example
pri cli_2
sync_in
In this section we give an example of an availability speci cation for a hypothetical faulttolerant service PBservice implemented using the primary-backup approach [13]. In this approach, a service is implemented by two servers executing on dierent hosts, a primary pri and a backup bck server. These servers run an identical program, but in dierent roles. In normal operation, only pri serves the client processes, while the bck server is continuously being updated with pri's local state. This is done by having pri call a function in interface sync in at server bck for sending its state updates. We assume that even when there is no ongoing client request, pri also periodically sends some alive signal to bck. Thus, if bck does not receive any update or alive message from pri for a certain period of time, it assumes that pri has aborted or crashed, and simply switches its function to the primary role, becoming the new pri. Obviously this alone is not sucient for providing a transparent transition from one server to the other. Two other tasks have to be done: First, a new bck has to be created, initialized with the current state of the service, and the new pri must be connected to it. Second, all bindings of clients to the former primary server have to be redirected to the new primary server.
serv_itf
bck->pri cli_3 sync_in serv_itf
new bck
Figure 4: Failure of the primary server Figure 4 shows a scenario of a primary failure. In this case, bck switches to the primary role, a new backup server is created, and all client processes are redirected to the new primary server. Figure 5 shows part of the script with the monitoring and con guration control actions describing the above-mentioned recon guration. The rst part of the script shows function CrLinkBck, which creates a new backup server (command server -B) at any host provided in list %Hosts, connects its interface to the primary identi ed by parameter %pri, and returns the identi cation of the newly created process. Event init is executed only once at service initialization. It creates the servers, publishes pri's interface in a CDS entry, and records that the server's sites are the critical hosts, which must be monitored with respect to their availability and reachability. The section introduced by +10 speci es that ev8
to the old primary3 are obtained by primitive LinkedHdl, and are updated (by UpdHandles) to the interface of the new primary, stored in variable %priItf. Finally, this interface is also published into the appropriate CDS entry, and the list of critical hosts is updated. The availability script would also contain the recon guration actions for the backup failure case, which are even more simple and are not shown in Figure 5. Although this language is still being de ned, the example should give an idea of how it can be used to describe recon gurations that ensure continuous availability for some class of fault-tolerant distributed programs.
PBservice { sub CrLinkBck(%pri, %args, %Hosts) { local(%pid, %bckItf); %pid =Create(("server -B",%args),Any(%Hosts)); if (%args == "L") load(%pid, %BPfile); %bckItf = GetItf(%pid,"synch_in"); Connect(%pri,"synch_out", %bckIft); return %pid } init >> { %pri = Create("server -P","host1"); %bck = CrLinkBck(%pri,"", "host2"); %critHosts = ("host1","host2"); %priItf = GetItf(%pri,"serv_itf"); Publish("PBservice","CDS-entry",%priItf); } +10 >> { %SWprob = ( Aborted("server",%critHosts), Stopped("server",%critHosts) ); %HWprob = ( Crashed(%critHosts), Unreach(%critHosts)); } // primary server or its host crashed (%pri in %SWprob)||(SiteOf(%pri) in %HWprob) >>{ %oldpri = %pri; %pri = bck; %bck = CrLinkBck(%pri,"L", %AllHosts-SiteOf(%oldpri)); %priItf = GetItf(%pri,"serv_itf"); forall %x in LinkedHdl(GetItf(%oldpri, "serv_itf")) { UpdateHandles(%x, %priItf); } Publish("PBservice","CDS-entry",%priItf); %critHosts = (SiteOf(%pri),SiteOf(%bck)); } // backup server or its host crashed ... }
8 Related Work Several other groups working with monitoring, fault-tolerance, dynamic con guration and distributed application management have had in uence on the design of Sampa's architecture and services. CONIC [11] and its successor languages [10] have shown the advantages of a strict separation between the program algorithms and con guration. With Sampa, we aim at extending this approach for availability policies for fault-tolerant programs. The main dierence to Kramer's approach to handle fault-tolerance[9], is that in Sampa the support for fault-tolerance is based on checkpointing rather than on moving processes to a passive state. Hong and Bauer proposed a reference architecture for general-purpose distributed application management [6], but their main focus is on monitoring, management integration, and system modeling. Becker [2] describes an architecture based on the concept of a fault-tolerance layer that hides fault-tolerance issues from distributed programs. Similar to Sampa's base services and agents, this layer provides special services (e.g., surveillance, checkpointing, atomic broadcast) for services with fault-tolerance requirements. The main dierence is that in his approach, the availability policy is hard-coded in this layer, rather than being the input for a supervising program.
Figure 5: Sampa Script for Primary-Backup Service ery 10 seconds the system must check if the program server is running at both critical hosts, and if these hosts are available. If any failure is detected, the corresponding primitive (e.g., Aborted, Crashed) returns the identi cation of the aborted/stopped process or crashed host, and stores it either in script variables %SWprob or %HWprob. The last section speci es the con guration actions to be taken when the pri is unavailable. First, variable %pri is reassigned with the former backup server's identi cation, and a new backup is created by calling function CrLinkBck. Then, all the binding handles of clients connected
3 Although this process may not exist any more, the binding information is still recorded in the supervisor.
9
Speci c tools for monitoring and controlling distributed applications have also been implemented. One of them is the Megascope [15] of the Pilgrim Project [14]. It is a basic monitoring service for management of DCE-based applications, which is designed around a centralized database (panel), for collecting and querying cell-speci c monitored data. Huang and Kintala [7] have also implemented a nice set of tools for controlling availability and checkpointing, but their work is not concerned with describing global availability speci cations. Another tool the Meta Toolkit [12], which is based on the ISIS system [3]. It provides means for instrumenting distributed application processes with sensors and actuators, which are used for monitoring and controlling the execution of the application processes from a control layer, where monitoring & con guration is speci ed in a rule-based language (Lomita). Compared to Meta, our work diers in that it allows for less intrusive and more exible monitoring, and de nes a higher-level language for specifying availability and monitoring directives.
evaluate the overhead caused by such service, and its implications on the application program's performance. For the checkpointing service we have already implemented some checkpoint daemons that perform the sender-based message logging. Soon we will be nishing the preprocessor that generates the save state and load state procedures for arbitrarily structured and linked data structures written in the C language. Then, we will integrate this part with the remaining checkpointing support. Along this implementation we noticed that using only DCE's RPC would lead to poor performance. Therefore, we decided to implement these services using Concert/C[1], which provides both asynchronous message passing and RPC facilities, and which also allows for communication with DCE services and applications. Soon we will also start implementing the group communication service, complete the con guration control service, and integrate the corresponding supervisor-agent protocols. At a later stage, we will nish the de nition of the scripting language, and implement the supervisor with full functionality. We have not yet tackled many other important issues, such as security management, support for multiple management, and scalability issues, which we will consider in later steps of the project.
9 Conclusion The Sampa project started from the belief that distributed applications, in particular those with high availability requirements, require tools that allow for automating at least some of their recon guration and recovery actions. Since OSF's DCE is emerging as a de facto standard environment for developing distributed programs, we decided to design such a system for DCE-based programs/services. We have already implemented a prototype of the con guration control service, which includes a con guration database server (supervisor), agents, and a preprocessor that generates the registration procedures for the application process' interfaces and handles. We are also currently implementing the base services, and will soon have prototypes of the monitoring and the checkpointing services. Our decision to implement monitoring as a service performed by user-level processes was driven by the requirement of providing a exible service, which could be customized for the speci c operating systems in use. However, we still have to
Acknowledgments We gratefully acknowledge the nancial support from CNPq (Integrated Project 522112/94-3 and Project Protem TCPAC) and FAPESP (Project 94/5816-6). The author would also like to thank Anil D'Souza and Sergio R. da Conceica~o for their contribution to this work and the unknown referees for their valuable comments.
About the Author Markus Endler is currently Assistant Professor at the Computer Science Department of the University of S~ao Paulo. In 1992 he received his Dr.rer.nat. from the Technical University of Berlin. From 1989 through 1994 we worked at 10
[9] J. Kramer. Con guration Programming A Framework for the Development of Distributable Systems. In Proc. IEEE Int. Conf. on Computer Systems and Software Engineering (CompEuro90), Tel Aviv, Israel, May 1990. [10] J. Magee, N. Dulay, and J. Kramer. Structuring parallel and distributed programs. In Proc. of the Int. Workshop on Con gurable Distributed Systems, pages 102{117. IEE, March 1992. [11] J. Magee, J. Kramer, and M. Sloman. Constructing distributed systems in Conic. IEEE Transactions on Software Engineering, SE15(6), June 1989. [12] K. Marzullo, R. Cooper, M.D. Wood, and K.P. Birman. Tools for Distributed Application Management. IEEE Computer, 24(8):42{51, August 1991. [13] S. Mullender. Distributed Systems. Addison Wesley, 1993. [14] J.D. Narkiewicz, M. Girkar, M. Srivastava, A.S. Gaylord, and M. Rahman. Pilgrim OSF DCE-based Services Architecture. In Proc. of International DCE Workshop, LNCS 731, pages 120{134, October 1993. [15] B. Obrenic, K.S. DiBella, and A.S. Gaylord. DCE Cells under Megascope: Pilgrim Insight into the Resource Status. In Proc. of International DCE Workshop, LNCS 731, pages 162{178, October 1993. [16] J. Shirley, W. Hu, and D. Magin. Guide to Writing DCE Applications. O'Reilly & Associates Inc., 1994. [17] L. Wall and R.L. Schwarz. Programming Perl. O'Reilly & Associates Inc., 1991.
the German National Center for Computer Science (GMD) Institute at Karlsruhe, where he participated in the ESPRIT Project REX developing a high-level scripting language for dynamic recon guration called Gerel. Since then he is leading the Sampa Project. His main research interests include distributed operating systems, fault-tolerance, distributed application management and distributed algorithms.
References [1] J.S. Auerbach et al. Concert/C Tutorial and User Guide: An Introduction to a Language for Distributed C Programming. Technical report, IBM T. J. Watson Research Center, January 1995. [2] T. Becker. Application-Transparent Fault Tolerance in Distributed Systems. In Proc. of 2nd. Int. Workshop on Con gurable Distributed Systems, pages 36{45, March 1994. [3] K.P. Birman and R. Cooper. The ISIS Project: Real Experience with a Fault Tolerant Programming System. Operating Systems Review, 25(2):103{107, April 1990. [4] M. Endler. The Design of Sampa. In Proc. 2nd International Workshop on Services in Distributed and Networked Environments, Whistler, CA, pages 86{92. IEEE, June 1995. [5] M. Endler and A. D'Souza. Supporting Distributed Application Management in Sampa. Technical Report RT-MAC-9516, IME/USP, November 1995. [6] J.W. Hong and M.A. Bauer. A Generic Management Framework for Distributed Applications. In Proc. 1st Int. Workshop on System Management, pages 63{71. IEEE, April 1993. [7] Y. Huang and C. Kintala. Software Fault Tolerance in the Application Layer, chapter 10. John Wiley & Sons, 1995. [8] D.B. Johnson and W. Zwaenepoel. SenderBased Message Logging. In Proc. of 7th. Int. Symposium on Fault-Tolerant Computing, pages 14{19, July 1987. 11