Reliable Management of Distributed Computations

Reliable Management of Distributed Computations in Nexus Anand Tripathi Neeran M. Karnik

Surya P. Koneru Clifton Nock Renu Tewari Vijay Bandi Khaled Day Terence Noonan Department of Computer Science University of Minnesota, Minneapolis MN 55455

Abstract This paper describes the approach taken for con guration management in the Nexus distributed operating system. This approach uses kernel-level support for monitoring status of distributed components of an application. Periodic user-level messages are no longer required for status monitoring. Group and dependency relationships between such components can be de ned by the programmer for the purpose of con guration monitoring and management. An object belonging to a distributed application can be monitored by its host kernel for some system-de ned exception conditions. When any of these conditions arise, other objects are noti ed through signals or messages, as speci ed by the programmer.

1 Introduction Workstation clusters connected by local-area networks oer great potential for high performance distributed computing by exploiting idle computing power for parallel processing. In this paper we discuss the facilities provided by the Nexus distributed operating system [16] for supporting con guration management of distributed computations. The objective of the Nexus design is to provide a set of simple, network-transparent abstractions to the programmer to utilize the distributed computing power and resources in such a cluster as a single monolithic computing facility. Nexus has been implemented on a cluster of Sun workstations. The Nexus approach centers around building distributed applications using object-based programming methods. The programming model is designed to be independent of any speci c language. Dealing with failure conditions is a major issue in distributed systems. Management of a distributed E-mail

contact: [email protected]

computation requires monitoring the status of its components and initiation of suitable recovery mechanisms and recon guration protocols under exception conditions. Most existing systems require periodic status messages (also referred to as heartbeat messages) to detect the failures of a distributed computation's components. This approach is highly sensitive to the frequency of such messages. Lower latency requirement implies higher message trac and computation load to process such messages. Nexus exploits kernel-level mechanisms to monitor the objects of a distributed application. A network-wide service then initiates programmer-de ned actions when exception conditions are detected. This approach completely eliminates user-level heartbeat message trac and latency problems. The con guration management mechanisms described in this paper have been developed with the following objectives. The mechanisms should be compatible with the Nexus computation model and the UNIX 1 signal management. The primary reason here is to be able to exploit the object-based distributed programming capabilities of Nexus, which uses UNIX as its implementation base. There should be clear separation of mechanisms from any recon guration policies. A programmer should be able to implement any desired recon guration policies using these mechanisms. Using these mechanisms it should be possible to monitor an object for exception conditions or deliver a signal to it without requiring the knowledge of its location. Finally, these mechanisms should be usable by both the user-level applications as well as by the Nexus operating system components. The next section presents a brief summary of other related work in the eld. Section 3 describes the object-based computation model of Nexus. Section 4 presents the user-level view of con guration man1

UNIX is a registeredtrademarkof AT&T Bell Laboratories

agement primitives. Section 5 describes the system level protocols for con guration management.

2 Related Work The various systems that have been developed in the past for supporting parallel and distributed computing in local area networks can be grouped into three general categories: parallel programming environments/tools, programming models and languages, and distributed operating systems. Programming environments are generally implemented as user-level software using the available operating system facilities. For this reason such systems are easily portable. Some of the examples of systems in this category are PVM [18], Express [5], and Isis [2]. In the second category we have distributed programming languages and models such as Argus [8], SR [1], and Conic [9]. In the third category we have distributed operating systems like Amoeba [6] Mach [11], Cronus [13], Clouds [12], and V system[4], which are kernel level implementations. Parallel programming environments are implemented at the user-level using the standard protocols available for message passing. The foremost attractive feature of such systems is their portability to different computers because they do not require kernel modi cations. Most such systems support programming in dierent languages and execution on heterogeneous platforms. The disadvantage of such systems is that they are not able to fully exploit the features of the underlying system for high performance. For example in an ethernet based environment, in many of such systems the broadcast communication is still performed using point-to-point messages. Stability of such systems can also become a problem when a long computation is spanned over a large number of computers. Distributed systems often deal with groups of objects or entities. Instead of a single resource in conventional systems, a group of servers or resources are fundamental in distributed systems [7]. A process group in the V system is a set of processes identi ed by a single identi er, which can be distributed on various hosts in the network. The process group mechanism and multicast communication are used to implement distributed and replicated implementation of services. For reliable distributed computing Isis [2] provides process groups and group programming tools. Groups in Isis are viewed as a set of commu-

nication end points. Isis tools provide support for group communication, synchronization using locks, monitoring of group membership and site failures, and triggering of recovery. Express does not provide any general purpose support for de ning groups. It is only recently that PVM has included support for process groups. Recon guration for fault-tolerance requires mechanisms for two functions: monitoring of events and exception conditions, and speci cations of the actions to be taken when such events occur. Most of the programming environments use periodic heartbeat messages for status monitoring. The group membership service in Isis supports facilities to monitor group membership and site failures and to trigger recoveries. Isis/Meta [10] uses rule-based speci cations for con guration management. PVM and Express also support detection of process crashes, but do not support explicit mechanisms for recon guration. Both V system and Mach provide kernel-level support for exception handling. There are no mechanisms available to the programmer for conveniently specifying application-level global policies for status monitoring and con guration management. Such policies have to the implemented using kernel-level primitives in each component of the application. The approach adopted in Nexus eliminates user-level heartbeat messages by supporting remote signal monitoring/delivery, and allows global application-level speci cations of actions to be taken when exception conditions arise. Similar to the Isis/Meta system, these speci cations are rulebased. Load balancing techniques are important in selecting the host machine for scheduling a process. As a part of the system-level con guration management functions, the operating system has to keep track of the current set of nodes in the system and their load status. In the V system, the least loaded machine is used for scheduling a process, thereby distributing the load among the machines in the cluster. The multicast mechanism is also used for sending load information to the distributed scheduling mechanism. PVM and Express adopt simple load balancing techniques in scheduling processes. Conic allows the programmer to control the mapping of a process on a speci c host node; however, there is no support for load balancing. Mach provides an elaborate set of mechanisms for programmer-level control of scheduling policies for concurrent threads of a task; however, it does not include any user-transparent load-balancing mechanisms.

Page 2

3 Nexus Computation Model In Nexus, an object is an abstraction for some data or process. It is viewed as an instance of some abstract type, whose internal state can be accessed or modi ed only by invoking its interface procedures. Each abstract type is also an object in the Nexus system. We refer to such objects as type-objects. All type-objects are managed by a system-de ned object called NexusType. By invoking operations on a typeobject one can create or delete its instances; such operations are called class operations. A 64-bit unique identi er (UID) is assigned to every object, which serves as a system-wide unique name for accessing it in the network. In accessing an oject through its UID, the object's location is transparent to the client. The Nexus kernel supports request-reply based communication between objects. A UNIX process is associated with each object; this process implements the functional abstraction of the object. It receives the invocation messages sent to the object and executes the requested interface operations. This process is called object manager. Internal to the system is the concept of communication ports. Invocation messages sent to an object are delivered by the system to the port of its object manager process. In Nexus, an active object represents a process; it encapsulates some data and the execution context of an activity. A passive object encapsulates data only. In case of active objects one object manager process is dedicated to each object whereas for passive objects, generally one manager process manages multiple instances. An object is persistent if its state continues to exist even after the process creating it has terminated. The Nexus operating system's services are provided by a set of system-de ned objects: NexusType, Nexus

Process Manager (NPM), NexusName Server, Nexus Con guration Manager (NCM) which are always ac-

tive on some hosts. Each Nexus host executes an instance of Nexus Process Manager whose primary function is to activate or deactivate object managers at its host, as requested by NexusType. Nexus Name Server supports mapping of programmer-assigned symbolic names for objects to their UIDs. There are two major functions of NexusType: (1) creation and management of type de nitions in the system, and (2) maintenance of location and active/inactive status of the object managers of dierent types in the system. It also interacts with the Nexus Con gura-

tion Manager whose function is to monitor the status of some speci ed objects in the network. To de ne a new object type the interface procedure

New of NexusType is invoked; this procedure returns

the UID for the newly created type-object. As a parameter to the New call the programmer speci es an executable le name which contains the code for the object manager. The programmer can also specify the DOMAIN, which is a set of preferred machines or architectures where the instances of the type are to be created. Tools such as the Nexus RPC compiler [17] and the Nexus Thread package are available to the programmer for building object managers. In de ning a new type-object, the programmer has to specify how the class operations are to be implemented and how the instances are to be managed. There are two choices. One is to use the system-de ned default implementation where each instance is an active object and the class operations are implemented by NexusType. To create a new instance, NexusType selects a node (based on the scheduling and loadbalancing policies) and creates a UNIX process on that node to execute the object manager for the new instance. The second option allows type-objects and their instances to be managed by one or more UNIX processes that execute the code de ned by the programmer.

4 Primitives for Con guration Management To introduce the con guration management primitives of Nexus, we shall use a simple application example, the existential worm program described in [14], since it exhibits many of the requirements of a typical distributed application. This application has a user interface (or manager) process and some n processes called worm segments, possibly running on dierent machines. The main function of the existential worm is to ensure that all segments are running at all times - if one segment dies a new one must be created in its place. In a practical application, each segment would perform some useful computation. To implement the existential worm, facilities are needed for dynamically creating instances of the worm segment, monitoring the user interface process and worm segments (possibly on remote machines), and for de ning the actions to be taken when exceptions occur. It is also useful to be able to view all the segments as a group, to simplify the speci cation

Page 3

of these actions, as well as for multicast communication. These requirements are addressed by the primitives described below for supporting group management and dependency speci cations.

4.1 Process Groups

In Nexus, a process group is de ned as a set of active objects which can be viewed as a single logical entity for communication and exception handling. The members of a process-group can be either active objects or process-groups themselves. Any object in the system can send a message to a group. Such a message is delivered to all currently active members of the group. The sender can also specify the number of replies expected from the group. The user sends a message to a group using the invoke call and waits for each reply serially. Members can communicate among each other using their logical indices or their object-ids. A process group in Nexus can have objects of dierent types as its members, and the members of a group can be present at any set of nodes in the network. To keep the implementation simple, no special categories are de ned for groups. In Nexus, the group is indexed such that the members can be addresses using their logical indices for the purpose of communication, status monitoring and exception handling. This is primarily for supporting parallel programming where the computation and communication performed by a process is based on its logical position in the group. System functions are available to the programmer for creating and deleting process groups. A group is viewed as an object and is assigned a UID, which is called its groupid. A group object in Nexus is of a special system type called GroupType to distinguish it from other objects. The object that creates the group is the owner of the group. System functions are also available for adding or removing members of a group. A member can be deleted by specifying either its index in the group or its UID. It is also possible to replace a member with a speci ed index in the group, with another object. A set of system functions support queries about group con guration and membership information. In the worm example, the user interface (UI) rst de nes a new type that represents a worm segment. This is done by executing the library function NexusTypeNew and specifying the le containing the binary code for the worm segment. The UID of this

new type is stored in the variable wormSegmentType. NexusTypeNew( wormSegmentCodeFile, &wormSegmentType);

It then spawns n worm segments in the network by creating instances of this newly created type. This is achieved by invoking the New operation on wormSegmentType using the call function of the Nexus RPC system. This invocation returns the UID of the newly created segment in the variable seg[i]. The user-interface process puts all these worm segments in one group. These steps are shown below: CreateGroup(&seg_groupid); for (i = 0; i < n ; i++) { call(wormSegmentType, "New", &seg[i]) AddMember(seg_groupid, seg[i]); }

4.2 Exception Dependencies

A dependency is a cause-eect relationship between objects. In Nexus, a user de nes a dependency so that, if some exception occurs on a particular active object (called the trigger) then a signal, and possibly a message is sent to another object which is called the target. The target then takes appropriate action based on the context. The trigger-target dependency can be represented as:

=)

A dependency is registered with the NCM using the AddDependency call and removed using the RemoveDependency call. The exception de ned for the trigger object can be (i) any UNIX signal, (ii) process termination, (iii) any user de ned exception. The programmer can specify a UNIX signal, and any one of a set of pre-de ned messages to be sent to the target when the exception occurs. A set of group exceptions are also available to the programmer. They include (i) group member addition/deletion, (ii) group member dependency changes, (iii) messages or signals addressed to the group. The trigger here is the UID of the group to be monitored. The target can be any active object or a group object. In this way all group operations can be monitored as required. In our example, the worm segments form a circular dependency, making every segment responsible for monitoring one other segment. The dependencies are set up such that, if the ith worm segment terminates,

Page 4

then the (i+1)th segment is noti ed, by sending the UNIX signal SIGUSR1. In addition all segments must be killed if the user interface is terminated. This can be done as follows: for (i = 0; i < n ; i++) { AddDependency(seg[i], TERMINATION, seg[(i+1) % n], SIGUSR1); } AddDependency(user_int, TERMINATION, seg_groupid, SIGKILL);

4.3 Setting Monitors

For each trigger object that needs to be monitored, the Nexus kernel on the host machine of that object, has to be informed. Only then will the kernel monitor the trigger for that exception and inform the NCM when it occurs. Such monitoring can be activated using one of the following; (i) A SetObjectMonitor call to the NCM, specifying an active object to be monitored or (ii) a SetProcessMonitor call specifying a UNIX process to be monitored. An active object can make these calls to enable monitoring itself for a particular exception or some other object can make these calls on its behalf. Similarly there are two primitives to disable the monitoring by the kernel: (i) ClearObjectMonitor stops the monitoring of an active object by the Nexus kernel for the speci ed exception and (ii) ClearProcessMonitor does the same for a UNIX process. The user interface and each worm segment in our example enable their own monitoring for the termination exception as follows: SetProcessMonitor(SELF, TERMINATION);

4.4 Raising Exceptions and Sending Signals

A provision is made to allow the user to simulate the occurrence of an exception on some trigger object of a dependency. The RaiseException call to the NCM gives the programmer the capability to force the target to perform some action based on an exception, even when the exception has not actually occurred. This is also useful when the exception which activates a dependency is not system-de ned, and hence cannot be monitored by the kernel. The trigger, on reaching some condition, can raise the exception using this call and force the target to take appropriate action. For example if normal termination is to be distinguished from a termination exception, then a dependency can

be de ned based on Normal Termination and the process can make a RaiseException call for Normal Termination just before exiting. Similarly the user can request the NCM to send a signal (using SendSignal) to the target or trigger object of a dependency, thus simulating an exception and forcing the target into its signal handling routine. The target object will usually specify the action to be taken on receipt of a signal using the UNIX signal system call. If a message is also sent to the target, it can receive and process the message in its signal handling routine. For example, each worm segment has a signal handler which is invoked when the process receives a SIGUSR1. This implies that its neighbor segment has terminated, and thus needs to be re-created. A new instance is created for the wormSegmentType and it is inserted in the group at the appropriate index. Using the Nexus primitives these steps can be implemented by the ith segment as shown below: call (wormSegmentType, "New" &new_segid); AddMemberAtIndex(seg_groupid, (i-1) % n, new_segid);

5 System Support for Con guration Management This section describes the Nexus system architecture and its internal protocols in supporting con guration management, status monitoring, object location and dynamic binding.

5.1 System Components 5.1.1 Nexus kernel

The primary functions supported by the Nexus kernel are: request-reply based asynchronous inter-object communication, object location in the network, transportation of messages across the network, and support for status monitoring of objects at its node. It also participates in the status monitoring of some subset of nodes in the Nexus environment. The kernel executing at each node is also viewed as an object, with a UID assigned to it. One can communicate with a remote kernel using its UID. To invoke an operation on an object, the client executes the invoke call. The important parameters to this call include the UID of the object and the invocation message. Another parameter to can be used to indicate that no reply is expected from the callee object, and therefore the kernel need not maintain any

Page 5

information for this invocation. The kernel caches the currently known port-ids for the objects that have been recently accessed. Entries in these caches are made whenever the kernel discovers the current port-id for an object from the request or reply message trac. These caches are used in the object location protocol described in section 5.2.

mine what action to take for such group exceptions. If the user de nes a member dependency in which trigger object is ALL MEMBERS the dependency is expanded for all existing members of the group. Whenever an object is added to a group, NCM informs the host kernel of that object, since group membership information is required by the kernel to support group communication.

The Nexus kernel also maintains a list of currently active object manager processes at its node. It maintains the association between their UNIX process-ids, the Nexus UID, and the communication port-id. For each group, the kernel maintains the list of the local active objects belonging to the group. Any messages sent to a group are delivered by each kernel to its local members' ports.

The NCM also maintains a log of the context of the last signal sent to a target object for a prede ned timeout period. The member can query the NCM for the cause of the signal. The context of the signal is the pair that activated the dependency. If the NCM detects that a node has crashed, it raises the termination exception for all the objects on that host.

The signal delivery system of UNIX has been modi ed to support signal monitoring in a distributed environment. For this purpose, the system call SetProcessMonitor directs the kernel to monitor the delivery of a speci ed signal to a given process on its node. Whenever the kernel delivers a signal to a process being monitored, it also sends a message to the NCM, informing it of this event.

5.1.2 Nexus Con guration Manager The Nexus Con guration Manager is a system object that performs the function of maintaining the userde ned dependencies between objects. The NCM is replicated on a subset of the hosts. When the user makes the SetObjectMonitor call the NCM queries NexusType for the process-id of the trigger object, and stores the pid-to-UID mapping in a local table. The NCM then requests the host kernel, using the SetProcessMonitor call, to monitor the process for the given exception. When the Nexus kernel informs the NCM of the exception that occurred for a particular object, the NCM searches its tables for all occurrences of that object as a trigger in a dependency relationship. It then sends the appropriate signal and message(if any) to the corresponding targets. Since the NCM cannot directly signal an object on a remote machine it sends a message to the remote kernel requesting that the signal be delivered to that object. Whenever any group operations are performed, the NCM searches the group dependency list to deter-

5.2 Object Location Protocol To send an invocation message to an object, the Nexus kernel needs the port id of the particular object manager that manages the requested instance. The kernel rst looks up its local caches to see if it knows the location of the speci ed object. If the object has been accessed `recently', then the corresponding port id will be available in the cache. If it isn't, the Nexus kernel is forced to query NexusType to nd it. The kernel makes a GetPort call to NexusType, requesting it to return the list of port ids at which the speci ed object may be located. NexusType is the main service responsible for registering the locations of all Nexus objects, and therefore, the location of NexusType itself has to be known to the kernels on all machines. This problem is dealt with in section 5.3.2. NexusType searches its directories to locate the desired entry. If the required object manager is active, its port id is returned to the kernel. If the object manager is inactive, NexusType activates the object manager on some workstation in the system. For this, NexusType makes a remote procedure call to the Nexus Process Manager (NPM) on the target machine, where the new object manager is to be executed. This target machine can be chosen using some load balancing algorithm, say, the least loaded machine in the system. Also, the user is allowed to specify a set of preferred machines or a set of machines to be excluded from consideration during this scheduling. The corresponding NPM then creates a UNIX process to execute the speci ed binary on its local host, assigns it a port-id, and returns this port-id to NexusType.

Page 6

5.3 System Con guration Protocols

It is necessary for Nexus to keep track of the current con guration of nodes and system components, in order to be able to make scheduling decisions. This includes knowledge about which nodes are currently active, where NexusType instances are running, and some metric of the loads on the active machines. The design goal in our approach was to reduce periodic broadcasts or heartbeat messages.

5.3.1 Node Status Monitoring

Our approach to the problem of tracking the active nodes in the system is to organize the set of workstations in the system as a logical ring. Each kernel sends its status information to just two other nodes in the system, its neighbors in the ring. At boot-time, the kernel informs the NCM its intention to join the system. Since it has no means of knowing the location of NCM, it has to broadcast a message to the system-de ned NCM group. NCM detects this message, inserts the new node into the logical ring, and broadcasts the new logical con guration of the system to all nodes. It also informs the newly booted kernel of the locations of the NexusType service. From then onwards, the kernel periodically receives the heartbeat messages from its neighboring machines. If any of its neighbors misses a heartbeat, the kernel reports this to NCM. If both neighbors of a node report that it is no longer sending status messages, NCM marks that node as down. It then rearranges the ring to exclude the failed node, and sends the new ring con guration to all kernels. Each kernel then restarts sending messages to its (potentially new) neighbors, and the new ring is established. When the failed node restarts, it again goes through the boot-time procedure and rejoins the system. The advantage of our approach is that heartbeat messages are no longer broadcast, but only sent to two nodes. Similarly a node receives heartbeat messages from only two other nodes. Also, messages to a central entity (NCM) are necessary only in case of changes in the system con guration (addition or deletion of a node), which are relatively infrequent. So it does not create a bottleneck.

5.3.2 Location and Monitoring of System Components

Another problem in con guration management in Nexus is that all the kernels need to know the location of at least one instance of NexusType. The NexusType service is replicated on some subset of

hosts, in order to provide better availability and load balancing. One possible solution is that each instance of NexusType could periodically broadcast heartbeat messages to all kernels, thus advertising its location. Instead, to avoid this broadcast trac, we use our con guration management facilities, i.e. the NCM. All the instances of NexusType form a process group, which is monitored by the NCM. Whenever an instance of NexusType goes down, this event is detected by NCM. The other active instances of NexusType are informed about this, and they can recreate the failed instance on another node if necessary. Whenever NexusType comes back up, it contacts the NCM and asks to be added to its process group. This new instance of NexusType then uses a prede ned protocol to join the other running instances and synchronizes its directories with those of the others. In either case, when the state of the NexusType process group changes, one of its members broadcasts the latest con guration to all the kernels. The kernels can then update their local caches to re ect the new status. Since this status update is idempotent, it is repeated a few times for greater reliability. Again, broadcast communication is needed only when the state changes, i.e. when an instance of NexusType goes down or comes up. The NCM service is also replicated on some subset of hosts, so that a single point of failure is avoided. Further, the two or more copies of NCM monitor one another, and in case one of them fails, it can be recreated on some available machine, in a worm-like fashion. This is achieved by putting all copies of the NCM into a group, and setting up dependencies so that, whenever any member of the group is terminated, the others are noti ed.

6 Discussion The con guration management in the Nexus system using the dependency speci cations and the group primitives provides the user a convenient environment for parallel programming on workstation clusters. The kernel level support for monitoring of exception conditions provides better latency in detecting remote status changes as compared to the use of periodic status messages. The overhead involved in supporting this scheme is in terms of some additional code that the kernel has to execute whenever it delivers a signal to any of its local processes. This code checks a table containing the list of local processes that are required to be monitored for some speci ed signals. When a trigger object incurs an excep-

Page 7

tion, to deliver a signal to the target object it takes 2 message-transmission delays and some processing time at NCM. The mechanism adopted in the Nexus system completely eliminiates the need of user-level periodic status monitoring messages, thus reducing the message trac. The con guration management mechanisms available to the application programmer are also used for managing the con guration of the Nexus operating system itself.

References [1] Gregory Andrews, Ronald Olsson, Michael Cof n, Irving Elsho, Kelvin Nilsen, Titus Purdin, and Gregg Townsend. An overview of the SR language and implementation. ACM Transactions on Programming Languages and Systems, 10(1):51{86, January 1988. [2] Kenneth Birman. The Process Group Approach to Reliable Distributed Computing. Technical Report TR-91-1216, Cornell University, July 1991. [3] Yih-Farn Chen, Atul Prakash, and C.V.Ramamoorthy. The Network Event Manager. Proceedings of the Computer Networking symposium, pages 169{178, 1986. [4] David Cheriton. The V Distributed System. Communications of the ACM, March 1988. [5] Parasoft Corporation. Express 3.2 Introductory Guide. Parasoft Corporation, 2500, E.Foothill Blvd, Pasadena, CA 91107, 1990. [6] A. S. Tanenbaum et al. Experiences with the Amoeba Distributed Operating System. Communications of the ACM, 33(12), December 1990. [7] Luping Liang, Samuel Chanson, and Gerald Neufeld. Process Groups and Group Communications: Classi cation and Requirements. IEEEComputer, February 1990.

[10] Keith Marzullo, Robert Cooper, Mark Wood, and Kenneth Birman. Tools for distributed application management. IEEE Computer, pages 42{51, August 1991. [11] M.Young, A.Tevanian, R.Rashid, D.Golub, J.Eppinger, J.Chew, W.Bolosky, D.Black, and R.Baron. The Duality of Memory and Communication in the Implementation of a Multiprocessor Operating System. In Proceedings of the 11th Symposium on Operating System Principles, November 1987.

[12] P.Dasgupta, R.LeBlanc, M.Ahamad, and U.Ramachandran. The Clouds Distributed Operating System. Computer Magazine, pages 34{44, November 1991. [13] Richard Schantz, Robert Thomas, and Girome Bono. The Architecture of the Cronus Distributed Operating System. In Proc. of the 6th International Conference on Distributed Computing Systems, pages 250{259, 1986.

[14] J.F. Schoch and J.A. Hupp. The Worm Programs - Early Experience with a Distributed Computation. Communications of the ACM, 25(3), March 1982. [15] A. S. Tanenbaum and R. van Renesse. Distributed Operating systems. Computing surveys, December 1985. [16] Anand Tripathi. An Overview of the Nexus Distributed Operating System Design. IEEE Transactions on Software Engineering, 15(6), June 1989. [17] Anand Tripathi and Terence Noonan. Design of a Remote Procedure Call system for ObjectOriented Distributed Programming. Technical Report 92-20, University of Minnesota, Minneapolis, 1992. [18] V.S.Sunderam. PVM: A Framework for Parallel Distributed Computing. Concurrency: Practice & Experience, 2(4), December 1990.

[8] Barbara Liskov. Distributed Programming in Argus. Communications of the ACM, 31(3), March 1988. [9] Je Magee, Je Kramer, and Morris Sloman. Constructing distributed systems in conic. IEEE Transactions on Software Engineering, 15(6):663{675, June 1989.

Page 8

Reliable Management of Distributed Computations

Reliable Management of Distributed Computations

Suggest Documents

Distributed Symbolic Computations

Reliable Distributed Network Management by ... - SensorNet - UFMG

Debugging Distributed Computations by Reverse

Spatiotemporal Dynamics and Reliable Computations in Recurrent

Distributed Stencil Computations: Conway's Game of Life

Impact of Mobility on Distributed Computations - CiteSeerX

Webs of Archived Distributed Computations for ... - CiteSeerX

Constructing Reliable Distributed Communication ... - CiteSeerX

14 Reliable and Resilient Trust Management in Distributed Service ...

Reliable Distributed Storage - Infoscience - EPFL

Distributed Multiscale Computations Using the ...

Sequential and Distributed Evolutionary Computations in Structural ...

Distributed Transitive Closure Computations: The ... - VLDB Endowment

Modeling and Analyzing Periodic Distributed Computations

Reliable computations of knee point for a curve and ...

Distributed Computations Driven by Resource ... - Semantic Scholar

Structuring distributed relation-based computations with SCDRC

Realizing Fast, Scalable and Reliable Scientific Computations in Grid

Reliable Low-Power Nano Computations and ... - Semantic Scholar

DRDT: Distributed and Reliable Data ... - Semantic Scholar

A Reliable and Distributed LIMS for Efficient

Distributed, Reliable Restoration Techniques using ... - UCR CS

reliable scheduling distributed in cloud computing

Avalon: Language Support for Reliable Distributed Systems