First Steps in the Implementation of a Fault-Tolerant ...

6 downloads 3442 Views 197KB Size Report
On each node participating in the LiPS system a dedicated server process, ... A node can be in idle state, meaning that LiPS is allowed to run an application.
First Steps in the Implementation of a Fault-Tolerant Tuple Space Machine for Volatile Data Thomas Liefke

Ohad Rodeh

Thomas Setz

Fachbereich Informatik Institute of Computer Science Fachbereich Informatik Technical University The Hebrew University Technical University Darmstadt of Jerusalem Darmstadt Germany Israel Germany [email protected] [email protected] [email protected]

Abstract Transis [ADKM92,AAD93,ADM+93] is a tool for group communication that provides reliable ordered multicast along with membership services and strong group semantics. Transis can currently be used by processes residing on nodes within a BCD (Broadcast Domain). Building distributed applications on top of these services enables the programmer to assume ordering constraints on message delivery even if participating nodes crash or the BCD is partitioned. The LiPS system [Set97,SL97] is an emerging system for fault-tolerant distributed computing using the shared-memory paradigm. The LiPS system uses a protocol based on the Totem-Protocol [ADM+ 93] to ensure fault-tolerant services of the shared memory space. Trips is the rst step of the cooperation of the Transis groups and the LiPS group. It is used to scale the LiPS system to work in an environment of several BCDs. It uses the Transis functionality for fault-tolerant and consistent gathering of information about the machines in a BCD which is needed for the operation of the LiPS system. This article rst describes how the daemons of both systems are integrated into a single daemon process in order to bene t from a shared address space in the daemon process resulting in enhanced performance and avoiding copying of data while sending and receiving. This was supported by the new fastsesion layer that has been added to Transis. Second, a new protocol is presented here that handles membership change messages in the Trips daemon. Transis reliably reports changes in the membership of connected machines by sending membership change messages to the application. It distinguishes two types of membership change message: regular and transitional membership change message.

1 Introduction and Overview In the next two subsections, the projects Transis and LiPS are described. In the following subsection 1.3, the idea of the Trips project and in particular the points that are addressed in this article are introduced.

1.1 Transis Overview Transis [ADKM92,AAD93,ADM+ 93] is developed at The Hebrew University of Jerusalem. It is a tool for group communication that supports a variety of reliable multicast message passing services between processors. It provides highly tuned multicast and control services for scalable systems with arbitrary topology. The communication domain comprises of a set of processors that can initiate multicast messages to a chosen subset. Transis delivers them reliably and maintains the membership of connected processors automatically, in the presence of arbitrary communication delays, of message losses, and of processor failures and joins. Each machine participating in Transis runs a Transis daemon. In addition, the Transis library provides an

2 API for accessing the Transis services. A full description of all of the Transis services is beyond the scope of this document. Here, a summary of the services supplied by Transis is given. Transis o ers the following set of message ordering services: 1. Atomic multicast: guarantees delivery of the message at all the active sites. This service delivers the message immediately to the upper level. 2. Causal multicast: Guarantees that if the sending of m causally follows the delivery of m then each processor delivers m before delivering m . A message m2 causally follows a message m1 either if m1 and m2 were sent by the same processor and m1 was sent before m2, or if m2 was sent by a processor after it has delivered m1. 3. Agreed multicast: Delivers messages in the same order at all sites. 4. Safe multicast: Delivers a message after all the active processors have acknowledged its reception. 0

0

Con guration changes are reported to processors using Transis through membership messages (view change messages), which are delivered in the regular ow of messages. Membership messages hold the list of id's of the processors which currently belong to the con guration. The Transis membership algorithm achieves the following properties: { Handles partitions and merges correctly. { Allows regular ow of messages of all the supported types while membership changes are handled. { Guarantees that members of the same con gurations receive the same set of messages between every pair of membership changes. Two of the changes in Transis that have been made by the Transis group during the last two years have consequences for Trips. The rst change is that Transis implements the Extended Virtual Synchrony as described in [MAMA93,ACDV97] as opposed to the previously implemented Virtual Synchrony. Extended Virtual Synchrony maintains a consistent relationship between the delivery of messages and the delivery of con guration changes across all processes in the system even during network partitioning and re-merging and with failing and recovering processes. In addition to a regular con guration it uses the notion of the transitional con guration of processes which consists of the set of processes in the next regular con guration that have the same preceeding regular con guration. As a consequence, Transis delivers two types of view change messages to the application: regular view change messages and transitional view change messages. The second change is related to the API of Transis. In the previous version, all the messages between Transis and the application that uses the Transis services were communicated via sockets. In order to support a design that integrates both the Transis daemon and the application in one single process, an additional layer called fastsession was added to Transis. With this interface, the Transis daemon uses a callback function instead of socket communication to deliver the message to the application which is implemented in the same process.

1.2 LiPS Overview The LiPS system [Set97,SL97] is an emerging system for fault-tolerant distributed computing and was developed at the Technical University of Darmstadt. The LiPS system has been built

1. Introduction and Overview

3

to exploit parallelism by using the idle time of networked workstations and has been proven to work within an environment of about 250 machines. The system should be enhanced to work within an environment of more than 1000 machines all over the world within the next years. LiPS provides the application programmer with the tuple space paradigm of distributed computing, in which the processes of a distributed application communicate through a shared memory that is laid out as a tuple space. The tuple space is replicated on several machines, each using a server process called MessageServer[Set97]. The MessageServers communicate through a total order protocol to ensure consistent and fault-tolerant replication of the tuple space. Each application uses a logically separate application tuple space. LiPS-System Runtime System with LiPS-Application Runtime System Application 2 Application master

Application client

Application client

Application

Application Tuple Space 1

Tuple Space 2

Application Runtime System

FTTM MsgServer

Application master

Application client

FTTM MsgServer

MsgServer

MsgServer MsgServer

lipsd

lipsd

lipsd

lipsd

lipsd

FixServer FixServer

FixServer

lipsdc

System Tuple Space

FTTM

FTTM

LiPS-System Runtime System

lipsd

Application level

Application 1

Fault-Tolerant Tuple Space Machine Tuple space access

UDP communication

Processes residing on the same machine

Figure1. The LiPS Runtime Systems On each node participating in the LiPS system a dedicated server process, called lipsd (LiPS daemon) aggregates node-state information regarding the node on which it executes. A LiPS daemon is responsible for spawning application processes (via fork()) on its node. The daemons also implement mechanisms that \automagically" detect crashed nodes, re-integrate new started machines into the LiPS system, and restart LiPS daemons on them.

4 Node-state information is periodically sent to a single tuple space called the runtime system tuple space (system tuple space for short). Like the application tuple space, the system tuple space is also replicated (using the Totem protocol) on several machines, each using a FixServer for this purpose. A node can be in idle state, meaning that LiPS is allowed to run an application process on the node or in busy state, in which LiPS is not allowed to run any application process on the node. An idle node locally checks for running permissions every few seconds 1 and updates the system tuple space. A node in busy state checks running permissions and updates the system tuple space every minute. Figure 1 on page 3 shows how the application runtime system and the system runtime system cooperate in LiPS. A more elaborate description of the runtime systems can be found in [SL97].

1.3 Trips Overview Currently, each lipsd communicates directly with one of the FixServers thereby updating system information. This cannot be scaled to worldwide computing without risking heavy load on intermediate networks, an e ect which does not conform with the LiPS idle goals. In order to solve this problem, a localized information gathering approach is adopted. Upto-date information within a BCD (broadcast domain) is accumulated using Transis and sent by a representative LiPS daemon to one of the FixServers. The implementation of this idea is called Trips, and the module within each BCD is called a Trips-box. This way, an entire BCD, represented as a Trips-box, can be treated conceptually as one node. Information sent to the system tuple space need not be at machine level: the representative may send aggregated information regarding the entire BCD. Thus, the system tuple space maintains \high-level" information regarding the broadcast clusters, and more detailed information is maintained at the BCD level. Jobs for execution are scheduled to BCDs rather than nodes. The representative accepts scheduling requests and schedules them according to the information aggregated at its BCD. 2 Therefore, the data in the local tuple space of each BCD is regarded as volatile data. Following the lines of fault-tolerant services provided by LiPS, Trips is resilient to machine crashes within a BCD (and in particular to a representative crash), as well as network partitions. Self repairing mechanisms are integrated into the Trips-box. Transis is a natural candidate for implementing the Trips-box. Dissemination of runtime information within a BCD is maintained through the Transis reliable multicast service which utilizes the broadcast media. Fault detection and recovery is implemented using the Transis membership service. The rst prototype that has been realized in previous states of the cooperation comprises of the two separate processes Transis daemon (transisd) and Trips daemon (tripsd) which communicate via sockets. The drawback of this approach is that each message that is sent among the processes must be copied from one process to the other. To increase eciency, the processes should be integrated into one single process which integrates the functionality of both Transis and LiPS. This eliminates the overhead of the socket communication between the two processes. 1 The interval in the current version is 15 seconds. 2 At preliminary stages of development, a representative will send complete (node-level) information regarding

its BCD, thus the scheduling ideas raised above may be implemented at further stages of development.

2. Integrating the Daemon Processes

5

The rst approach we have taken was to use threads: the pthread package as well as an own thread implementation based on the standard C-library functions setjmp() and longjmp(). Both leaded to the problem that the program being written is hard to debug. Therefore, we decided to integrate the two processes on the level of the event library. The next section describes how the integration is done by using the fastsession layer introduced in Transis and by choosing one of the event libraries as the primary one into which the secondary one is integrated. The translation between the two event libraries is realized in an additional mediator. The section following that describes a new protocol for maintaining consistency of the tuple space. It utilizes the new message type for transitional views that has been added to Transis as mentioned in section 1.1. The article closes with presenting the results and by giving an outlook on future work.

2 Integrating the Daemon Processes This section describes the integration of the functionality of the LiPS daemon and the Transis daemon into the Trips daemon. The integration into one single tripsd process was supported by the fastsession layer that has recently been added to Transis. By calling functions of the fastsession layer, messages can be sent to the transisd directly as depicted in Figure 2. Messages that are received by the transisd are delivered to the lipsd by calling a previously registered callback function. This function accumulates any message in a receive queue from which the messages are dequeued on demand. Figure 3 depicts the new structure of the tripsd that is contained in one process. Besides using the fastsession layer, it was necessary to unify the event libraries of the transisd and lipsd. In the previous version of Trips, each of the lipsd and the transisd had its own event library which caused the processes to work in an event driven manner. Therefore, each of the daemons had its own main loop in the event library. The event libraries provide functions for registering callback functions which are triggered from the main loop. In a process that integrates both the lipsd and the transisd there can not be two main loops executed concurrently. Instead, it was necessary to choose one of them as the primary one and to integrate the secondary one within the primary. This is only possible if every functionality of the secondary event library can be realized by the primary one. The integration of the secondary event library into the primary is done in the following way: Every callback A that is registered in the secondary library is additionally registered as A in the primary library. A is not the original function, instead it is a special function that dispatches the original callback function A in the secondary library. Furthermore, every callback A that is removed from the secondary library is also removed as A from the primary one. Finally, the main loop of the primary library is executed. A comparison of the two event libraries has shown that every functionality of the Transis event library can be translated to the LiPS event library. The opposite is not the case because the LiPS event library is capable of handling child callbacks and exit callbacks which do not exist in the Transis event library and would make it necessary to implement them in the Transis event library too. 0

0

0

6 Trips BOX 1 Machine 1

Machine 2

Machine 3

WAN communication LAN communication Function call Processes residing on the same machine

lipsd

lipsd

lipsd

FS_...

FS_...

FS_...

transisd

transisd

transisd

Representative tripsd

FTTM

Fault-Tolerant Tuple Space Machine

FS_...

Fastsession Layer of Transis

Local Tuple Space

FTTM System Tuple Space

Figure2. A Trips Box Application Layer

Middle Layer

Tuple Space

TRIPS

Transis Layer

Multiplex Layer

Network Layer

Transis

LiPS Events

Mediator

Stream Sockets, Datagram Sockets

Figure3. Structure of the Trips Daemon

Fastsession

Transis Events

3. Handling View Changes

7

The mediation between the di erences is done in an additional mediator between the LiPS event library and the Transis event library as shown in the multiplex layer in Figure 3. It mediates (translates) between the di erences of the Transis event library functions and the corresponding LiPS event library functions: Socket callbacks that are to be removed are identi ed in Transis by the le descriptor while they are identi ed in LiPS by the related callback handle. In Transis, timers are given in absolute microseconds since 1970 whereas in LiPS they are given in seconds and they are added using relative time since the current time. Timer callbacks, that are to be removed, are identi ed in Transis by the callback function while they are identi ed in LiPS by the callback handle.

3 Handling View Changes This section presents the protocol for handling view change messages which retains tuple space consistency. In contrast to the previously implemented protocol, it uses the newly integrated additional message type transitional view change message as described in subsection 1.1. If the view change is due to a network partition, i.e. the previous Trips-box is partitioned into several Trips-boxes, every Trips-box is allowed to proceed independently from each other | each with an individual tuple space. If the view change is due to members from di erent Trips-boxes merging, the tuple spaces can not be merged. Instead, the protocol solves the problem by choosing the process, the tuple space of which is replicated among the tripsds. This process is called replica. A tripsd of a merging Trips-box may decide to replace its tuple space by the tuple space of the replica (if it is newly started or if it has not accessed the tuple space yet) or to restart itself because the tuple spaces can not be merged. The replica is the process with the tuple space which is the same in as many other tripsds as possible, i.e. it is a process of the largest partition of the merging partitions. This choice minimizes the number of tripsds that need to restart themselves. If the representative of a Trips-box crashes, a new one must be chosen. This is also done by the protocol. The protocol is executed by every tripsd locally. It is triggered when a view change message is received. The protocol is written in a pseudo code manner. The syntax and semantics of it should be obvious.

I Variables ul my id g: holds my Trips-ID (every tripsd within a BCD has a unique Trips-ID) v cur view m: holds the currently operational view v trans view m: holds the most recently received transitional view v reg view m: holds the most recently received regular view v intersect view m: holds the intersection of all previously received transitional views until

the next v cur view m is established i state val: holds the state value of the current process, i.e. jv intersect view mj

8

ul replica: holds the Trips-ID of the process that is chosen as the replica, i.e. the process,

the local tuple space of which will be replicated to all other processes during tuple space transfer i tspace accessed g: TRUE if process has accessed the current local tuple space by reading or writing, FALSE otherwise

II Initialization

v cur view m = ful my id gg v trans view m = fg i tspace accessed g = FALSE

III View Change Messages

A transitional view change message delivers a transitional view which is stored in v trans view m. No transitional view change message is delivered before the rst regular view change message. A regular view change message delivers a regular view which is stored in v reg view m. Between a transitional and a regular view change message only regular messages are delivered. They are all handled as usual.

IV Handling View Change Messages

After the regular view change message has been received:

{ v intersect view m = v trans view m START: { if v intersect view m = fg { {

{ {

then goto END if v intersect view m = v reg view m then goto END run state transfer protocol:  i state val = jv intersect view mj  multicast i state val  receive jv reg view mj replies  ul replica = process with largest reply (i.e. i state val) if largest reply (i state val) = 0 then goto END if ul replica = ul my id g then multicast tuple space, receive (ignore) it, goto END

(I am newly started) (only leaving nodes)

(all are new) (I am the replica)

{ if ul replica 2 v intersect view m and ul my id g 2 v intersect view m then receive (ignore) tuple space, goto END

(I have the same tuple space as the replica)

4. Current State of the Project

9

{ if ul replica 2= v intersect view m and ul my id g 2 v intersect view m

then  if :i tspace accessed g then receive tuple space, goto END (I have a di erent tuple space than replica but haven't accessed it yet)  if i tspace accessed g then restart (I can not merge my tuple space with replica's) { if ul replica 2= v intersect view m and ul my id g 2= v intersect view m then receive tuple space (I am new)

Summary of the last three cases:

ul replica 2 v intersect view m 2= v intersect view m

ul my id g 2 v intersect view m same TS/ignore 2 = v intersect view m can't happen

di erent TS/restart or install I am new/install

END:

v cur view m = v reg view m

declare representative (has largest Trips-ID in v cur view m)

4 Current State of the Project The current tripsd prototype is implemented as a single process using the shared memory space in the daemon thereby resulting in better performance. It uses the LiPS event library as the primary library and it makes use of the newly added fastsession layer. In di erent test scenarios, several tripsds are able to run and to correctly communicate with each other by exchanging data of the local (Trips-box) tuple space using the di erent tuple space operations. Due to the new protocol for handling view changes, this is true even if the operations are interrupted by several consecutive view changes, e.g. when the processes are (manually) killed and restarted. During the implementation of the prototype, regression tests are implemented using the LiPS test environment [SL98], although it is hard to design regression test in the distributed environment.

5 Future Work While a view change message is handled, it is possible for another view change to occur due to other processes joining or leaving the group. In order to handle nested view changes, the protocol described here need to be extended.

10

References

References [AAD93] Amir Y., Amir O., and Dolev D. A Highly Available Application in the Transis Environment. In Proceedings of the Hardware and Software Architectures for Fault Tolerance, LNCS 774, 6 1993. [ACDV97] Amir Y., Chockler G. V., Dolev D., and Vitenberg R. Ecient State Transfer in Partitionable Evironments. In Proceeding of the European Research Seminar in Advanced Distributed Systems (ERSADS'97), Zinal (Valais, Switzerland), March 1997. [ADKM92] Amir Y., Dolev D., Kramer S., and Malki D. Transis: A communication sub-system for highavailability. In Annual International Symposium on Principles of Distributed Computing, 7 1992. [ADM+ 93] Amir Y., Dolev P., Melliar-Smith P., Agarwal D., and Ciarfella P. Fast Message Ordering and Membership using a Logical Token-Passing Ring. In 13th International Conference on Distributed Computing Systems (ICDCS), number 13 in IEEE, pages 551{560, Pittsburgh, 5 1993. [MAMA93] Moser L. E., Amir Y., Melliar-Smith P. M., and Agarwal D. A. Extended Virtual Synchrony. Technical Report ECE#93-22, Department of Electrical and Computer Engineering, University of California, Santa Barbara, CA, 1993. [Set97] Setz T. Design, Implementation and Performance of a Fault Tolerant Tuple Space Machine. In Proceedings: ICPADS'97: 1997 International Conference on Parallel and Distributed Systems, December 10{13, 1997, Seoul, Korea. IEEE, 12 1997. [SL97] Setz T. and Liefke T. The LiPS Runtime Systems based on Fault-Tolerant Tuple Space Machines. In Proceedings of the Workshop on Runtime Systems for Parallel Programming (RTSPP), 11th International Parallel Processing Symposium (IPPS'97), Geneva, Switzerland, April 1997. Appeared as Technical Report, Vrije Universiteit Amsterdam, Faculteit der Wiskunde en Informatica, No. IR-417, februari 1997. [SL98] Setz T. and Lippmann J. An Integrated Test-Environment for in CWEB written C-Programs. In 9.th International Symposium on Software Reliability Engineering, Nov 4-7, 1998, Paderborn, Germany. IEEE Computer Society Press, 1998.