A Fault-Tolerant Architecture Based on Autonomous ... - CiteSeerX

0 downloads 0 Views 1MB Size Report
Namely, replicated objects can receive the same messages in the same order. ..... bottle-neck of load, reliability, processing and expansibility. As shown in Fig.7, ...
Proceedings of the 28th Annual Hawaii International Conference on System Sciences -

1995

A Fault-Tolerant Architecture Based on Autonomous Replicated Objects Toshibumi Seki, Tetsuo Hasegawa, Yasukuni Okataku, Shinsuke Tamura Systems & Software Engineering Laboratory, R&D Center, TOSHIBA Corporation 70, Yanagi-cho, Saiwai-ku, Kawasaki, 210, Japan Abstract

hardware modules are used as replication units. Either an active replication mechanism[4-71 or a passive replication mechanism[8-1 l] is chosen in accordance with the system’s requirements / conditions; i.e., response time and available resources. In an active replication mechanism, individual objects carry out the same operation in parallel according to input messages. On the other hand, in passive replication mechanisms, usually only one replica is active, and the other replicas are in stand-by state (passive mode). When an active object fails, one of the passive objects is activated and takes over the operation of the failed active object. In declarative systems, individual objects also should not be aware of the fault-tolerant mechanism employed by its own replicated objects nor that of related replicated objects. Namely, there should be independence of replication mechanisms in order to allow individual objects to change replication mechanisms dynamically according to system requirements without influencing other objects. Although a fault-tolerant transparency for user view has already been proposed, the location, replication degree and fault-tolerant mechanism of individual objects are managed in a system level, such as a Group Manager[ 121. That is, each object’s behavior is controlled by the manager holding global knowledge. When an object status is changed, the object has to report its change to the system level managers. This paper proposes a faulttolerant technique allowing individual objects to carry out operation in accordance with only its local knowledge, e.g., its execution mode(active or passive), without exchanging any information even if an object’s location, replication degree and fault-tolerant mechanism are changed. In Section 2, the distributed system architecture and its environment that are assumed in this paper are briefly described. In Section 3, fault-tolerance based on replicated objects in a declarative system is explained. In Section 4, a passive replication mechanism focusing on checkpoint transparency, which can be used simultaneously with a previously proposed active replication mechanism[3], is proposed. In Section 5 a new

This paper proposes an architecture for the replication of program modules enabling them to behave in accordance with their own local knowledge, without any inj-i’uence by not only their location, replication degree and fault-tolerant mechanism but also system level modules. In the proposed architecture, program modules are implemented as objects and communication among them is carried out by a total ordering broadcast protocol, enabling individual objects to behave autonomously. Therefore, individual objects can choose the most suitable replication degree and fault-tolerant mechanism in accordance with their own required reliability and execution efficiency without the need to change programs or for the object location to be influenced.

1 Introduction Due to technical advances, user requirements are continually diverging. Not only increased functional performance, but also expansibility, reliability, maintainability and adaptability have become requirements in various computer systems. Declarative distributed systems, which simply consist of a set of system elements whose abilities are independently declared from others, can respond to these requirements. Such systems do not have any centralized mechanism. Individual system elements find their own roles adaptively and dynamically through negotiation, so as to accomplish given jobs. Location and replication independencies of individual elements are the base of this kind of declarative system. Namely, each element acts autonomously without knowledge of the location and replication degree of its related elements. This paper proposes a fault-tolerant technique based on element replication. Each element is expressed as an object based on an object model and exchanges messages through broadcast conmnmication in order to establish a fault-tolerant declarative distributed system. For constructing highly reliable systems, mechanisms have been proposed in which software modules instead of

1060-3425/95$4.0001995

IEEE

506

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE

Proceedings of the 28th Annual Hawaii International Conference on SystemSciences - 1995

they receive the same messages in the same order. This enforces the programmer to construct objects as state machines. EN2) All sites that are alive and connected to LANs will receive the same messages in the same order. Namely, replicated objects can receive the same messages in the same order. A total ordering reliable broadcast protocol[l4-171, which has also been realized on IDPS[3], guarantees this environment. EN3) All objects that correctly carry out operations will never send out erroneous messages. Namely, a fail-stop object[ 191 is assumed: a failed passive object will not take over the corresponding active object’s operation, moreover a failed active object will not continue erroneous operation.

Figure 1: Configuration of the IDPS consensus protocol is introduced by which an exactly once operation of replicated object interacting with real actions to external environment, e.g., I/O action, is performed. In Section 6 the proposed fault-tolerant technique is compared with related works. In Section 7 an application example of the proposed technique is shown.

Under EN2) and EN3), every replica that is alive can receive the same correct messages in the same order. Therefore, they can result in the same state by only committing the first-come message sent from replicated objects. IDPS has an efficient message commitment mechanism[3] in which each object commits the message when it receives a pre-defined number(PDN) of messages with the same message identifier(ID). The message ID, which is necessary for identifying the same messages sent from replicated objects, is asynchronously issued in a distributed manner by using only the local information in each object. The fail-uncontrolled behavior of each object is tolerated by the IDPS message commitment mechanism when only active replication is required. EN3) is necessary to support the passive replication.

2 Broadcast Based Distributed System Model This section describes the Intellectual Distributed Processing System (IDPS) [ 11, which has been developed in Toshiba, as an execution environment for the proposed mechanism. IDPS is an object oriented distributed system based on broadcast communication and is the basic architecture used when realizing the declarative distributed system.

2.1 IDPS Architecture Figure 1 shows the IDPS configuration. The IDPS operating system(IDPS-OS) is a broadcast based object oriented operating system[2]. Individual system elements are described as objects, based on an object oriented model, and exchange information through asynchronous broadcast communication so as to realize the declarative system with location and replication independencies. An individual object is replicated on distinct sites so as to achieve high availability of computing services in despite of failures. In Fig.1, the replicated objects are identified by the same pattern. These objects can behave independently from their location and replication degree. It is also possible to create objects at any site and for them to migrate to any site according to the system state.

2.3 Access Type to External Environment The following two types exist as access techniques to the external environment. Message Type Access: Here, the receipt site has the responsibility of realizing exactly once operation. Namely, even if the same content messages are transmitted from several replicated objects, the individual receipt objects select one message to achieve the exactly once operation. I /O Type Access: In this technique the transmission site has the responsibility of establishing the exactly once operation. As I/O operations take place in the external environment, repeated requests for action by the copies of a replicated object should be prohibited to avoid the same operation being performed more than once. Prohibition must take place at the transmission object because the receipt operation has no means of recognizing duplicated requests. For example, in case of valve control, if the

2.2 System Environment The proposed replication mechanism works under the following system environment.

ENl) Each object will behave in deterministic way. Namely, replicated objects will produce the same results if 507

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE

Proceedings of the 28th Annual Hawaii lntemational Conference on System Sciences - 1995

same operation were to be repeated due to duplicated requests, the valve would be rotated twice the desired angle. The reason why the access technique is divided into these two types is to reduce the extra messages and extra synchronous operations. The Message Type Access is more efficient than the II0 Type Access because the II0 Type Access requires consensus taking at every I/O request. Whereas, in Message Type Access, individual objects can broadcast messages without knowledge of destination objects and receive exactly one message selected by the IDPS message commitment mechanism. Therefore, an active object has only to carry out its processing using the single message, and broadcast the result messages. A passive object queues the received messages, and is permitted to re-transmit the same messages when it is activated to take over the operation of the failed active object. The details of the fault-tolerant mechanism for this kind of object is mentioned in Section 4. The consensus protocol performing the exactly once operation of II0 Type Access is introduced in Section 5.

?_I, Active Object

PassiveObject

Figure 2: An Example of the Replication for Autonomous Objects mixed-replication. That is, Al and A2 are controlled by the active replication mechanism and A3 is a passive object. Object-B group is controlled by the active replication mechanism. Object-C and Object-D groups are controlled by the passive replication mechanism. The active objects Al and A2 autonomously broadcast to Bi(i=1,2). The passive object A3 takes over the operation when active object Al or A2 fails. As both Bl and B2 are active objects, they receive one message with same content by selecting message sent from Al and A2, and also broadcast to Ci(i=1,2,3), respectively. In the same way, active object Cl receives the messages from B 1 and B2, and also broadcasts to Di(i=l,2). As mentioned above, the replication technique of each object is decided in accordance with its required reliability and responsibility without any influence from other objects or from any centralized manager holding global information. An object can continue its operation by changing its execution mode through cooperation among related objects. For example, suppose the pm-defined number (PDN) of identical messages which are required in order for the IDPS commitment mechanism to allow Bi(i=1,2) to receive the messages sent from Ai(i=1,2) is two. Then, if the contents of the message sent from Al is different from that of the message sent from A2, the IDPS message commitment mechanism of Bl and B2 cannot accept either message. The mechanism autonomously broadcast the mode change request message to Ai(i=1,2,3) in order to increase the number of active objects. Then A3 is activated and broadcasts its result message to Bi(i=1,2). Therefore, Bi(i=1,2) will receive the correct message by using the majority based decision.

3 Fault-Tolerance Based on Replicated Objects 3.1 Replication Mechanism for Autonomous Objects First, define replica group to be the set of objects that have the same function. For example, all copies of an object form one replica group. Mixed-replication is defined as a replica group consisting of two or more active objects and at least one passive object. It achieves both non-stop operability and resource availability within the replica group. Autonomous objects should behave only in accordance with their local state, e.g., their execution mode (active or passive). Therefore, when all objects in a replica group are declared as active, the replica group is autonomously controlled by active replication mechanism. When only one object is active and the others are passive mode, the replica group is controlled by the passive replication mechanism. In addition, when two or more objects are declared as active and at least one object is declared as passive, mixed-replication is possible. Figure 2 shows an example of the proposed replication mechanism for autonomous objects. In Fig.2, object-A, B, C and D are triplicated, duplicated, triplicated and duplicated, respectively. Object-A group consisting of two active objects and one passive object is

3.2 Requirement for Replication Mechanism This Subsection shows the requirements to realize the fault-tolerant declarative system based on autonomous objects.

508

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE

Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995

RQl) Transparency for Fault-Tolerance: This characteristic allows each object to dynamically change fault-tolerant mechanism. If a checkpoint function is explicitly written in a program, the object replica group cannot be used under the active replication mechanism.

RQ2) Independency of object’s location, replication degree and execution mode:

be easily used in the proposed replication situation, however, the passive replication mechanism requires further support. In the following Section, a passive replication mechanism, which achieves user transparent checkpointing and enables mixed-replication within a replica group, is introduced.

This characteristic enables to achieve load balance among computers, and to change the replication degree and execution mode according to the required reliability. Broadcast communication realizes the characteristics of location, replication and execution mode transparencies.

4 Broadcast Based Passive Replication Mechanism This Section proposes a new passive replication mechanism. Conventional checkpointing mechanisms [8,9,1 l] are active oriented. Namely the active object detects a failure of itself and controls the checkpoint timing. On the contrary, the proposed mechanism is passive oriented, where passive objects detect the corresponding active object’s failures and control their checkpoint timing in order to realize user transparent checkpointing and not to overflow the message saving buffer area. Moreover, it enables a quick recovery operation in which it removes a cumbersome checkpointing operation from an active object. A checkpointing mechanism and a failure recovery mechanism are necessary to take over correct operation in the event of the active object’s failure. The checkpoint mechanism decides the checkpoint timing and copies the internal state information of an active object to its corresponding passive objects. The failure recovery mechanism selects one new active object, which carries out a rollforward operation from the latest checkpoint state, from passive objects to take over the operation of the failed active object.

RQ3) Consistency within a replica group: Each replica group’s should not be endangered degree / execution mode internal state consistency guaranteed at all time.

determinism and consistency even if its location / replication are changed. This means that within a replica group must be

When these requirements are fully achieved, faulttolerance for the autonomous object can be realized. A broadcast based active replication mechanism that maintains location and replication independencies of objects, has already been realized[3]. In that mechanism, consistency within a replica group is guaranteed because they are given the same initial state and are invoked by a total ordering broadcast message under the condition of replica’s deterministic operation. Of course, the application program is not aware of the active replication mechanism. Therefore, the active replication mechanism satisfies the above mentioned three requirements. In the passive replication mechanism, a user mechanism that hides transparent checkpointing checkpoint timing and the checkpoint procedure from application programs is required, in order to satisfy requirement RQl). If active objects have a different name from the corresponding passive object, RQ2) is not satisfied because an object has to recognize the execution mode of related objects. In IDPS, however, RQ2) is satisfied because the active object and the corresponding passive object have the same name. Objects can send messages to other objects without knowledge of not only their location and replication degree but also their execution mode because they exchange messages through broadcast communication. The message is committed in both active and passive objects. To satisfy RQ3), a checkpointing mechanism must maintain consistency within a replica group even in the presence of a failure. It should not need to be aware of the number of active and passive objects within a replica group. As stated earlier the active replication mechanism can

4.1 Checkpointing Mechanism This Subsection describes the proposed techniques to realize user transparent checkpointing and mixedreplication while maintaining the object’s location and replication transparencies. Checkpoint timing Checkpoint timing: transparency is realized, while reducing copy operations under the condition of finite length message logging buffers, by taking checkpoints when one of the following conditions is satisfied.

4.1.1

COl) When the total volume of messages received by a passive object reaches a predetermined constant (CT). C02) When COl) is not satisfied in a certain period of t.iWTexp). Condition COl) fixes the message logging buffer size

509

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE

Proceedings of the 28th Annual Hawaii International Conference on System Sciences -

of each passive object to a constant size. This condition also removes unnecessary copy operations, such as copying an object’s unchanged state as occurs in cyclic checkpointing. The constant CT is determined according to message exchange rate. By only condition COl), the rollforward operation in the occurrence of an active object’s failure may take a long time in a case of a long calculation with low message exchange rate, because the checkpoint may not be taken for a long period. Therefore, a cyclic checkpointing mechanism, i.e., CO2), is also used to bound the maximum rollforward operation time. When C02) is used, there may be cases where the internal state of an active object has not changed since the latest checkpoint. In order to remove unnecessary copying, the internal state will be copied only when it has been changed. The occurrence of internal state alteration can be detected by checking the corresponding object’s message sending and receiving counters in the IDPS-OS kernel. Namely, if the contents of the counters are different from the previous ones, the internal state can be considered to be altered from the latest checkpoint state. Even when the contents are the same, if there is a busy method in the object, the copy operation is necessary because the internal state is altered owing to a long time calculation in the method. The proposed check mechanism, however, carries out unnecessary copy operations, if a busy object is just waiting for a message for a long period, or if all the input messages are read only requests that do not modify an objects’ state. This problem is resolved by copying only modified page. However, when the internal state size is smaller than the page size, unnecessary copy operation can not be removed. By using these conditions, the frequency of checkpointing becomes appropriate in proportion to the communication rate. Most of the unnecessary internal state copies can be removed, and the message logging buffer size can be fixed to a constant size because conditions COl) and C02) are enforced by the passive object.

1995

Broadcast

Change est

Figure 3: Checkpointing Mechanism Queuing Method. It receives the messages sent to its corresponding active object, and queues them into its message logging buffer. The last is the Mode Management Method whose further detail is explained in the next Subsection because it is used in failure recovery. Condition COl) is enforced by the Message Queuing Method in each passive object. That is, when the method receives a message, it checks the total volume of messages stored in its message logging buffer. When the passive object detects satisfaction of the condition COl), it broadcasts a state copy request message for checkpointing to the State Transmission Method of the corresponding active object. Satisfaction of condition C02) is detected when the State Reception Method of the passive object does not receive any message in the pre-defined period of time (Tex >. Then, the method broadcasts a request message for ci eckpointing to the State Transmission Method of the corresponding active object. The State Transmission Method broadcasts the internal state of the corresponding object. However, when it is invoked by condition CO2), it checks whether or not the internal state copy is necessary, because sometimes the internal state may not have been changed. Only when the internal state has been changed does it broadcast the state information, such as the program counter, data and stack areas and the latest received message ID, to the State Reception Method of the corresponding passive object. When it is invoked by condition COl), it can unconditionally broadcast the state information. The State Reception Method of a passive object receives the internal state of the corresponding active object, and then alters its own internal state to that of the received one. It also resets the watchdog timer for Texp. The latest received message ID is used to purge

4.1.2 Checkpointing

Procedure: The application programmer does not need to know the checkpointing procedure because it is automatically installed into each object as a set of system methods. Figure 3 shows the proposed checkpointing mechanism. Each object requires four additional methods in order to support the checkpointing procedures. The fust method is the State Transmission Method that sends a modified internal state from an active object. The second is the State Reception Method that receives an internal state from an active object. The third one is the Message 510

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE

Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995

unnecessary messages from the message logging buffer. That is, the messages with message IDS issued before the latest received message ID are cleared, and the remaining messages are alive. The consistency between an active object and passive ones is maintained. Because the execution priority of these checkpointing methods is higher than that of user methods, user methods cannot change their internal state during checkpointing. If an active object’s internal state is not changed, the State Transmission Method only broadcasts “I’m alive” message. The State Reception Method only reseta the watchdog timer for Texp.

request the internal state copy through broadcast communication. Here, replicated active objects running asynchronously may broadcast different internal states with the same message ID owing to differences in checkpoint timing among them. However, all passive object states are altered to the same state by selecting the first-come internal state message. The total ordering broadcast communication and the IDPS message commitment mechanism realize this consistency by ensuring that each passive object receive the same messages in the same order.

4.1.3 Consistency within a Replica Group: Figure 4(a)

The failure of an active object is detected in a passively oriented manner. That is, when the State Reception Method in a passive object does not receive the corresponding active object’s internal state within a specified period after requesting the copy, it assumes the active object’s failure and invokes the Mode Management Method in the same passive object. This failure detection relies on the fact that the passive object can receive messages in the absence of active and passive objects’ failure by means of the total ordering broadcast communication mechanism. On detection of failure of the active object, the Mode Management Method changes the passive object’s execution mode from passive to active and activates the messages logged in its message logging buffer. Then, the individual new active objects can autonomously carry out their rollforward operation, because the internal state of the failed active object was copied into the new active objects at the previous checkpoint and all messages sent to the failed active object have been logged since the previous checkpoint. The domino effect[ 181 has inherently been avoided by the passive object logging all messages sent to the corresponding active object since the latest checkpoint timing and behaving in deterministic way. That is, a new active object can carry out its rollforward operation purely by using the messages logged in its message logging buffer. When several passive objects exist, the Mode Management Method has to select one new active object from among them. In the proposed passive replication mechanism, individual passive objects autonomously become active without negotiation among them, as soon as they detect the corresponding active object’s failure, in order to reduce the recovery time required for deciding a new active object and processing a rollforward operation. Extra active objects are returned to passive ones by negotiation among new active objects after their modes have been changed to active.

4.2 Failure Recovery Mechanism

shows the case of passive replication consisting of only one active object and several passive objects. All of the passive objects broadcast the request message for copying the corresponding active object’s internal state. The active object, to which the passive objects have sent state copy request, receives only one of these requests due to the IDPS message commitment mechanism. Further, it replies with it’s internal state, to the corresponding replicated passive objects, only once. Therefore, the internal state consistency within the replicated passive objects is guaranteed. Even with the mixed-replication, as shown in FigA( each passive object need not be aware of the replication degree of the corresponding active objects, because they

(a) PassiveReplication

Passive Object

( only one Active Object Possible ) -Active

Object

ation

Passive Object Ib) Mixed-Reolication ( more &II one Active Object Possible ) .

I

Figure 4: Internal State Copy for a Checkpoint

511

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE

Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995

In order to return extra active objects to passive ones without any central management element, each Mode Management Method of the new active objects broadcasts its site load information, that is, site ID and the number of waiting events in its site. When a site recognizes that its load is larger than the other sites’ ones, it returns to the passive object by only changing its execution mode from active to passive. Therefore, only one active object remains active in the lowest load site. If the new active object does not receive any other site’s load information, it recognizes that no other active objects exist. At such times it copies itself to another distinct site dynamically as a passive object, so as to keep at least one passive object available. When the replication degree of an active object needs to be preserved to be k(>l) in mixed replication case, the PDN for the internal state message should be coincident with k. Then, each passive object cannot commit the internal state message until it receives PDN(=k) messages out of k messages with the same message ID. If the number of active objects becomes less than PDN, the passive object can never commit the internal state message and therefore it can detect the active object’s failure. The new active objects negotiate among themselves through broadcasting each site’s load information. If the new active object receives the information with lower load than itself from PDN or more sites, it returns to the passive object. The consistency within a replica group is guaranteed as mentioned in the previous Subsection even when several active objects exist during the negotiation.

Write( I/O-Port, User-Write-Data ) i Async-Wait(EXECUTION-DECLARE. &Site-ID); if( No reception of the EXECUTION-DECLAREmessages ) Broadcast(Self-Replica-Group, EXECUTION-DECLARE, Self-Site-ID); Sync-Wait(EXECUTION-DECLARE, &Site-ID); if( Site-ID = Self-Site-ID ) I/O-Write(I/O-Port, User-Write-Data);

(a) Write Operation Read( I/O-Port. &User-Read-Data ) ( Async-Wait(EXECUTION-DECLARE, &Site-ID, &Rcv-Value); if( No reception of the EXECUTION-DECLAREmessages ) I/O-Read(l/D-Port, LTemp-Value); Elroadcast(Self-Replica-Group, EXECUTION-DECLARE, Self-Site-ID. Temp-Value): Sync-Wait(EXECUTION-DECLARE, ISite-ID, &Rcv-Value); User-Read-Data = Rev-Value ; I

(b) Read Operation Figure 5: A Consensus Procedure subsequent messages with the same message ID. Of course, as a different I/O operation issues a different message ID, the function does not receive the EXECUTION-DECLARE messages transmitted from different timing I/O operations. When the object has already received the message, it recognizes that the real action has already been achieved, and then it does not do any real action. Otherwise the object recognizes that it may be the earliest object to the corresponding real action. Then, it broadcasts the EXECUTION-DECLARE message with the self site ID to the replicated objects to take a consensus in the replica group. If the S i te- I D in the message received by Sync-Wait coincides with self site ID, that site gets a right to execute a real action. Then it permits I/O-Write to write the User-Write-Data to the specified I /O-Port. Sync-Wait function is the same as Async-Wai t function except for being blocked until the corresponding message is received. Even if replicated objects in a replica group broadcast the EXECUTION-DECLARE? messages at the same time, only one of the objects will get the permission to execute, because they receive the same messages in the same order by the total ordering broadcast protocol. When only one active object exists, it receives the message broadcasted from itself. In a read operation, the main idea is the same as the write operation except that the read operation is achieved before taking a consensus. Therefore, all the replicated objects in the replica group get the same value read by the earliest object. This consensus protocol is carried out among active objects within the replica group. Passive objects queue the EXECUTION-DECLARE messages in the message

5 I/O Type Operation This Section shows a consensus protocol to achieve the exactly once operation of objects which interact with real actions, such as I/O device, to the external environment. The main idea of the consensus protocol is that the object which first reaches the I/O Type Access executes the real action. Figure 5 shows the pseudo-procedure for writing to and reading from the external environment. These procedure is linked with an object as a software library. In a write operation, the real external operation is achieved after negotiation within a replica group. That is, fust check whether an other object within a replica group has executed the corresponding I/O action or not by trying to receive the EXECUTION~DECLARE! messages. Async-Wai t function asynchronously gets the message with the attribute name of EXECUTION-DECLARE. The IDPS message commitment mechanism, in which PDN is one, receives the first-come message and ignores the

512

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE

Proceedings of the 28th Annual Hawaii lntemational Conference on System Sciences - 1995 MessageSend

6 Comparison with Related Works In this Section, the proposed fault-tolerant technique is compared with related works, i.e., Delta-4 [7,8,121, ISIS [9,20], Auragen [ 111, as shown in Table 1. In comparing fault-tolerant mechanisms, several points are interest, i.e., transparency, checkpoint technique, failure recovery mechanism, non-detetministic operation, other replication technique and communication, First is transparency. All works realize transparency concerning object information, i.e., location, replication degree and fault-tolerant mechanism of each object, from the point of user view. However, transparency from the point of system view is realized only in IDPS. An IDPS object behaves autonomously in accordance with only its own local information without any system level manager. On the other hand, in other three works the system level (Group Manager on Delta-4, system processes on ISIS, low level software on Auragen) manages the object’s information to realize user transparency. It is necessary to exchange extra messages among related objects if the object information is changed. Second is the checkpoint technique. Both IDPS and Auragen take a checkpoint under the same condition, i.e., cyclic and the number of receipt messages. In IDPS, however, checkpoint timing control and failure detection of active objects are achieved in a passive oriented manner. Therefore, the proposed mechanism removes the extra operations for maintaining reliability from the active object and enables the user operation to proceed much more effectively. Delta-4 and ISIS take a checkpoint for every Remote Procedure Call(RPC). They are too expensive owing to numerous message exchanges, though they can treat nondeterministic operation. Delta-4 also takes a checkpoint at every real action. As IDPS broadcasts only the information, i.e., site ID and input value, at every real action, the volume of this exchanged message is limited in

Passive Queuing Object Rem. Message S&d

Figure 6: Roll-Forward Operation Using I/O Operation logging buffer the same as normal messages. When the passive objects become active to take over the operation of the corresponding failed active object, the consensus protocol guarantees the exactly once operation. In the write operation, if the old active object has already executed the I/O operation, the new active object does not re-write the corresponding I/O operation. In the same way, the read operation consumes the values in the logging messages read by the old active object if the EXECUTION-DECLARE messages have already broadcasted. Therefore, the system preserves consistency in the replica group and among related objects. Figure 6 shows the rollforward operation of objects which interact with both Message Type Access and It0 Type Access.

During the rollforward operation, although the message type operation m-executes, the re-transmitted message is ignored in the reception sites. The consensus protocol prohibits the I/O type operation until a new active object reaches the point of failure and then permits normal operation from that point. Therefore, the consensus protocol guarantees consistency in the system even if the new active objects are controlled by the active replication mechanism before one new active object is selected.

Auragen ISIS IDPS Delta-4 I User Level User Level User Level System & UserLevel I (None) (Group Manager) (isis goprotos process) (H/W 8rLowLevel S/W ) Active Oriented Active Oriented Active Oriented Passive Oriented Cyclic or Rcv.Mes.# Cyclic 8r Rcv.Mes. # RPC or Real Action Cyclic or RPC Unnecessary After Negotiation After Negotiation Before Negotiation Recovery Negotiation (Rollforward operation) (Mes:ReTrans, I/O:Once) (Mes. L I/O: Once) (Message: Once) (Message: Once) Forbidden Allowed Forbidden Allowed Non-Detemrinistic None None Active Replication Active Replication Other Techniques Mixed-Replication Leader-Follower Atomic Broadcast 3-Way Atomic Multicast Atomic Broadcast Reliable Multicast Communication Table 1: Comparison of Passive Replication Mechanism Project Name I I Transparency (Object Management) Checkpoint

513

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE

Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995

comparison with that of the checkpoint information. In IDPS, deterministic operation of each object is assumed in order to allow two or more active objects in a replica group. If non-deterministic operation is required, a checkpoint should be taken both at every message transmission and at every I/O operation, like Delta-4 and ISIS. Third is failure recovery speed. In IDPS, as soon as individual passive objects detect the corresponding active object’s failure, all of them autonomously become active. After that, extra active objects are returned to passive ones by negotiation among new active objects. The recovery response time is reduced compared with Delta-4 and ISIS techniques in which the object’s execution mode is changed after negotiation among passive objects. In IDPS, while all replicas in the replica group are active before one new active object is selected, the total system load is increased. However, as the rollforward operation is carried out in a lower level priority until one new active object is selected, it is guaranteed that the recovery response time of IDPS is less or equal than that of other works. Namely, when no CPU idle time exists during negotiation among new active objects, the recovery response time is equal to other works. If the CPU idle time exists, the idle time allows the new active objects to execute the rollforward operation. Therefore,, in general the recovery response time is reduced. Finally, compare whether other replication techniques are supported or not. IDPS and Delta-4 support other replication techniques. IDPS supports the active replication technique as similar to Delta-4 active replication technique. In addition, as a similar technique to mixed-replication on IDPS, Delta-4 supports the leaderfollower technique, where all the replicas behave in active mode and only the leader objects of individual replicas communicate among themselves. It may be considered by some that this mechanism mixes active and passive objects, because the followers do not send any messages hence being passive. However, all of the followers do perform the actual processing, resulting in inefficient use of hardware resources. Efficient use of resources is purely a merit of the passive replication mechanism, therefore the leader-follower technique is different from the mixedreplication on IDPS.

Signal t-s Train10

Cirmit/SignalInformation

ControlInformation JI4

Lcation

Figure 7: Objects of Train Traffic Control Systems straight portions called a section that is a unit of train location management. Tracing of trains is carried out around the whole railway based on the status of the sections and signals and the trains service diagram. The central management mechanism for this system becomes a bottle-neck of load, reliability, processing and expansibility. As shown in Fig.7, four different kinds of objects, i.e. I/O object, TRAIN object, STATION object and SECTION object, control the train direction through communication among them. SECTION corresponds to a section of a railway and traces train locations by exchanging messages among its neighboring SECTIONS. It also passes this location information to the relevant TRAINS. TRAIN is dynamically generated in duplicate on the two least loaded sites when the trains starting time arrives. Each TRAIN contains its diagram and sends a request message for controlling apparatuses, such as points and signals, to the next STATION, when it reaches the specified location. STATION produces control signals for the apparatuses in the station in response to the request from TRAINS and sends these signals to I/O. Filly, I/O using the proposed I/O type operation sends control signals to the apparatuses. I/O also detects any change in railway state and sends this information to relevant SECTIONS. Applying the proposed replication mechanism to the train traffic control system leads to the following effectiveness. *Expansibility of systems: A system’s expansion and modification, such as a railway extension and a station addition, are achieved by the addition and modification of

7 Application Example This Section describes the train traffic control system as a practical application example. A train traffic control system traces trains locations and controls their directions, so as to insure safe and punctual operation. A railway is divided into several

514

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE

Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995

the corresponding object without stopping the system. *Improvement of processing performance: Load increase owing to high density of train traffic and/or railway extension is resolved by adding extra computers and then copying some objects to the added computers during operation so as to maintain acceptable load levels.

Pacific Computer Communications Systems, Proc. Symposium, pp.129-133, 1985 [2]T.Seki, et& An Operating System for the Intellectual

Distributed Processing System - An Object Oriented Approach Based on Broadcast Communication -, IPSJ JIP 14(44), 1992 [3]TSeki, et.al., A Fault-Tolerant Architecture for the Intellectual Distributed Processing System, Dependable Computing and Fault-Tolerant Systems, Vo1.6. Springer-Verlag, pp.333-353, 1991 [4]E.C.Cooper, Replicated Distributed Programs, Proc. 10th ACM Symp. on Operating System Principles, Operating SystemRevienl, 19(5), pp.63-78, 1985 [S]D.Powell, et. al., The Delta-4 Approach to Dependability in Open Distributed Computing Systems, Proc. of 18th FTCS, pp.246-251( 1988) [6]R.F.Cmelik, et.al., Fault Tolerant Concurrent C : A Tool for Writing Fault Tolerant Distributed Programs, Proc. of 18th FTCS, pp.56-61, 1988 [7]M,Chereque, et& Active Replication in Delta-4, Proc. of 22nd FTCS, pp.28-37,1992 [8]N.A.Speirs, Using Passive Replications in Delta-4 to Provide Dependable Distributed Computing, Proc. of 19th FTCS, pp.184-190, 1989 [9]K.P.Birman, et& Implementing Fault-Tolerant Distributed Objects,IEEE Trans. on SE., 1 l(6), pp.502-508, 1985 [lO]O.Babaogle, Fault-Tolerant Computing Based on Mach, ACM Operating SystemsReview, 24(l), pp.27-39, 1990 [ ll]A.Borg, et& A Message System Supporting Fault Tolerance, Proc. of 9th ACM Symp. on Operating System Principles, pp.90-99, 1983 [ 12lP.A.Barrett. et& The Delta-4 Ex@a Performance Architecture (XPA), Proc. of 20th FTCS, pp.481-488, 1990 [13]MAhamad, et& Fault Tolerant Computing in Object Based Distributed Operating Systems, Proc. of 6th Symposium on Reliable Distributed Systems,pp.1 15-125, 1987 [14]K.Birman, et& Lightweight Causal and Atomic Group Multicast, ACM Trans. on Comp. Syst., 9(3), pp.272-314,

*Selection of the most suitable replication mechanism: As STATION object requests non-stop operation when a train arrives near a station, the active replication mechanism is suitable for it. On the other hand, as SECTION object having no point allows a little latency for switching from passive to active, the passive replication mechanism is suitable for it so as to reduce the load on the resources. In this situation, the proposed mechanism can easily mix both the active replication and the passive replication mechanisms in a target system. In addition, system designers can arbitrarily decide upon each object’s replication degree in proportion to its required reliability. For example, STATION is triplicated and TRAIN is duplicated. Therefore, higher resource utilization and good economy are achieved while maintaining required performance. *Improvement of Non-stop maintainability: maintenance during operation is possible by separating the target computer after all objects’ group on it are changed to active replication. This execution mode exchange is achieved by invoking the Mode Management Method of the corresponding objects.

8 Conclusion This paper has proposed a fault-tolerant technique, in which an object behaves autonomously independently from the location, replication degree and fault-tolerant mechanism of each object, based on an object model and a total ordering broadcast protocol. In addition, this paper has proposed a new passive replication mechanism realizing an automatic and transparent checkpoint mechanism in each object. The consensus protocol for the I/O type operation has also been proposed to realize the exactly once operation. Through the proposed mechanism, each object can autonomously change its location, replication degree and fault-tolerant mechanism in accordance with its required reliability and responsibility without any influence on other objects. Furthermore, practical application of the proposed mechanism has shown that an optimal reliable system can be easily constructed.

1991

References [l]S.Tamura, et.al., IDPS: Intellectual Distributed Processing

[lS]S.W.Luan, et& A Fault-Tolerant Protocol for Atomic Broadcast, IEEE Trans. on Parallel and Distributed Systems, l(3), pp.271-285, 1990 [16]J.M.Chang, et al., Reliable Broadcast Protocols, ACM Trans. on Comp. Syst., 2(3), pp.251-273, 1984 [17]M.F.Kaashoek, et at., An Efficient Reliable Broadcast Protocol, Operating SystemReview, 23(4), pp.5-19, 1989 [18]R.Koo, et.al., Checkpointing and Rollback-Recovery for Distributed Systems, IEEE Trans. on SE., 1(13), pp.23-31, 1987 [19]F.B.Schneider, Byzantine Generals in Action: Implementing Fail-Stop Processors, ACM Trans. on Comp. Syst., 2(2), pp. 145- 154, 1984 120lThe Isis Distributed Toolkit, V3.0, User Reference Manual

515

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE

Suggest Documents