A Quasi-synchronous Algorithm for Checkpointing in ... - CiteSeerX

A Quasi-synchronous Algorithm for Checkpointing in Distributed Systems D. Manivannan and M. Singhal Department of Computer and Information Science, The Ohio State University, Columbus, OH 43210. email: fmanivann, [email protected] Abstract

In this paper, we propose a quasi-synchronous algorithm for checkpointing in distributed systems. This algorithm preserves process autonomy by allowing them to take checkpoints asynchronously and uses communication-induced checkpoint coordination for the progression of the recovery line which helps in bounding rollback propagation during a recovery. Thus, it has the easeness and low overhead of asynchronous checkpointing and the recovery time advantages of synchronous checkpointing. There is no extra message overhead involved during checkpointing and the additional checkpointing overhead is nominal. The algorithm ensures the existence of a recovery line consistent with the latest checkpoint of any process all the time. This property of the algorithm is a highly desirable feature for rollback recovery because a failed process needs only rollback to its latest checkpoint and request other processes to rollback to a consistent checkpoint, thus avoiding domino eect completely. It also helps in garbage collection because after a process has established a recovery line consistent with its latest checkpoint, all the processes can delete the checkpoints that precede the recovery line. Multiple processes can establish recovery lines consistent with their latest checkpoints simultaneously without intruding other processes.

Keywords: distributed checkpointing, global snapshot collection, failure recovery, fault-

tolerance.

1

1 Introduction An important consideration in the design of systems is how to make them resilient to failures. Checkpointing is a technique used to store the state of a process in the stable storage so that the process can resume to its normal computation from that state on recovery from a failure, instead of starting from the initial state [10]. Such checkpointingrecovery techniques for a single processor system have been extensively studied in the literature [14]. In a distributed system, the states of processes depend on one another due to interprocess communication. So, when a process P rolls back after a failure, the processes that have states directly or transitively dependent on P 's state are forced to rollback. The use of checkpoints on a stable storage and rollback-recovery protocols are well established techniques for dealing with process failures in a distributed system as well. When a failure occurs, a rollback protocol uses the checkpoints and message logs to restore the system to a consistent global state [16]. By a consistent global state we mean that if the receipt operation of a message has been recorded in the global state, then the send operation of that message must also have been recorded. In the literature, several checkpointing schemes have been proposed for distributed systems. They can be broadly classi ed in to two categories { asynchronous and synchronous. In asynchronous checkpointing [3, 13, 23], processes take checkpoints periodically without any coordination with each other. To recover from a failure, processes communicate with each other to restore the system to a consistent set of local states. If their local states are causally related, sites that received messages which are responsible for causal dependencies, rollback to eliminate these causal dependencies. This process is repeated 2

until the local states of all the processes are free from causal dependencies. This approach allows maximum process autonomy for taking checkpoints and has low checkpointing overhead. However, this approach may suer from the domino eect, in which the processes rollback recursively while determining a consistent set of checkpoints. To reduce domino eect, Kim et al. [11] and Venkatesh et al. [21] use the dependency tracking and insert checkpoints before processing a new message that introduces dependency. Message logging [7, 9, 18, 20, 25] and message reordering [24] have been suggested in the literature to cope with the domino eect. Asynchronous checkpointing requires multiple checkpoints to be stored at each process. Thus, storage requirement may be large. This problem can be solved by periodically establishing a recovery line1 and deleting the checkpoints that precede the checkpoints in the recovery line. In synchronous checkpointing schemes, domino-free recovery is achieved by sacri cing process autonomy and incurring extra message overhead during checkpointing. In this approach, processes synchronize their checkpointing activities so that a globally consistent set of checkpoints is always maintained in the system [6, 12, 15]. The storage requirement for the checkpoints is minimum because each process keeps only one checkpoint in the stable storage at any given time. Process execution may have to be suspended during the checkpointing coordination as in [10, 12], resulting in performance degradation.

Paper Objectives The main objective of the paper is to present a checkpointing algorithm which ensures the existence of a recovery line consistent with the latest checkpoint of any process all the time. The algorithm achieves this objective by allowing processes to take check1

A globally consistent set of checkpoints.

3

points asynchronously while employing communication-induced checkpointing coordination. Thus, the algorithm is neither synchronous nor asynchronous. It has the easeness and low overhead of asynchronous checkpointing and the recovery time advantages of synchronous checkpointing. The algorithm ensures the existence of a recovery line consistent with the latest checkpoint of any process, which helps in bounding rollback for rollback recovery. There is no extra message overhead involved in the checkpointing coordination. Only a sequence number is piggybacked with every computation message. We also present a global checkpoint collection algorithm based on our checkpointing algorithm. The global checkpoint collection is non-intrusive and has minimum overhead. When a process wants to collect a global checkpoint consistent with its latest checkpoint, it only needs to send a request to all the other processes and receive a reply from all the other processes. No explicit coordination is required for global checkpoint collection. Multiple processes can simultaneously initiate global checkpoint collection and succeed in collecting global checkpoint in a non-intrusive manner. The rest of the paper is organized as follows. In the next section, we present the system model. In Section 3, we present the checkpointing algorithm and the global checkpoint collection algorithm. We also prove the correctness of the global checkpoint collection in Section 3. In Section 4, we modify the checkpointing algorithm to reduce the checkpointing overhead further. We analyze the overhead involved in the modi ed checkpointing algorithm in Section 5. We compare the performance of the algorithm with the existing algorithms in Section 6. Section 7 concludes the paper.

4

2 System Model and Notations A distributed program consists of N sequential processes denoted by P1,P2, ,PN . The concurrent execution of all the processes on a network of processors is called a distributed computation. The processes do not share a global memory or a global physical clock. Message passing is the only way for processes to communicate with one another. The computation is asynchronous: each process evolves at his own speed and messages are exchanged through communication channels, whose transmission delays are nite but arbitrary. We assume that no messages are lost, altered or spuriously introduced. No assumption is made about the FIFO nature of the channels. Activity of each process is modeled by a sequence of events (i.e., executed actions). Three kinds of events are considered: internal, send, and receive events. At a given time, the local state of a process Pi is de ned by the values of the local variables managed by this process. Occurrence of an event generally causes a change of the local state of the process. The local state of a process saved in stable storage is called a checkpoint of the process. Each process takes checkpoints independently. Processes also take checkpoints as a result of the reception of some messages. Independently taken checkpoints are called basic checkpoints and those triggered by message receptions are called forced checkpoints.

Each checkpoint of a process is assigned a unique sequence number. The sequence number assigned to a checkpoint C is denoted by C:sn. The sequence numbers assigned to the checkpoints of a process increase monotonically. Each message is piggybacked with the sequence number of the latest checkpoint of the process sending the message. The sequence number piggybacked with message M is denoted by M:sn. The checkpoint with sequence number m of process Pi is denoted by Ci;m. The send and receive events of a 5

message M are denoted respectively by send(M ) and receive(M ). We say send(M ) 2 Ci;m if and only if message M was sent by process Pi before the checkpoint Ci;m was recorded. Also, we say receive(M ) 2 Ci;m if and only if message M was received and processed by

Pi before the checkpoint Ci;m was recorded.

De nition 1 A set S = fC1;m ; C2;m ; ; CN;mN g of N checkpoints, one from each process is said to be a global checkpoint2 if for any message M and for any integer i; 1 i N : receive(M ) 2 Ci;mi =) send(M ) 2 Cj;mj for some j; 1 j N . 1

2

3 The Checkpointing Algorithm

3.1 Basic Idea Checkpointing

Each process is allowed to take checkpoints asynchronously. In addition, processes are forced to take checkpoints as a result of the reception of some messages. When a process sends a message, it appends the sequence number of its latest checkpoint to the message. When a process Pi receives a message, if the sequence number accompanying the message is greater than the sequence number of the latest checkpoint of Pi , then before processing the message, it takes a checkpoint and assigns the sequence number that it received in the message to the checkpoint taken. This communication-induced checkpointing advances the recovery line which helps reducing rollback propagation in case of rollback recovery.

Global checkpoint collection When process Pi wants to establish a recovery line consistent with its latest local checkpoint, it sends a checkpoint request message along with the sequence number n of its latest 2

Also called a global snapshot or a recovery line.

6

local checkpoint to all the processes. On receiving this message, a process Pj sends the sequence number m of a local checkpoint, where m is the smallest positive integer greater than or equal to n; if such a checkpoint does not exist (i.e., all the checkpoints of Pj have sequence numbers < n), then Pj takes a checkpoint and assigns n as sequence number to the checkpoint taken and sends the number n to Pi. After Pi receives a response from all the processes with sequence numbers m1; m2; ; mN , it declares fC1;m1 ; C2;m2 ; ; CN;mN g as a global checkpoint. Thus, a process can always establish a recovery line consistent with its latest checkpoint in a non-intrusive manner. This is a very desirable feature for a rollback recovery in distributed systems. For example, when a process Pi fails, it can restart from its latest checkpoint and request the other processes to rollback to a checkpoint consistent with

Pi 's latest checkpoint.

3.2 Formal Description of the Algorithm

The Quasi-synchronous Checkpointing Algorithm Data Structures at Process Pi

sni : integer (:= 0);

fSequence number of the latest checkpoint, initialized to 0. This is updated every time a new checkpoint is takeng

When process Pi sends a message M

M:sn := sni ; fsequence number of the current checkpoint appended to the message M g send (M ); When it is time for process Pi to take a basic checkpoint sni := sni + 1; Take checkpoint C ; C:sn := sni ; Process Pj , upon receiving a message M from process Pi If snj < M:sn then snj := M:sn; Take checkpoint C ; C:sn := snj ; Process the message.

7

The Global Checkpoint Collection Algorithm When process Pi wants to collect a global checkpoint

send request check point(i; sni ) to all processes including process Pi ; After receiving reply (j; mj ) from each process Pj , Declare S = fCj;mj j 1 j N g as a global checkpoint; Process Pj , upon receiving request check point(i; m) from Process Pi If (m > snj ) then snj := m; Take checkpoint C ; C:sn := snj ; send reply (j; snj ) to process Pi ; Else Find the earliest checkpoint C such that C:sn m; send reply (j; C:sn) to Pi .

Now, we explain the checkpointing algorithm with an example. 0 P 1

P 2

P 3 P 4

[

2 [

1 [ M 1

1 [

0

[

3 [ M 2 2 [ M 3

0

1 [

0

1 [

[ [

4 [ * 3 [ *

4 [

M 4

Figure 1: Example to illustrate the algorithm

3.3 An Example We illustrate the checkpointing algorithm using the example in (Figure 1). The basic checkpoints are shown in the gure as \[" and the forced checkpoints are shown as \[". The sequence number assigned to a checkpoint is shown next to it. Message M3 forces process P4 to take a checkpoint with sequence number 3 before processing M3 because

M3:sn = 3 and sn4(= 1) < 3 while receiving the message. Similarly, message M4 also 8

forces P3 to take a checkpoint before processing the message. However messages M1 and

M2 do not force the receiving processes to take a checkpoint before processing them. Note that there may be gaps in the sequence numbers assigned to checkpoints. For example, process P3 does not have checkpoints with sequence numbers 2 and 3. If process P1 wants to establish a global checkpoint consistent with its local checkpoint

C1;3, it sends request check point(1; 3) to all the processes including itself. In response to this message, processes P1; P3, and P4 will send reply(1; 3); reply(3; 4), and reply(4; 3), respectively; however, process P2 will take a new checkpoint with sequence number 3 before sending reply(2; 3) to P1. After receiving all these replies, process P1 declares

fC1;3; C2;3; C3;4; C4;3g as a global checkpoint.

3.4 Correctness of Global Checkpoint Collection Observation 1: If Pi receives reply(j; mj ) from Pj (1 j N ) in response to its request check point(i; m) message, then 1. there exists a local checkpoint Cj;mj of process Pj such that mj m; 2. all checkpoints taken by Pj prior to checkpoint Cj;mj have sequence numbers less than m.

Observation 2: For any message M and for any i (1 i N ), send(M ) 2 Ci;mi () M:sn < mi.

Observation 3: For any message M and for any i (1 i N ), receive(M ) 2 Ci;mi =) M:sn < mi. Note that the converse is not true.

Observation 4: Pi processes a received message M only after it has taken a checkpoint with sequence number M:sn. 9

Safety Property: Theorem 1 If a process Pi declares S = fC1;m ; C2;m ; :::; CN;mN g as a global checkpoint, 1

2

then S indeed is a global checkpoint.

Proof: Let m = mi. Note that process Pi must have sent request check point(i; m) in order to declare S as global checkpoint. From Observation 1, we have

8j; 1 j N : mj m:

(1)

Suppose S is not a global checkpoint. Then, there exists a message M sent by some process Pj to some process Pk such that send(M ) 62 Cj;mj but receive(M ) 2 Ck;mk . So,

M:sn mj from Observation 2

(2)

M:sn < mk from Observation 3

(3)

From equations 1, 2, and 3 we get,

m mj M:sn < mk

(4)

From Observation 1, all the checkpoints taken by process Pk prior to checkpoint Ck;mk have sequence numbers less than m. Since M:sn m, Pk must have processed M only after a checkpoint with sequence number m had been taken (from Observation 4). Since receive(M ) 2 Ck;mk , message M must have been received by process Pk before the checkpoint Ck;mk was taken. Hence there exists a checkpoint Ck;mk of Pk that was taken 0

before Ck;mk such that mk m. This contradicts the fact that all the checkpoints taken 0

by process Pk prior to checkpoint Ck;mk have sequence numbers less than m. Hence our assumption that S is not a global checkpoint in wrong. Hence, the theorem. 2 The following corollary gives a sucient condition for a set of local checkpoints to be a part of a global checkpoint. 10

Corollary 1 Let S = fCi ;mi ; Ci ;mi ; :::; Cik;mik g be a set of local checkpoints from distinct processes. Let m = minfmi ; mi ; :::; mik g. Then, S can be extended to a global checkpoint if 8 l (1 l k), Cil;mil is the earliest checkpoint of Pil such that mil m. 1

1

2

2

1

2

Proof: Without loss of generality, we can assume that m = mi . Now, suppose process 1

Pi1 initiates a global checkpoint collection by sending request check point(i1; mi1 ) to all the processes and declares S = fC1;m1 ; C2;m2 ; :::; CN;mN g as a global checkpoint after re0

ceiving replies from all the processes. Claim: S S . 0

Proof of Claim: In response to the message request check point(i1; mi1 ) of process Pi1 , process Pil (1 l k) must have sent reply(il; mil ) because Cil;mil is the earliest checkpoint of Pil whose sequence number is m. Hence, for each l : (1 l k), Pi1 should have included Cil;mil as the checkpoint of process Pil in the set S . So, S S . Hence, S 0

0

can be extended to a global checkpoint. 2

Corollary 2 gives a sucient condition for a set of checkpoints, one for each process, to be a global checkpoint.

Corollary 2 Let S = fC1;m ; C2;m ; :::; CN;mN g be a set of local checkpoints one for each process. Let m = minfm1; m2; :::; mN g. Then, S is a global checkpoint if 8 i (1 i N ), Ci;mi is the earliest checkpoint of Pi such that mi m. 1

2

Proof: This is a special case of Corollary 1. 2

Liveness Property: Theorem 2 The global checkpoint collection algorithm terminates in nite time. Proof: When a process initiates global checkpoint collection, it sends a request check point message to all the processes. On receiving this message, a process either sends a reply 11

immediately or sends a reply after taking a checkpoint. Thus, assuming the message delay (i.e., maximum time taken by a message to reach from one process to another process) is bounded by d and the maximum time taken by a process to take a checkpoint is bounded by c, the initiating process would receive reply for its request from all the processes in 2 d + c time (we assume the processing time of the message is included in d). Thus, the global checkpoint collection algorithm terminates in nite time. 2

4 Reducing the Checkpointing Overhead In this section, we modify the quasi-synchronous checkpointing algorithm to reduce the checkpointing overhead. First, we motivate the necessity of the modi cation with examples. In asynchronous checkpointing, checkpoints are taken periodically. For example, a process may take checkpoints once every x time units or a process may take checkpoints after having done a predetermined amount of computation (which may depend on the number of messages received, number of messages sent, the amount of local computation done, a combination of one or more of these or some other parameter). We call the interval during which a process has to take a basic checkpoint, a checkpoint interval of that process. Figure 2(a) illustrates the checkpoints taken under an asynchronous checkpointing scheme. The communication pattern in Figure 2(a) renders all of the checkpoints useless for rollback recovery since the recovery line stays always at the beginning of the computation. When the quasi-synchronous checkpointing algorithm is used for the same computation and basic checkpointing pattern, the checkpoints taken are shown in Figure 2(b). As can be seen from Figure 2(b), the number of forced checkpoints is equal to the number of basic checkpoints. Thus, the quasi-synchronous algorithm results in 12

high checkpointing overhead for this communication pattern; however, the recovery line stays at the end. This high checkpointing overhead is incurred because none of the basic checkpoints taken is useful for advancing the recovery line and the quasi-synchronous checkpointing algorithm always tries to make sure that a recovery line consistent with the latest checkpoint of any process exists. 0 P 1

P 2

[

1

2

3

4

[

[

[

[

0

1

2

3

4

[

[

[

[

[ basic checkpoint

(a) 0 P 1

P 2

P 1

P 2

[

1

2

3

4

5

6

7

8

[ *

[

[ *

[

[ *

[

[ *

[

0

1

2

3

[

[

[ *

[

0

1

2

3

4

[ *

[ *

[ *

[ *

[

4

5

6

[ * (b)

[

[ *

7

8

[

0

1

2

3

4

[

[

[

[

[

forced checkpoint

[ *

(c)

Figure 2: Communication-induced checkpointing coordination: (a)asynchronous checkpointing and communication pattern; (b) quasi-synchronous checkpointing; (c) quasisynchronous checkpointing with modi cation To reduce the checkpointing overhead, we propose the following modi cation to the quasi-synchronous algorithm: if a process has taken a forced checkpoint, then it can skip taking the next basic checkpoint. The checkpoints taken using this new strategy for 13

0

[

P 1

1 [

0 P 2

2 [

3 [

4 [

1 [

[

[ [

1 [

2 [

3 [

4 [ 4 [

2 [

4 [

0 P 3

8 [ 4 [ 2 [

(a)

0 P 2

7 [

1 [

[

0 P 1

6 [ 3 [

2 [

0 P 3

5 [

[

5 [

6 [ 6 [

7 [

8 [ 8 [ 8 [

(b)

Figure 3: Communication-induced checkpointing coordination: (a)asynchronous checkpointing and communication pattern; (b) corresponding quasi-synchronous checkpointing with the new strategy; the same computation is shown in Figure 2(c). The total number of checkpoints taken in Figure 2(c) is same as the total number of checkpoints in Figure 2(a); however, the recovery line in Figure 2(a) stays in the beginning whereas the recovery line in Figure 2(c) stays at the end. Figure 3(a) illustrates a computation in which dierent processes take checkpoints at dierent pace. In this gure, P1 takes a basic checkpoint once every x time units, P2 takes a basic checkpoint once every 2 x time units, and P3 takes a basic checkpoint every 4 x time units. If process P3 fails after taking checkpoint C3;2, it would force all other processes to rollback to the checkpoint with sequence number 2 under rollback recovery; in particular, P1 has to rollback a great distance unnecessarily. For this computation and basic checkpointing pattern, even if we use our quasi-synchronous checkpointing al14

gorithm, the checkpoints taken by the processes would be the same and it will not help in advancing the recovery line. This happens because the processes which take checkpoints at a lower pace (i.e., the processes P2 and P3), are not aware of the checkpoints taken by the process (P 1) which takes checkpoints at a faster pace (i.e., the process which takes checkpoints at faster pace does not send any messages to the processes that take checkpoints at lower pace). If the recovery line consistent with the latest checkpoint of any process should stay close to the end of the computation, the sequence numbers of the latest checkpoints of all the processes should be close to each other. To keep the sequence numbers of the latest checkpoints of all the processes close to each other, we propose the following strategy: each process Pi keeps an integer variable

nexti, which is incremented periodically after every x time units, where x is the smallest of the lengths of the checkpoint intervals of all the processes (in fact, x could be smaller). When process Pi takes a basic checkpoint, it assigns nexti as the sequence number for the checkpoint taken. Since all the processes increment the variable nexti after every x time units, the values of the variable next of all the processes will be equal to each other if the local clocks do not drift. In general, the drift between the local clocks will be very small when compared to x. So, at any given time,

jnexti ? nextj j 1 8 i; j : 1 i; j N: When we use this strategy, the checkpoints taken for the computation of Figure 3(a) are shown in Figure 3(b). Even though no extra checkpoints are taken, because of the numbering of the checkpoints taken, if a process fails at any point, other processes do not have to rollback beyond 4 checkpoints under rollback recovery using our algorithm for determining global checkpoint. We prove later that no process will have to rollback 15

beyond Q checkpoints where Q is the maximum of the ratios of the checkpoint interval lengths of any two processes. In this example, Q = 4. Next, we present the modi ed quasi-synchronous algorithm incorporating the new strategies.

Modi ed Quasi-synchronous Checkpointing Algorithm We only present the actions that need to be taken when a process tries to take a basic checkpoint. All the other parts of the checkpointing algorithm remain the same. The global checkpoint collection algorithm remains the same. Data Structures at Process Pi sni := 0; fSequence number of the latest checkpoint, initialized to 0. This is updated every time a new checkpoint is taken.g nexti := 1; fSequence number to be assigned to the next basic checkpoint initialized to 1g

When it is time for process Pi to increment nexti nexti := nexti + 1; fnexti incremented at periodic time intervalsg When it is time for process Pi to take a basic checkpoint

If nexti > sni then fskips taking a basic checkpoint if nexti sni (i.e., if it had already sni := nexti ; taken a forced checkpoint C with C:sn nexti in this Take checkpoint C ; checkpoint interval).g C:sn := sni ;

The modi ed quasi-synchronous checkpointing algorithm, hereafter referred to as MQSA, allows a process to skip taking a basic checkpoint if it had already taken a forced checkpoint whose sequence number is greater than or equal to the expected sequence number of the next basic checkpoint, thus making the checkpointing overhead lower than the original algorithm. Since there is no change in the global snapshot collection algorithm, the correctness proof of global snapshot collection for the MQSA remains unchanged.

16

5 An Overhead Analysis In this section, we analyze the overhead involved in the MQSA. In checkpointing algorithms, generally three kinds of overheads exist. First is the extra control messages required for checkpoint coordination, second is the the control information sent on computation messages, and the third is the number of checkpoints that need to be taken (i.e., the checkpointing overhead). In the MQSA, each computation message is piggybacked with the sequence number of the latest checkpoint. There is no other message overhead. So, we only analyze the checkpointing overhead involved in the MQSA. The checkpointing overhead depends on the run-time communication pattern as well as the basic checkpointing pattern. Let basicnum denote the total number of checkpoints the processes would take for a basic checkpointing pattern; let modified quasinum denote the total number of checkpoints the processes would take for the same computation and basic checkpointing pattern when the MQSA is used. We de ne the induction ratio R for a computation as

quasinum R = modified basicnum We use R to analyze the checkpointing overhead of the MQSA. One would desire R to be as close to 1 as possible. If R = 1, then there is no extra checkpointing overhead induced by the MQSA. If R > 1, then there is additional checkpointing overhead. We analyze the checkpointing overhead of the MQSA under two given basic checkpointing patterns: (i) Each process takes basic checkpoints at xed time intervals periodically, the time interval being same for all processes. (ii) Each process takes basic checkpoints at its own pace. The results of the analysis in these two cases are given in Assertions 1 and 2, respectively below. 17

Assertion 1: Assume that under a basic checkpointing pattern, each process takes a basic checkpoint at the end of every x time units, and the local clocks of the processes can drift by at most where < 21 x. Then the MQSA has no additional checkpointing overhead for this basic checkpointing pattern. Thus, R = 1 and the algorithm ensures that the recovery line stays at the end of the computation.

Proof: The variable next of each process is incremented every x time units. Thus, the variable next is incremented in every checkpoint interval once. Without loss of generality, we can assume that all processes start simultaneously (If a process starts at a dierent time than other processes, it initializes the value of its variable next accordingly) and take their basic checkpoint in the later half of their checkpoint interval.

Claim 1: For any i; 1 i N , process Pi takes at most one forced checkpoint in any checkpoint interval. Suppose Pi takes two or more forced checkpoints in its tth checkpoint interval. Let C1 and C2 be any two such checkpoints taken in that order. Since each process Pi attempts to take a basic checkpoint once in each checkpoint interval, the sequence number of the latest checkpoint of process Pi at the beginning of tth checkpoint interval is t ? 1. Thus,

C1:sn t and C2:sn t + 1. This implies that Pi , in its tth checkpoint interval, had received a message M such that M:sn t + 1, which implies that there exists some process

Pk which had already taken a basic checkpoint with sequence number t + 1. Thus, Pk must have already passed the rst half of its (t + 1)th checkpoint interval while Pi is still in the tth checkpoint interval. This is impossible since the drift in the local clock values of the processes is < 21 x. This contradiction proves Claim 1. It is also clear that if a process Pi takes a forced checkpoint in a checkpoint interval, 18

it skips taking a basic checkpoint in that checkpoint interval since nexti sni . So, in each checkpoint interval a process either takes a basic checkpoint or a forced checkpoint but not both. Thus, after the tth checkpoint interval, each process would have taken exactly t checkpoints if the MQSA is used. Thus, the total number of checkpoints taken by processes using MQSA is same as the total number of checkpoints the process would have taken under the basic checkpointing pattern. Hence, R = 1. Moreover, since at the end of checkpoint interval t, the sequence number of the latest checkpoint of all the processes is t, the recovery line always stays at the end of the computation. 2

Assertion 2: If processes take basic checkpoints at their own pace, then the induction ratio R dQe where Q is the maximum of the ratios of the lengths of the basic checkpoint intervals of any two processes. Moreover, no process need to rollback to a distance more than dQe under rollback recovery.

Proof: The variable next of each process is incremented every x time units, where x is the smallest of the lengths of the checkpoint intervals of all the processes. We consider the checkpointing overhead in the worst case scenario. The worst case scenario occurs when each process broadcasts a message to every other process after taking a checkpoint and the message delay is 0. It is clear that the process with the smallest checkpoint interval will force all other processes to take a forced checkpoint after it takes a basic checkpoint. Thus, at any given time, the total number of checkpoints taken by any process is same as the total number of basic checkpoints the process with smallest checkpoint time interval would have taken under the basic checkpointing scheme. Thus, if Q denotes the maximum ratio of the lengths of the basic checkpoint intervals of any two processes, the induction ratio R dQe. Generally, R will be much less than 19

dQe because if the processes with smaller checkpoint intervals do not send messages to processes with larger checkpoint intervals frequently, then fewer forced checkpoints will be taken. Moreover, even if no messages are exchanged between the processes, jsni ? snj j

dQe at any given time, i.e., the sequence numbers of the latest checkpoints of the processes will dier by at most dQe. This is because the process that has the smallest checkpoint interval will take a basic checkpoint every x time units, whereas the process with the largest checkpoint interval, Q x, will take a basic checkpoint every Q x time units. Hence, the sequence number of the latest checkpoint of the process that has the largest checkpoint interval will catch up with that of the process with the smallest checkpoint interval after every Q x time units. This bounds the rollback distance to at most dQe in the case of a failure. This proves Assertion 2. 2 1

0 P 1

[

[

M

P 2

0

1

2

3

[

[

[

[ basic checkpoint

(a) 0 P 1

[

1

2

3

[ *

[ *

[ *

forced checkpoint

M

P 2

0

1

2

3

[

[

[

[ (b)

Figure 4: Communication-induced checkpointing coordination: (a)asynchronous checkpointing and communication pattern; (b)checkpointing using MQSA An example illustrating the case where each process takes basic checkpoints at its own pace is shown in Figure 4(a). Process P1 takes only one basic checkpoint during which 20

process P2 takes three basic checkpoints. If P1 fails after receiving message M but before taking checkpoint C1;1, it will rollback to the beginning and it will also force P2 to rollback to the beginning under rollback recovery, thus rendering the checkpoints taken by process

P2 useless. For the same computation and basic checkpointing pattern, Figure 4(b) shows the checkpoints taken by the processes using the MQSA. Even though P1 is forced to take more checkpoints than it would have taken otherwise, it helps the recovery line to progress as the processes progress in their computation. This helps in reducing rollback in the event of failure. For example, if P1 fails after receiving message M , it would rollback to checkpoint C1;3 and as a result P2 will be forced to rollback to C2;3.

6 Comparison with Existing Algorithms In this section, we compare the performance of the MQSA with the existing algorithms. The algorithms proposed in [12, 2] have a two-phase structure. This causes processes to suspend the normal computation for making checkpoint decisions which greatly increases the overhead during normal computation. The MQSA does not cause any such overhead and avoids domino eect completely during recovery. In [12], if multiple global checkpoint collections are initiated concurrently, all of them may have to be aborted in some situations. In our global checkpoint collection algorithm, multiple processes can concurrently initiate global checkpoint collection and all of them will succeed in collecting the global checkpoint without suspending the underlying computation. In [19], concurrent initiations of global checkpoint collections are handled by restricting the propagation of snapshot requests in the system. In our algorithm, several processes can simultaneously initiate and collect global snapshot successfully without restricting any other process from collecting global snapshot. The synchronous checkpointing algorithm 21

of Silva and Silva [6] requires a xed process to take a checkpoint and send request messages to all the other processes for taking a checkpoint consistent with its latest checkpoint; it also requires the sequence number of the latest checkpoint taken to be piggybacked with each computation message. This requires the extra overhead of control messages for each checkpoint taken, whereas the MQSA does not have any additional message overhead. In Acharya et al.'s [1] asynchronous checkpointing algorithm for mobile computing systems, a process takes a checkpoint whenever a message reception is preceded by a message transmission. This might force the processes to take as many checkpoints as the number of messages if the message reception and transmission are interleaved, which would result in high checkpointing overhead. The checkpointing overhead of the MQSA however does not depend much on the communication pattern; rather it depends on basic checkpointing pattern of the processes. Wang et al. [22] proposed lazy checkpoint coordination for bounding rollback propagation. Like our approach, their technique requires the checkpoint number being piggybacked on the computation messages so that the receiving processes can take an extra checkpoint when required. However, their approach has higher checkpointing overhead since processes do not skip basic checkpoints if they have taken forced checkpoints. Moreover, if all the processes do not take checkpoints at the same pace, the recovery line will not stay close to the end of the computation if the process that is taking checkpoints at faster pace does not send any messages during the computation. In addition, their algorithm does not ensure the existence of a recovery line consistent with the latest checkpoint of a process; only checkpoints with a particular sequence number form a global checkpoint. In many of the existing checkpointing algorithms in the literature (for example, [4, 5, 22

6, 22]), a set S of local checkpoints where

S = fCk;mk j1 k N and Ck;mk is a checkpoint of process Pk g is considered a global checkpoint if and only if 8 i; j; 1 i; j N : mi = mj . Thus, only the local checkpoints with the same sequence number form a global checkpoint. This requirement results in high message overhead (as in [5, 6]) and/or high checkpointing overhead (as in [4, 22]). Our approach does not require the sequence numbers of all the local checkpoints in a global checkpoint to be equal to each other; any set of local checkpoints satisfying the hypothesis of Corollary 2 forms a global checkpoint. Moreover, in our approach, since processes skip taking a basic checkpoint if they have taken a forced checkpoint with sequence number the sequence number to be assigned to the next basic checkpoint, checkpointing overhead is low.

7 Conclusion We presented a novel quasi-synchronous checkpointing algorithm. The algorithm has the easeness of asynchronous checkpointing and the advantages of synchronous checkpointing. Communication-induced checkpointing coordination guarantees the progression of recovery line. The algorithm has no additional control message overhead and has nominal checkpointing overhead. It also guarantees the existence of a recovery line consistent with the latest checkpoint of any process all the time. As a result, a failed process only need to rollback to its latest checkpoint and request the other processes to rollback to a checkpoint consistent with its latest checkpoint. Under rollback recovery, processes other than the failed process may have to rollback to a distance of at most dQe checkpoints, where Q is the maximum ratio of the basic checkpoint intervals of any two processes. Our 23

overhead analysis showed that the induction ratio, i.e., the total number of checkpoints taken by the processes under the MQSA divided by the total number of checkpoints the processes would have otherwise taken, is dQe. We believe that this checkpointing approach can be very useful in designing ecient recovery algorithms because it has the low overhead of asynchronous checkpointing and the recovery time advantages of synchronous checkpointing.

References [1] A. Acharya, B. R. Badrinath, and T. Imielinski. \Checkpointing Distributed Applications on Mobile Computing Systems". Technical report, Department of Computer Science, Rutgers University, 1994. [2] B. Bhargava and P. Leu. \Concurrent Robust Checkpointing and Recovery in Distributed Systems". In Proc. of 4th IEEE Int. Conf. Data Eng., pages 154{163, February 1988. [3] B. Bhargava and S. R. Lian. \Independent Checkpointing and Concurrent Rollback for Recovery in Distributed Systems{An Optimistic Approach.". In Proc. 7th IEEE Symp. Reliable Distributed Syst., pages 3{12, 1988. [4] D. Briatico, Ciuoletti, and L. Simoncini. \A Distributed Domino-Eect free Recovery Algorithm". In Proc. of IEEE 4th Symposium on Reliability in Distributed Software and Database Systems, pages 207{215. IEEE, 1984. [5] F. Cristian and F. Jahanian. \A Timestamp-based Protocol for Long-lived Distributed Computations". In Proc. 10th Symp. Reliable Distributed Systems, pages 12{20, September 1991. [6] Luis Moura e Silva and Jouao Gabriel Silva. \Global Checkpointing for Distributed Programs". In Proc. Symp. Reliable Distributed Systems, pages 155{162, 1992. [7] D. B. Johnson and W. Zwaenepoel. \Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing". In Proc. of 17th IEEE Symp. FaultTolerant Comput., pages 14{19, June 1987. [8] T. T-Y. Juang and S. Venkatesan. \Ecient Algorithm for Crash Recovery in Distributed Systems". In 10th Conf. on Foundations on Software Technology and Theoretical Computer Science, pages 349{361, 1990. [9] T. T-Y. Juang and S. Venkatesan. \Crash Recovery with Little Overhead". In Proc. of 11th International Conf. on Distributed Comput. Syst., pages 454{461, 1991. 24

[10] Junguk L. Kim and Taesoon Park. \An Ecient Protocol for Checkpointing recovery in Distributed Systems". IEEE Trans. on Parallel and Distributed Systems, 4(8):955{ 960, August 1993. [11] K. H Kim. \A Scheme for Coordinated Execution of Independently Designed Recoverable Distributed Processes". In Proc. of 16th IEEE Symp. Fault-Tolerant Comput., pages 130{135, June 1986. [12] R. Koo and S. Toueg. \Checkpointing and Roll-back Recovery for Distributed Systems". IEEE Trans. on Software Eng., SE-13(1):23{31, January 1987. [13] K.Tsuruoka, A. Kaneko, and Y. Nishihara. \Dynamic Recovery Schemes for Distributed Process". In Proceedings of IEEE 2nd Symp. on Reliability in Distributed Software and Database Systems, pages 124{130, 1981. [14] P. L'Ecuyer and J. Malenfant. \Computing Optimal Checkpointing Strategies for Rollback and Recovery Systems". IEEE Trans. Comput., 37(4):491{496, April 1988. [15] K. Li, J. F. Naughton, and J. S. Plank. \Checkpointing Multicomputer Applications". In Proc. 10th Symp. on Reliable Distributed Systems, pages 2{11, 1991. [16] S. L. Peterson and Phil Kearns. \Rollback Based on Vector Time". In Proc. 12th Symp. on Reliable Distributed Systems, pages 68{77, 1993. [17] S. Pilarski and T. Kameda. \Checkpointing for Distributed Databases: Starting from the Basics. IEEE Transactions on Parallel and Distributed Systems, 3(5):602{610, September 1992. [18] A. P. Sistla and J. L. Welch. \Ecient Distributed Recovery Using Message Logging". In Proc. of 8th ACM Symp. Principles Distributed Comput., pages 223{238, August 1989. [19] M. Spezialetti and J.P. Kearns. \Ecient Distributed Snapshots". In Proceedings of the 6th International Conference on Distributed Computing Systems, pages 382{388, 1986. [20] R. E. Strom and S. Yemini. \Optimistic Recovery in Distributed Systems". ACM Trans. Comput. Syst., 3(3):204{226, August 1985. [21] K. Venkatesh and T. Radhakrishnan. \Optimal Checkpointing and Local Encoding for Domino-free Rollback Recovery". Information Processing Letters, 25:295{303, July 1987. [22] Y. M. Wang and W. K. Fuchs. \Lazy checkpoint Coordination for Bounding Rollback Propagation". Technical Report CRHC-92-26, Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, 1992. 25

[23] Y. M. Wang and W. K. Fuchs. \Optimistic Message Logging for Independent Checkpointing in Message Passing Systems". In Proc. Symp. on Reliable Distributed Systems, pages 147{154, October 1992. [24] Y. M. Wang and W. K. Fuchs. \Scheduling Message Processing for Reducing Rollback Propagation". In Proc. IEEE Fault-Tolerant Computing Symposium, pages 204{211, July 1992. [25] Y. M. Wang, Y. Huang, and W. K. Fuchs. \Progressive Retry for Software Recovery in Distributed Systems". In Proc. IEEE Fault-Tolerant Computing Symposium, pages 138{144, June 1993.

26

A Quasi-synchronous Algorithm for Checkpointing in ... - CiteSeerX

A Quasi-synchronous Algorithm for Checkpointing in ... - CiteSeerX

Suggest Documents

Algorithm-Based Diskless Checkpointing for Fault ... - CiteSeerX

Novel Checkpointing Algorithm for Fault ... - Stanford University

Stabilizers: A Modular Checkpointing Abstraction for ... - CiteSeerX

Implementing Checkpointing Algorithm in Alchemi ... - Thapar University

A Diskless Checkpointing Algorithm for Super ... - Semantic Scholar

Cooperative Checkpointing Theory - CiteSeerX

Grid Checkpointing Architecture - CiteSeerX

Checkpointing Imprecise Computation - CiteSeerX

Application-Level Checkpointing Techniques for Parallel ... - CiteSeerX

Clustered Checkpointing and Partial Rollbacks for ... - CiteSeerX

On-Line Scheduling for Checkpointing Imprecise ... - CiteSeerX

Adaptive Fault Tolerant Checkpointing Algorithm for ... - Science Direct

Algorithm-Based Diskless Checkpointing for Fault ... - Semantic Scholar

The Performance of Independent Checkpointing in ... - CiteSeerX

Checkpointing vs. Migration for Post-Petascale ... - CiteSeerX

Analysis of Checkpointing Schemes for Multiprocessor ... - CiteSeerX

Resource Management and Checkpointing for PVM - CiteSeerX

Compiler-Enhanced Incremental Checkpointing for ... - CiteSeerX

Checkpointing and Its Applications - CiteSeerX

A Low-Latency Checkpointing Scheme for Mobile ... - CiteSeerX

Adaptive Incremental Checkpointing for Massively Parallel ... - CiteSeerX

Application-Level Checkpointing Techniques for Parallel ... - CiteSeerX

Analysis of Checkpointing Schemes for Multiprocessor ... - CiteSeerX

Lightweight Checkpointing for Faster SOAP Deserialization - CiteSeerX