Dept. of Computer Science & Automation, Indian Institute of Science, ... pointing and recovery algorithm which can work even when the chan- nels are assumed ...
An Efficient and Scalable Checkpointing and Recovery Algorithm for Distributed Systems K.P. Krishna Kumar and R.C. Hansdah Dept. of Computer Science & Automation, Indian Institute of Science, Bangalore 560012, India {krishna, hansdah}@csa.iisc.ernet.in
Abstract. In this paper, we describe an efficient coordinated checkpointing and recovery algorithm which can work even when the channels are assumed to be non-FIFO, and messages may be lost. Nodes are assumed to be autonomous, and they do not block while taking checkpoints. Based on the local conditions, any process can request the previous coordinator for the ’permission’ to initiate a new checkpoint. Allowing multiple initiators of checkpoints avoids the bottleneck associated with a single initiator, but the algorithm permits only a single instance of checkpointing process at any given time, thus reducing much of the overhead associated with multiple initiators of distributed algorithms.
1
Introduction
Eighty percent of failures in computer systems are of non catastrophic nature, termed temporary faults. The checkpointing protocols described in the literature are mainly used to overcome these temporary faults. It is desirable that the checkpointing and recovery protocols (i) do not depend on a single central coordinator to avoid the bottleneck problem, (ii) do not assume that the communication channels are reliable and FIFO, (iii) are non-blocking at the time of taking checkpoints, (iv) permit nodes to take checkpoint at any point in time, i.e., nodes are autonomous, (v) have low space and message overhead,(vi) have a low recovery cost, and (vii) are adaptive in the sense that the smaller is the number of messages lost and the amount of interaction between the processes, the lower is the space and message overhead. Essentially, there are three categories of checkpointing and recovery protocols, viz., coordinated, communication induced, and uncoordinated. Communication induced and uncoordinated checkpointing protocols generally have high recovery cost and stable storage space required is also generally high. On the other hand, the centralized checkpoint initiator schemes like the algorithms in[1, 2] suffers from the bottleneck associated with a centralized initiator. For other distributed initiator schemes like Prakash and Singhal’s algorithm[3], there may be multiple instances of checkpointing in the system at any given time. A detailed description of various checkpoint and rollback-recovery protocols can be found in [4, 5]. Our aim is to design a checkpointing algorithm with multiple initiators, but which has only one instance of checkpointing going on in the system at any point in time and has all of the S. Chaudhuri et al. (Eds.): ICDCN 2006, LNCS 4308, pp. 94–99, 2006. c Springer-Verlag Berlin Heidelberg 2006
An Efficient and Scalable Checkpointing and Recovery Algorithm
95
above features. The rest of the paper is organized as follows. Section 2 describes our checkpointing algorithm and in Section 3, we describe the rollback and recovery algorithm. In Section 4, the details of the performance evaluation are given, and finally, we conclude the paper in section 5.
2 2.1
The Checkpointing Algorithm Informal Description
The system consists of a set of n processes P1 ,P2 ,...,Pn which communicate with each other using messages only. The channels are assumed to be non-FIFO, but messages may be lost. The faults are assumed to be transient. The transmission delays are unpredictable but finite. The messages, if they are delivered at the receiver’s end, are delivered correctly. The processes are assumed to be asynchronous. All the processes initiate a checkpoint before they start their computation. A process is initially nominated to be the coordinator. A coordinator for a checkpoint interval is responsible for coordinating the checkpointing activities for a particular checkpoint interval. Each checkpoint interval is given a unique checkpoint ID(CPID), and CPIDs are incremented sequentially. Application messages from the same node are also numbered sequentially. Both the sequence number of a message, and the current CPID are piggybacked on the message. If a process receives a message with a higher CPID than the current CPID, it takes a checkpoint before receiving the message. The concepts of CPID and sequence numbers are the same as those used in [2]. After a checkpointing process is initiated, the coordinator receives reports from all the processes regarding confirmation of having taken a checkpoint and a report of the count of all the messages they have sent and received. This is used by the coordinator to find out, whether all the processes have successfully taken a checkpoint and also that no messages sent in a particular checkpoint interval have been lost. For the next interval, some new processes may put their claim to be the coordinator and the present coordinator grants permission to one of them. It is assumed that the loss of control messages are handled using timeouts, and the failure of coordinator is handled using an election protocol. 2.2
Messages
There are two kinds of messages, viz., application messages and control messages. In this paper, the term message refers to application message only, unless stated otherwise. The various types of messages are as follows. application message: They are sent by the user application. control message: They are sent either by the processes or the coordinator as part of the checkpointing process. They include the following : send balance report: They are sent by the coordinator to a process if the balance report from the process has not reached it after a certain time interval.
96
K.P. Krishna Kumar and R.C. Hansdah
balance report: These are forwarded by the processes to the coordinator when they take a checkpoint. It is the difference between the number of messages sent and received by a process during a checkpoint interval. Each process also sends a dependency list which gives the individual balance count with each of the other processes. update report: This message is sent by a process to the coordinator when it receives a message pertaining to a previous checkpoint interval. reconcile message: This message is sent by the coordinator to a processes which has received lesser number of messages than sent from a certain sender. list of messages: The receiver process on getting the message reconcile message from the coordinator, sends a list of message sequence numbers it has received from a given sender. send request coord: This message is sent by the process requesting permission to initiate a new checkpointing process. grant permission coord: This message is sent by the coordinator while granting permission to a process to be the coordinator for the next checkpoint interval. 2.3
The Algorithm
Algorithm at each process Initially take a checkpoint before computation; On receipt of an application message: if( message.CPID > own CPID) take a checkpoint with new CPID; send balance report to the coordinator; if(message.CPID == own CPID) update the balance count; if( message.CPID < own CPID) update balance count; send update report to coordinator; log the message in stable storage; On receipt of message send balance report: send the latest balance report to coordinator; On receipt of message reconcile message : send the list of messages from the given process to the sender; On receipt of message list of messages : send the missing messages to the receiver; On requirement to initiate a checkpoint : send message send request coord to the present coordinator; On receipt of message grant permission coord declare self as coordinator if indicated in the message; initiate checkpointing process; To send an application message; add the present CPID to the message header; log the message in volatile memory;
Algorithm at the coordinator process On receipt of message send request coord: if the present checkpointing process is completed send message grant permission coord to the process; On receipt of message balance report from processes: update the overall balance count; update the overall process count; if(process count != 0 after scheduled timeout) send message send balance report to the concerned processes;
An Efficient and Scalable Checkpointing and Recovery Algorithm
97
if (( process count == 0) and (balance count != 0)) // all processes have taken fresh checkpoint. wait till time-out; send message request balance report to receiver processes with positive balance count; // completion of checkpointing process if((process count == 0) and (balance count == 0)) the checkpointing process is completed; // to cater for any lost update messages if balance count != 0 after receipt of fresh balance report send message reconcile message to the concerned receiver processes ;
3
Rollback Recovery
The processes maintain at most two checkpoints in the stable storage. One checkpoint is the fully completed one and the other one is the current checkpoint which is being coordinated. The checkpoints with the same CPID form a recovery line. In a modification to the Silva’s algorithm [2], the rollback algorithm tries to minimize the number of processes which are required to rollback. On being informed of a process’s failure, the coordinator asks the processes to take a virtual checkpoint and forward the list of processes to which they have sent messages since the checkpoint indicated in the coordinator’s message. This checkpoint will be the one coordinator knows has been completed. The processes, forward the list as asked for and continue their computation. The virtual checkpoint does not involve saving the state of the process. The processes are allowed to receive messages. But, processes are not allowed to send messages till further confirmation is received from the coordinator. Using the lists provided by the processes, the coordinator calculates the dependencies between processes. It finds out which all processes have communicated with the failed process(es), either directly or indirectly. The rollback control messages are : request virtual cp: This is sent by the coordinator to all the processes asking them to forward their dependency lists. The processes further do not send any more application messages, but continue to receive messages. rollback request: Message sent by the coordinator to the processes asking them to rollback to a given CPID, and resume computation. resume request: Message sent by the coordinator to the processes asking them to resume computation from the present state. Algorithm at the coordinator process On receipt of failure report of a process; take a virtual checkpoint; send request virtual cp message to all processes. On receipt of message balance report from the processes; calculate dependencies; send rollback request message to dependent processes; send resume request message to non-dependent processes;
98
K.P. Krishna Kumar and R.C. Hansdah
Algorithm at each process On receipt of message request virtual cp; take a virtual checkpoint; send balance report to the coordinator; accept any incoming application message; do not send any application message ; On receipt of message rollback request; rollback to the indicated CPID; resume computation; On receipt of message resume request; resume computation from the present state;
4
Performance Evaluation
Extensive simulation was carried out to test the algorithms and measure various parameters. The simulations were done using discrete event simulation with the number of processes varying from 10 to 50. The error percentage of the channel was also varied from 0 to 10 percent. Exponential distribution was used for generation of messages. Each of the experiments were performed at least ten times, and the average values for all the measured parameters were taken. In terms of checkpointing overhead, we have compared our checkpointing algorithm with uncoordinated algorithm that take periodic checkpointing and with uncoordinated algorithm that take a checkpoint prior to a message receipt event provided a message send event has occurred in the same checkpoint interval. We have also compared our algorithm with communication induced checkpointing algorithms,viz., Briatico, Ciuffoletti and Simoncini’s(BCS) algorithm [6], 2
Overhead(%)
1.5
BCS MS BQF Uncoord(periodic) Uncoord(receive based) Coord
1
0.5
0 0
10
20
30
40
50
No of Processes
Fig. 1. Comparison of Checkpointing Overhead between different Algorithms
An Efficient and Scalable Checkpointing and Recovery Algorithm
99
Manivannan and Singhal’s (MS) algorithm [7] and the Baldoni, Quaglia and Fornara’s(BQF) algorithm [8]. The comparison of the percentage overhead of the various classes of algorithm is shown in figure 1. The amount of saving in computation time in comparison to the situation in which all processes have to rollback as in [2] in the event of a failure is from 3 to 6% when the number of processes are upto 30. For further increase in number of processes, the saving is just over 1%. The saving is due to the fact that only those processes which have communicated with the failed process(es) are required to rollback.
5
Conclusion
In this paper, we have described an efficient checkpoint and recovery algorithm which tries to combine the advantages of coordinated, uncoordinated and the communication-induced protocols and which has all of the desirable features given earlier. Simulation results indicate that it performs well in terms of checkpointing overhead compared to other algorithms.
References 1. Koo, R., Toueg, S.: Checkpointing and rollback-recovery for distributed systems. IEEE Transactions on Software Engineering 13(1) (1987) 23–31 2. Silva, L.M., Silva, J.G.: Global checkpointing for distributed programs. In: Proceedings of the 10th Symposium on Reliable Distributed Systems. (1992) 155–162 3. Prakash, R., Singhal, M.: Low cost checkpointing and failure recovery in mobile computing systems. IEEE Transactions on Parallel and Distributed Systems 7(10) (1996) 1035–1048 4. Elnozahy, E., Johnson, D., Yang, Y.: A survey of rollback-recovery protocols in message passing systems. ACM Computing Surveys 34(3) (2002) 375–408 5. Manivannan, D., Singhal, M.: Quasi synchronous checkpointing: Models, characterization and classification. IEEE Transactions on Parallel and Distributed Systems 10(7) (1999) 206–216 6. Briatico, D., Ciuffoletti, A., Simoncini, L.: A distributed domino-effect free recovery algorithm. In: Proceedings of the IEEE International Symposium on Reliability, Distributed Software and Databases. (1984) 207–215 7. Manivannan, D., Singhal, M.: A low overhead recovery technique using quasi synchronous checkpointing. In: Proceedings of the IEEE International Conference on Distributed Computing Systems. (1996) 100–107 8. Baldoni, R., Quaglia, F., Fornara, P.: An index-based checkpointing algorithm for autonomous distributed systems. In: Proceedings of the IEEE International Conference on Distributed Computing Systems. (1999) 181–188