Checkpointing approach for computer clusters Emir Imamagic University of Zagreb, University Computing Centre, Zagreb
[email protected]
Damir Danijel Zagar University of Zagreb, University Computing Centre, Zagreb
[email protected]
Branimir Radic University of Zagreb, University Computing Centre, Zagreb
[email protected]
Abstract: Checkpointing is a procedure of storing process state to a file, which is later used to reconstruct the process. This feature is important for fault-prone environments such as computer clusters. Beside for fault tolerance, checkpointing is appealing for other purposes, such as preemption and job migration. However, there aren't many checkpointing implementation available today. Additionally, cluster environments place special demands before checkpointing library. In this paper, we elaborate this demands and present state of the art checkpointing implementations. Furthermore, we introduce an approach that optimizes network load and checkpointing speed, by utilizing local file system of cluster nodes. Keywords: checkpointing, fault tolerance, cluster
1. INTRODUCTION Computer cluster is a set of interconnected commodity computers, functioning and acting as a single system. In such environment, failures of hardware and software components are common, thus efficient mechanisms for failure recovery is needed. One such mechanism is checkpointing procedure of storing the current state of active process in file. Stored file is used to restart the process from that point. File is usually stored on reliable, shared file system. In case of node failure, process can be restarted on one of the active nodes. Checkpointing also enables applications to be suspended and then restarted in case of process migration or preemption. The rest of the paper is organized in following manner. In second section, we describe checkpointing in details and give short overview of existing solutions. In third section, we describe cluster environment and usage of checkpointing on clusters. We also describe main features of Sun Grid Engine system. Description of our approach for checkpointing on computer clusters is given in section 4. In the last two sections, we conclude the paper and point out some future directions.
2. CHECKPOINTING Checkpointing is a process of storing the current state of active process in file. Stored file (also called checkpoint) contains all the information about the process that is necessary to recreate it from that point. Basic information stored in file is registers, stack and heap of process. In general case, file should contain pending signals, file descriptors, sockets, status of all threads, etc.
Many open source checkpointing implementation exists today, but none is fully capable of checkpointing every application type. Especially problematic cases are parallel applications that consist of several processes, applications with opened file handles and sockets. 2.1. Checkpointing implementations There are three types of checkpointing implementations: kernel-level, user-level and application-level ([1], [2]). These types differ in level of transparency, efficiency and mechanism used to initiate checkpoint and restart. Transparency in this context means if checkpointing can be done without modification of application's source code. Kernel-level or system checkpointing is the most transparent approach. In this case, kernel performs checkpointing, and user does not have to change the application at all (e.g. re-compile, re-link). However, kernel-level approach is the least efficient, because system does not have any knowledge about application. Instead, it simply dumps whole memory footprint to a file. Although this is the most appealing approach for users, there are only few operating systems that fully implement checkpointing. Example is SGI's operating system IRIX [8]. There are some implementations in form of kernel modules, but they have to be compiled for used kernel. User-level checkpointing is achieved by using checkpointing library. User has to re-compile or re-link its source with library in order to make it checkpointable. Checkpointing is initiated by sending signal to the application. Main drawbacks of this approach are application re-compiling or re-linking and lack of specific signal usage. Users often use third party application and do not have source code of application. Lack of specific signal usage can raise issues in case of parallel applications. From the aspect of efficiency, user-level approach is similar to kernel-level. Application-level checkpointing is achieved by the developer itself. As a part of application, developer implements set of procedures that handle checkpointing and restart. This approach is the most efficient, because developer has detailed knowledge about application. Thus, he can store only the relevant data. On the other hand, this approach is the least transparent and demands great effort from application developer. Main drawbacks are need for application's source code in order to implement checkpointing and lack of common mechanism for checkpoint and restart. 2.2. Berkley Lab's Checkpoint/Restart Berkley Lab's Checkpoint/Restart (BLCR) [3] is checkpointing library developed at Berkley Lab. There are two BLCR implementations: kernel-level and user-level. Users can insert BLCR kernel module into kernel and checkpoint/restart their applications from command line. User-level implementation enables users to re-link their application and make it checkpointable, without altering the kernel. BLCR can checkpoint following types of applications: single processes, multithreaded applications and parallel applications developed with LAM/MPI library ([4], [5]). Furthermore, BLCR is capable of recovering file descriptors, signal handlers and pending signals. Drawbacks in current version are: 64-bit architectures and checkpointing of process groups are not supported. 2.3. Condor checkpointing library Condor checkpointing library [5] is a part of Condor system. Condor is a CPU harvesting system that enables execution of applications on idle workstations. In such environment, checkpointing is extremely important, because the workstation can become occupied at any point and system needs to migrate jobs. Checkpointing library can be used separately from Condor system.
Condor checkpointing is implemented as user-level. Process is checkpointed when it receives signal SIGTSTP. In order to use checkpointing, user only needs to re-link its program with Condor library. This library enables checkpointing of single processes with open file handles, signal handlers and pending signals on various flavors of UNIX OS-es.
3. CLUSTER ENVIRONMENT Computer cluster is a set of commodity computers, interconnected with high-speed networks, functioning and acting as a single computer. This architecture provides computing power for demanding applications, by using commodity, of the shell components. Cluster complements expensive, proprietary supercomputers. One of the most often used cluster architecture is Beowulf cluster (shown on Fig. 1). Beowulf cluster consists of frontend computer and set of nodes. Frontend contains central services of various middleware systems and shared file system. One of the most important middleware systems is job management system (JMS). JMS is responsible for executing and managing users' applications (also called jobs) on nodes. Users access only the frontend computer, where they place their applications and data. By using JMS, they start and control their jobs. Nodes are used solely for executing user’s applications. Beside shared file system placed on frontend, every node has local file system called scratch. Scratch is used for storing application's temporary files.
Figure 1 Beowulf cluster
Checkpointing in cluster environment is usually performed in a way that checkpoint is stored on a shared file system on frontend. Since the nodes are commodity computers and often fault prone, checkpoint can be used to restore the job. This and other scenarios are described in next section. Storing checkpoints on shared file system also has some performance drawbacks. First, checkpoints can be very large (e.g. several gigabytes) and this can easily congest the private network. This can cause performance loss for parallel applications which are using private network for communication. On the other hand, during the checkpointing, application is frozen. Therefore long lasting checkpointing process causes performance loss for the application itself. 3.1. Checkpointing usage Main benefits of checkpointing in cluster environments are fault tolerance, preemption and job migration.
Fault tolerance is ability of system to recover automatically and transparently from failure. Fault tolerance of applications is achieved by periodical checkpointing. This feature is especially important for long running jobs (e.g. jobs running for few months), which are common for cluster environments. In such long period, there is a high probability of hardware, network, OS failure or other user's program failure. For fault tolerance purposes, checkpoint has to be stored on shared file system. Otherwise, the application cannot be restored until the node becomes active or not at all. Preemption is process of temporarily stopping one job in order to allocate resources for another. Preemption is necessary for implementing advanced scheduling features such as: priorities, debugging applications and advance reservations. For example, when developers want to test their application, preemption is necessary in order to avoid waiting for common long running jobs to finish. Another case is providing resources on demand. If one wants to provide set of resources for specific time period, jobs that are using those resources need too be preempted. Job migration is procedure of moving job from one node to another. It is especially useful in case of preemption. Preempted jobs can be moved to free resources, instead of waiting for the initial ones to become free. Another usage is load balancing. In case when resource becomes overloaded, JMS can migrate jobs to free nodes. 3.2. Sun Grid Engine Sun Grid Engine (SGE) [7] is a product of Sun Microsystems Company. SGE provides user-friendly graphical interface for executing jobs. It enables various job types: serial, parallel, interactive jobs and job arrays. Furthermore, SGE provides support for job migration, load balancing and fault tolerance. Especially important part is support for checkpointing. SGE enables users to develop their own modules that handle checkpoint, restart and migration of jobs.
4. CHECKPOINTING APPROACH In this section, we propose our checkpointing approach for computer clusters. We use both shared file system and local node file system in order to decrease network load and checkpointing time. Basic idea is to have two checkpointing intervals. Shorter period is used for creating local disk checkpoint and longer for checkpoint on shared file system. Special logic is used for job suspension started by user and actual node failure. In practical implementation, we used BLCR checkpointing library and SGE system for managing jobs. We decided to use BLCR because it provides kernel-level implementation and in our environment, many users use third party software without source code available. Second reason is support for parallel application developed with LAM/MPI. We used SGE because it enables implementation of custom modules for handling checkpointing events. In our approach, we have three cases of checkpointing: fault-tolerant checkpointing, preemption without job migration and job migration. Details about each case are described in following sections. Each case is also illustrated on figure. On figures, solid lines are used for checkpointing and dash lines for restart. 4.1. Fault-tolerant checkpointing In order to achieve fault-tolerance, application needs to be checkpointed periodically. This period depends on application itself. If the application runs for months, reasonable period would be several hours. On the hand, shorter application would demand shorter periods. In a case of memory demanding application period has to be carefully determined, because it can take a while for the checkpointing process to finish.
Our solution for this case is show on Fig. 2. We use two periods: TS and TL. Period TS is shorter period that defines interval between two checkpoints being made and stored on local file system (1 on Fig. 2). Period TL is longer period that defines interval between checkpoints being made and stored on shared file system (2 on Fig. 2). Period TL can also be expressed as a number of shorter periods. By using local file system, we decrease load on network an increase speed of checkpointing. If a node failure occurs, we check how old the stored checkpoint is. If the version is the most recent, job is restored from it on some available node (3 on Fig. 2). Otherwise, we wait for a certain period to check if the failed node will recover. If the node recovers, job is restarted on the initial node (4 on Fig. 2). If the node does not recover, job is recovered on available node from the last checkpoint stored on shared file system (3 on Fig. 2).
Figure 2 Checkpoint and restart for fault-tolerant case
4.2. Preemption without job migration In case of preemption without job migration, job is first checkpointed and then stopped. Afterwards, a new job is executed on that node. Once the new job finishes, initial one is restored on the same node. Example of preemption when job cannot be migrated to another node is when user specifically demands that node. In this case, we store checkpoint only on a local file system (1 on Fig. 3). Once the new job finishes, initial is restored from a local file system (2 on Fig. 3). Since the job cannot be migrated to another node, there is no need to store it on a shared file system. This way we significantly decrease network load. In addition, by increasing checkpointing speed we increase the preemption process.
Figure 3 Checkpoint and restart for preemption without migration
4.3. Job migration In case of job migration, job is checkpointed, stopped and restored on another node (as shown on Fig. 4). As mentioned before, job migration can occur in case of preemption and load balancing. In this case, when migration occurs, we store checkpoint only on a shared file system (1 on Fig. 4). Since the job is being migrated to another node (2 on Fig. 4), checkpoint only needs to be on a shared file system.
Figure 4 Checkpointing and restart for job migration
5. CONCLUSION Checkpointing is one of the key functionalities needed for implementing fault-tolerant systems. In cluster environments, checkpointing is necessary for enabling other features such as preemption and job migration. In this paper, we describe the main issues with checkpointing on computer cluster – network load and performance cost due to the time needed for checkpointing. Furthermore, we propose approach for avoiding these issues, by using nodes’ local file system. Our solution is implemented by using BLCR checkpointing library and SGE job management system, but can be also used with in any other environment.
6. FUTURE WORK In the next stage of our work, we are planning to investigate incremental checkpointing implementations. At this point, we minimize network load and increase checkpointing speed by storing checkpoints to local instead of shared file system. However, we still store complete application image. In case of incremental checkpointing, we would only store differences between two checkpoints. This way we could achieve significant optimization.
7. REFERENCES [1] Duell, J.; Hargrove, P.; and Roman, E. (2002): Requirements for Linux Checkpoint/Restart, Berkeley Lab Technical Report (publication LBNL-49659) [2] Roman, E. (2002): A Survey of Checkpointing/Restart Implementations, Berkeley Lab Technical Report (publication LBNL-54942) [3] Duell, J.; Hargrove, P.; Roman, E. (2002): The Design and Implementation of Berkley Lab's Linux Checkpoint/Restart, Berkeley Lab Technical Report (publication LBNL-54941)
[4] Sankaran, S.; Squyres, J. M.; Barrett, B.; Lumsdaine, A.; Duell, J.; Hargrove, P.; Roman, E. (2003): The LAM/MPI Checkpoint/Restart Framework: System-Initiated Checkpointing, LACSI Symposium, Sante Fe, New Mexico, USA [5] LAM/MPI, www.lam-mpi.org [6] Condor Checkpointing, http://www.cs.wisc.edu/condor/checkpointing.html [7] Sun Grid Engine, http://gridengine.sunsource.net/ [8] SGI IRIX, http://www.sgi.com/developers/technology/irix/