Batch Queuing and Resource Management for PVM Applications in a Network of Workstations Ursula Maier, Georg Stellner, Ivan Zoraja Lehrstuhl fur Rechnertechnik und Rechnerorganisation (LRR-TUM) Institut fur Informatik Technische Universitat Munchen fmaier,stellner,
[email protected] http://wwwbode.informatik.tu-muenchen.de
Abstract A resource management system can eectively shorten the runtime of batch jobs in a network of workstations (NOW). This is achieved with load balancing mechanisms to distribute the load equally among the hosts. To avoid con icts between interactive users and batch jobs, a resource management system must be able to migrate batch jobs from an interactive host to an idle host. Common resource management systems oer process migration only for sequential jobs but not for parallel jobs. Within the SEMPA project a resource management system with batch queuing functionalities including checkpointing and migration is designed and implemented. We focus on PVM applications because PVM oers dynamic task management and an interface to resource management systems 1 .
1 Introduction Parallel scienti c computing applications, e.g. in computational uid dynamics, require a large amount of CPU time and memory. Therefore, they are often run on massively parallel systems. However, networks of workstations (NOWs) often have computing capacities available that are sucient for the computation of resource intense applications. Especially smaller companies or research institutes use their NOWs for parallel applications as a low-cost alternative to massively parallel systems. A resource management system makes the use of a NOW transparent to the user and guarantees that the computational power of a NOW is utilized in the best possible way. To take advantage of a resource management system, resource intense applications are executed as batch jobs. In the remaining paper a parallel application is a PVM application submitted as batch job to a resource management system. Checkpointing and migration of applications are important functionalities of a resource management system for reasons of fault tolerance and dynamic load balancing. Periodic checkpoints are written of long running applications to avoid the loss of the so far computed results, if the application unexpectedly aborts, e.g. because of a hardware error. Process migration is a way to equalize the load in a NOW if the load situation is unbalanced or to relocate processes during runtime.
1 This work has been funded by the German Federal Department of Education, Science, Research and Technology, BMBF (Bundesministerium fur Bildung, Wissenschaft, Forschung und Technologie) within the research project SEMPA (Software Engineering Methods for Parallel Applications in Scienti c Computing).
Primarily a NOW is used for interactive work, batch jobs only utilize idle resources and hence, the interactive users have precedence over batch jobs. If an interactive user wants to work on a host running a process of a parallel application, the process must be migrated because it probably claims such an enormous amount of resources that the interactive user will have unacceptable response times on the host. Existing resource management systems, e.g. Condor [LTBL97] and LSF [Pla96] oer checkpointing and migration only for sequential applications. Merely initial process placement is supported for parallel applications, i.e. the processes of a parallel application are mapped to appropriate hosts. The processes are bound to their hosts and cannot be migrated to other hosts at runtime because checkpointing mechanisms for parallel applications with communicating processes are rarely available [ZB96]. The reason why existing resource management systems hardly support parallel applications is the lack of control over the processes of a parallel application. Without control over the processes a resource management system is unable to kill, checkpoint or migrate a running process of a parallel application or to observe resource limitations. A major goal of the SEMPA project [LMRW96] is to design and implement a batch queuing and resource management system for sequential and parallel applications in a NOW. Available resources should always be utilized for the execution of batch jobs. A mechanism for checkpointing and migration of parallel applications must be provided to equalize the load in the NOW and to release hosts running processes of a parallel application if the hosts are needed by an interactive user. Our basic idea was to use existing batch queuing and resource management facilities and add new features supporting the ecient computation of parallel applications in a NOW. The SEMPA Resource Manager is based on the batch queuing and resource management system CODINE [GEN96] and the checkpointing and migration capability for parallel applications of CoCheck [Ste95]. A PVM resource manager is implemented to control the parallel applications and to join the components and functions of CODINE and CoCheck. The remaining paper is organized as follows. Section 2 describes the design concept of the SEMPA Resource Manager. The structure and functionalities of the basic components are explained in section 3. Section 4 shows some implementation details of the SEMPA Resource Manager. First performance measurements are presented in section 5. The paper closes with a brief summary and an outlook on further research.
2 The Design Concept of the SEMPA Resource Manager An architectural design of a distributed resource management system for parallel applications in a NOW is introduced in [MS97]. The concept of the distributed resource management system comprehends modular components for the main functionalities batch queuing, scheduling and load management and includes de ned interfaces between these components. The scheduling component is organized hierarchically, i.e. a global resource manager places a parallel application initially and then passes it to a local resource manager that is responsible for the parallel application until it has nished. The functions of the local resource manager are the management of hosts and processes and the remapping of the parallel application. The SEMPA Resource Manager is an implementation of the design concept presented in [MS97] based on CODINE, CoCheck and the PVM resource manager interface.
The architectural design of the SEMPA Resource Manager strongly depends on the structure and the components of CODINE and CoCheck that should be retained as far as possible. An important issue in the design of the SEMPA Resource Manager is to de ne a communication model for the information exchange between the dierent components. One of the major functions of the SEMPA Resource Manager is to control the parallel applications which means to control each of its processes. This is the basic assumption for further functions of the SEMPA Resource Manager that operate on single processes of a parallel application. Control over a parallel application is required to:
suspend a running parallel application stop a running parallel application, e.g. by the job owner write periodic checkpoints of a parallel application migrate one or more processes of a parallel application observe resource limitations of a parallel application collect accounting information about a parallel application
3 Components of the SEMPA Resource Manager Structure and functionalities of the main components of the SEMPA Resource Manager, CODINE, CoCheck and PVM resource manager interface, are explained in the following sections.
3.1 CODINE CODINE is a batch queuing and resource management system for NOWs [GEN96]. Users submit their jobs to CODINE that queues the jobs until the required resources are available. A batch job is composed of an application and resource requirements speci ed by the user, e.g. machine architecture or size of memory. CODINE maps sequential and parallel applications to idle or low loaded hosts. CODINE is built up of various components to queue and schedule jobs and to measure the load on the hosts in the NOW: qmaster: The qmaster is the central component in CODINE and has the control over all other
components. It corresponds to a database server containing the information about hosts and jobs. schedd: The schedd is the component that performs the scheduling algorithm. It gets information about hosts and jobs from the qmaster and computes the job order list. commd: A communication daemon is running on every host that is controlled by CODINE. The commd implements the communication between the CODINE components over TCP sockets. Some connections are permanent, e.g. between qmaster and schedd, other connections are set up on demand and closed when the transmission is over.
execd: An execution daemon is running on every host that executes batch jobs. The execd starts and controls jobs and measures the load on its host. When a job has nished, the execd returns the accounting information about the job to the qmaster. shepherd: The shepherd process is started by the execd and builds up the execution environment for a job. The execd does not start a job immediately but starts a shepherd and the shepherd starts the job by forking a process. When the job has nished, the shepherd collects the
accounting information about the job.
Figure 1 shows the components of CODINE and their relationship. qmaster and schedd usually run on the same host to minimize the communication overhead. Jobs are running on execution hosts and for every job a shepherd is existing that controls the job. execution host qmaster
execd commd
schedd
commd shepherd
shepherd
job
job
Figure 1: The structure of CODINE When a parallel job is submitted, additional resource requirements must be speci ed compared to a sequential batch job, e.g. the parallel programming environment or the minimum and maximum number of hosts. Parallel CODINE jobs can use PVM, MPI or EXPRESS as parallel programming environment. A job in CODINE is not directly started by an execd but by a shepherd process that is started by the execd. The shepherd is parent of the started job and has control over the job, e.g. to suspend or kill the job during runtime or to collect accounting information about the job. A shepherd can only start one job whereas an execd can start several shepherd processes. In the current version of CODINE there is only a single shepherd existing for each parallel job, i.e. CODINE only has control over the process forked by the shepherd but not over processes that are created by parallel programming environments, e.g. spawned by PVM. Thus, operations such as resource limitation and the collection of accounting information can only be performed for the master process forked by the shepherd but not for the spawned processes. One of the aims of the SEMPA Resource Manager is to overcome this de ciency.
3.2 CoCheck CoCheck (Consistent Checkpoints) is an extension to message-passing libraries that allows the creation of checkpoints of parallel applications and the migration of processes. Implementations of CoCheck for PVM [Ste95] and MPI [Ste96] exist. For the remainder of the paper we will refer to the PVM version of CoCheck. Before the application can actually be started the user must relink the PVM application with the CoCheck libraries to incorporate the code which implements checkpointing and migration. A PVM resource manager is provided [GBD+ 94] that receives and handles requests to checkpoint or restart an application or to migrate processes. An API has been de ned to send such requests to the resource manager. After the PVM resource manager of CoCheck has received a request to checkpoint or migrate it initiates the CoCheck checkpointing protocol. All processes of the currently executing application are informed about a pending checkpoint. In turn all the processes start to exchange so called \ready messages". These ready messages ush all communication channels between all the processes. Messages that were in transmit upon checkpoint time are thus forwarded to their destination and stored there. After restart these messages are automatically retrieved from the buers. When the processes are restarted they get a new identi er. These identi ers are then sent to the CoCheck resource manager. It in turn sets up a mapping table from old to current identi ers. Within the wrappers for the communication calls these current values are used to send and receive messages instead of the values that the application actually uses. Hence, checkpointing and migration is transparent to the application [Ste95].
3.3 The PVM Resource Manager Interface PVM 3.3 oers a resource manager interface to de ne an own host and task management and new scheduling strategies [GBD+ 94]. Usually PVM calls are handled by the PVM daemons, but if there is a PVM resource manager registered in the virtual machine, PVM calls concerning hosts and tasks, e.g. pvm addhosts or pvm spawn are redirected to the PVM resource manager. The PVM resource manager provides handler functions to execute the redirected PVM calls. The handler functions in the PVM resource manager are not part of PVM, they must be explicitly written by the user corresponding to a given PVM message framework. CoCheck uses the PVM resource manager interface for the implementation of additional handler functions for checkpointing and migration. For the SEMPA Resource Manager a complete PVM resource manager has been implemented with handler functions for all aected PVM calls that joins the components of CODINE, CoCheck and PVM and realizes a local resource manager for every PVM application.
4 Implementation Aspects of the SEMPA Resource Manager In the previous sections the architectural design and the components of the SEMPA Resource Manager have been introduced. This section explains some functionalities of the SEMPA Resource Manager and shows some implementation details.
The main component of the SEMPA Resource Manager is the PVM resource manager with its handler functions for host and task management that initiate certain operations of CODINE, CoCheck or PVM. The data exchange between CODINE and PVM components is realized by PVM calls and a signal interface.
4.1 Starting a PVM Job by the SEMPA Resource Manager Before a job can be started, hosts for the execution of the job must be selected and the parallel environment must be con gured. In the SEMPA Resource Manager the CODINE scheduler selects the hosts for the PVM application and the master host where the application is started corresponding to the load on the hosts. Then the execd on the master host starts a shepherd, called the master shepherd. The master shepherd starts the master PVM daemon (pvmd) and the PVM resource manager. The PVM resource manager sets up the virtual machine with the hosts selected by the schedd, i.e. it starts a slave pvmd and a PVM tasker on each host belonging to the virtual machine. Due to PVM implementation constraints the PVM resource manager must be started before hosts are added to the virtual machine. Now the virtual machine is built up completely with the master pvmd and the PVM resource manager running on the master host and a slave pvmd and a PVM tasker on every other host in the virtual machine as shown in Figure 2. As the next step the application is started by the master shepherd, i.e. the rst PVM task is started that usually spawns further PVM tasks.
4.2 Spawning a Task As mentioned above, CODINE is intended to have control over all tasks spawned by PVM. The PVM tasker concept is used to implement the creation of a new task with an own strategy. The PVM resource manager selects a host within the virtual machine where the new task is started. If no appropriate host is available in the virtual machine, the PVM resource manager requests a new host maybe with speci c hardware requirements from the CODINE qmaster. The pvm spawn call is sent to the PVM resource manager that selects a host and sends a message to the PVM tasker on that host. PVM uses the round-robin strategy to map tasks to hosts. A strategy considering load information about the hosts will be implemented in the next phase of the project [SKS92]. It is not reasonable to specify a particular host in the pvm spawn call because the resource manager selects a host for the task. If the pvm spawn call fails, a corresponding PVM error is generated and the responsibility to handle the error message is turned to the calling task. The PVM tasker implements a procedure that prevents the PVM tasker to fork the new task itself but causes the execd to start a shepherd that nally creates the task (see Figure 3). The task is spawned on a host belonging to the virtual machine, i.e. that a slave pvmd and a PVM tasker are already running on that host. The spawned task is now under the control of CODINE and PVM.
host 1 (master host) schedd
execd
qmaster
master pvmd
shepherd
PVM appl.
start
spawn task PVM resource manager
execute request
host n
host 2 PVM tasker
slave pvmd
PVM tasker
slave pvmd
message exchange execd
execd
Figure 2: Starting a PVM job by the SEMPA Resource Manager
4.3 Exiting a Task When a task exits, CODINE and the PVM resource manager must be noti ed. An exiting task sends a signal SIGCHLD to its parent process that is a shepherd process. After receiving the signal SIGCHLD, the shepherd writes the accounting information about the task to a temporary le and sends a signal SIGCHLD to the PVM tasker to inform it that the task has exited. The shepherd exits and sends a signal SIGCHLD to its parent process, the execd. When the PVM resource manager recognizes that all tasks have terminated, it stops PVM and exits.
5 Performance Measurements Functionalities and performance of the SEMPA Resource Manager have been evaluated with ParTfC as a real world test case. ParTfC is a computational uid dynamics package to compute laminar and turbulent viscous ows in three dimensional geometries. It has been parallelized within the SEMPA project corresponding to the SPMD (single program, multiple data) paradigm [LMR+ 96]. The underlying grid is partitioned into smaller parts and every partition is computed by an own process.
host n
host 1 (master host) master spawn task task
PVM resource manager
PVM tasker
slave pvmd
execd
execute shepherd request message exchange task
Figure 3: Spawning a PVM task by the SEMPA Resource Manager The presented time measurements show the in uence of a resource management system to the runtime of ParTfC. The following three measurement models have been viewed: (M1) ParTfC in interactive mode (M2) ParTfC started as CODINE batch job without a PVM resource manager (M3) ParTfC started as batch job to the SEMPA Resource Manager The time measurements were performed with two dierent grids: (T1) A grid with 3150 grid nodes divided into 4 partitions. (T2) A grid with 19200 grid nodes divided into 4 partitions. The four processes of ParTfC were computed on two SGI Indigo 4400 so that two processes were running on one host. The two grids are relatively small but they are sucient to show that the overhead produced by CODINE or the SEMPA Resource Manager is negligible. Table 1 shows that the runtime of ParTfC hardly increases if ParTfC is started as a batch job in CODINE or the SEMPA Resource Manager compared to the runtime of ParTfC in the interactive mode. The time for start and stop scripts in CODINE and the SEMPA Resource Manager that are performed before starting and after nishing ParTfC are shown in Table 2. However, compared to the runtime of ParTfC these times can be neglected. The start script in CODINE starts PVM and sets up the virtual machine. The execution of the
(M1) (M2) (M3) (T1) 190 s 194 s 197 s (T2) 389 s 395 s 396 s Table 1: Runtime of ParTfC for the three measurement models start script of the SEMPA Resource Manager takes more time compared to the start script of CODINE because the PVM resource manager and the PVM tasker must be started in addition. The stop script of CODINE performs a pvm halt to stop the virtual machine. The stop script of the SEMPA Resource Manager sends a signal to the PVM resource manager to stop the virtual machine if all processes of the parallel application have nished. (M2) (M3) start script 100 ms 4.2 s stop script 100 ms 60 ms Table 2: Time for start and stop scripts in CODINE and the SEMPA Resource Manager
6 Conclusion The SEMPA Resource Manager provides batch queuing and resource management facilities for PVM applications in a NOW. Parallel applications are started as batch jobs and each process of a parallel application is under the control of the SEMPA Resource Manager so that e.g. resource limitation and migration of each process can be performed. The presented approach is restricted to PVM applications because PVM oers dynamic task management and features to de ne own resource management services. The exibility of the PVM concept prevents changes in the PVM code. Modi cations in CODINE and CoCheck are necessary but reduced to a minimum. The implementation of the SEMPA Resource Manager has almost been completed except the integration of the CoCheck handler functions into the PVM resource manager. The next step after the integration of the migration facilities will be to improve the scheduling strategy of the PVM resource manager to decide about the mapping and remapping of processes more eciently. Currently the round-robin method is used that does not consider the dierent CPU and memory capacities of the hosts and the actual load situation in the virtual machine and the NOW. The interface between the PVM resource manager and the CODINE qmaster must be extended to make scheduling information of CODINE available to the PVM resource manager.
References [GBD+ 94] Al Geist, Adam Beguelin, Jack Dongarra, Weicheng Jiang, Robert Manchek, and Vaidy Sunderam. PVM: Parallel Virtual Machine | A Users' Guide and Tutorial for Networked Parallel Computing. Scienti c and Engineering Computation. The MIT Press, Cambridge, MA, 1994.
[GEN96]
GENIAS Software GmbH, Erzgebirgstr. 2B, D-93073 Neutraubling, Germany. CODINE Reference Manual, Version 4.0, 1996.
[LMR+ 96] Peter Luksch, Ursula Maier, Sabine Rathmayer, Friedemann Unger, and Matthias Weidmann. Parallelization of a state-of-the-art industrial CFD Package for Execution on Networks of Workstations and Massively Parallel Processors. In Third European PVM Users' Group Meeting, EuroPVM 96, Munchen, October 1996. [LMRW96] Peter Luksch, Ursula Maier, Sabine Rathmayer, and Matthias Weidmann. SEMPA: Software Engineering Methods for Parallel Scienti c Applications. In International Software Engineering Week, First International Workshop on Software Engineering for Parallel and Distributed Systems, Berlin, March 1996. [LTBL97] Michael Litzkow, Todd Tannenbaum, Jim Basney, and Miron Livny. Checkpoint and Migration of UNIX Processes in the Condor Distributed Environment. Technical Report 1346, University of Wisconsin-Madison, April 1997. [MS97] Ursula Maier and Georg Stellner. Distributed Resource Management for Parallel Applications in Networks of Workstations. In HPCN Europe 1997, volume 1225 of Lecture Notes in Computer Science, pages 462{471. Springer-Verlag, 1997. [Pla96] Platform Computing Corporation, North York, Ontario, Canada. LSF Documentation, December 1996. [SKS92] Niranjan G. Shivaratri, Phillip Krueger, and Mukesh Singhal. Load Distributing for Locally Distributed Systems. Computer, 25(12):33{44, December 1992. [Ste95] Georg Stellner. Checkpointing and Process Migration for PVM. In Arndt Bode, Thomas Ludwig, Vaidy Sunderam, and Roland Wismuller, editors, Workshop on PVM, MPI Tools and Applications, number 342/18/95 A in SFB-Bericht, pages 44{48. Technische Universitat Munchen, Institut fur Informatik, November 1995. [Ste96] Georg Stellner. CoCheck: Checkpointing and Process Migration for MPI. In Proceedings of the International Parallel Processing Symposium, pages 526{531, Honolulu, HI, April 1996. IEEE Computer Society Press, 10662 Los Vaqueros Circle, P.O. Box 3014, Los Alamitos, CA 90720-1264. [ZB96] Avi Ziv and Jehoshua Bruck. Checkpointing in Parallel and Distributed Systems. In Albert Zomaya, editor, Parallel and Distributed Computing Handbook, Series on Computer Engineering, chapter 10, pages 274{302. McGraw-Hill, 1996.