Sep 19, 1995 - ten for each program, the job manager acts like a ''purchasing ... requires a custom allocation ''program'' for each parallel application, and ...
Hector: Automated Task Allocation for MPI
Mississippi State University Technical Report MSSU–EIRS–ERC–95–6 September 19, 1995
Dr. Samuel H. Russ, Dr. Brian Flachs, Jonathan Robinson, Bjorn Heckel NSF Engineering Research Center for Computational Field Simulation
Abstract— Many institutions already have networks of workstations, which could potentially be harnessed as a powerful parallel processing resource. A new, automatic task allocation system has been built on top of MPI, an environment that permits parallel programming by using the message–passing paradigm and implemented in extensions to C and FORTRAN. This system, known as ‘‘Hector’’, supports dynamic migration of tasks and automatic run–time performance optimization. MPI programs can be run without modification under Hector, and can be run on existing networks of workstations. Thus Hector permits institutions to harness existing computational resources quickly and transparently.
TABLE OF CONTENTS
#(&$)($# )&&#( '&
'& ()&'
*'(# ''''#' ' !!$($&' ($& # )($"( ' !!$($&
,'("' ' $# & (
''''# (#&' # ,'("'
' &($#
)!( $!&# %("! '$)& !!$($# ' $ %%!($#' *!$%"#(
' &($# #
&$&"# %("-($#
&$&"# ')!(' )()& !#'
!$&%+
(( &#'& #'"
)!( $!&# !!$($# &$'' (&$#$)' (+$& '
Hector: Automated Task Allocation for MPI
Russ, Flachs, Robinson, Heckel
I. INTRODUCTION The abundance of workstations in many institutions provides a strong motivation to harness these available resources to work together to solve computationally intensive problems. Several obstacles have hindered these efforts, although the emergence of industry–standard operating systems, network connectivity, and file sharing has facilitated architectural openness. This openness, combined with increasingly powerful workstations, provides opportunities to develop a powerful class of multiprocessor, the Network of Workstations (NOW). This article describes one approach to developing such a multiprocessor. Research at Mississippi State University’s NSF Engineering Research Center has led to an automated task allocator for running parallel tasks on a network of workstations. This allocator, also called ‘‘Hector’’, is designed to automate the allocation and migration of MPI ‘‘tasks’’ during the course of running a parallel program. In the context of this paper, a ‘‘task’’ is a section of sequential code that runs standalone as part of a larger parallel program. It is akin to a ‘‘process’’ in Hoare’s Communicating Sequential Processes (CSP) model [10]. Each task is run as a separate Unix ‘‘process’’ in the current implementation. ‘‘Program’’ usually refers to a parallel program written in MPI. A ‘‘processor’’ or ‘‘node’’ is an independently functioning CPU, and a ‘‘platform’’ is a single computer, which includes a workstation, a supercomputer, and a tightly coupled multiprocessor. Current research in the area is described in section II. After describing the desired properties of an automatic task allocator in section III, the capabilities of currently available allocation systems is summarized in section IV. The capabilities of the Hector system are discussed in section V and its actual performance in section VI. The paper concludes with a brief outline of future plans in section VII.
NSF Engineering Research Center for CFS
1
Mississippi State University
Hector: Automated Task Allocation for MPI
Russ, Flachs, Robinson, Heckel
II. CURRENT RESEARCH Parallel processing offers the ability to speed computation, usually at a lower incremental cost than accelerating uniprocessor computation. Several models and paradigms have been developed over the years, and most can be divided into two categories. (The discussion here is about the programming model, not the underlying hardware.) A. Systems Based on Shared Data In this paradigm, every part of a parallel program ‘‘sees’’ at least part of some shared data space. This permits tasks to communicate and share workload. Two approaches have been used to support this programming model. First, some researchers have developed custom programming languages to express parallelism formally. Examples include Orca [4]. While they do offer the advantage of formalism, and therefore inherent correctness, most scientific applications programmers seem reluctant to learn an entirely new programming language. Second, many systems implement shared data via language extensions. The ANL macros are a set of language extensions to support shared data and synchronization on shared–memory multiprocessors [11]. More intrusive extensions are required to present the shared–data paradigm when shared–memory hardware is not available. The Linda language extensions are an example [3]. In the Linda system, ‘‘tuples’’ are placed in or removed from ‘‘tuple–space’’, a shared data abstraction. The Piranha system, built using Linda, supports master–worker parallelism [5]. Other examples of shared–memory systems built using language extensions include Mermaid [17], which has support for memory–sharing across heterogeneous platforms. The shared memory model is widely used, and some argue this model simplifies parallel programming. Shared–memory multiprocessors are commercially available and naturally support this paradigm in hardware. However, networks of workstations have no hardware or operating system support for distributed shared memory, and performance NSF Engineering Research Center for CFS
2
Mississippi State University
Hector: Automated Task Allocation for MPI
Russ, Flachs, Robinson, Heckel
penalties may result from the disparity between underlying hardware and programming paradigm. B. Message–Passing Standards and Systems The message–passing paradigm seeks to describe parallel programs as a collection of sequential programs that communicate with explicitly defined messages. As was the case with shared data, two approaches are prevalent. First, custom languages have been developed to express parallelism formally. The Inmos Transputer is programmed in Occam, for example [12]. These custom languages share the same shortcomings as the shared memory custom languages. Second, systems have been developed to support message passing in extensions to existing languages. Examples include PVM [8] and MPI [9]. Both are supported on large and growing bases of platforms and processors and are implemented as extensions to C and FORTRAN. This research focuses on the message–passing paradigm because it more closely reflects the hardware of a network of workstations. On a NOW, there is an especially large penalty for inter–processor communications. By explicitly controlling communications, the programmer can minimize the overhead and effects of this performance penalty. It focuses on language extensions so that it can support a wide variety of applications programmers, many of whom are unwilling to learn a new programming language. III. DESIRED FEATURES By examining current systems and existing resources, it becomes possible to identify desirable features in an automated task allocation system. Some of these desired features are described below. A. Task Migration A goal of this project is the ability to move tasks within an MPI program between processors during run–time. Migration must not affect program behavior, should not require operating system modifications, and should be as fast and efficient as possible. NSF Engineering Research Center for CFS
3
Mississippi State University
Hector: Automated Task Allocation for MPI
Russ, Flachs, Robinson, Heckel
B. Fault Tolerance Fault tolerance is a critical issue. For example, if each workstation has a 1% chance of failing overnight, then 100 workstations will stay up (1 – 1%)100 = 37% of the time. A parallel program running over 100 workstations overnight will fail to complete almost 2/3 of the time under these conditions. C. Optimal Resource Allocation The assignment of tasks to nodes should be done to minimize execution time. Resource allocation is dynamic. That is, as operating conditions change the optimal allocation may also change. Tasks may need to be migrated so that the actual allocation matches the new optimal allocation. D. Ease of Applications Development A difficult–to–use development environment will not gain wide acceptance. Existing languages and operating systems minimize the burden of learning about and maintaining the development environment. It is important that the allocation system be transparent to the parallel programmer. IV. EXISTING MESSAGE–PASSING–BASED TASK ALLOCATORS
Optimal processor allocation
Condor/ CARMI
Prospero
MIST
DQS
?
?
?
Y
Optimal network allocation
Y
Optimal reallocation
?
Only stops task under migration
Y
Global migration ‘‘tracking’’
Y
Y
Fault tolerance
Y
Y
Y
Y
Works with MPI
Y
No source code modifications to existing MPI/PVM program
Y
Uses existing operating system
Y
Y Y
Y
Y
Figure 1: Comparison of Existing Task Allocators NSF Engineering Research Center for CFS
4
Mississippi State University
Hector: Automated Task Allocation for MPI
Russ, Flachs, Robinson, Heckel
Systems already exist which can allocate tasks across networks of workstations. Some of these systems are capable of migrating tasks from one processor to another. Historically, the motivation for task migration has been to release the workstation for use by its interactive user. Figure 1 summarizes these systems. One such system is a special version of Condor named CARMI [14]. CARMI can allocate PVM tasks across idle workstations and can migrate PVM tasks as workstations fail or become ‘‘non–idle’’. In order to support task migration, CARMI checkpoints all tasks to disk when one task needs to migrate [16]. This has the advantage of naturally supporting fault tolerance. When a node fails, the program can ‘‘roll back’’ to the most recent checkpoint and resume execution. Notice that all tasks must be checkpointed at a consistent point in time to be correct. However, stopping all tasks slows program execution since only the migrating task must be stopped. Further, all tasks must access the disk which is usually shared over the network, in an NOW environment. These disk requests create another performance bottleneck. Another automated allocation environment is the Prospero Resource Manager, or PRM [13]. Each parallel program run under PRM has its own job manager. Custom–written for each program, the job manager acts like a ‘‘purchasing agent’’ and negotiates the acquisition of system resources as additional resources are needed. A system manager provides a central interface point for each program’s job manager, and each candidate processor has a ‘‘node manager’’ to observe its performance. PRM is scheduled to use elements of Condor to support task migration and checkpointing and may use information gathered from the job and node managers to reallocate resources. Notice that use of PRM requires a custom allocation ‘‘program’’ for each parallel application, and future versions may require modified operating systems and kernels. The MIST system is intended to integrate several development efforts and develop an automated task allocator [6]. It has improved migration performance relative to ConNSF Engineering Research Center for CFS
5
Mississippi State University
Hector: Automated Task Allocation for MPI
Russ, Flachs, Robinson, Heckel
dor. First, it only stops the task under migration. Second, it transfers the program state directly over the network, avoiding the intermediate step of writing to disk. They estimate network state transfer is about 10 times faster than ‘‘core dump’’ transfer. To minimize the performance penalty of checkpointing for fault tolerance, every task ‘‘forks’’ a duplicate of itself, and the duplicate tasks write out their states to files. MIST is built on top of PVM, and PVM’s support of ‘‘indirect communications’’ can potentially lead to a great deal of administrative overhead when a task has been migrated [7]. Indirect communications is designed to permit tasks to communicate with other tasks in such a way that neither knows the host of the other. It is usually implemented by a task sending messages to the local PVM daemon (pvmd). The source pvmd forwards the message to the destination pvmd, which forwards it to the desired task. This communications paradigm has some advantages, but after a task has migrated, message routing can become problematic. The original host where the task was located before migration keeps a ‘‘forwarding address’’ on hand. If a message arrives for a migrated task, the host forwards the message and sends a message back to the message sender notifying it of the ‘‘change of address’’. Each host maintains a ‘‘cache’’ of host forwarding addresses, but may have to contact the original host if the cache is full. The list of forwarding addresses is duplicated on other hosts to support fault tolerance. The important point to note is that MPI, with a globally available list of task hosts, requires none of this overhead. Every task sends its messages directly to the receiving task, and the only overhead required after a task has migrated is to notify every other task of the new location. This will be discussed in more detail below. The Distributed Queuing System, or DQS, is designed to manage jobs across multiple computers simultaneously [2]. It can support one or more queue masters which process user requests for resources on a first–come, first–served basis. Users prepare small batch files to describe the type of machine needed for particular applications. (For example, the application may require a certain amount of memory.) Thus resource allocation is perNSF Engineering Research Center for CFS
6
Mississippi State University
Hector: Automated Task Allocation for MPI
Russ, Flachs, Robinson, Heckel
formed as jobs are started. While it has no built in support for task migration or fault tolerance, DQS can issue commands to applications that can migrate and/or checkpoint themselves. It supports both PVM and MPI applications. V. HECTOR: AN AUTOMATED TASK ALLOCATOR The Hector automated task allocator is designed to dynamically allocate tasks to processors optimally, to migrate tasks transparently and with minimal overhead, and to support automated allocation of unmodified MPI programs. The current version supports these features through a distributed architecture. A ‘‘master allocator’’ collects performance information and maintains a central relative performance database. It controls the execution of MPI programs by means of ‘‘slave allocators’’, which are small processes running on each candidate platform. The slave allocators monitor the run–time performance of each CPU and conduct various task control functions. For example, slave allocators start new MPI tasks on the local processor. They probe the kernel for load averages, total memory usage, and other similar run–time statistics, and send the information back to the master allocator. Each MPI task is in communication with its local ‘‘slave allocator’’ via signals and Unix sockets. Task control is implemented in the allocators and in the MPI library, and is therefore transparent to the programmer. This structure is diagrammed in Figure 2.
NSF Engineering Research Center for CFS
7
Mississippi State University
Hector: Automated Task Allocation for MPI
Russ, Flachs, Robinson, Heckel
Computing Node 1
Master Allocator Task Migration Decision–Maker
Other Computing Nodes
Computing Node m
Slave Allocator
Other System Information
Permanent Socket Performance and Control Info.
Launch (fork)
MPI Task n
Control Info.
Signal and/or Temporary Socket
Figure 2: Structure of the Task Allocator Running an MPI Program A. Task Migration in MPI The first step in task migration is to transfer a task’s state completely. State transfer is currently accomplished by dumping core to disk. (Future versions may transfer the state using the network.) This is supported automatically by the operating system. In order to restart from a core dump, a special function is linked with the program and is invoked via command–line arguments. The entire data segment (including the heap) and stack segment are read in from the core file into an area of stack created by alloca. After using sbrk to create the correct amount of heap space, the ‘‘old’’ data segment is copied over the new task’s data segment. All operations to this point use local variables, which are on the stack, so as not to interfere with the newly reconstructed data segment. The function skips over the empty space between the end of heap and the ‘‘pre–migration’’ NSF Engineering Research Center for CFS
8
Mississippi State University
Hector: Automated Task Allocation for MPI
Russ, Flachs, Robinson, Heckel
stack pointer. Once data has been reconstructed, the function uses global variables, which are in the data segment, to copy the image of the ‘‘old’’ stack over the new task’s stack. A small assembly language function restores the registers and jumps to the location at which the old task exited. Several key aspects of a Unix process’s state are in structures maintained by the kernel and therefore are not present in the core dump. Rather than modifying the kernel, calls to the system which alter the kernel’s state are ‘‘trapped’’ by user–level wrapper functions. (This same approach is used by Condor and MIST.) These functions keep track of relevant kernel structures in the user’s data segment, so that the migrated task can recreate the conditions under which it ran before migration. Like Condor and MIST, these structures contain the information necessary to re–open files and adjust file pointers. Additionally, these structures permit the reconstruction of calls to mmap and re–install signal handlers. This reconstruction process occurs after the core file has been read and before the old data segment has been copied. It is essential that message traffic be kept intact during the migration of a task. While MPI’s global ‘‘process table’’ simplifies bookkeeping, some effort is still required to insure correctness. A series of carefully designed steps is required. Once the master allocator has decided to migrate a task, all slave allocators are notified. (Recall that there is one slave allocator on every platform.) Every slave allocator sends a signal (for now, SIGUSR2) to the tasks on the slave allocator’s machines that are part of the appropriate MPI program. (Multiple MPI programs may be running simultaneously.) Once each task receives the signal, it opens a new socket and sends the port number of the new socket back to the slave allocator. It is able to send the information back to the slave allocator because every slave allocator has a permanent socket at a fixed port number. Once the slave allocator has received a task’s new port number, it is ready to communicate with it. Note that communications between the slave allocator and the local tasks
NSF Engineering Research Center for CFS
9
Mississippi State University
Hector: Automated Task Allocation for MPI
Russ, Flachs, Robinson, Heckel
under its control is extremely fast, because all communications are local and do not involve the network. The slave allocator sends one of two messages to each task. The first type of message simply notifies the task that another task is migrating. The second type of message, sent only to the migrating task, notifies it that it needs to prepare for migration. Notice that only the task that is migrating is actually halted. Each non–migrating task checks to see if it has opened a two–way communications channel to the task that is about to migrate. If it has, it sends an ‘‘end–of–channel’’ message and closes the connection. Each task then marks the migrating task’s state as ‘‘migrating’’ in its process table. Any attempt to communicate with the task under migration after that point will be blocked. The task that is migrating checks for ‘‘end–of–channel’’ messages from every task communicating with it. This ensures that no additional messages are in transit. This is important, as any messages in the kernel’s buffers will be lost. Once all the messages have been received, it notifies the slave allocator, dumps core, and exits. Upon reception of the ready message, the slave allocator uses wait to detect when the task has exited. It notifies the master allocator, which in turn notifies the slave allocator on the destination node. Once the appropriate slave allocator has launched the task again, using the state reconstruction procedure described above, another set of conversations ensues. Once again, each slave allocator signals each task and each task opens a socket and notifies the slave allocator. The slave allocators notify each task of the new location of the migrated task, and each task updates its global process table appropriately. By marking the newly migrated task’s entry as ‘‘not connected’’, the MPI library knows to reopen communications the next time a task tries to communicate with the newly migrated task.
NSF Engineering Research Center for CFS
10
Mississippi State University
Hector: Automated Task Allocation for MPI
Russ, Flachs, Robinson, Heckel
None of these mechanisms are visible to the applications programmer. The functions to start MPI tasks are called by the allocator. All of the socket communications and signal handling are in the libraries that are linked with the application. This means that task allocation and migration is ‘‘transparent’’ to the applications programmer. B. Performance Optimization Two sources of information are used by the task allocator. The first, which can be thought of as ‘‘static’’, includes information about each processor’s relative CPU performance, main memory size, speed of local disk, network connectivity, and ‘‘etiquette’’ information. The ‘‘etiquette’’ information may specify when platforms are available to run jobs, for example. The second, which can be thought of as ‘‘dynamic’’, includes information about disk swapping, context switching, message latency, memory usage, and any interactive user activity. Each slave allocator probes the local processor’s kernel to gather performance information. The presence of a local information–gathering process provides relatively rapid response to run–time events and some kinds of faults. (For example, a processor with severe thrashing can be harder to detect and work around than one that has completely crashed.) The initial mapping of tasks to processors is currently based on each processor’s current workload and its relative performance. The allocator attempts to minimize run–time as resources become available. This uses faster computers as soon as they become available. It also starts parallel programs on available machines while waiting for faster ones. VI. PERFORMANCE RESULTS An MPI program was run under two different scenarios to determine the performance benefits of ‘‘optimal’’ scheduling over ‘‘first come, first served’’ scheduling. A three–process matrix multiply written in C with MPI extensions was used. The primary task reads in two matrices, broadcasts the first matrix, and sends 1/3 of the second matrix NSF Engineering Research Center for CFS
11
Mississippi State University
Hector: Automated Task Allocation for MPI
Russ, Flachs, Robinson, Heckel
to each of the two other tasks. After computing results for its share of the answer, the primary task receives the results from the two other tasks. The matrix size was 1000 x 1000. The machines used in the test ranged from a 40 MHz SPARC–based system to a 4–processor 70 MHz Hyper–SPARC–based system. All were free of external loads, except as noted below. First, tests were run to validate the advantages of using a processor’s load as a criterion for task migration. The matrix multiply job was launched on three workstations. Two minutes into the run, another, unrelated program was launched on the third workstation. The matrix multiply took 1316 seconds when the third workstation was ‘‘loaded’’. The experiment was repeated, except this time the task was migrated from the third workstation to an idle workstation. The run time dropped to 760 seconds. By dynamically balancing the load, the overall run time is minimized. Second, tests were run to show the advantages of migrating tasks in the event faster workstations come available. The job was launched on three workstations, which took a total execution time of 1182 seconds. Then the job was launched on three workstations again. This time, faster machines became available two minutes into the computation, and the tasks running on the slower machines were migrated to faster machines. The run time decreased to 724 seconds. Notice that the algorithms used by other task allocators would not have migrated the slow tasks once the program was launched. The performance penalty associated with task migration––currently on the order of 45 seconds––was more than offset by the performance improvement that resulted. One anomaly of interest occurred during the second test. The run time of one of the ‘‘slave’’ processes seemed to be longer than the total run time. It turned out that the system time on the source machine and destination machine were different by about three minutes, which affected the run–time calculation. This highlights some of the unexpected effects that can occur when tasks migrate in mid–computation.
NSF Engineering Research Center for CFS
12
Mississippi State University
Hector: Automated Task Allocation for MPI
Russ, Flachs, Robinson, Heckel
VII. FUTURE PLANS A. State Transfer Mechanism There is a trade–off between ‘‘core dump’’ transfer and network transfer. Many tools exist to garner useful information from core dumps, and so there is some motivation to dump core. This must be weighed against the substantial speedup possible through direct network transfer. Network state transfer is faster for two reasons. First, it lacks the overhead of reading and writing disks, usually over some file–sharing mechanism like NFS. Second, it only transfers those parts of the data segment and stack that are actually in use. Core files contain large unused spaces corresponding to the area into which the stack may grow. Network state transfer will be implemented in order to determine its benefits. B. Fault Tolerance Future versions of Hector will tolerate single–node failures. Experiences in industry with NOW processing highlight this need. Initially, this will be supported by checkpointing tasks and rolling them back in the event of a node fault. The ability to suspend and restart tasks can lead to an important, potentially useful feature. Tasks could be ‘‘duplicated’’ and run simultaneously on different nodes. By observing the outgoing messages themselves, some sort of ‘‘voting logic’’ could be coupled to provide fault detection. Even if a task is only duplicated once, this duplicate task could take over rapidly and almost transparently in the event of a node fault. In order to support task duplication, the communications library must transmit duplicate messages to redundant tasks and receive multiple copies of each message. Failure to receive a duplicate of a message, or reception of duplicates that differ, indicates a node fault. C. Allocation across Heterogeneous Networks Allocation strategies for heterogeneous networks are under development. Most institutions now have heterogeneous networks providing workstation connectivity. For example, a single LAN may use ATM, FDDI, and various types of Ethernet. Allocation NSF Engineering Research Center for CFS
13
Mississippi State University
Hector: Automated Task Allocation for MPI
Russ, Flachs, Robinson, Heckel
strategies should be able to take advantage of high–speed sub–networks, ideally in an automated fashion. VIII. BIBLIOGRAPHY [1]
‘‘The Condor Distributed Processing System’’, Dr. Dobbs’ Journal, February 1995, pp. 40–48. Also available at http://www.cs.wis.edu:80/condor
[2]
DQS User Manual –– DQS Version 3.1.2.3, Supercomputer Computations Research Institute, Florida State University, June 1995.
[3]
Sudhir Ahuja, Nicholas Carriero, and David Gelertner, ‘‘Linda and Friends’’, Computer, Vol. 19 No. 8, August 1986, pp. 26–34.
[4]
H. Bal, M. F. Kaashoek, and A. Tanenbaum, ”Orca: A language for parallel programming of distributed systems,” IEEE Trans. Software Eng., vol. 18, no. 3, Mar. 1992, pp. 190–205.
[5]
N. Carriero, E. Freeman, D. Gelernter, and D. Kaminsky, ”Adaptive Parallelism and Piranha,” Computer, vol. 28, no. 1, Jan. 1995, pp. 40–49.
[6]
Jeremy Casas, Dan Clark, Phil Galbiati, Ravi Konuru, Steve Otto, Robert Prouty, and Jonathan Walpole,‘‘MIST: PVM with Transparent Migration and Checkpointing’’,Proceedings of the Third Annual PVM User’s Group Meeting, Pittsburgh, PA, May 1995 Also available via http://www.cse.ogi.edu /DISC/projects/mist
[7]
Jeremy Casas, Dan Clark, Ravi Konuru, Steve W. Otto, Robert Prouty, and Jonathan Walpole, ‘‘MPVM: A Migration Transparent Version of PVM’’, Usenix Computing Systems Journal, February 1995 Also available via http://www.cse.ogi.edu /DISC/projects/mist
[8]
Al Geist, Adam Beguelin, Jack Dongarra, Weiching Jiang, Robert Mancheck, Vaidy Sunderam, PVM: Parallel Virtual Machine, Cambridge Mass: The MIT Press, 1994.
[9]
William Gropp, Ewing Lusk, Anthony Skjellum, Using MPI, Cambridge Mass: The MIT Press, 1994.
[10]
C.A.R. Hoare, ”Communicating Sequential Processes”, Communications of the ACM, vol. 21, no. 8, pp. 666–667, August 1978.
[11]
Ewing Lusk, Portable Programs for Parallel Processors, New York: Holt, Rinehart and Winston, Inc., 1987.
[12]
Phillip C. Miller, Charles E. St. John, and Stuart W. Hawkinson, ‘‘FPS T Series Parallel Processor’’, found in Robert G. Babb, ed., Programming Parallel Processors, Reading, MA: Addison–Wesley Publishing Co., 1988, pp. 73–92.
[13]
B. Clifford Neuman and Santosh Rao, ‘‘The Prospero Resource Manager: A Scalable Framework for Processor Allocation in Distributed Systems’’, Concurrency: Practice and Experience, Vol. 6(4), June 1994, pp. 339–355. Also available via http://nii–server.isi.edu/gost–group/products/prm/prm
NSF Engineering Research Center for CFS
14
Mississippi State University
Hector: Automated Task Allocation for MPI
Russ, Flachs, Robinson, Heckel
[14]
Jim Pruyne and Miron Livny, ‘‘Providing Resource Management Services to Parallel Applications’’, Workshop on Job Scheduling Strategies for Parallel Processing, Proceedings of the International Parallel Processing Symposium (IPPS 95), April 15, 1995.
[15]
Luis M. Silva, Joao Silva, Simon Chapple, and Lyndon Clarke, ”Portable Checkpointing and Recovery”, Proceedings of the Fourth IEEE International Symposium on High Performance Distributed Computing, IEEE Computer Society Press, August 1995, pp. 188–195.
[16]
Georg Stellner, ‘‘Consistent Checkpoints of PVM Applications’’, Proceedings of the First European PVM User’s Group Meeting, 1994. Also available at http://wwwbode.informatik.tu–muenchen.de/~stellner/CoCheck.html
[17]
S. Zhou, M. Stumm, K. Li, and D. Wortman, ”Heterogeneous distributed shared memory,” IEEE Trans. Parallel and Distrib. Systems, vol. 3, no. 5, Sep 1992, pp. 540–554.
NSF Engineering Research Center for CFS
15
Mississippi State University