Condor flocking: load sharing between pools of workstations - CiteSeerX

Condor flocking: load sharing between pools of workstations Report 93-104

X. Evers J.F.C.M. de Jongh R. Boontje D.H.J. Epema R. van Dantzig

Faculteit der Technische Wiskunde en Informatica Faculty of Technical Mathematics and Informatics Technische Universiteit Delft Delft University of Technology

ISSN 0922-5641

Copyright c 1993 by the Faculty of Technical Mathematics and Informatics, Delft, The Netherlands. No part of this Journal may be reproduced in any form, by print, photoprint, microfilm, or any other means without permission from the Faculty of Technical Mathematics and Informatics, Delft University of Technology, The Netherlands. Copies of these reports may be obtained from the bureau of the Faculty of Technical Mathematics and Informatics, Julianalaan 132, 2628 BL Delft, phone +3115784568. A selection of these reports is available in PostScript form at the Faculty’s anonymous ftp-site. They are located in the directory /pub/publications/tech-reports at ftp.twi.tudelft.nl

1 Introduction The Condor system, developed primarily by M.J. Litzkow, M. Livny, and M. Mutka at the Computer Science Department of the University of Wisconsin - Madison [4], is a batch queueing system for a pool of UNIX1 workstations connected by a network, assigning computeintensive jobs to idle workstations. When organizations that own such Condor pools cooperate, they may wish to share their (otherwise idle) computing power without actually integrating their pools into one large Condor pool. This sharing requires an extension of the Condor mechanism across the boundaries of Condor pools. A collection of Condor pools cooperating through such an extension has been named a Condor ock by M. Livny. In this paper we describe a rst design and implementation of a Condor ock. The motivation was to be able to share computing power between NIKHEF2 and institutes with which it cooperates, such as CERN3. Condor is based on the premise that in some organizations, many compute-intensive jobs are generated while at the same time many workstations are idle (i.e., there is much wait-while-idle time). In a Condor pool, workstations periodically send information on their queue of waiting jobs and idle/busy status to the Central Manager , which in turn makes the scheduling decisions. A brief summary of Condor can be found in Section 2. Condor ocking is based on the premise that also across dierent Condor pools that are owned by dierent departments within an organization or by dierent organizations that cooperate, there is much wait-while-idle time. Although standard Condor supports pools of workstations that contain WAN-connections, there are at least three reasons for organizations not to create one large Condor pool consisting of their joint machines rather than create a properly designed Condor ock. First, a ock allows organizations to keep full authority over their own machines; for instance, they can choose the Condor con guration parameters themselves, and withdraw their machines from the ock in the same way as from a single pool. Second, a large pool may put too large a load on the single Central Manager. Finally, a ock is more fault tolerant: in a single pool, the failure of the Central Manager brings all of Condor down, while in a ock, the failure of the Central Manager of some pool leaves the rest of the ock essentially unaected. In Section 3, we discuss some design issues for Condor ocking, amongst which the feasibility of maintaining the remote-system-call mechanism across WANs. A restriction imposed on our design was to leave the Central Manager unmodi ed, beUNIX is a trademark of UNIX Systems Laboratories, Inc. Nationaal Instituut voor Kernfysica en Hogere-Energie Fysica (National Institute for Nuclear Physics and High-Energy Physics). 3 Centre Europ een pour la Recherche Nucleaire. 1 2

1

cause the existence of several versions of the Central Manager would make it very dicult for the Condor design team to provide support. A new feature should rst prove its usefulness before being incorporated in the Central Manager. In our design we have followed as much as possible the design principles of Condor. From a user viewpoint this means that remote jobs do not interfere with the workstation owner's activities, and that Condor ocking is transparant: a job submitted to Condor may be migrated to another pool in the ock. From a system viewpoint this means that remote access is made easy (the same mechanisms for checkpointing and remote system call are employed, and Condor ocking is implemented completely outside the kernel). The main design decision was to introduce World Machines, one in each of the pools making up the ock, which exchange information on the status of the workstations in the pools, and through which requests for idle machines are exchanged. This information enables them to represent the other pools but behave as ordinary machines to the Central Managers. The latter then can give permission to an ordinary workstation to run a job on the World Machine, eectively allowing its transfer to another pool, or give permission to the World Machine to execute a job in its pool, eectively allowing the transfer of a job from another pool to its own pool. While standard Condor is rather centrally organized, our design of a Condor ock, which is described in Section 4, is truly distributed. In Section 5, we present some conclusions.

2 The Condor System In this section we describe the Condor system as far as needed for the purposes of this paper. Further details can be found in [3, 4]. There are three main principles that have driven the design of Condor. First, an owner of a workstation should always have exclusive access to it: jobs remotely running on a workstation should interfere as little as possible with the workstation owner's activities. Second, remote access should be easy: Condor is responsible for nding an idle machine, starting the job on the remote machine, handling system calls (for le access etc.), checkpointing in order to preempt a job when the user of the workstation on which it runs returns and to be able to continue a job after a failure, in short, to execute the job without user intervention. Finally, Condor does not require special programming: a user only has to relink his object les with a special Condor library, and Condor runs outside the UNIX kernel, so it is portable. Recently, IBM introduced a product based on Condor called LoadLeveler4 [2]. In LoadLeveler, the remote-system-call mechanism is discarded, and all system calls that are allowed 4

LoadLeveler is a trademark of International Business Machines Corporation, Inc.

2

are executed locally; this requires the use of a network le system such as NFS5 or the Andrew File System. An important improvement over Condor is the possibility of having more than one LoadLeveler job on a single workstation. LoadLeveler does not have a facility for

ocking.

2.1 Condor Components In a pool of workstations running Condor, one machine is designated as the Central Manager Machine. The Condor control software consists of two daemons on each member of the Condor pool, viz. the scheduler daemon Schedd and the starter daemon Startd, and two daemons on the Central Manager Machine, viz. the Collector and the Negotiator, jointly called the Central Manager. These daemons perform the following functions:

The Schedd maintains the local queue of jobs submitted to Condor, assigns priorities to these jobs, and periodically sends the number of jobs in queue to the Collector.

The Startd determines whether the machine on which it runs is idle, and is responsible for starting and managing the foreign Condor job (there can be only one) if one is running on its machine. The Startd periodically sends to the Collector the state of its machine. A machine is considered idle when it has a load average below some threshold, and its keyboard has not been used for a speci c time interval (monitored by the keyboard daemon Kbdd on machines with an X-server).

The Collector maintains information on the state of the pool, which it obtains from the Schedd and Startd daemons.

The Negotiator does the scheduling: it periodically assigns jobs to idle machines, based on information obtained from the Collector and its priority scheme. Subsequently, it sends information on its decisions to the machines in the pool, and the recomputed priorities of the machines to the Collector.

To illustrate how the daemons cooperate, we will follow a Condor job from the moment of its submission by a user on the initiating machine until it nishes on the execution machine, or the owner of the latter returns (see Figure 1). By means of a special command, a user can submit a job as a Condor job, which means that it is a candidate for transfer by Condor to another machine. The local Schedd includes this job in its local queue. Periodically (usually every 5 minutes) the Negotiator tries to make as many scheduling decisions as possible, based on the information of the Collector. We will not go into the details of the 5

NFS is a trademark of Sun Microsystems, Inc.

3

scheduling algorithm used by the Negotiator (the Up-Down algorithm, see [7]), so suppose that our job is assigned to an idle machine. On the initiating machine, the Schedd spawns a Shadow process to support the remote job. The Shadow contacts the Startd on the execution machine, who spawns a process called the Starter, who in turn is responsible for starting and managing the remotely running job. The Shadow transfers the checkpoint le (see below) of the Condor job to the Starter, who spawns the remote job. The Starter periodically gives the job a \checkpoint" signal, causing the user job to save its le state (which is maintained by the Condor C-library) and make a core dump, which is used by the Starter to create a new checkpoint le. The Starter restarts the job from the new version of the checkpoint le, and sets a timer for the next time it has to give the job a checkpoint signal. When the job nishes, the Starter and Shadow clean up, and the user is noti ed by mail. When the owner of the execution machine returns, the Startd sends a \suspend" signal to the Starter, who suspends the job. If the machine remains busy for some time, the Startd sends a "vacate" signal to the Starter, who will abort the job and return the last checkpoint le to the Shadow. No new checkpoint is made in order not to interfere with the user, and all work done since the last checkpoint is lost. Central Manager Machine Central Manager

Startd

Schedd

Kbdd

Shadow

Startd

Schedd

Starter

Kbdd

User Job

Initiating Machine

Execution Machine

Figure 1: Condor daemons and one Condor job.

2.2 Remote System Calls Condor employs the Remote UNIX facility [3] to execute jobs on machines dierent from the ones on which they were started. A Condor job runs with the illusion that it is operating in the environment of the initiating machine. Most of the system calls that are allowed, which mainly have to do with le I/O, are directed back to the initiating machine, where 4

they are handled by the Shadow. This remote-system-call mechanism is implemented by means of a special version of the C-library. This library has a stub for each UNIX system call. These stubs either execute a request locally by mimicking the normal stubs, or pack the request into a message which is sent to the Shadow. The Shadow executes the system call on the initiating machine, packs the results, and sends them back to the stub. The stub then returns to the application program in exactly the same way the normal system call would have, had the call been local. The Shadow runs with the same user and group ids, and in the same directory as the user process would had it been executing on the initiating machine. If a network le system such as NFS is used, an important optimization is possible by accessing les directly from the execution machine.

2.3 Checkpointing Condor provides a transparant checkpoint mechanism which allows it to achieve the eect of process migration. When the owner of an execution machine returns, ideally, the Condor job should be migrated. Condor simulates a real process migration facility by periodically making checkpoints with enough information to restart jobs at a later time, possibly on a dierent machine. In UNIX, the kernel maintains much information on processes, such as open les, information relating to signals, timers, etc. In order to make checkpointing feasible, many UNIX system calls prohibited; for example, the fork and exec system calls are not allowed, so Condor only supports single-process jobs that can be checkpointed relatively easily. For the type of applications for which Condor was designed (compute-intensive work), this approach is reasonable. A checkpoint made by Condor is itself an executable le. Before a Condor job is started for the rst time, its executable le is modi ed to look exactly like a checkpoint le. The text segment in every subsequent checkpoint is an exact copy of the text segment in the previous checkpoint. The data and stack segments are copied from the core le into the new checkpoint le. The setjmp/longjmp facilities of the C-library are used to save the register contents and program counter. Information about currently open les is gathered by the stubs of the le system calls (such as open, close and dup) of Condor's special C-library. The exact way in which Condor makes a checkpoint le can be found in [6].

3 Design Issues for Condor Flocking In this section we discuss some design alternatives for Condor ocking and our choices made during the design, on which Section 4 provides more details. We call the pool where a Condor job is submitted the initiating pool, and the pool where the execution machine is situated 5

the execution pool.

3.1 Remote Versus Local System Calls The rst design issue is whether in a ock, system calls are to be executed on the initiating machine or on the execution machine. This depends largely on two issues: the average frequency of system calls made by Condor jobs, and the average delay involved in remote system calls. When the frequency of system calls or the delays of remote system calls are low, remote system calls can be used to preserve the local execution environment of a Condor job. Otherwise, it is necessary to execute system calls on the execution machine. The latter poses several problems though, since the execution environment ( le naming, time zone, etc.) will be dierent in the execution pool. situation best worst mean local, NIKHEF 0.038 0.110 0.061 remote, NIKHEF!NIKHEF 4.0 12.3 6.6 remote, NIKHEF!FWI 5.7 12.2 8.8 remote, NIKHEF!CERN 62 229 121 remote, NIKHEF!Wisconsin 164 1294 485 Table 1: Delay of gettimeofday system call in milliseconds. As an example of the dierence in system call delays, we measured the delay of the gettimeofday system call. We compared ve situations: a local system call on a machine in the NIKHEF pool, a remote system call between two machines in the NIKHEF pool, and a remote system call between a machine in the NIKHEF pool and one at either FWI6, CERN, or the University of Wisconsin - Madison, which in each case was made part of the pool at NIKHEF. Due to our implementation of Condor ocking, to be described in Section 4, the results found do not dier from the actual ocking situations. The results are shown in Table 1. Each delay shown is the average over several measurements (each of which consists of 1000 system calls) during various periods of the day. Although the results only give an indication of the delay that may be encountered, the delays of remote system calls with the pools at CERN and Wisconsin are much higher than with the pool at FWI, where it is comparable to the delay of the remote system calls within the pool at NIKHEF. We restrict ourselves to situations where it is feasible to use the remote system call mechanism of Condor. In some cases, however, performance will be poor and we realize that

ocking (in this form) cannot always be used. Faculteit der Wiskunde en Informatica (Department of Mathematics and Computer Science), University of Amsterdam. 6

6

3.2 Centralized Versus Distributed Approach The second design issue is whether one machine in a ock should collect the status information of all the Condor pools and take the decisions, or whether the decision-making process is to be distributed among the Condor pools. In the latter case, it should be the Central Manager of a pool which allocates idle machines of its pool to waiting Condor jobs, which may either have been submitted locally or originate from other pools. The most important advantage of making decisions centrally is simplicity. However, an important drawback of the centralized approach is the low reliability of such a system; failure of the machine which takes the decisions results in the collapse of the entire management system. We decided to distribute the authority to migrate Condor jobs among the pools in a ock.

3.3 Source-initiative Versus Server-initiative Approach There are two possible ways in which Condor pools can decide to move a Condor job from one pool to another pool, a source-initiative and a server-initiative way. In the source-initiative approach, a Condor pool goes searching for a server when it has a Condor job that it wants to run on a machine of another pool. The initiating pool may decide that a job should be run on a machine of another pool when there are no idle machines in the own pool that can serve this job, or when it expects that there is a better (faster) server available in one of the other pools. In the server-initiative approach, a Condor pool that has idle machines that may be used for jobs of the other Condor pools of the ock, goes searching for work. Note that in ordinary Condor the approach is both source initiative and server initiative: both idle machines and machines with waiting Condor jobs send information to the Central Manager in order for the Central Manager to take the scheduling decisions. We have chosen for the server-initiative approach, because in this way the Central Manager of a (source) pool knows which machines are available in the ock when it starts scheduling, since these machines were made available by the Central Manager in their pool.

3.4 User Identity Within a Condor pool, jobs run under the UID and GID of the owner of the job, so that les can be accessed via NFS. In a Condor ock it will normally not be possible to run the jobs under the UID and GID of the owner, because the owner has no UID on the machines of the execution pool. A solution is to run Condor jobs under the UID \nobody" (this was the case in ordinary Condor before the NFS optimization) or under one of several special Condor ock UID's. In this case, les that could be accessed via NFS from the execution machine, now have to be accessed via remote system calls. 7

One of the security aws of Condor is that it does not check whether a Condor job has been linked with the special Condor library. Such a job can therefore easily bypass the Remote UNIX mechanism. The philosophy behind the use of Condor in a single pool is that all users should be trusted in this matter. Clearly, this assumption may be too strong in a Condor ock, which strengthens the need to repair this security aw.

4 A Design of a Condor Flock We have designed and implemented a rst version of a Condor ock for the situation where it is acceptable to use the remote-system-call mechanism between all the machines of all the pools that are part of the ock. No changes are required to the checkpointing mechanism. The following requirements were posed to the design:

Nothing should be changed to the Central Manager.

It should be transparent to the user whether a job is executed in the initiating pool or in another Condor pool (except perhaps for a somewhat poorer performance).

It should be possible for a user to prevent that one of his Condor jobs is executed in another pool.

For every Condor pool in the ock, there is a set of pools that are allowed to run jobs on the machines of the pool, and a set of pools where this pool is allowed to run jobs. It should be possible to change these relations between pools.

Failure of daemons designed for Condor ocking or of Condor daemons in other pools of the ock should not aect the ordinary operation of Condor in a single pool.

4.1 The World Machine The basic idea to implement the ock is by adding a virtual machine to each Condor pool, called the World Machine. The World Machine represents in a Condor pool all the other Condor pools that are part of the ock. It looks to the Central Manager like a normal machine that can oer jobs for execution, and can serve jobs of machines of the pool. The World Machine consists of two daemons, the W-Schedd and the W-Startd. Figure 2 shows three Condor pools, which together form a ock. In each pool there is a ock con guration le, which contains information about the relations between the pool and the other pools of the ock. It contains the names of the pools that are part of the ock, the network addresses of their World Machines, the names 8

Schedd

Schedd

Schedd

Schedd

Startd

Startd

Startd

Startd

Machine 1

Machine 2

Machine 3

Machine 4

LAN

LAN

Collector

W_Schedd

W_Schedd

Collector

Negotiator

W_Startd

W_Startd

Negotiator

Central Manager

World Machine

World Machine

Central Manager

WAN Schedd

W_Schedd

Startd

W_Startd

Machine 6

World Machine

LAN

Schedd

Collector

Schedd

Startd

Negotiator

Startd

Central Manager

Machine 8

Machine 7

Figure 2: A Condor Flock. of the pools that are allowed to run Condor jobs in this pool, and the names of the pools where this pool is allowed to run Condor jobs. When the W-Startd and the W-Schedd start up, they read the ock con guration le. The SIGHUP signal is used to make the W-Startd or the W-Schedd re-read the ock con guration le. In this way it is possible to change the relations between pools of the ock without having to restart the daemons of the World Machine. Periodically, the W-startd daemons of a ock exchange information about the idle machines in their pools. To this end, each W-Startd asks the status of its pool from the Collector, which replies by sending information about each machine. With this information, the W-Startd makes a list of the idle machines in the pool, which it sends to all the W-Startd daemons of pools that have the right to run jobs in this pool. If a W-Startd has not received a new list of idle machines from a pool on which this pool is allowed to run jobs, this pool is considered to be \down". The current list of idle machines of this pool will then be deleted. Periodically, the W-Startd chooses a machine from the lists of idle machines received 9

and presents itself, like a normal Startd, to the Central Manager with the characteristics of this machine. This simple policy has been implemented because it is unknown what kind of machine is wanted by the waiting jobs. If there are no idle machines or all the pools are \down", the W-Startd presents itself as a machine that is unavailable. The W-Startd sends a message with status information to the Collector as if it was the machine \world" with the chosen characteristics. Condor allows a user to indicate that his Condor job should not be executed on a particular machine. By doing so for the machine \world", a user can prevent that his job is executed in another pool.

4.2 Starting a Job in Another Pool Figure 3 shows the situation where a machine receives permission from the Central Manager to run a job on the World Machine. The following actions are taken by the daemons to start the job in another pool (the numbers in the list below correspond to those in the gure): Central Manager

Negotiator

Central Manager

Initiating Pool Execution Pool

1

Schedd

4 2 6

W-Startd

3 5

World Machine

Shadow

Negotiator

W-Schedd

Startd

World Machine

7 8

Initiating Machine

Starter+ UserJob Execution Machine

Figure 3: Starting a Job. 1. The Schedd of the initiating machine receives permission from the Central Manager to run a job on the World Machine. 2. The Schedd marks the job as running and sends a GIVE MACHINE message consisting of the job requirements, job preferences and the job owner to the W-Startd. This constitutes a deviation from the normal procedure in a Condor pool, where the Schedd directly contacts the execution machine, and the Schedd had to be modi ed accordingly. 3. The W-Startd receives the GIVE MACHINE message and checks if there is an idle machine in another pool that meets both the job requirements and job preferences. If 10

no such machine is found, the W-Startd searches for an idle machine that only meets the job requirements. If still no machine is found, the W-Startd sends the message NO MACHINE to the Schedd, which marks the job as idle. If an idle machine is found, the W-Startd sends the message REQUEST MACHINE and the job information to the W-Schedd of the pool with the idle machine. 4. The W-Schedd tells its Collector that it wants to run a job and waits for the Negotiator to schedule. The W-Schedd behaves to the Central Manager in the same way as a normal Schedd. 5. If the W-Schedd receives permission from the Negotiator, it sends the message FOUND MACHINE and the name of the execution machine to the W-Startd that requested the idle machine. If the W-Schedd does not receive permission it sends the message NOT FOUND to the W-Startd. 6. If an idle machine is found, the W-Startd sends the message MACHINE and the name of the execution machine to the Schedd of the initiating machine, otherwise it sends the message NO MACHINE to this Schedd. 7. If the Schedd receives the message NO MACHINE, it marks the job as idle so that it can be scheduled by the Central Manager. If the Schedd receives the message MACHINE, it starts a Shadow process for the job, with as arguments the job id and the name of the execution machine. The Shadow will send the message START FRGN JOB and the job information directly to the Startd of the execution machine. If the situation on the execution machine has not changed since the last update of this machine to its Central Manager, the Startd will respond with an OK. 8. The job is now started by the Shadow and the Startd, and from this moment on is executed remotely, in same way as when the initiating machine and execution machine are located in the same Condor pool. The only dierence is that the Condor job does not run under the UID of the owner, but under a special UID. All the messages concerned with starting a Condor job contain the name of the initiating machine and the job id. This is necessary because the daemons of the World Machines may be working for many jobs of dierent machines and dierent pools at the same time.

4.3 Limitations of the Design The idea of the World Machine has allowed us to implement this rst version of a ock without any changes to the Central Manager. Only the Schedd and the Startd have been 11

changed. We now take a look at the limitations of this design. The assumption has been made that all the machines of the pools in the ock have the same architecture and operating system. If there are dierent machine types in the ock there is the possibility that the World Machine present itself as a machine for which there are no waiting jobs. The restriction that nothing could be changed to the Central Manager implies the following limitations:

The World Machine has a priority that is calcuted by the Central Manager in the same way as the priority of a normal machine. This means that the jobs of the World Machine are assigned before the jobs of a normal machine when the latter has a lower priority than the World Machine.

It is possible that jobs are run on the World Machine while there are idle machines in the initiating pool. This is caused by the rst- t allocation algorithm of Condor [1].

Because the W-Startd and W-Schedd use the same port numbers as the normal Startd and Schedd, a complete machine has to be set aside as the World Machine.

Because Condor allows at most one Condor job on a machine, the Negotiator will give permission to at most one machine per schedule interval to run a job on the World Machine. However, in the schedule interval after such a permission was granted, the World Machine can presents itself again as an idle machine, and the Central Manager can again give a machine permission to start a job on the World Machine.

If a job from one pool is executed on a machine of another pool, the report returned by the program condor status gives an inconsistent view of the states of the Condor pools. In the initiating pool, the initiating machine reports the job as running, but there is no corresponding machine that is serving the job. In the execution pool, the execution machine is serving a job while there is no corresponding machine with a remotely executing job. The problem can be solved by letting the World Machine present itself to the Central Manager as a machine which is serving several jobs, and which has several jobs in queue that are running.

5 Conclusions We have described a rst design and implementation of Condor ocking, under the restriction that the Central Manager should not be modi ed. Although the result meets the basic goal of coupling Condor pools, we feel that in a full implementation, the Central Manager should know about the existence of pools in a ock, and so should be changed. Either the World 12

Machines may be retained, but the Central Manager should know they represent other pools or at least multiple machines, or information exchange between pools may proceed directly between Central Managers instead of between World Machines, which can then be discarded. Because we still use the remote-system-call mechanism, our design may not be suitable for jobs that are relatively I/O-intensive, because of the long network delays. So far, our Condor

ocking mechanism has been operational in a test situation with two pools, one at NIKHEF and one at FWI, where it functioned without problems.

6 Acknowledgments The work reported on in this paper was carried out at NIKHEF-K7, where Condor is employed in a pool of workstations. During the design and implementation, we bene tted much from discussions with M. Litzkow and M. Livny of the University of Wisconsin - Madison and with M. Litmaath of NIKHEF and CERN, and from discussions with and support from W. Heubers of NIKHEF and P.M.A. Sloot and other sta members of the Faculty of Mathematics and Informatics of the University of Amsterdam.

References [1] A. Bricker, M.J. Litzkow and M. Livny, "Condor technical summary," Version 4.1b, University of Wisconsin - Madison, 1991. [2] IBM LoadLeveler: User's Guide, Doc. No. SH26-7226-00, IBM Corporation, 1993. [3] M.J. Litzkow, "Remote UNIX, Turning Idle Workstations into Cycle Servers," in Proc. of the 1987 Usenix Summer Conference, Phoenix, Arizona, USA, 1987. [4] M.J. Litzkow, M. Livny and M.W. Mutka, "Condor - A hunter of idle workstations," in Proc. of the 8th Int'l Conf. on Distributed Computing Systems, San Jose, California, USA, 1988, pp. 104-111. [5] M.J. Litzkow and M. Livny, "Experience with the CONDOR Distributed Batch System," Proc. of the IEEE Workshop on Experimental Distributed Systems, Huntsville, AL, USA, 1990. [6] M.J. Litzkow and M. Solomon, "Supporting Checkpointing and Process Migration outside the UNIX Kernel," in Proc. of the 1992 Usenix Winter Conference, San Francisco, California, USA, 1992. 7

NIKHEF - Sectie Kernfysica (Section Nuclear Physics).

13

[7] M.W. Mutka and M. Livny, "Scheduling Remote Processing Capacity in a Workstationprocessor Bank Network," in Proc. of the 7th Int'l Conf. on Distributed Computing Systems, Berlin, Germany, 1987, pp. 2-9. [8] M.W. Mutka and M. Livny, "Pro ling Workstations' Available Capacity for Remote Execution," in Performance '87, Proc. of the 12th IFIP WG 7.3 Int'l Symp. on Computer Performance Modelling, Measurement and Evaluation, Brussels, Belgium, 1987, pp. 529-544.

14

Condor flocking: load sharing between pools of workstations - CiteSeerX

Condor flocking: load sharing between pools of workstations - CiteSeerX

Suggest Documents

Connecting Condor Pools into Computational ... - Semantic Scholar

Connecting Condor Pools into Computational Grids by Jini1 - CiteSeerX

Condor - A Hunter of Idle Workstations - Description - University of ...

Load Information Sharing Policies in Communication ... - CiteSeerX

Leveraging HTC for UK eScience with Very Large Condor pools ...

Load Information Sharing Policies in Communication ... - CiteSeerX

Condor Birdbath - CiteSeerX

Sharing the garbage load??

INVESTIGATION OF LOAD TRANSFER BETWEEN THE ... - CiteSeerX

Customized Dynamic Load Balancing for a Network of Workstations

A worldwide flock of Condors: load sharing among ... - CiteSeerX

and Load Sharing Facility - SAS

Database Computing on Clusters of Workstations - CiteSeerX

and Load Sharing Facility - SAS

Information sharing between heterogeneous uncertain ... - CiteSeerX

Autonomous Load Sharing of Voltage Source Converters

Research Article Load-Sharing Characteristics of

Reliability of Modules with Load-Sharing Components

Load Sharing Optimization of Parallel Compressors

Flocking Behavior

Adaptive Load Sharing for Clustered Digital Library Servers - CiteSeerX

Multi-service load sharing for resource management in the ... - CiteSeerX

scheduling and load sharing in mobile computing using ... - CiteSeerX

Condor