A Literature Study on Scheduling in Distributed Systems - CiteSeerX

A Literature Study on Scheduling in Distributed Systems X. Evers Supervision:

Nationaal Instituut voor Kernfysica en Hoge-EnergieFysica P.O. Box 14882, 1009 DB Amsterdam, The Netherlands R. van Dantzig (SMC) W.P.J. Heubers (CSG) R. Boontje (CSG)

Delft University of Technology Department of Mathematics and Computing Operating Systems and Distributed Systems Group P.O. Box 356, 2600 AJ Delft, The Netherlands I.S. Herschberg D.H.J. Epema J.F.C.M. de Jongh October 1992

Contents Preface 1

2

iii

Introduction

1.1 1.2 1.3 1.4 1.5

Distributed systems : : : : : : : : : : : : : Network and distributed operating systems Scheduling in distributed systems : : : : : : Load sharing versus load balancing : : : : : Performance : : : : : : : : : : : : : : : : : :

1 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

A taxonomy of distributed scheduling

2.1 Hierarchical classi cation : : : : : : : : : : : : : : : : : : : : : : : : : 2.2 Static scheduling algorithms : : : : : : : : : : : : : : : : : : : : : : : : 2.2.1 Optimal versus sub-optimal : : : : : : : : : : : : : : : : : : : : 2.2.2 Approximate versus heuristic : : : : : : : : : : : : : : : : : : : 2.2.3 Optimal and sub-optimal-approximate techniques : : : : : : : : 2.3 Dynamic scheduling algorithms : : : : : : : : : : : : : : : : : : : : : : 2.3.1 Distributed versus non-distributed : : : : : : : : : : : : : : : : 2.3.2 Cooperative versus non-cooperative : : : : : : : : : : : : : : : 2.3.3 Lower sub-branches of the dynamic branch : : : : : : : : : : : 2.4 Flat classi cation characteristics : : : : : : : : : : : : : : : : : : : : : 2.4.1 Adaptive versus non-adaptive : : : : : : : : : : : : : : : : : : : 2.4.2 Bidding : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 2.4.3 Non-pre-emptive versus pre-emptive policies : : : : : : : : : : : 2.4.4 Load balancing : : : : : : : : : : : : : : : : : : : : : : : : : : : 2.5 Classi cation based on the level of information dependency : : : : : : 2.5.1 A dichotomy between source- and server-initiative approaches : 2.5.2 Information dependency : : : : : : : : : : : : : : : : : : : : : : 2.5.3 Canonical scheduling algorithms : : : : : : : : : : : : : : : : : 2.5.4 Performance metric : : : : : : : : : : : : : : : : : : : : : : : : 2.5.5 Performance analysis : : : : : : : : : : : : : : : : : : : : : : : :

i

1 2 3 4 5

7

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

7 8 9 9 10 10 10 11 11 11 11 12 12 13 13 14 14 15 17 17

3

4

Availability of nodes for remote execution

3.1 A simple analytic model : : : : : : : : : : 3.2 Pro ling workstation usage : : : : : : : : 3.2.1 Technique of acquiring data : : : : 3.2.2 Usage patterns : : : : : : : : : : : 3.2.3 Correlation of AV and NA periods 3.2.4 The availability of workstations : :

21 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

Condor

4.1 4.2 4.3 4.4 4.5 4.6 4.7

Design features : : : Remote system calls Checkpointing : : : : Control Software : : Priorities : : : : : : User interface. : : : Discussion : : : : : :

21 23 24 25 28 29

32 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

Bibliography

32 33 34 35 37 39 39

41

ii

Preface

This report is the result of a two-month literature study on scheduling in distributed systems. This assignment was performed within the Computer System Group of NIKHEF (National Institute for Nuclear Physics and High-Energy Physics) under the supervision of R. van Dantzig (Spin Muon Collaboration) and D. Epema (Technical University of Delft). Chapter 1 contains an introduction to distributed systems and distributed scheduling. Chapter 2 describes the taxonomies presented in papers by Casavant and Kuhl [5] and Wang and Morris [24]. These taxonomies are an attempt to provide a common terminology and can be used to classify scheduling algorithms. Chapter 3 rst describes an analytic model of a distributed system to determine the probability that a workstation is idle while another workstation is serving more than one job. The second part of Chapter 3 summarizes the article of Mutka and Livny [19] who obtained the pro le of available and non-available periods of a group of workstations. Their work can be used for realistic workload simulation. They found that in a network of workstations, about 70% of the workstations are available for remote execution. Condor is a facility for executing UNIX jobs on a pool of cooperating workstations. Jobs are queued and executed remotely on workstations at times when those workstations would otherwise be idle. A transparent checkpointing mechanism is provided, and jobs migrate from workstation to workstation without user intervention. Condor was developed by the Computer Science department of the university of Wisconsin - Madison. The Condor system is described in Chapter 4. Amsterdam, Xander Evers, October 1992.

iii

Chapter 1

Introduction 1.1

Distributed systems

\Distributed systems" is a term used to de ne a wide range of computer systems, from weaklycoupled systems such as wide area networks or local area networks to strongly-coupled systems such as multiprocessor systems. In this report we look at distributed systems that consist of a number of computers and peripheral resources (e.g. disks, printers, plotters) connected by a local area network. Figure 1.1 shows the architecture of such a distributed system. Each circle is an autonomous unit, consisting of a CPU, memory and peripherals. The physical memory available to each unit is independent of activity on other units. In this chapter such an autonomous unit will be called a node. This architecture is comparable to that of most distributed operating systems (such as the V-system, Amoeba, Mach etc.) Processors

Processors

Memory Peripherals

Memory

Processors

Memory

Fig. 1.1

Architecture of a distributed system. 1

Distributed systems developed on the basis of local area networks have advantages as well as disadvantages. Some of the advantages are: The whole system can have a higher reliability and availability than a single time-sharing

system|a few parts of the system can be inoperative without disturbing people using the other parts.

It is possible to share a few expensive hardware resources such as printers, hard disks,

plotters and, of course, processors in a very eective way as well as sharing software resources.

In principle a network can have a large total computing power.

Some of the problems of distributed systems based on local area networks are: In a network there is no central trusted operating system. Security may dictate that

some data may not be transferred to another machine. Thus a user wanting to access these data must use the processor of the node where the data reside.

The physical distribution of resources may not match the distribution of the demands

for services. This implies that some resources may be idle while others are overloaded.

Even though a node (e.g. a workstation or personal computer) may have signi cant

computational capabilities, its power is often less than that expected of a large mainframe computer.

1.2

Network and distributed operating systems

Goscinski [10] distinguishes two kinds of operating systems for local area networks: network and distributed operating systems. A distributed operating system is one that looks to its user like an ordinary centralized operating system, but runs on multiple, independent nodes. A distributed operating system should: control resource allocation to allow their use in the most eective way; provide the user with a convenient virtual computer that serves as a high-level program-

ming environment;

hide the distribution of the resources; provide mechanisms for protecting system resources against access by unauthorized

users; and

provide secure communication.

2

Network operating systems are de ned as a collection of operating systems of nodes connected

to a network, incorporating modules to provide access to remote resources. Some of the characteristics that distinguish a network operating system from a distributed operating system are: Each node has its private operating system, instead of running part of a global, system-

wide operating system.

Users normally work on their own computer (node); use of a dierent node requires

some kind of \remote login", instead of having the operating system dynamically assign processes to nodes.

Often users must be aware of where each of their les is stored and must move their

les between computers with explicit \ le transfer" commands, instead of having le placement managed by the operating system. Transparent distributed le service can be provided to networked UNIX systems by the Sun Network File System (NFS).

The most important dierence between network and distributed operating systems is transparency, that is, to what extent need users be aware of the fact that multiple computers are used. 1.3

Scheduling in distributed systems

In a distributed system it can occur that some nodes are idle or lightly loaded while others are heavily loaded. This leads to the opportunity of improving the performance of a distributed system as a whole by remote execution and migrating jobs from heavily-loaded nodes to idle or lightly-loaded nodes. It is the task of the distributed scheduler to schedule processes to nodes (processors) in some optimal way. In the literature, there is often an implicit distinction between the terms scheduling and allocation. However, Casavant and Kuhl [5] argued that these terms are merely alternative formulations of the same problem. From the resources' point of view, the problem is how to allocate processors to processes. From the users point of view, the problem is how to schedule processes to processors. Scheduling for distributed systems is signi cantly more complex than for single-processor systems. A distributed scheduling policy for a general-purpose system can be logically divided into two components: a local scheduling discipline determines how the CPU resource at a single node is allocated among its resident processes, while a global (load) distributing policy spreads the system workload among the nodes through process migration and/or initial placement. A load scheduler has to allocate new processes to nodes, and may have to reschedule when processes leave the system. Figure 1.2 shows that every processor has its own set of processes and the local scheduler of each processor determines according to a local scheduling discipline which of the runnable processes may run. The load scheduling policy can be divided into three parts: 3

LOAD SCHEDULER When to move which process and where

Load of processors, waiting time in ready queues, available processors. Processor 1

Processor 2

P

P

P

Node 1 Local scheduler 1

Node 2 Local scheduler 2

Node N Local scheduler N

P

P

P

P

Ready processes Fig. 1.2

P

P

P

Ready processes

Local and global scheduling in a distributed system.

transfer policy

The

selection policy

The

location policy

node to another. transferred.

- to determine when it is necessary to transfer a process from one - to determine which process of the selected node should be

- to determine to which node a selected process should be trans-

The load scheduling problem can be stated shortly as: where and when, to improve overall performance. 1.4

P

Ready processes

The

ferred.

Processor N

which

process should be moved to

Load sharing versus load balancing

Solutions to the distributed scheduling problem can be achieved through load sharing and load balancing. Allowing processes to get computation service on idle nodes is called load sharing. Load balancing algorithms strive to equalize the system workload among all nodes of a distributed system. The dierences between these two classes of load distributing algorithms are: The objective of a load sharing algorithm is to prevent that any node in the system

is idle while processes wait for service. When successful, such an algorithm is called work-conserving. The objective of a load balancing algorithm goes beyond conservation of work. By 4

balancing the workload among nodes, each process residing in the system perceives approximately the same level of contention. If the local scheduling discipline is processor sharing, each resident process receives service at approximate the same rate. The question of distributed scheduling objectives was addressed in a paper by Krueger and Livny [21]. An important dierence is that whereas load sharing is based on initial placement and

migration of processes only to idle computers, load balancing may require migration of processes even when no computers are idle.

1.5

Performance

Scheduling has two important intertwined aspects, performance and eciency. Performance is the part of a system's behavior that encompasses how well the resource (processors) to be managed is being used to the bene t of all users of the system. Eciency is concerned with the added cost (or overhead) associated with the distributed scheduler itself. Two factors largely determine the eciency of the load-distribution facility: the cost of exchange of load-estimation messages, coordination of processes involved in

a potential process migration, exchange of negotiation messages, process transfer, and the cost of transmitting the results from the remote computer;

processor cost, for load estimation based on measurement of current load, and for pack-

ing data for transmission and unpacking it upon reception.

The performance of a load-distribution algorithm may be viewed from either the system or the user point of view. When considering performance from the user point of view, the metric involved is often one of minimizing individual process completion times|response/turn-around time. Alternately, what counts from the system point of view in evaluating performance is the overall rate of process execution|throughput. Casavant and Kuhl [5] stated that there is an inherent con ict in trying to optimize both response and throughput. Krueger and Livny [21] formulated the goal of a scheduling algorithm as allocating resources in such a way that the performance expectations of the users are most closely met. User performance expectation centers on the quality of service provided to the processes they initiate, and fairness. Two users simultaneously initiating equivalent processes expect to receive about the same quality of service. Similarly, a user submitting the same job several times, under equivalent workloads, expects each to receive about the same quality of service. To ensure fairness, the variance in quality of service under a given workload should be acceptably low, and the variation in quality of service received by processes should be strictly random. Both wait time and wait ratio are accepted measures of the quality of service received by a process. Wait time is de ned as the total amount of time a process spends waiting for resources, while wait ratio is the wait time per unit of service. 5

Krueger and Livny state the following about the choice between wait time and wait ratio as a performance metric:

Which measure is used carries an implied assumption about the importance of fairness. The use of wait time implies that the important factor in assessing quality of service is the absolute amount of time one waits for a resource, regardless of one's service demand. A person wanting to check out a book from a library and a person requesting a complete library tour would be considered to have received equal service if each waited the same amount of time for the attention of the librarian. The use of wait ratio implies that the important factor is the amount of time one waits for a resource relative to one's service demand. In providing equal quality of service, the person requesting an exhaustive library tour would be expected to wait longer for that service than the person wanting to borrow a book.

6

Chapter 2

A taxonomy of distributed scheduling There is a wide variety of approaches to the problem of scheduling in general-purpose distributed computing systems. This variety makes it dicult to meaningfully compare dierent systems since there is no uniform means for qualitatively or quantitatively evaluating them. In a paper by Casavant and Kuhl [5] a taxonomy of distributed scheduling has been presented in an attempt to provide a common terminology and classi cation mechanism necessary in addressing this problem. A taxonomy is a classi cation according to a reasonable small set of fundamental distinguishing features. This chapter also discusses the taxonomy presented by Wang and Morris [24]. They classi ed distributed scheduling algorithms into source-initiative and server-initiative strategies. These authors also characterized algorithms according to the degree of information dependency involved. The taxonomy proposed by Casavant and Kuhl [5] is a hybrid of a hierarchical classi cation scheme and a at classi cation scheme. The taxonomy is hierarchical as long as possible in order to reduce the total number of classes, and at when the descriptors of the system may be chosen in an arbitrary order. The levels in the hierarchy have been chosen in order to keep the description of the taxonomy itself small, and do not necessarily re ect any ordering of importance among characteristics. This point should be emphasized especially with respect to the positioning of the at portion of the taxonomy near the bottom of the hierarchy.

2.1 Hierarchical classi cation The structure of the hierarchical portion of the taxonomy is shown in Fig. 2.1. At the highest level, we may distinguish between local and global scheduling. Local scheduling is involved with the assignment of processor time of a single processor to processes. Global scheduling is the problem of deciding where to execute a process. The job of local scheduling is left to the operating system of the processor to which the process is ultimately allocated. This gives processors (nodes) increased autonomy while reducing the responsibility (and consequently overhead) of the global scheduling mechanism. This does not imply that global scheduling 7

local

global

static

optimal

dynamic

physically distributed

sub-optimal

approximate

heuristic

cooperative

optimal

enumerative

graph theory

math. pgmg.

Fig. 2.1

queueing theory

physically non-distributed

non-cooperative

sub-optimal

approximate

heuristic

Hierarchical classi cation.

must be done by a single central authority, but rather that the problems of local and global scheduling can be seen as separate issues, and (at least logically) separate mechanisms are at work solving each. The next level in the hierarchy (beneath global scheduling) is a choice between static and dynamic scheduling. This division is based on the time at which the scheduling decisions are made. Static scheduling means assigning processes to processors at compile time (or earlier). Dynamic scheduling means that processes are assigned to a processor when they begin execution (run time assignment) ,and that they may be reassigned while they are running. Another dierence is that a static scheduler makes decisions based only on information regarding the processes (expected execution time, I/O characteristics, etc.) and the static system (processor power, network con guration, etc.), while a dynamic scheduler takes into account the current state of the system (workload, queue lengths, etc.). Static scheduling algorithms are discussed in Section 2.2 and dynamic scheduling algorithms are discussed in Section 2.3.

2.2 Static scheduling algorithms In the case of static scheduling, information regarding the total mix of processes in the system as well as all the independent subtasks involved in a job is assumed to be available by the time the program object modules are linked into load modules. A static scheduling algorithm assigns each executable image in a system to a particular processor, and each time that process image is submitted for execution, it is assigned to that processor. Over a period of time, the topology of the system may change, but characteristics describing the collection of jobs remain the same. Hence, the scheduler may generate a new assignment of processes to processors to serve as the schedule until the topology changes again. 8

The principal advantage of static scheduling is its simplicity, because system state information need not be maintained. It is also eective when the workload can be suciently well described before making a decision. However, it fails to adjust to uctuations in the system load. 2.2.1

Optimal versus sub-optimal

Because all information regarding the state of the system as well as the resource needs of a process are known, an optimal assignment can be made based on some criterion function. The following optimalization measures are in common use:

Minimizing total process completion time.

Maximizing utilization of resources in the system.

Maximizing system throughput.

Because of the size of a typical distributed system (large number of processes, processors, and other resources that impose some restrictions) static scheduling is a complex computational problem. Thus, obtaining optimal solutions can be very expensive and in many cases not feasible in a reasonable time period. In this event sub-optimal solutions may be tried. Within the realm of suboptimal solutions to the scheduling problem, we may think of two general categories. 2.2.2

Approximate versus heuristic

The approximate approach uses the same formal computational model for the algorithm, but instead of searching the entire solution space for an optimal solution, we are satis ed when we nd a `good' one. These solutions are categorized as sub-optimal-approximate. The problem is how to determine that a solution is good enough. The factors which determine whether this approach is worthy of pursuit include:

Availability of a function to evaluate a solution.

The time required to evaluate a solution.

The ability to judge according to some metric the value of an optimal solution.

Availability of a mechanism for intelligently pruning the solution space.

The second category of sub-optimal algorithms is based on heuristic search strategies. This branch represents the solutions to the static scheduling problem which require the most reasonable amount of time and other system resources to perform their function. The most distinguishing feature of heuristic schedulers is that they make use of special parameters which aect the system in indirect ways. Often, this parameter is correlated to system performance in an indirect instead of a direct way, and this alternate parameter is much simpler to calculate. 9

Casavant and Kuhl give as an example clustering groups of processes which communicate heavily on the same processor, and physically separating processes which would bene t from parallelism. This decreases the overhead involved in passing information between processors while reducing the interference among processes which may run without synchronization with one another. 2.2.3

Optimal and sub-optimal-approximate techniques

Regardless of whether a static solution is optimal or sub-optimal-approximate, there are four basic categories of task allocation algorithms which can be used to arrive at an assignment of processes to processors.

Solution space enumeration and search.

Graph theoretic.

Mathematical programming.

Queueing theoretic.

2.3 Dynamic scheduling algorithms In the dynamic scheduling problem, the more realistic assumption is made that very little a priori knowledge is available about the resource needs of a process. It is also unknown in what environment the process will execute during its lifetime. In the static case, a decision is made for a process image before it is ever executed, while in the dynamic case no decision is made until the process begins its life in the dynamic environment of the system. 2.3.1

Distributed versus non-distributed

The next issue (beneath dynamic solutions) involves whether the status information of all processors and execution environments is to be collected at one location (physically nondistributed), or whether the decision-making process is to be physically distributed among the processors that utilize information stored in many places. The most important feature of making decisions centrally is simplicity. However, such systems suer from a number of drawbacks. The rst drawback is that gathering information and maintaining an up-to-date system state in one central location, can lead to a large time overhead, since there are transfer delays and messages may be lost. It is a very serious problem because it is known from the theory of optimization that decisions based on information that is not up-to-date can be worse than no decisions. The second drawback is the low reliability of such systems: failure of the node on which the load scheduling is done results in the collapse of the entire management system. A globally distributed load-scheduling algorithm does not suer from the above drawbacks. The bottleneck of collecting status information at a single site is avoided and schedulers can react quickly to dynamic changes in the system state. If one computer running a scheduler fails, others can continue their decisions jobs. 10

2.3.2

Cooperative versus non-cooperative

Within the realm of distributed dynamic global scheduling, there are mechanisms which involve cooperation between the distributed components (cooperative) and mechanisms in which the individual processors make decisions independent of the actions of the other processors (non-cooperative). The question here is one of the degree of autonomy of each each processor in determining how its own resources should be used. In the noncooperative case individual processors take decisions regarding the use of their resources independent of the eect of their decision on the rest of the system. In the cooperative case all processors are working toward a common system-wide goal. In other words, each processor's local operating system is concerned with making decisions in concert with the other processors in the system in order to achieve some global goal, instead of making decisions based on the way in which the decision will aect local performance only. 2.3.3

Lower sub-branches of the dynamic branch

As in the static case, the taxonomy tree has reached a point where we may consider optimal, sub-optimal-approximate, and sub-optimal-heuristic solutions. Optimal solutions may be optimal locally or globally. When optimal solutions are computationally infeasible, it is necessary to use sub-optimal solutions. As for static algorithms, optimal and sub-optimal dynamic algorithms can be based on mathematical programming, queueing theory, graph theory, and solution space enumeration and search.

2.4 Flat classi cation characteristics In addition to the hierarchical portion of the taxonomy, there are a number of other distinguishing characteristics of scheduling algorithms. This section deals with characteristics which do not t uniquely under any particular branch of the tree-structured taxonomy given thus far, but are important in the way that they describe the behavior of a scheduler. 2.4.1

Adaptive versus non-adaptive

An adaptive solution to the scheduling problem is one in which the algorithms and parameters used to implement the scheduling policy change dynamically according to the previous and current behavior of the system in response to previous decisions made by the scheduling system. In contrast to an adaptive scheduler, a non-adaptive scheduler would be one which does not necessarily modify its basic control mechanism on the basis of the history of system activity. The terms dynamic scheduling and adaptive scheduling are sometimes used in the literature in an inconsistent manner. In a dynamic situation, the scheduler takes into account the current state of aairs as it perceives it in the system. This is done during the normal operation of the system under a dynamic and unpredictable load. In an adaptive system, the scheduling policy itself re ects changes in its environment|the running system. Whereas a dynamic solution 11

takes environmental inputs into account when making its decisions, an adaptive solution takes environmental stimuli into account to modify the scheduling policy itself. It is easy to see that it is impossible to have a static adaptive scheduler. 2.4.2

Bidding

In this class of policy mechanisms, a basic protocol framework exists which describes the way in which processes are assigned to processors. Each node in the network is responsible for two roles with respect to the bidding process: manager and contractor. The manager represents the process in need of a location to execute, and the contractor represents a node which is able to do work for other nodes. A single node takes on both roles, there are no nodes which are strictly manager or contractor alone. The bidding system works in the following way: The manager announces the existence of a process to be executed by broadcasting, or multi-casting, or directing a process-announcement (request for bid) message. At the same time its computer can receive announcement messages from other managers.

Available contractors evaluate received announcements and submit bids on those for which they are suited.

The manager evaluates received bids and awards contracts to the nodes it determines to be the most appropriate. The contracts are multicast to the bidders. It can also happen that there is no bid available for a given process. In this case contracts are not awarded. According to the bidding mechanism, a contractor is allowed to partition a process and award contracts to other computers in order to improve performance. In this case the contractor becomes a manager. There are a number of important features of load-distribution algorithms based on bidding. The rst is that the eectiveness and performance of these algorithms depends on the amount and type of information exchanged during negotiation. The second is that all computers have full autonomy; managers can decide to which computer to award a contract, and contractors have the freedom to respond to announcement messages by sending or not sending bids, and, if sending, to which manager.

2.4.3

Non-pre-emptive versus pre-emptive policies

Global dynamic load-scheduling systems are non-pre-emptive (one-time assignment) if they assign only newly-created processes to processors, that is, processes are not reassigned once they commence execution. Pre-emptive (dynamic reassignment) load-scheduling systems are those in which processes in execution may be interrupted, moved to another processor, and resume in a new computational environment. There are performance bene ts oered by pre-emptive (migratory) load distribution beyond those oered by non-pre-emptive (non-migratory) load distribution. It has been emphasized that the adaptability of pre-emptive load-distribution systems to react to workload uctuations, allows both an increase in throughput and a decrease in process response time. But the costs of process migration are often high. 12

2.4.4

Load balancing

In Chapter 1, the solutions to the distributed scheduling problem were already divided into load-sharing and load-balancing solutions. The basic idea of load balancing is to attempt to balance (in some sense) the load on all processors in such a way as to allow all processes on all nodes to make progress at approximately the same rate. This solution is most eective when the nodes of the system are homogeneous since this allows all nodes to know a great deal about the structure of the other nodes. Normally, information would be passed about the network periodically or on demand, in order to allow all nodes to obtain a local estimate concerning the global state of the system. Then the nodes act together in order to remove work from heavily loaded nodes and place it at lightly loaded nodes. This is a class of solutions which relies heavily on the assumption that the information at each node is quite accurate in order to prevent processes from endlessly being circulated around in the system without making much progress (processor thrashing).

2.5 Classi cation based on the level of information dependency In a paper by Wang and Morris [24] a taxonomy was presented, which is based on two other distinguishing features of distributed scheduling algorithms. They considered a distributed computer system with a con guration as shown in Figure 2.2. source 1

...

source 2

source N

Communication Medium

server 1 Fig. 2.2

...

server 2

server K

A logical model of a distributed computer system

Jobs enter the system via nodes called sources and are processed by nodes called servers. This con guration represents a logical view of the system|a single physical processor might be 13

both a source and a server node. The authors assume that the jobs are individually executable, are logically independent of one another, and can be processed by any server. They only consider non-pre-emptive load-distribution strategies, i.e., strategies which distribute jobs to servers irrevocably. For simplicity, the authors assume that all servers have the same processing rate. They present their taxonomy as a taxonomy of load sharing, because a server serves only one job at a time, but it can also be used to classify load-balancing algorithms. 2.5.1

A dichotomy between source- and server-initiative approaches

The rst distinction they draw demarcates the type of node that takes the initiative in the global searching. If the source node makes a determination as to where to route a job, the strategy is called source-initiative. On the other hand, if the server `goes looking for work', i.e., determines which jobs at dierent sources it will process, then the strategy is called server-initiative. In a source-initiative algorithm, queues tend to form at the server, whereas in a server-initiative algorithm queues tend to form at the sources. Another dierence is that in source-initiative algorithms, scheduling decisions are usually made at job arrival epochs (or a subset thereof), whereas in server-initiative algorithms, scheduling decisions are usually made at job departure epochs (or a subset thereof). One of the most important conclusions presented by Wang and Morris is that with the same level of information available, server-initiative algorithms have the potential to outperform source-initiative algorithms. 2.5.2

Information dependency

The other axis of classi cation the authors propose refers to the level of information dependency that is embodied in a strategy. By this is meant the degree to which a source node knows the status of servers or a server knows the status of sources. Naturally, as the level of information available increases we expect to be able to obtain strategies with improved performance. But as more information is exchanged, communication cost may increase and more sophisticated software or hardware mechanisms may become necessary. A source-initiative algorithm can be described according to a function server = f (1 1 1); by which is meant that the server selected by the source to process a job is a function of the arguments of f . Similarly, server-initiative algorithms are described by a function source = f (1 1 1); which determines the source at which a server seeks its next job. The taxonomy is presented in Table 2.1. There are seven levels of information dependency which range from static scheduling were no information about the system status is required, to strategies were extensive information is required and algorithms such as global rst come rst served (FCFS)|one ideal of multiserver behavior|are attained. 14

Level of Information Dependency in Scheduling

Source-Initiative

Server-Initiative

1

server=f(source)

source=f(server)

2

e.g. source partition server=f(source,! )

e.g. server partition source=f(server, ! )

3

e.g. random splitting server=f(source,! ,sequence state)

e.g. random service source=f(server,! , sequence state)

4

e.g. cyclic splitting server=f(source,! ,sequence state,

e.g. cyclic service source=f(server,! , sequence state,

server busy/idle status)

5

e.g. join idle before busy queue (cyclic oer) server=f(source,! ,sequence state, server queue lengths)

6

source queue emptiness)

e.g. cyclic service without polling empty sources source=f(server,! , sequence state, source queue lengths)

e.g. join the shortest queue (JSQ) server=f(source,! ,sequence state, server queue lengths,

departure epochs of completed

e.g. serve the longest queue (SLQ) source=f(server,! , sequence state, arrival epochs of jobs at sources)

jobs at servers)

7

e.g. JSQ with ties broken by last departure epochs at each server server=f(source,! ,sequence state, departure epochs of completed and remaining jobs at servers)

e.g. FCFS Tab. 2.1

e.g. FCFS !

source=f(server, , sequence state, arrival epochs and execution times of jobs at sources)

e.g. shortest job rst

Taxonomy based on information dependency

The authors explain that the number of levels of information is somewhat arbitrary and could have been aggregated into fewer or expanded into a ner classi cation. Note that the information levels are arranged so that the information of a higher level subsumes that of the lower levels. The parameter ! plays the role of randomization which can be viewed as providing additional information dependency. 2.5.3

Canonical scheduling algorithms

The authors de ne and analyze the performance of some of the canonical scheduling algorithms which were cited in the taxonomy of Table 2.1. They consider the following algorithms, were N denotes the number of sources, and K the number of servers. They assume that the local scheduling policies are FCFS. 15

1.

Source Partition (N K ): Level 1, Source-Initiative|The sources are partitioned into groups and each group is served by one server. Each source in a group sends it jobs to the assigned server.

2.

Server Partition (N

3.

Random Splitting: Level 2, Source-Initiative|Each source distributes jobs randomly and uniformly to each server, i.e., it sends a job to one of the K servers with probability 1=K .

4.

Random Service: Level 2, Server-Initiative|Each

server visits sources at random and on each visit removes and serves a single job. After each visit it selects the next source to be visited randomly and uniformly. This can be considered a dual to random splitting. A simple variation on this algorithm which may reduce communication overhead would have the server remove a batch of up to B jobs on each visit.

5.

Cyclic Splitting: Level 3, Source-Initiative|Each

K ): Level 1, Server-Initiative|The servers are partitioned into groups and each group of servers serves one source. A FCFS queue is formed at each source and jobs are removed one at a time by one of the available servers. This can be considered to be a dual of source partition.

the i(mod K )th server.

source assigns its ith arriving job to

6.

Cyclic Service: Level 3, Server-Initiative|Each server visits the sources in a cyclic manner. When a server visits the ith source it removes and serves a batch of up to B jobs and may then return to revisit the ith source for up to a total of V visits or until the source queue becomes empty. It then moves to the next source queue in a predetermined sequence. The algorithm has two parameters B and V and a visit sequence which may be dierent for each server. The main cases considered by Wang and Morris have B = 1, V = 1; B = 1, V = 1 (often called exhaustive cyclic service), and B = 1, V = 1 (a case of limited cyclic service). Another variation they considered was called `B = 1, gated'|this means the server revisits a source queue until the queue empties or the server nds a job next in queue that arrived to the source after the server rst arrived to that source in a given cycle. This class of algorithms can be considered as dual to cycle splitting.

7.

Join the Shortest Queue (JSQ): Level 5, Source-Initiative|Each source independently sends an arriving job to the server that has the least number of jobs (including the one in service). Ties are broken randomly and uniformly over tied servers.

8.

Serve the Longest Queue (SLQ): Level 5, Server-Initiative|A

9.

First Come First Served (FCFS): Level 6, Server-Initiative or Level 7, Source-Initiative|

dual to JSQ is the algorithm in which whenever a server becomes available it serves a job from the longest source queue. Again ties are broken randomly and uniformly. All servers are applied to form an overall multiserver FCFS system. As shown in Table 2.1, it can be considered server-initiative or, alternatively, source-initiative if more 16

information is made available. It is possible to implement FCFS as a source-initiative algorithm if we allow the sources to know which server has the least backlog of work (although this is seldom a practical proposition). 10.

Shortest Job First (SJF): Level 7, Server-Initiative|Servers select the

job that has the smallest service time requirement.

2.5.4

currently waiting

Performance metric

The authors develop a performance metric called the Q-factor (quality of load sharing), which captures some important aspects of load-sharing performance. A good load-sharing algorithm will tend not to allow any server to be idle while there are jobs awaiting processing in the system|work conservation. It will also not discriminate against a job based on the particular source where the job arrives|fairness. One ideal of a global scheduler is to behave like a multiserver rst come rst served (FCFS) queue. Since jobs originated from a speci c user may always arrive to a particular source, they may be subject to systematic discrimination, either favorable or unfavorable. Therefore we can view an overall performance as being attained only if this level or better is attained by jobs at every source. This leads to the de nition of the Q-factor of an algorithm A as:

QA () = 1

K

Psup N = i=1 i

mean response time over all jobs under FCFS max mean response time at ith source under algorithm A i

where the response time of a job is de ned to be the length of time from when the jobs arrives at the system until it departs from the system, and N = number of sources, K = number of servers, 01 = mean service time, i = arrival rate at the ith source,

= aggregated utilization of system, =

PN

i=1 i

K

This Q-factor provides a factor usually between zero and one which describes how close the system comes to a multiserver FCFS, as seen by every job stream. The larger the Q-factor, the better the performance. This measure does not take into account conditional (on service time requirement) or response distribution information but instead attempts to expose the overall load-sharing behavior of the system. 2.5.5

Performance analysis

Wang and Morris analyze performance of the algorithms using exact analysis, and in the cases where exact analysis was not possible, simulation. The complete performance analysis results can be found in their paper [24]. We will now discuss these results shortly. The following assumptions about the transport delay of the communication medium and 17

node processor overhead involved in Inter Process Communication (IPC) were made. The transport delay of the communication was assumed to be zero. The authors state that this assumption is reasonable for many applications and current local area interconnect technology which usually has ample bandwidth and provides rapid message passing between processors. The IPC overhead was assumed to consist of a per message processing time of ho and he at the source and server nodes, respectively. These burdens are supposed to take into account system call, context switch, and communication medium management. Values of ho and he observed in practice are on the order of one or several milliseconds for typical microprocessor technology and are often relatively independent of message length.

Fig. 2.3

Mean response time versus utilization

Figure 2.3 shows the mean response time over all the jobs for the various algorithms as a function of utilization for the case N = 2, K = 3, 1 = 2. The cyclic service (B = 1), random service, and SLQ algorithms produce the same mean response time as FCFS. This is because these algorithms simply result in an interchange of order of processing from FCFS and thus do not change the mean response time. Of the other algorithms shown, SJF gives the best mean response, being the only algorithm considered that has knowledge of jobs' service times. The poorer performance of JSQ compared to FCFS is attributable to the inadequacy of using the number of jobs in the queue as load indicator. That results in the possibility that a job can wait at a server while another server is idle. The phenomenon of jobs waiting while a server is idle and the resulting degraded performance is even more pronounced in the random and cyclic splitting algorithms, which have no knowledge of the servers states. The cyclic splitting algorithm outperforms the random splitting algorithm because of the increased regularity of the job stream at the servers induced by the cyclic splitting. If the load would be unbalanced between the two sources, than the overall performance of the server-initiative and JSQ algorithms would stay the same and it would have only a small aect on the random and cyclic splitting algorithms. It has a major eect on the server partition algorithm which is unable to adapt to the imbalance and one server quickly becomes 18

saturated. This eect can be reduced by repartitioning the servers. A scheduling algorithm that does the repartitioning automatically is called adaptive (see 2.4.1). Another interesting aspect of the failure of source-initiative algorithms is the eect of service time distribution on overall performance. The overall performance of SQL and cyclic service (B = 1) are not aected (when normalized to FCFS) by service time distribution, but the performance of the random and cyclic splitting algorithms degrade rapidly as more variability is introduced into the service time distribution.

Fig. 2.4

Fig. 2.5

Q-factors when service time distribution is exponential

Q-factors when service time distribution is deterministic

In Figures 2.4, 2.5 and 2.6 the Q-factor was used to summarize performance as a function of . Note that in computing the Q-factor we must nd the combination of loadings (is) which cause the worst performance. For most algorithms the Q-factors degrade rapidly as ! 1. The only source-initiative algorithm that holds up adequately well is JSQ. The cyclic service algorithms show reasonable Q-factors except at very high utilization, with the `B = 19

Fig. 2.6

Q-factors when service time distribution is hyperexponential

1, gated' version showing the best performance. Note that algorithms that have identical performance when viewed over all jobs have dierent Q-factors. Figures 2.5 and 2.6 treat the deterministic and hyperexponential service time distributions, respectively, and show that load-sharing performance of all algorithms degrades as service time variability increases. But the performance of the server-initiative algorithms is less sensitive to the service-time variability. The performance of source-initiative algorithms such as random and cyclic splitting also degrades as the number of servers becomes large. On the other hand, server-initiative algorithms such as cyclic service approach the ideal of load sharing.

20

Chapter 3

Availability of nodes for remote execution In this chapter we look at the availability of (processor) nodes for remote execution. In Section 3.1, a group of N nodes connected by a network will be looked at as a N*M/M/1 or M/M/N queueing system. Section 3.2 summarizes the results of a paper by Mutka and Livny [19], who explored the patterns of activity of owners on their workstations. Their general conclusion is that in a network of workstations, about 70% of the workstations are available for remote execution.

3.1 A simple analytic model A node can be viewed as a single-server queueing system with tasks arriving in accordance with a Poisson process having rate . In other words, the intervals between consecutive arrivals are independently and identically distributed according to an exponential distribution having mean 1/. Each task, upon arrival, goes directly into service if the server is free and, if not, the task joins the queue. When the server nishes serving a task, the task leaves the system, and the next task in line, if there is any, enters service. The tasks are served in a rst come rst served manner. The successive service times are also assumed to be distributed in an exponential stochastic way having mean 1/. The above is called the M/M/1 queue. A group of N nodes connected by a network can be viewed as a system of N identical and independent M/M/1 queueing systems, see Figure 3.1. This model was used by Livny and Melman [17] to show that in a distributed system with more than ten servers almost all the time a task is waiting for service while a server is idle. The assumption is made that all the servers have the same inter-arrival and service-time distribution. Let Pwi be the probability that the system is in a state in which at least one task waits for service and at least one server is idle, then !

Pwi =

N X N QH ; i N 0i i i=1

21

(3:1)

arrival

arrival

arrival

P1

P2

PN

departure

departure

departure

Communication device

Fig. 3.1

An N*(M/M/1) system.

where

Qi = (P0)i is the probability that a given set of i servers is idle,

Hi = (1 0 P0 )i 0 (P0 (1 0 P0))i is the probability that a given set of i servers is not idle and at one of more of the servers a task waits for service, and

P0 = 1 0 is the probability that a server is idle. Pw i 1.0

0.8

0.6 N=5 N=10

0.4

0.2 0.2

N=20

0.4 Fig. 3.2

0.6

0.8

ρ

Pwi as a function of .

Figure 3.2 shows the value of Pwi for various values of server utilization , = 1 0 P0 , and number of servers N . The curves of the gure indicate that for practical values of , Pwi 22

is remarkably high. The high value of Pwi indicates that by distributing the instantaneous load of a multi-resource system, performance can be considerably improved. The shape of the curves shows that for a given number of servers, Pwi reaches its maximum value when the servers are utilized during 65% of the time. As the utilization of the servers increases past the level of 65%, Pwi decreases. This property of Pwi indicates that a `good' load-sharing algorithm should work less when the system is heavily utilized. Livny and Melman [17] used as performance measure the expected turnaround time, denoted by W . A reduction in Pwi of a multi-resource system will cause a decrease of W . Pwi can be reduced by transferring tasks from one queue to another and the expected turnaround time will be minimal if Pwi is zero, in which case the system is work-conserving. In such a case the system will behave like an M/M/N (single queue N servers) system. Pwi is zero when a task will be transferred from one queue to another when one of the following events occurs:

A task arrives at a busy server and there are less than N tasks in the system. A server completes the service of a task, no other tasks are waiting in its queue and there are more than N tasks in the entire system.

Therefore a lower bound to the rate of task transfers in order to minimize W is given by

LT =

NX 01 i=1

iPi +

NX 01 i=1

(N 0 i)PN +i ;

(3:2)

where Pi is the probability of having i tasks in an M/M/N system:

8 > > > > > > < Pi = > > > > > > :

(=)i i! i N; NX 01 (=)j (=)N N + j! N ! N 0 j =0 (

i N )N N N ! P0

(3:3)

i > N:

The rst summand in 3.2 is the rate of transfers caused by arriving tasks (the rst event). The second summand is the part of the transfer rate caused by departing tasks (the second event). Figure 3.3 gives LT as a function of the number of servers for various arrival rates , and a xed service rate, = 1:0. The gure shows that in order to achieve that Pwi = 0, a large fraction of the tasks have to be transferred.

3.2 Pro ling workstation usage In a paper by Mutka and Livny [19], the patterns of activity of owners on their workstations and the extent to which capacity is available for sharing, was explored. A stochastic model 23

LT λ=.9 λ=.7

15

λ=.5 µ=1.0 10

5

0 0 Fig. 3.3

10

20

N

LT versus the number of servers.

of workstation utilization was developed. Their work gives the opportunity to use realistic workloads when evaluating load-sharing policies. Also, when usage patterns are understood, algorithms that take advantage of the patterns can be designed. This Section gives a summary of the article. Some parts of the article were copied (almost) literally. 3.2.1

Technique of acquiring data

Mutka and Livny [19] monitored the usage patterns of 11 workstations over a period of ve months. The stations observed were owned by a variety of users; six were owned by the faculty where Mutka and Livny work, three by systems programmers, and two by graduate students. Two additional stations used by systems programmers were only monitored for three months. They obtained the pro le of available and non-available periods of each of the workstations. An available period, AV, is a period during which the workstation is in the available state. A non-available period, NA, is a period during which the workstation is in the non-available state. A workstation is in the non-available state when it is being used, or was recently used by its owner so that the average user cpu usage is above a threshold ( 14 %) or was above the threshold within the previous 5 minutes. The average user cpu usage follows the method the UNIX operating system uses for the calculation of the user load in a similar way as the ps(1) command (process status) computes the cpu utilization of user processes. This load is a decaying average that includes only the user processes, and not the system processes. The value of the threshold is chosen so that activities resulting from programs such as time of day clocks or graphical representations of system load do not generate user loads higher than the threshold. A workstation is in the available state whenever the workstation is not in the non-available state. The workstation usage patterns were obtained by means of a monitoring program on each 24

workstation. The monitor on each station executes as a system job and does not aect the user load. When the workstation is in the non-available state, the monitor looks at the user's load every minute. When the user's load is below the low threshold for at least 5 minutes, the workstation's state becomes available. During this time the workstation's monitor will have its `screen server' enabled. The monitor looks at the user's load every 30 seconds when the workstation is in the available state. Any user activity, even a single stroke at the keyboard or mouse, will cause the `screen saver' to be disabled and all user windows on the workstation's screen to be redrawn. This activity brings the user load above the threshold, and causes the state to become non-available. If no further activity occurs, approximately seven minutes pass before the station's state changes to available. This is because it takes the average user cpu usage 2 minutes to drop below the threshold, and an imposed waiting time of 5 minutes. The waiting period is imposed so that users who stop working only temporarily are not disturbed by the `screen saver' reappearing as soon as they are ready to type another command. The user load with the imposed waiting time is used as a means of detecting availability because the station should not be considered a source of remote cycles if an owner is merely doing some work, thinking for a minute, and then doing some more work. Otherwise a station would be a source of remote cycles as soon as the owner stopped momentarily. The workstation's owner would suer from the eect of swapping in and out of his/her processes, and the starting and stopping activities of the remote processes. 3.2.2

Usage patterns

In order to de ne a stochastic process it is necessary to know the distributions of the AV and NA periods, and how these periods are correlated. A graph of the cumulative relative frequency of the AV periods for all the stations during the entire time monitored is shown in Figure 3.4 (the solid line). For each time t on the horizontal axis, the corresponding percentage on the vertical axis is the percentage of AV periods less than t + 1 minutes. The solid line curve in Figure 3.5 shows the cumulative relative frequency of NA periods. The graphs show that the percentage of periods longer than one hour is greater for AV than for NA periods. Figure 3.4 shows that there was a large number of AV periods beyond 300 minutes long. This led the authors to belief that the AV periods are dominated by three types of periods: short, medium, and long. Short periods occurred when users did some work, and then stopped to think for a while before resuming the use of their workstations. Medium periods were the result of users leaving their desk for short intervals, or stopping to do other work during the day. Since users left their oces in the evening and weekends, scheduled long meetings, and taught or attended classes, long AV periods occurred. For the NA periods the authors have identi ed two types of periods: short and long. The short component is the result of frequent short activities. The user typed some simple commands and then stopped to do something else. The user might have had some jobs that executed for short intervals even when he/she was not at the station. These short jobs contributed to the user load which brie y made the station unavailable. The long components are the result of prolonged activity of the user. Long NA periods occurred when the user had long jobs to execute, which continued to execute after the user left the station. 25

Fig. 3.4

Distribution of AV periods.

Fig. 3.5

Distribution of NA periods.

The authors observed that the distributions of AV and NA periods appears to have exponential components because they increase similarly to an exponential distribution and have long tails. A mixture distribution of exponentials was chosen to t the observed data. This distribution is sometimes referred to as a k-stage hyperexponential distribution, when the distribution has k components. The k-stage hyperexponential distribution function is de ned as

F (t) =

i=k X i=1

i Fi (t), where Fi (t) = 1 0 e0i t , and

i=k X i=1

i = 1:

(3:4)

Component i of a k-stage distribution introduces two parameters that must be adjusted: i and i . On the one hand, the more components introduced the better the t is, but on the other hand it is more complex to assign values to a large number of parameters. For the AV relative frequence distribution, a good match was achieved by using a 3-stage hyperexponential distribution. The stages represent the small, medium and large AV periods. The components were assigned the expected values of 3, 25, and 300 minutes. Weights were assigned by using a least-squares t for these components to obtain the following 3-stage distribution 0t

0t

0t

FAV (t) = 0:32(1 0 e 3 ) + 0:44(1 0 e 25 ) + 0:24(1 0 e 300 ):

(3:5)

The small component contributes 32%, the medium component 44% and the large component 24% of the distribution. Less complexity was introduced when matching the NA periods, because their relative frequency distribution has fewer long intervals. A good match for 26

the NA periods was obtained by using a 2-stage hyperexponential distribution. The two components have the expected values of 7 and 55 minutes. Since each NA period lasted at least 7 minutes, the distribution was modi ed so that the probability that an period is less than 7 minutes is zero. The distribution of NA periods was de ned as F

Fig. 3.6

NA (t) =

(

0t

0t

0:68(1 0 e 7 ) + 0:32(1 0 e 55 ) if t 7; 0 if 0 t < 7:

Distribution of AV periods.

Fig. 3.7

(3:6)

Distribution of NA periods.

Figures 3.6 and 3.7 show the match of the cumulative distribution to the monitored traces for AV and NA periods smaller than 60 minutes. This region one must study to determine whether it is worthwhile to use workstations as a source for remote execution. Figures 3.4 and 3.5 show the match for AV en NA periods that were up to 600 minutes in length. The overall dierence between the tted distribution and the monitored relative frequency distribution is very small. The authors used the Kolmogorov-Smirnov test for curve matching to calculate the likeliness that their observed data could be generated from a random sequence of the tted distribution. They found that the probability that a random sequence generated from the tted distributions would deviate as much from the distributions as the observed data, are high (around 80%). The authors therefore believe that FAV and FNA can serve as a means for arti cially describing AV and NA period characteristics for studies involving remote allocation strategies of workstations. 27

3.2.3

Correlation of

AV and NA periods

It is necessary to know how the length of AV and NA periods correlate, before you can build an arti cial workload generator. Mutka and Livny analyzed pairs of NA and AV periods to determine whether such a correlation exists. AV periods were labeled as short (less then 9 minutes), medium (longer than 9 minutes and less than 75 minutes), or long samples (longer than 75 minutes). Similarly the NA samples less than 21 minutes were classi ed as short, and the remaining samples as long. AVAILABLE Short 41%

NOT-AVAILABLE 71%

Short 74%

76%

Fig. 3.8

38% 23% Medium 36%

24% Long 26%

74% Long 23%

Short 41%

39% 29%

Medium 36%

AVAILABLE

45% 34% 21%

Long 23%

26%

Conditional probability of AV-NA state changes.

With this labeling method, they found the conditional probability graph of Figure 3.8. It shows that 41% of all AV samples were short, 36% were medium, and 23% were long. And that of the NA samples, 74% were short, and 26% were long. The conditional probability distribution is very close to the unconditional probability distribution. Therefore the authors concluded that there is no correlation between the length of the AV and NA periods. This observation was veri ed by computing the correlation coecient of NA and AV periods. It has a very small positive value. AVAILABLE

AVAILABLE 64%

Short 41%

25%

Short 41%

11% 29% Medium 36%

Medium 36%

52% 19% 21%

Long 23%

Fig. 3.9

31%

48%

Long 23%

Conditional probability of AV-AV state changes.

Although the AV and NA periods were uncorrelated, a correlation between pairs of periods of the same type was identi ed. An AV pair is two AV periods that are separated by a single 28

NOT-AVAILABLE Short 74%

NOT-AVAILABLE 80%

20%

Short 74%

60% Long 26%

Fig. 3.10

40%

Long 26%

Conditional probability of AN-AN state changes.

period. Likewise, an NA pair is two NA periods that are separated by a single AV period. A correlation was expected because of the way individuals use their workstations. Users tend to have a cluster of short idle periods, or a cluster of several long idle periods. Figure 3.9 shows the conditional probability graph of AV pairs. Short AV intervals were more likely (64%) to be followed by short AV intervals. Medium AV intervals were more likely (52%) to be followed by medium AV intervals. Similarly, long AV intervals were more likely (48%) to be followed by long AV intervals. A correlation also existed for NA pairs as shown in Figure 3.10. However, it was much less signi cant. NA

3.2.4

The availability of workstations

The availability of individual stations varied from month to month, but the total system availability of workstations stayed steady for the entire 5 months. The system availability remained steady within 3% deviation from the average (71%). Mutka and Livny emphasize that the actual idle time of the workstations was larger than the availability time reported. A portion of the larger idle time is due to the fact that any user activity causes a workstation to be unavailable for at least seven minutes. If the seven minute interval for each busy period did not occur, the system availability would increase approximately 4%. Additional idle time occurred during the NA intervals between user activities. The availability of remote cycles varied during the course of a day, and it varied during the workweek and the weekend. Early in the morning, the system availability was large (75%), and then dropped to about 50% between 2 and 4 in the afternoon. Even at the busiest time of the day there was a large capacity to use. The availability on weekends was between 70-80%. Figure 3.11 shows the average hourly utilization of the workstations. A station`s hourly utilization is the percentage of the hour the station was in the NA state. It shows that on the average the hourly utilization of a workstation was less than 10% (NA for less than 6 minutes) for 53% of the time. For 21% of the time, the hourly utilization was greater than 90%. The only other signi cant frequency is the 10-20% hourly utilization. This is due primarily to single 7 minutes NA intervals occurring during an hour period. Figure 3.11 shows that if each hour was observed individually, a station was either available for almost the entire hour, or was busy for the whole hour. In addition to the utilization of individual stations, the utilization of the entire system is of interest to a remote capacity scheduler. The system utilization during an interval is the 29

60 50 Individual 40

Simultaneous

30 20

10 0 0

10

Fig. 3.11

20

30

40

50

60

70

80

90 100

System and individual station utilization.

(average) number of stations in the NA state during the interval divided by the total number of stations. The relative frequency distribution of the system utilization during periods of length l, S Ul , can help a scheduler to estimate the fraction of the system capacity that is available for the next l minutes.

Tab.

S U60 S U30 S U10 S U1 Utilization, u 6.0 8.4 13.0 18.0 0% u 10% 10% u 20% 28.1 25.2 24.4 24.6 20% u 30% 28.2 29.1 24.7 22.7 30% u 40% 16.2 15.3 15.4 13.4 8.7 8.8 8.9 7.8 40% u 50% 6.5 6.7 6.4 5.7 50% u 60% 4.4 4.4 4.3 4.0 60% u 70% 1.6 1.7 2.1 2.3 70% u 80% 0.3 0.4 0.8 1.1 80% u 90% 0.0 0.1 0.1 0.4 90% u 100% 3.1 Relative frequency distribution of system utilization, S U .

Table 3.1 shows the relative frequency distribution of the system utilization for interval lengths of 60 minutes, 30 minutes, 10 minutes and 1 minute. The system utilization was less than 40% for almost 80% of all hour intervals. This means that the probability that at least 6 of the 11 stations were available is almost 80%. Figure 3.11 shows how, on an hourly basis, the system utilization and individual station utilization compare. Individual stations were likely to be either AV or NA for entire hours, while the system was likely to have 10-40% of the stations in the NA state. Because individual 30

stations are likely to be either AV or NA for the entire hour, the prediction of whether a station is available for an entire hour can be approximated by a Bernoulli distribution. A station can be viewed as having a probability p that it is in the NA state, and 1 0 p that it is in the AV state. If we assume that the behavior of each station is independent, then the probability nk , that k out of N stations are in the NA state, can be approximated by the binomial distribution ! N k N 0k : nk = p (1 0 p) (3:7) k

Figure 3.12 shows the Bernoulli density function for p = 0:3 and N = 11 and the correspond80

60

Bernoulli Binomial

40

20

0 0 Fig. 3.12

10

20

30

40

50

60

70

80

90 100

Bernoulli and binomial probability density functions.

ing binomial density function. We use p = 0:3 because the hourly utilization of the individual stations was more than 50% for approximately 30% of the time. The similarity of the shapes of the graphs in Figure 3.11 and 3.12 indicates that the stations can be viewed as independent and that the system utilization can be approximated by a binomial distribution.

31

Chapter 4

Condor Many organizations own hundreds of powerful workstations which are connected by local area networks. These workstations are often allocated to a single user who exercises full control over the workstation's resources. Litzkow and Livny [15] stated that in such an environment you can nd three types of users, casual users who seldom utilize the full capacity of their machines, sporadic users who for short periods of time fully utilize the capacity of the workstation they own, and frustrated users who for long periods of time have computing demands that are beyond the power of their workstations. The throughput of these frustrated users is limited by the power of their workstations. In a paper by Mutka and Livny [19] it was shown that in a computing environment of workstations connected by a local area network, about 70% of the workstations is available for remote execution. Condor is a distributed batch system which was designed to meet the challenge posed by the frustrated users, namely to provide convenient access to unutilized workstations while preserving the rights of their owners. Condor has been developed by the Computer Science department of the university of Wisconsin - Madison. This chapter gives a summary of the documentation [3] [4] [13] [14] [15] [16] [18] on Condor. 4.1

Design features

Several principles have driven the design of Condor.

Workstation owners should always have the resources of the workstation they own at their disposal. Workstation owners are generally happy to let somebody else compute on their machines while they are out, but they want their machines back promptly upon returning, and they don't want to have to take special action to regain control. Immediate response is the reason most people prefer a dedicated workstation over access to a time sharing system. Condor handles this automatically.

Remote capacity should be easy to access. The Condor software is responsible for locating and allocating idle workstations. Condor users do not have to search for idle 32

machines. The local execution environment is preserved for remotely executing processes. Users do not have to worry about moving data les to remote workstations before executing programs there. Users of Condor may be assured that their jobs will eventually complete. If a user submits a job to Condor which runs on somebody else's workstation, but the job is not nished when the workstations owner returns, the job will be checkpointed and restarted as soon as possible on another machine.

4.2

No special programming should be required to use Condor. Condor is able to run normal UNIX programs, only requiring the user to relink, not to recompile them or change any code. Condor does its work completely outside the kernel, and is compatible with Berkeley 4.2 and 4.3 UNIX kernels and many of their derivatives. Because it requires no changes to the operating system, Condor is portable and can be used in environments where access to the internals of the system is not possible. Condor does pay a price for this exibility in both the speed and completeness of its process migration [16]. Remote system calls

The user program is provided with the illusion that it is operating in the environment of the submitting machine. In some circumstances le I/O is redirected from the machine where execution actually takes place to the submitting machine. In other situations les on the submitting machine are accessed more eciently by use of NFS. Every UNIX program, whether or not written in the C language, is linked with the `C' library. In the normal situation this library provides the interface between the user program and the UNIX kernel. This interface is implemented as stubs, which perform the system call on behalf of the user program. Figure 4.1 illustrates the normal UNIX system call mechanism.

Application Program C library

trap

UNIX Kernel

Fig. 4.1

Normal UNIX system calls.

Figure 4.2 shows how the system call mechanism has been altered by providing a special version of the C library which performs system calls remotely. This special library, like the normal library, has a stub for each UNIX system call. These stubs either execute a request locally by mimicking the normal stubs or pack the request into a message which is sent to the 33

Shadow Program (UID=User)

Application Program (UID=User)

RSC

C library

C library

trap

trap

UNIX Kernel

UNIX Kernel

Fig. 4.2

Remote system calls.

shadow process. The shadow executes the system call on the initiating machine, packs the results, and sents them back to the stub. The stub then returns to the application program in exactly the same way the normal system call would have, had the call been local. The shadow runs with the same user and group ids, and in the same directory as the user process would have had it been executing on the submitting machine. If a network le system such as NFS is in use, an important optimization is possible. Performance can be increased in these cases by avoiding remote system calls, and accessing the le directly. This mechanism works as follows. At the time of an open request, the stub sends a name translation request to the submitting machine. The shadow process responds with a translated pathname in the form of hostname:pathname, where hostname may refer to a le server, and the pathname is the name by which the le is known on the server (which may be dierent from the pathname on the submitting machine, because of mount points and symbolic links). The stub then examines the mount table on the machine where it is executing, and if possible accesses the le without using the shadow process. Whenever a process is checkpointed and restarted on another machine, the name translation process is repeated, since access to remotely mounted les may vary among the execution machines. 4.3

Checkpointing

Condor provides a transparent checkpointing mechanism which allows it to take a checkpoint of a running job, and migrate that job to another workstation when the machine it is currently running on becomes busy with non-Condor activity. This allows Condor to return workstations to their owners promptly, yet provide assurance to Condor users that their jobs will make progress, and eventually complete. Ideally, checkpointing and restarting a process means storing the process state and later restoring it in such a way that the process can continue where it left o. In the most general case, the state of a UNIX process may include pieces of information which are known only to the kernel, or which may not be possible to recreate. Condor is meant for jobs whose state is simple enough that they can be checkpointed. In the current version of Condor only single process jobs are supported. This means that the fork(2), exec(2), and similar calls 34

are not implemented. Signals and signal handlers (the signal(3), sigvec(2), kill(2) calls), and interprocess communication (IPC) calls (socket(2), send(2), recv(2), etc) are not supported. The state of a UNIX process includes the contents of memory (the text, data and stack segments), processor registers and the status of open les. The approach of the designers of Condor to saving and restoring the state of a process is to rely on basic UNIX mechanisms to keep Condor easy to port. The checkpoint le is itself a UNIX executable le. A new checkpoint le is created from pieces of the previous checkpoint and a core image. The text segment of the new checkpoint is an exact copy of the text segment of the old checkpoint. The data area and stack area are copied from the core le into the new checkpoint le. The setjmp/longjmp facilities of the C library has been used to save the register contents and program counter. Information about currently open les is gathered by the stubs of open, close and dup , of Condor's special C library. The exact way in which Condor makes a checkpoint le can be found in a paper by Michael Litzkow and Marvin Solomon [16]. Before a Condor process is executed for the rst time, its executable le is modi ed to look exactly like a checkpoint le, so that every checkpoint is done in the same way. 4.4

Control Software

Condor includes control software consisting of two daemons which run on each member of the Condor pool, and two other daemons which run on a single machine called the central manager. The two daemons that run on the central manager machine are the collector and the negotiator. The collector and the negotiator are separate processes, but they can be viewed as one logical process called the central manager. On each machine in the \Condor pool" the schedd maintains a queue of Condor jobs, and negotiates with the central manager to get permission to run them. The schedd must prioritize its own jobs. The other daemon is the startd. The startd determines whether its machine is idle, and is responsible for starting and managing the foreign job if one is running on its machine. An additional daemon (kbdd) is necessary on machines running the X window system, to inform the startd about the keyboard and mouse \idle time". The startd periodically informs the central manager whether the machine is available, and the schedd periodically informs the central manager about the number of jobs it wants to run. The central manager is responsible for allocating the idle machines to other machines which have Condor jobs to run. To illustrate how the daemons work together we will follow a Condor job from the moment of submitting to the moment that it nishes or the owner of the hosting workstation returns (as in the Condor technical summary [3]). Figure 4.3 illustrates the situation when there are no Condor jobs running. When a machine becomes idle, the startd of this machine will tell it to the central manager. The central manager will then decide which of the machines should execute one of its job remotely on the idle machine. The central manager will give permission to the initiating machine to run a job on the execution machine. The schedd on the initiating machine will select a job from its queue and spawns o a shadow process to serve it. The shadow will then 35

Central Manager Machine Central Manager

Startd

Schedd

Schedd

Kbdd

Kbdd

Execution Machine

Fig. 4.3

Startd

Initiating Machine

Condor processes with no jobs running.

contact the startd on the execution machine to ask if the machine is still idle. The startd on the execution machine then spawns a process called the starter. The starter is responsible to start and manage the remotely running job. Central Manager Machine Central Manager

Startd

Schedd

Schedd

Startd

Starter

Kbdd

Shadow

Kbdd

Execution Machine

Fig. 4.4

Initiating Machine

Condor processes while starting a job.

The shadow on the initiating machine will transfer the checkpoint le to the starter on the execution machine. The starter spawns o the remote job. The starter is responsible to give the user job periodically a checkpoint signal, causing the user job to save its le state and stack, and then to make a core dump. A new checkpoint le is made by the starter and stored temporarily on the execution machine. The starter restarts the job from the new version of the checkpoint and sets a timer for the next time it has to give the user job a checkpoint signal. The shadow process on the initiating machine will handle the signal calls for the user job. If the user job nishes, the starter and shadow clean up, and the user is noti ed by mail that the job has nished. If the owner of the execution machines returns, the startd on the execution machine will detect this, and it will send a \suspend" signal to the starter, which will temporarily suspend the user job. If the execution machine remains busy for a certain 36

Central Manager Machine Central Manager

Startd

Schedd

Starter

Kbdd

Schedd

Kbdd

User Job

Shadow

Execution Machine

Fig. 4.5

Startd

Initiating Machine

Condor processes with one job running.

period, the startd will send a \vacate" signal to the starter, who will abort the user job and return the latest checkpoint le to the shadow on the initiating machine. If the user job had not run long enough to reach a checkpoint, the job is just aborted. This is done because making a checkpoint is an I/O intensive activity and it should be avoided that the returned owner notices any interference from Condor. 4.5

Priorities

Condor can be customized by changing the con guration les. There is one generic con guration le with de nitions for all machines in a Condor pool, and every station has a local con guration le . A de nition in the local con guration le overrules a de nition in the generic con guration le. In this section we will look at some of the important expressions in the con guration le. In the con guration le the interval is de ned at which the negotiator will update the priorities of the machines in the Condor pool, and negotiate with machines that want to run jobs. This interval is normally 300 seconds. The expression used by the negotiator to update the machine priority of active machines is UPDATE-PRIO : Prio + Users - Running

\Prio" is the previous priority of the machine, \Users" is the number of dierent users that have Condor jobs in the local-queue, and \Running" is the number of Condor jobs the machine has currently running. If a machine is inactive (Users 0), the priority will be incremented by one if the priority is negative, and decremented by one if the priority is positive. This scheduling mechanism (the up-down algorithm) was presented in a paper by Mutka and Livny [18]. Machines which are running many jobs will tend to have low priorities, and machines which have jobs to run, but can't run them, will accumulate high priorities. The up-down 37

algorithm tries to protect the rights of light users when a few heavy users try to monopolize all free machines, without degrading throughput. The schedd is responsible for prioritizing its own jobs. The following expression is used by the schedd to assign priorities to its local jobs. PRIO : (UserPrio * 10) + Expanded - (Qdate / 1000000000)

\UserPrio" is de ned by the job owner in a similar(but opposite) way as the UNIX \nice" value. \Expanded" will be 1 if the job has already done some execution. This is done to preserve disk space, an expanded job is bigger than an unexpanded job. \Qdate" is the UNIX time the job was submitted. The constants make that \UserPrio" is the major criteria, \Expanded" is less important and \Qdate" is the minor criteria in determining job priority. The standard interval at which the startd checks if its state should be changed is 5 seconds. Every 120 seconds the negotiator is informed by the startd about the machine status. The status of a machine is NOJOB when the keyboard has been idle for 15 minutes and the load average is below 0.3. A Condor job will be suspended if the load average becomes higher than 1.5 or when the keyboard is touched. The maximum time a Condor job can be suspended is 10 minutes. A Condor job will be resumed if the load average is below 0.3 and the keyboard has not been touched for 5 minutes. If a Condor job is not resumed within 10 minutes, the job will be vacated (the checkpoint le will be moved from the execution host to the initiating host). If a job has not been vacated within 10 minutes, the Condor job is killed. All of the above mentioned intervals can be changed to meet a workstation owner's wishes. Figure 4.6 shows all the dierent states a machine can be in.

USER BUSY

CONDOR DOWN

CHECK POINTING

NOJOB

JOB RUNNING

KILLED

VACATING

Fig. 4.6

SUSPENDED

States of Condor machines.

The starter is responsible for giving the Condor job a checkpoint signal at the appropriate time. In the con guration le the minimum and maximum interval is de ned at which the starter should give a checkpoint signal. Normally the rst checkpoint is made after 30 minutes. After the rst checkpoint the interval is doubled, until it is as big as the maximum interval 38

(normally 2 hours). This is done, because the expectation of the period that a given machine will stay idle increases as the period that this machine is already idle increases. 4.6

User interface.

The Condor user interface consists of the following programs: Submit jobs to the Condor job queue.

condor submit condor rm

Remove jobs from the Condor job queue.

condor prio condor q

Change priority of jobs in the Condor job queue.

Display the local Condor job queue.

condor globalq

Display the Condor job queue of all machines in the pool.

condor summary condor status

Summarize recent Condor usage on the local machine.

Examine the status of the Condor machine pool.

Condor submit reads a description le which contains commands that direct the queueing of jobs. It is possible to submit many Condor jobs at once, a \job cluster". These jobs must share the same executable, but may have dierent input and output les, and dierent arguments. Submitting multiple jobs in this way is advantageous, because only one copy of the checkpoint le is needed until the jobs begin execution. The description le must contain the name of the executable, and the requirements which a remote machine must meet to execute the job. Their are three kinds of requirements; \Memory", \Arch", and \OpSys". \Memory" is an estimation of the amount of physical memory that will be required, and \Arch" and \OpSys" describe the particular kind of UNIX platform for which the executable has been compiled. Another important item in the description le is the Condor job priority. 4.7

Discussion

A new performance measure has been introduced by Litzkow, Livny and Mutka [14] called the leverage. The leverage of a job is the ratio of the capacity consumed by a job on the executing machine to the capacity consumed on the initiating machine to support remote execution. The capacity on the initiating machine is the combination of capacity used to support placement, checkpointing and system calls. Litzkow et al. found that the average leverage of jobs in the Condor-pool of their department was approximate 1300. This means that only one minute of local capacity was consumed for nearly 22 hours of remote capacity. With the command condor summary the total leverage of a user (the remote CPU time of all jobs of a user divided by the local CPU time of all jobs of a user) can be found. In the Condor-pool of NIKHEF-K (8 SPARC stations) the total leverage of users lies normally 39

between 1500 and 4000. In total Condor has provided around 250 days of CPU time in a period of about 9 months. Litzkow and Solomon [16] described the following areas of future work on Condor. In some circumstances it would be better to transfer processes directly between execu-

tion sites rather than always sending a checkpoint le back to the originating site.

Data compression could be used to reduce the volume of data transferred and stored. The stack and data segments could be read directly from the core le into a new instan-

tiation of a process rather than converting the core le to an executable module (which includes copying the text segment of the old checkpoint le).

Support for signals. This is a non-trivial feature because the information maintained by

the kernel has to be checkpointed. This information is maintained in a way that varies among UNIX implementations.

Unfortunately, the last two optimalizations work against portability. A solution to a problem will work for some platform, but not for others. The following is a collection of other areas of development. At the moment Condor is machine oriented. The central manager is responsible for

prioritizing the machines of a pool, and the schedd of each machine has as task to prioritize the jobs submitted on that machine. This mechanism only works well if each user would always submit his/her jobs from the same machine. In practice this is not always the case due to multi-user systems and machines owned by more than one person. In the future, Condor will become user oriented. There are two possible architectures, each user will have his own queue of Condor jobs, or there is one global queue. This last solution means that a new scheduling mechanism for the jobs in the global queue will be needed.

An interesting idea is the coupling of Condor pools over wide area networks ( ocking).

Within a Condor pool there is a wide variation in the number of Condor jobs. Sometimes there is a large (virtual) queue of jobs, and sometimes there are no jobs in the queue while there are idle machines. By coupling Condor pools, work from \over-loaded" pools can be transported to pools with idle machines.

Support for shell scripts. The possibility to run a shell script is interesting, because

in this way one can have the script compile and link the job on the y on the remote machine. This is useful in a multi-architecture environment. It should be possible to let the script end with the execution of a checkpointed program.

40

Bibliography [1] J-P. Baud, F. Cane, F. Hemmer, E. Jagel, G. Lee, L. Robertson, Ben. Segal and A. Trannoy, "SHIFT, user guide and reference manual," CERN-CN, Geneva, Switzerland, 1991. [2] K. Birman and K. Marzullo, "The ISIS distributed programming toolkit and the META distributed operating system," Sun Technology, Summer 1989, pp. 90|104. [3] A. Bricker, M.J. Litzkow and M. Livny, "Condor technical summary," Version 4.1b, University of Wisconsin - Madison, 1991. [4] A. Bricker and M.J. Litzkow, UNIX manual pages: condor intro(1), condor(1), condor q(1), condor rm(1), condor status(1), condor summary(1), condor con g(5), condor control(8), condor master(8), Version 4.1b, University of Wisconsin - Madison, 1991. [5] T.L. Casavant and J.G. Kuhl, "A taxonomy of scheduling in general-purpose distributed computing systems," IEEE Trans. on Softw. Eng., Vol. SE-14, no. 2, 1988, pp. 144|154. [6] R.C.A.M. van Driel, P.G. Huisken and D.B. Reuhman, "Distributed job execution in a network of UNIX computers," Philips Research, 1992. [7] D.L. Eager, E.D. Lazowska and J. Zahorjan, "Adaptive load sharing in homogeneous distributed systems," IEEE Trans. on Softw. Eng., Vol. SE-12, no. 5, 1986, pp. 662| 675. [8] D. Ferrari and S. Zhou, "An empirical investigation of load balancing applications," in Proc. of Performance '87, 1987, pp. 515|528. [9] A. Goscinski and M. Bearman, "Resource management in large distributed systems," Operating Systems Review, Vol. 24, no. 4, 1990, pp. 7|25. [10] A. Goscinski, Distributed Operating Systems - The logical design, Addison-Wesley publishing company, 1991. [11] B.A. Kingsbury, "The network queueing system," preliminary draft, Sterling Software, Palo Alto, USA, 1991. [12] W.E. Leland and T.J. Ott, "Load-balancing heuristics and process behavior," in Proc. of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, 1986, pp. 54|69. 41

[13] M.J. Litzkow, "Remote UNIX, turning idle workstations into cycle servers," in Proceedings of the 1987 Summer Usenix Conference, Phoenix, Arizona, 1987. [14] M.J. Litzkow, M. Livny and M.W. Mutka, "Condor - A hunter of idle workstations," in Proceedings of the 8th International Conference on Distributed Computing Systems, San Jose, California, 1988, pp. 104|111. [15] M.J. Litzkow and M. Livny, "Experience with the CONDOR distributed batch system," Proceedings of the IEEE Workshop on Experimental Distributed Systems, Huntsville, AL, 1990. [16] M.J. Litzkow and M. Solomon, "Supporting checkpointing and process migration outside the UNIX kernel," Usenix Winter Conference, San Francisco, California, 1992. [17] M. Livny and M. Melman, "Load balancing in homogeneous broadcast distributed systems," in Proceedings of the ACM Computer Network Performance Symposium, 1982, pp. 47|55. [18] M.W. Mutka and M. Livny, "Scheduling remote processing capacity in a workstationprocessor bank network," in Proc. 7th Int. Conf. Dist. Comp. Systems, Berlin, West Germany, 1987, pp. 2|9. [19] M.W. Mutka and M. Livny, "Pro ling workstations' available capacity for remote execution," in Proceedings of Performance '87, The 12th IFIP W.G. 7.3 International Symposium on Computer Performance Modeling, Measurement and Evaluation, Brussels, Belgium, 1987, pp. 529|544. [20] D.A. Nichols, "Using idle workstations in a shared computing environment," in Proceedings of the 11th ACM Symposium on Operating Systems Principles, Austin, Texas, USA, 1987, pp. 5|12. [21] P. Krueger and M. Livny, "The diverse objectives of distributed scheduling policies," in Proc. 7th Int. Conf. Dist. Comp. Systems, 1987, pp. 242|249. [22] A.S. Tanenbaum, M.F. Kaaskoek, R. van Renesse and H.E. Bal, "The AMOEBA distributed operating system - A status report," Computer Communications, Vol. 14, no. 6, 1991, pp. 324|335. [23] M.M. Theimer, K.A. Lantz, and D.R. Cheriton, "Preemptable remote execution facilities for the V-System," in Proceedings of the 10th ACM Symposium on Operating Systems Principles, Orcas Island, Washington, USA, 1985, pp. 2|5. [24] Y-T Wang and R.J.T. Morris, "Load sharing in distributed systems," IEEE Trans. on Comp., Vol. C-34, no. 3, 1985, pp. 204|217.

42