Migration Impact on Load Balancing - An Experience on ... - CiteSeerX

0 downloads 0 Views 813KB Size Report
2 Design Philosophy. The Amoeba system is chosen for this experimental study because of its ready availability and its high performance IPC facility.
Migration Impact on Load Balancing - An Experience

on Amoeba

Weiping Zhu

Piotr Socko, Bartek Kiepuszewski

D e p a r t m e n t of C o m p u t e r Science T h e U n i v e r s i t y of Q u e e n s l a n d A u s t r a l i a Q L D 4072

Math, Informatics and Mechanics Warsaw University Poland

E-mail weiping@cs, uq. edu. au Abstract Load balancing has been extensive study by simulation, positive results were received in most of the researches. With the increase of the availability of distributed systems, a few experiments have been carried out on different systems. These experimental studies either depend on task initiation or task initiation plus task migration. In this paper, we present the results of an experimental study of load balancing using a centralized policy to manage the load on a set of processors. All experiments were carried out on an Amoeba system composed by a set of 386s and linked by 10Mbps Ethernet. The results on one hand indicate the necessity of a load balancing facility for a distributed system. On the other hand, the results question the impact of using process migration to increase system performance under the configuration used in our experiments.

1

Introduction

A distributed computer system with hundreds of computers from time to time experiences load imbalance, which degrades system performance in terms of the average response time and resource utilization. In order to achieve high performance from a distributed system, the operating system must be equipped with an efficient task scheduling facility. This facility tends to evenly assign tasks to processors, called initiation. However, simple assignment can not eliminate load imbalance. A load balancing facility in an operating system is not only responsible for task assignment, but also monitors load variation in a system. If the load difference between processors is over a threshold, it can move tasks from heavily loaded processors to lightly loaded ones. Therefore, task initiation and task migration are the possible actions that can

43

be adopted by a load balancing facility. No doubt that a reasonable initiation policy can improve system performance remarkably. However, whether migration can further improve system performance is still an open question. Within this paper, we investigate the suitability of using task migration to enhance system performance. A considerable number of research projects on load balancing have been carried out since last decade. The results received from simulation, analysis or experimental studies on one hand indicate the necessity of a load balancing facility for the performance of a distributed system. However, on the other hand, researchers disagree on a number of other issues, such as, algorithms quality, load index selection, etc. This is partially due to different configurations or modeling techniques used in these studies, i.e. architectures, platforms, benchmarks, assumptions and etc., which lead to quite different results. From these results the researchers are divided into different camps. For instance, some are in favor of a non-preemptive strategy [2], others prefer the preemptive approach [6]. In this paper, we present the results of an experimental study of load balancing on the Amoeba system which is equiped with the one of the fastest IPC facilities, FLIP [5]. Two centralized load balancing methods were used in the experiments: one is based on task initiation to achieve load balancing; the other uses task initiation plus migration method for load balancing. The results on one hand show the improvement of system performance when a load balancing facility is added. On the other hand, the results question the impact of process migration on system performance. This paper is organized into five sections. In the next section, we outline the design of a load balancing facility (called load balancer) under the Amoeba Distributed Operating System. In section 3,

the load indices used in our study will be discussed. Section 4 describes the experiment environment and workload used in these experiments. Section 5 is devoted to the experimental study and conclusions.

2

Design Philosophy

The Amoeba system is chosen for this experimental study because of its ready availability and its high performance IPC facility. One principle used in our design is to keep our modification consistent with the Amoeba design philosophy, it is then possible to add our code into the Amoeba in its future release.

2.1

Minimize Kernel Change

The Amoeba is a processor-pool based distributed system, logically it has two layers (user and kernel) and three kind of processes (user, servers executed in user mode, and the server executed in kernel mode). Each pool processor has its own local memory and executes the Amoeba kernel. The kernel has two parts: 1) micro-kernel, 2)the process server, which is the only server executed in kernel mode. The process server was designed very carefully for high performance. The micro-kernel performs memory and low level communication functions. While, the process server, a multi-threaded process with the highest priority, provides all process management function. The server can access the critical data structures kept in the micro-kernel, and even executes some of the code in the micro-kernel directly. This technique not only makes the kernel smaller, but also eliminates the system call overhead incurred by the process server to access the micro-kernel services. Due to the potential complication caused by changing kernel, we limited ourselves to minimize the change of the system kernel when adding our load balancing facility. This is because without a complete knowledge about the kernel any change may be catastrophe. Except a process migration facility was added into the process server for performance reason, little has been changed in the kernel. The first version of migration facility was implemented as a user process and based on the checkpoint. However, this implementation turned out not very efficient. On average, it takes 1.5 second to migrate a 100 KB process and the time used to migrate the same process varies substantially. A careful analysis reveals the cause of this inefficiency and vari-

44

ation. This is because there are only two priorities supported in the Amoeba, user and kernel. When we implemented the migration facility as a user process, it needs to contact the local process server for checkpoint, and contact the remote process server to resume the execution. A number of RPCs are required for a migration. A round-robin scheduling used in Amoeba forces the migration facility to be queued a number of times. On a heavily loaded processor, the waiting time in the queue can be quite long and vary accordingly, depending on the behavior of other processes. On the other hand, once the kernel process, i.e., the process server, is ready, it will be scheduled immediately. Therefore, instead of changing the scheduling policy of the Amoeba, we moved the migration facility to the kernel and integrated it into the process server. The performance test indicates the kernel approach is at least 6 time faster than the previous implementation [10] and with little variation.

On top of the kernel, there arevarious processes, including server processes and user processes (called user layer), all of them require to use system calls to access kernel services. All processes in this layer contact with each other by RPC, they also use RPC to ask the process server for process-related services, such as, create~ destroy, and migration after we implemented this facility, but the process server does not initiate a RPC call with other processes except when a process (caller) asks the process server to suspend a process, in this case, after the process is suspended the process server calls the owner of the suspended process (may not be the caller) and passes an checkpoint to it.

By this way; the Amoeba eliminates upcalls (from the kernel layer to the user layer) and avoids the priority inversion problem. We respect this principle and did not introduce any new upcalls when the load balancing facility was put into the system. Therefore, in order to, obtain load information from the kernel, the load balancer has to send a request the kernel, then waits :for replies. Based on the principle of separating policy from mechanism, we implemented our load balancer at the user layer, it polls the kernel of each processor for workload information and sends requests to process servers for process migration.

2.2

Using

the

Same

Interface

as

HOSTC

be-

HOSTA

USER PROCESS

fore

RUNSERVER Host B

T h e r e is a centralized server (called run server) which p r o v i d e s j o b i n i t i a t i o n f u n c t i o n in the A m o e b a syst e m . T h i s server w a i t s for users requests. Once receiving a request, t h e server selects an a p p r o p r i a t e processor f r o m the p o o l processors a n d sends the c a p a b i l i t y of t h e processor to the user. Based on t h e r e t u r n e d c a p a b i l i t y the user can s t a r t a process on t h a t processor. To s t u d y the i m p a c t of process m i g r a t i o n on s y s t e m p e r f o r m a n c e a n d to keep consistence w i t h t h e o t h e r p a r t s of t h e A m o e b a , the s a m e interface as t h e run server is used for our l o a d b a l a n c e r , b u t new f u n c t i o n a l i t i e s are a d d e d to t h e load balancer.

eZ{

SHELL HOSTB PROCESS SERVER Process is running

F i g u r e 1: O r d e r of events of e x e c u t i n g exec_file 2.3

2.2.1

Interface

T h e r e is o n l y one k i n d of requests, find_processor, t h a t are d i r e c t e d to t h e run server in the A m o e b a . T h i s request asks the run server to provide a processor for a new task. In fact, a user process s e l d o m sends such a request to the run server directly, ins t e a d a user process n o r m a l l y uses a l i b r a r y f u n c t i o n ( e x e c _ f i l e ) to s t a r t a new task. T h e e x e c _ f i 2 e o p e r a t e s in t h r e e steps, which are:

Structure

Functions

A p a r t f r o m i n h e r i t i n g the s a m e interface as the run server, we used s o m e of t h e code f r o m the run server, changed some code a n d a d d e d a new t h r e a d to m a k e our l o a d b a l a n c e r . T h e newly a d d e d t h r e a d is responsible for d y n a m i c l o a d b a l a n c i n g - - m i g r a t i n g processes between processors. A f t e r these changes, the l o a d b a l a n c e r is c o m p o s e d of t h e following threads:

1. R e a d t h e file d e s c r i p t o r f r o m a file server (called b u l l e t server in A m o e b a ) ,

Pollingnodule Servermodule Clg~ngmovie

2. M a k e a f i n d _ p r o c e s s o r call to t h e run server a n d receive the c a p a b i l i t y of a processor.

Migulionmoduk

3. Make a p s _ e x e c call, w i t h the file d e s c r i p t o r o b t a i n e d in step 1 as a p a r a m e t e r , to the process server which c o r r e s p o n d s to the r e t u r n e d processor's c a p a b i l i t y . T h e processor server t h e n executes t h e p r o g r a m a n d r e t u r n s the process's capability.

and

G

~

Q @

HOST G

Fina ~msl

T h e s t e p 2) a n d 3) are shown in figure 1 Note t h a t a user process can skip step 2 or ignore the processor c a p a b i l i t y r e t u r n e d in step 2, a n d force a processor to s t a r t a new process. T h i s m a y lead the run server to m a k e i n a p p r o p r i a t e decisions in its s u b s e q u e n t o p e r a t i o n s since it c a n n o t m a i n t a i n u p - t o - d a t e w o r k l o a d i n f o r m a t i o n . In order to o b t a i n u p - t o - d a t e l o a d i n f o r m a t i o n , t h e run server p e r i o d ically polls o t h e r processors a n d based on o b t a i n e d i n f o r m a t i n to e s t i m a t e the load on o t h e r processors.

Misrale roceJ$1to C

F i g u r e 2: S t r u c t u r e a n d f u n c t i o n a l i t y of l o a d balancer • Polling t h r e a d s . A set of t h r e a d s t h a t are responsible for collecting w o r k l o a d i n f o r m a t i o n

45

from all other processor on a regular basis (by default every 5 sec.)

where tlast is the t calculated last time, and T is current number of threads running on the processor. This value can be interpreted as the average CPU queue length, which is often use as a load index [4, 9].

• Checking threads. These threads, on much less frequency than polling threads, checks the reachibility of other processors. If a processor does not respond to a request, the processor is considered to be down.

• Free memory. These three pieces of information are only useful for process initiation because there is no information about running processes, thereby, one is not able to choose processes for migration. To answer the question - - when to move which process to where, some additional information concerning both processors and processes in the system is needed. A minor change on the process server was made. The server, besides gathering the above information, can provide the followings:

• Server threads. There are several services that this thread provides to other processes. The main service is to respond the request find_processor with a capability of a chosen processor. Half a second later after responding to a request, the thread checks the load on the selected processor to obtain the latest workload on that processor. • Migration thread. This thread relies on the information gathered by polling threads to find if there are any overloaded and underloaded processors. If so, it tries to balance the load by migration. To minimize the effect of using outdated information, it checks the load of the overloaded processor just before migration. If the processor is still overloaded, migration will precede. Otherwise, migration would not proceed. After a successful migration, this thread polls b o t h source and destination processors to update the load information.

• The average utilization of a processor U~v in a time interval T. Uav is expressed with a formula: Ua, (n) = (1 - p) * U ~ (n - 1) ÷ p * Uc, where p E l 0 , 1], Uc is defined with formula: vo =

Load

• Average CPU consumption for each process running on a specific processor is also stored, which indicates the association between a process and the processor. The average CPU utilization is the sum Of average CPU consumption of processes running on the processor.

Information

The existing facility can provide some load information about each processor, which are: • CPU nominal speed which is computed when a processor is bootuped. This is a static information t h a t is used to distinguish slow architectures (or configurations) from fast ones.

• Number of processes currently running on a specific processor. The information can be used as the load index [11]. • Size of m e m o r y occupied by each process in bytes. This information is essential to estimate if there is enough room for a new process on a destination processor.

• Average number of running threads t, which is in fact the weighted average of number of threads running on a processor during last 3.2 seconds. The t is calculated every lOOms with following formula: t = [T * 216 + t z , , t * 2 s

-

1000),

where P is the set of processes which are running on the processor. While, Bi is the CPU time used b y process Pi (in milliseconds) during period T (in seconds). Values p and T can be tuned (in our experiments we used p = 0.9 and T = 3). This information can be useful when determining if a processor is underloaded [4].

The first three kind of threads provide quite similar function as the run server, with more flexibility than before. The overall structure of the load balancer and it's functionality are shown on figure 2.

3

Bi/(T , iEP

• The number of milliseconds elapsed from creation of each process, which can be used to estimate remaining execution time.

tl,s,]/25

46

• Number of threads in each process.

check the latest workload information to see whether load imbalance occurs. Once the thread detects an overload processor and some underload processors within the system, it selects a process for migration. Ideally, one would like to move such a process that has: 1) a long remaining execution time in comparison with the time spent on migration; 2) consuming the most CPU resource, and 3) smaller size because the bigger a process is, the much the time is consumed on moving it from one processor to another [1]. Nevertheless, big processes usually require more CPU time than small ones, so without a priori knowledge about process execution times, choosing a small process with long remaining time is a difficult task. We have implemented three heuristic methods to select processes for migration:

Note that the above information is more than enough for the experiments reported in this paper, some of the information will be used in next stage when we test different dynamic load balancing policies, both centralized and distributed.

3.1

Load Balancing Policies

Based on the available information, our load balancer can support either a centralized task initiation policy or a centralized task initiation plus migration policy.

3.1.1

Task Initiation Only

Under this policy, the load balancer performs a similar function as the run server. The load balancer manages all the pool processors, its main task is to select processors for new tasks. In fact, without a priori knowledge about processes and without accurate load information, it is almost impossible to achieve the target of an optimal task assignment. A load balancer might take the following into account in its decision making process: the available memory, m e m o r y fragmentation (especially since the current version of A m o e b a cannot move segments) and whether the text and d a t a segments are cached in the processor's memory. The load balancer is based on the following formula to evaluate which processor is the best candidate for a process:

• M o s t computing process selection. In this policy we select the process which is computation-intensive for migration. • Oldest process selection. The process which has been running for the longest time so far is selected for migration because that process probably have the longer remaining execution time according to Markovian property. • Combination selection. This policy combines the previous two, and uses all available information to select a process for migration. A weighted sum of the CPU consumption and execution time is used as the measure for process selection.

value(h,pd) - epuspeedh runnableh + memory(h, pd) 3.1.3 where h stands for a processor, pd for a process, cpuspeedh is a static parameter which shows the speed of a processor, runnableh corresponds to the average number of runnable threads and memory(h,pd) is a parameter showing if processor h has enough m e m o r y to execute process pd. The load balancer compares these values and selects the processor which has the greatest value for the process pd. If a number of processors have the greatest value, the load balancer randomly chooses one from the available processors.

3.1.2

Source and Destination Selection

The migration thread depends on the information gathered by other threads to identify an overloaded processor (source) and a underloaded processor (destination). Two different load indices can be used to measure the load of a processor: 1) the number of processes on a processor; 2) the utilization of a processor. We can also merge these two factors together to determine the load state of a processor. Currently the following strategies are used to determine the load state of a processor: • Queue length. Using this, the processor that has more processes than others and the number of processes is over a threshold (MAX_NUMBER) will be the source, while the processor that has the smallest number

Task Initiation plus Migration

With this policy, the migration thread of the load balancer is activated. It starts on a regular basis to

47

of processes, and the number is smaller or equal to another threshold ( M I N _ N U M B E R ) is selected as the destination. If two processors have the same number of processes, one of them is selected.

il

)

Pool? (

• Utilization. Similarly to the previous one, a process is selected for migration from a processor with utilization bigger than MAX_THRESHOLD, to a processor with utilization smaller than MIN_THRESHOLD.

Terminal

Exp~imentC¢~tml

• C o m b i n a t i o n o f f o r m e r t w o . In this approach, if more then one processors have the same number of processes, we are choosing the processor, which has the bigger utilization factor, as the source. The destination can be selected in a similar fashion.

Figure 3: The System configuration. • Mean inter-arrival time of processes Am, which yields an exponential distribution.

Both strategies and values of thresholds are parameters to the load balancer, which can be tuned accordingly.

• Mean execution time of a process on an idle processor Era, which also yields an exponential distribution. In the case of processes which are computation intensive this is equal to the mean service time.

4

• Average m e m o r y requirements. This is defined by means of granularity factor G. We use the following :formula to obtain the m e m o r y size of a process:

Experiment

4.1

Environment

Artificial Workload

Instead of using the workload obtained from a real system for our experiments we decided to use an executable artificial workload to carry out the performance evaluation (as defined in [3]) because artificial workloads are easier to reproduce and have a greater flexibility than real ones. All the experiments were carried out at night or weekend to avoid the possible network interference at a PC laboratory. This laboratory is equipped with a number of 80386s, with 40Mhz, 8MB main memory, all of these PCs are connected to an Ethernet segment (10 Mbps) by Etherlink2 cards. The experiment environment is shown in Fig. 3, there are 6 pool processors, plus 1 processor runs the load balancer and another processor runs the loader server which injects the artificial workload to the system during experiments. A SPARC workstation or a X terminal is used as the control console to control these experiments.

4.1.1

M e r e = E * 1 0 0 0 . G * 1024. Where E is execution time of the process. Thus, in our environment the m e m o r y size of a task is propotional to its execution time. Based on the results reported in [7, 9], this is natural assumption because normally the longer processes use more m e m o r y than short ones. • Average CPU usage Car. This p a r a m e t e r indicates the characteristic of the workload used in an experiment, i.e., it is C P U bound or I//O bound. This is the percentage of execution time spent on computations (not I / O ) . Using different Am, Em and Car, we are able to generate various workload while keeping the CPU utilization on a certain level. The utilization factor can be expressed with the following equation: Sm Usy~ - A,~ * M

Workload Parameters

E m * Ca~ Am * M

where Sm is the mean service time and M is the number of processors which participate in an experiment. In [9] Zhou shows that Am = 2.371s and

The artificial workload used in our experiment was designed according to [3] and [7, 9], which takes into account the following four factors:

48

4.1.2

( E,

getParameters

U, vs = 0.62 resemble the natural load of academical Unix environment traced for m a n y hours.

/"

E - servl~ C - CPU Mere

C, Mem);

time; usage;

size.

- la~ap

*/

cyc = B/200; /* no of cycle.s */ = 2OO * C; ~ running period*/ s l e - c p ffi 2 O O - r u n g / * s l e . e p i n g */ run

Workload Generators

mallo~

A program was created to produce executable processes used in experiments, this program according to the following parameters to create executable processes:

(Mem);

/*¢~t~

heap*/

1"

/~ now comput~ something do sometl~ing ();

• execution time E,

*/

~ $

NO

• processor utilization C, • m e m o r y Mere.

sleep_for we

/* now

The program first allocates heap according to the p a r a m e t e r M e m . The execution time (E) is divided into equal segments of the length 200ms. Each segment has two parts: execution and sleep. It first executes for a portion of the segment before going into sleeping for the rest of the segment. The execution portion is determined by the parameter C. An instruction is inserted at the end of a virtual process which sends a message to the statistic server collects data for performance analysis. The Figure 4 shows the algorithm of the program. Another two programs are developed to produce the workload for experiments. One of the programs, called tracer, uses the load parameters described in last subsection (except of Am) to create a stream of artificial processes. The created processes (output from tracer) are used later to conduct an experiment. Tracer also measures the response time of these artificial processes on an idle processor. The average response time, E, of these artificial processes on an idle process is calculated. The E will be used in the final performance analysis to obtain the normalized response time, which is defined as the measured average response time ( R m ) a g a i n s t E. The third program, called the loader (or loadserver), is used in experiments to inject workload into the system, it has two threads. One of the thread reads the trace file created by the tracer and starts the aritifical processes on the pool processors. This ensures that we are able to use the same workload to evaluate different load balancing policies. When starting a new task, the loader creates a record for the task, which records some related information during its execution. The information will be used in the performance analysis. The second thread of the loader is used to collect the ending messages from terminating tasks. It is

(sleep); are

sleeping

eye ~ cyc -1; /* de.crease number

*/

of cycles

*l

NO

repot

tPtni.,~ h ( ) ;

/* send

inc.*sage

to sewer

*/

END

Figure 4: The procedure used to produce executable processes done by a short I~PC call from a terminating process to the loader. The loader exits when all tasks finished their execution. An experiment (i.e., execution o f two hundred processes on six processors) lasted from 5 to 20 minutes depending on the mean inter-arrival time (which defined load level when using the same trace of processes). Each of the experiments was repeated several times to obtain more reliable results.

5 5.1

Results and Conclusions Impact of Polling T i m e

First we look into the impact of the polling time on the effectiveness of job initiation. For an algorithm which depends on regular polling to obtain load information, the polling frequency plays an important role, which determines the quality of information, and consequently which determines the quality of the decisions made by the algorithm. There are twofold: in order to keep the information more up-

49

Syslem perlonnance wlh d ] f ~ l ~ l l p o l i n g I I g ~ . 1.0

,

,

,

,

,

,

3.2 3

1.8

Sysl~n l~orrN~ , , , Exo~Ulbn l i m e 2s Exoc~llbn lime ~ ~ E x e c u l b n Ilrae ,Ss -o--.

with pcillng l l m e $ = ~ a n d diRelnt ~ r , e ~ ~ u U o n , , , , , ,

Un~ ,

2.8 1.7 2.6 1.6 2.4

///

1.5

1.4

2.2

1..I.- --~" ..Y

2

./,.,""

1.8

1.3

./.J'

t.6 1.2

.~~~.~-

.IA 1.1

........

1.2 I 450

I 500

I 550

I 600

I 650

I 700

I 750

t 800

Utilization of syslem

Figure 5: The effect of different polling times

I

I

I

I

350

400

450

500

I

I

550 600 UIUlZaUonof system

I

I

I

I

I~O

700

750

800

850

Figure 6: The effect of process service time (polling

to-date, one needs to increase the polling frequency; frequent polling, however, congests the network and increases the load on each processor because each processor needs to respond the polling, especially for a large system with hundreds processors. On the other hand, less frequent collection means that. the load balancer might make its decision based on outdated information. The results of using different polling frequencies with the same workload are shown on the Figures 5 and 6. In these figures, the vertical axis is the ratio of the mean response time of processes to the mean response time of an idle processor; while, the horizontal axis is the average system utilization. Figure 5 represents the results using job initiation only policy, but with different polling interval. It clearly shows that more frequent polling leads to better performance because the load balancer made its decisions based on more updated information and the overhead incurred in the polling is very well compensated by appropriate process placement. On the other hand, when the polling frequency remains the same, but increasing the average service time of a process (using bigger processes), the system performance has obvious improvement. Figure 6 shows the behavior of the system when we load the system with the average service time =2, 3 and 5 sec, respectively. This result demonstrates that under the same workload but using longer processes, the load balancer can select processors for processes more accurately because the information used in its decision making is relatively more stable. However, the results of these experiments have to be taken with caution. From these experiments we

50

time = 5s) cannot derive what exactly should be the correspondence betweei1 the polling frequency and the interarrival time and the average process service time. It seems to us that the polling frequency should be greater or at least comparable to inter-arrival time, and process service time. Nevertheless, in practice, we do not have the information about inter-arrival time, etc., thereby, what polling frequency should be used in practice is an open question although we know it has remarkable impact on system performance

5.2

Different

Traces

A set of experiments were carried out to investigate the impact of process migration, on system performance. Several policies were used for process selection and source/destination selection, and we found the difference between these policies were too small to draw any conclusion. Thus, we present the best results here. The polling interval used in the experiment is two seconds, and the migration thread checks the state of a system every two seconds as well. The combination policy is used to choose the overloaded processors and the process for migration. Three different sets of experiments were conducted, using three different traces. The results are shown on figures 7, 8 and 9. The workload used in the first experiment is composed of computation only processes with an exponential service time (=3s) as shown in Figure 7. The figure illustrates

System p ~ d o ~

v~thand v~lhoul migration ( t ~ d compullng only plocesses)

No mlgrallon ~ Wflh mlgrallon -0--

.

b~/sbm paluman~ wflh end wl~oul mlgra~on wlh u ~ ol CPU hogs , , Random icacew~nt NO migration ~ Wilh migration -o--

3.5 /

/

3

3

2.5

2.5

2

2

1.5

1.5

/*

00

450

500

560

600 650 t~Htza~onol system

700

1550

800

750

//"/

600

~

700

750 800 Utfltzatk~nol system

850

900

9GO

1000

Figure 7: Migration effect on computation oriented

Figure 9: Migration effect on a trace with some long

processes

computation tasks System I~fonnance wflh and wlhotd mlgraUon(us~ p~ccess~swlh C=0.6,26)

5.5

' Random placement I No mlgmlbn -.P-. Wllh mlgratbn -a-3

2.5

2

. . . . ,11

1.5

4

...~'""'""

O0

I 450

I ,500

I 550

I I 600 ~ ~ll]Za~n oKsystem

I 700

i 750

~

I 800

850

Figure 8: Migration effect on mixed processes that the load balancer makes marked improvement of system performance. However, migration does not further improve the performance, instead it deteriorates the performance when the average workload is over 70%. In the second experiment we also used an exponential service time (=3s), but some of the processes are computation-intensive, while the rest processes spend some factor of their time merely sleeping. This trace allows us to differentiate processes while making a decision which process to migrate. Figure 8 indicates when the system utilization is below 70%, the load balancer with migration capability performs very similar to the initiation. When the workload is over 75%, the load balancer with process initiation performs better than the load balancer equipped

51

with migration. This indicates when the average workload of a system is approaching saturate, using migration can not improve system performance, those migrations on the other hand make the performance worse. The third trace is composed of computation only processes, however, some of the processes are very long processes which take 12% of the whole load. The result is illustrated in Figure 9. This time when the average load is over 80%, the load balancer with migration overperforms the initiation only load balancer. This demonstrates that with the existence of very long processes, a load balancer can choose a good candidate for migration and the overhead introduced in a migration can be compensated by the gain received from using the idle CPU cycles. The results clearly show, that under certain conditions (existence of long processes) we can gain slight improvement using migration mechanism, but that improvement is not very significant.

5.3

Gain

from

Migration

To see the real benefit of migration to load balancing we designed a simple experiment to see how much we can gain when idle computers are added to the pool of processors during the experiment. We started several very long processes (1 rain.) on three processors, and during the experiment three new processors were added into the system, one by one. The results are shown on on figure 10.

Respo~e =me ol pcocesses when Ir~eas~ng I~a size of system

a X KB process, which is comparable with other implementations, such as Sprite[l], Mach[8]. Therefore, one must be very cautious to use it for load balancing, i.e., when it is absolutely certain that the gain is over the cost. Unfortunately, in most cases in practice, it is almost impossible to do so since lack of knowledge about processes. Of course, with the advance of network technology, the time and cost of migration decline, thereby, using migration for load balancing may have more positive impact on system performance in the future. Nevertheless, no matter what situation is, migration is the last choice for high performance since migration is still relative expensive in a foreseeable future as the studies indicated on parallel machines. On the other hand, our results show that migrating long and Computation-intensive processes are worth considering. In addition, adding new computers to a system dynamically also make migration as an alternative for load balancing. In fact, it is a common situation to add some computers in a load balancing pool in current distributed system, especially if the owner of a computer has the freedom to decide when his/her computer participates/quits the load balancing scheme to maintain his/her workstation autonomy as we suggested [10]. Under a similar configuration as we used in our experiment, we conclude that process migration can be used for the following purposes:

4O

I

Proceslors turned~ durl~ e~q~enl

Figure 10: The impact of adding idle computers It is worth mentioning that our load balancer was trying to balance the load right after the computers were added to the pool. In the case of one computer added, the load balancer moved one processes to the new processor, and after a while, the remaining process on the source processor has finished, and one more migration was possible. In case of two processors added to the pool, we observed immediate two migration, and in case of three processors, immediate three processes from overloaded processors were migrated to the idle processors. 5.4

Conclusion • Limited load balancing

The results received from the experiments indicate when using low thresholds and a small difference between overload and underload thresholds, the load balancer equipped with migration (LBM) is very sen: sitive to the load variation and leads to many migrations in an experiment. However, the performance measured is worse than the initiation only load balancer. This is because the total cost of these migrations is over the benefit received from these migrations. At least, some of the decisions were inapropriate. On the other hand, under the same condition, increasing the thresholds and the margin between the threads, LBM tolerates small load variation in a reasonable range and results in the drop of the number of migration in an experiment. We also observed a slight improvement on performance because the cost of the migrations is compensated by the gain received from using idle cycles. All of these indicate under the environment used in our experiments process migration is an expensive operation. Our migration facility takes approximately 250 * X/100 ms to move

• Load balancing for newly added processors to the pool • Fault tolerance • Administrative purposes

References

52

[1]

F. Douglis and J. Ousterhout. Transparent process migration for personal workstations. Technical report, University of California at Berkeley, Nov. 1989. U C B / C S D 89/540.

[2]

D.L. Eager, E.D. Lazowska and J. Zahorjan. The Limited Performance Benefits of Migrating Active Processes for Load Sharing. Proc. of the 1988 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, May 1988.

[3] D. Ferrari. Evaluation. Hall, 1978.

Computer Systems Performance Englewood Cliffs, NJ: Prentice-

[4] D. Ferrari and S. Zhou. An empirical investigation of load indices for load balancing applications. 1988. [5] M.F. Kaashoek, R. van Renesse, H. van Staveren and A.S. Tanenbaum. FLIP: An Internetwork Protocol for Supporting Distributed Systems. A CM Trans. on Comp. Syst., Volume 11, Number 1, 1993. [6] P. Krueger and M. Livny. A Comparison of Preemptive and Non_preemptive Load Distributing. Proc. of the 8th International Conference on Distributed Computer Systems, June 1988. [7] Thomas Kunz. The influence of different workload descriptions on a heuristic load balancing scheme. IEEE Transactions in Software Engineering, Volume 17, Number 7, July 1991. [8] D.S. MilojiSi~. Load Distriby~ion. Implementation for the Mach Microkernel. Verlag Vieweg, 1994. [9] S. Zhou. A trace-driven simulation study of dynamic balancing. IEEE Trans. on Software Eng., Volume 14, Number 9, 1988. [10] W. Zhu and al et. Load balancing and workstation autonomy on Amoeba. Australin Computer Science Communications, Volume 17, Number 1, 1995. [11] W. Zhu and A. Goscinski. Load balancing in RHODOS. Technical report, University of New South Wales, Australia, March 1990. CS90/8.

53

Suggest Documents