Parallel Continuous Randomized Load Balancing

0 downloads 0 Views 248KB Size Report
log n;O ?m n using r rounds of communication, w.h.p.. BMS97] extends the above results to two directions. Berenbrink, Meyer ..... their parent node in the tree.
Parallel Continuous Randomized Load Balancing (Extended Abstract)

Petra Berenbrink Department of Mathematics and Computer Science Paderborn University, Germany Email: [email protected] Tom Friedetzky and Ernst W. Mayry Institut fur Informatik Technische Universitat Munchen, Germany Email: (friedetz|mayr)@informatik.tu-muenchen.de Abstract Recently, the subject of allocating tasks to servers has attracted much attention. There are several ways of distinguishing load balancing problems. There are sequential and parallel strategies, that is, placing the tasks one after the other or all of them in parallel. Another approach divides load balancing problems into continuous and static ones. In the continuous case new tasks are generated and consumed as time proceeds, in the second case the number of tasks is xed. We present and analyze a parallel randomized continuous load balancing algorithm in a scenario where n processors continuously generate and consume tasks according to some given probability distribution. Each processor initiates load balancing actions only if its load exceeds a certain threshold t, and then tries to nd a balancing partner for moving some of its tasks to this partner. We consider several randomized load generation and consumption schemes which yield an expected load of O(n) in the whole system. We show that the maximum load of any processor is upper bounded by (log log n)2 , with high probability. The expected number of load requests required for nding a balancing partner is constant.

 Supported in part by the DFG Sonderforschungsbereich 376 \Massive Parallelitat: Algorithmen, Entwurfsmethoden, Anwendungen", by the SICMA Project founded by the European Community within the program on \Advanced Communication Technologies and Services", and by the EU ESPRIT Long Term Research Project 20244 (ALCOM-IT) y Supported in part by the DFG Sonderforschungsbereich 342 \Werkzeuge und Methoden fur die Nutzung paralleler Rechnerarchitekturen"

Additionally, we consider an adversarial scenario where each processor is allowed to generate up to O((log log n)2 ) 2 tasks in (log log n) steps, and each processor consumes one task per step, provided that at least one is present.0 In this case, we unfortunately also need an upper bound T on the current system load. The maximum number of tasks per processor is then bounded by O(T 0 + (log log n)2 ), w.h.p. Compared to well known balls-into-bins games, where each generated ball is placed on a randomly chosen processor, our algorithm saves on communication. It moves tasks that have been generated on processor P only to another processor if the load of P exceeds a certain limit. The costs of communication are saved at the expense of a slightly higher maximum load, that is, O((log log n)2 ) compared to O(log log n). Another advantage of our algorithm is that it attemps to have the tasks generated on the same processor together, which is very important if they are not independent.

1 Introduction Load balancing is about distributing tasks (load units, jobs) among a set of processors and thereby satisfying certain criteria. One criterion might be the smooth distribution of the load among all processors, or make sure that no processor becomes idle, i.e. each processor is assigned at least one task. There are several ways to classify load balancing problems. First, there is static load balancing, where a set of tasks has to be assigned to the processors. The size m of this set is xed (but may depend on n, the number of processors). Important representatives of this category are balls-into-bins type games. Here we deal with a set of balls (tasks) that have to be placed into bins (processors). The goal is to attain a load distribution as smoothly as possible. Furthermore, one can distinguish between sequential and parallel strategies, that is, placing the tasks one after the other or all of them at the same time. Completely di erent from the above mentioned issue, there are continuous load balancing problems. Here, the number of tasks is no longer xed but new tasks are generated (and consumed) as time passes by. We consider such problems as if they would have potentially in nite running

time. Usually one focuses on the long-term behavior in continuous systems. Furthermore, we can distinguish both the static and continuous case into local and global load generation approaches. The rst assumes that load is generated \in place" (that is, by the processors), whereas the latter models a scenario where load comes \from the outside" (this approach applies to balls-into-bins games). Often the local approach is referred to as load balancing opposed to task allocation in the global case. In this paper we examine continuous randomized load balancing problems where the load is generated and consumed so that the expected overall system load and the load generated per round is O(n), i.e., the processors are allowed to generate and consume a constant number of tasks per round. We present a load balancing algorithm where each processor initiates load balancing actions only if its load exceeds a certain threshold T , and then it tries to nd a balancing partner in order to move some of its tasks to this partner. We present bounds on the maximum load of the processors at an arbitrary time step under several load generation and consumption models. We additionally bound the expected amount of communication that is needed to nd a balancing partner as well as the time the tasks rest in the system.

1.1 Known Results

Let us rst consider static balls-into-bins problems. It is well-known that, if m = n balls are placed independently and uniformly at random (i.u.a.r.) into n bins, there is one bin getting (log n= log log n) balls with a probability of 1 ? o(1). In [ABKU94], Azar, Broder, Karlin, and Upfal present a di erent (sequential) game where each ball is allowed to choose d  2 bins and is placed into the bin with the lowest load among the chosen bins. They show that w.h.p. the maximum load decreases exponentially to (log log n= log d). In [CS97], Czumaj and Stemann investigate an adaptive process where the number of choices made in order to place a ball depends on the load of the previously chosen bins, a process allowing reallocations, and an o -line allocation process which knows the random choices in advance. Adler, Chakrabarti, Mitzenmacher, and Rasmussen extend these results in [ACMR95] to parallel settings and focus on the trade-o between the amount of communication and the maximum load. For a certain class of strategies, p they show a lower bound of ( r log n= log log n) on the maximum load if a constant number of r rounds of communication are allowed. They present parallelizations of the sequential strategy of Azar et al. that asymptotically match the lower bound for two rounds of communication. Furthermore, they examine a strategy using a threshold T : In each of r communication rounds each non-allocated ball tries to access two bins chosen i.u.a.r. and each bin accepts up toq T balls during each round. They show that log n with T = r (2r+logo(1)) log n this algorithm terminates after r rounds with maximum load r  T , w.h.p. In [Ste96], Stemann extends the results to the case where the number of balls m is larger than the number n of bins.

For m = n, he analyzes a very q simpleclass of algorithms achieving maximum load O r logloglogn n if r rounds of communication are allowed. For constant r, this matches the lower bound presented in [ACMR95], and for r = log log n he achieves constant maximum load.log log Forn m > n balls he achieves optimal load O( mn ) using log( of comm=n ) rounds ? m  p r munication, w.h.p., or load max log n; O n using r rounds of communication, w.h.p. [BMS97] extends the above results to two directions. Berenbrink, Meyer auf der Heide, and Schroder generalize the lower bound to arbitrary r  log log n implying that Stemann's protocol is optimal for all r. Their main result is a generalization of Stemann's result to weighted balls. They present a process archiving  a maximum load  of at least ?  log n

 mn  W A + W M using O log( ((logm=n )+1)) rounds of A M communication. In that case W (W ) denotes the average (maximum) weight of the balls, and  = W A=W M . Their algorithm is optimal in the case of weighted balls for various degrees of uniformity . Another property of their protocol is that m, the number of balls, needs not be known in advance. Apparently somewhat less work has been spent on theoretically analyzing continuous load balancing. In [ABKU94], Azar, Broder, Karlin, and Upfal provide an additional results of an in nite version of their sequential process: in the case of a stationary distribution, the maximum load of any bin is less than log log(n)= log(d) + O(1), w.h.p., where n is both the number of balls and bins in the system. In [CS97] the rate of convergence of the in nite process to the stationary distribution is improved. In [RSU91] Rudolph, Slivkin-Allalouf, and Upfal present a simple strategy to equalize the load of two processors in one step and show that the expected load of any processor at any point of time is within a constant factor of the average load. They assume a load generation model where at each time step the load di erence of any processor due to local generation and consumption is bounded by some constant. In [LM93], Luling and Monien use a similar strategy. In their model each processor can generate and consume a constant number of load packets per step. A processor initiates a load balancing action if its load has doubled since its last balancing action. To balance load, the processors choose a constant number of processors at random and equalize their load. Luling et al. show that the expected load di erence of any two processors is bounded by a constant factor. Furthermore, they tightly bound the variance of the expected load of a processor. In [Lau95], Lauer presents a load balancing algorithm assuming the average load av of the system to be known. Each processor becomes active as soon as its load di ers from the average load by c  av for some constant c. In each round, an active processor P chooses a balancing partner at random, until it nds an applicative0 one. A processor P 0 is called applicative if both P and P will not be active after equalizing? their load. Lauer shows that with probability av 10? 2n1?O log n no processor has load that exceeds c0  av for c  c. Thus, he obtains a high probability result for the case av  O(log n) only. Additionally, he presents techniques to

estimate the average load of the system and extends his results to this case. In [Mit96], Mitzenmacher introduces a so-called supermarket system and shows that, given constant running time T , w.h.p. no processor ever has load exceeding O(log log n). The balls are placed sequentially and he assumes that the load generation is modelled by a Poisson stream of rate n,  < 1, with exponentionally distributed service time with mean 1. In [Mit97] he extends his results to several different load generation and consummation schemes such as constant service time or di erent customer types. However, his results are only valid for constant running time. Several algorithms and complete tools for continuous load balancing are developed and properly examined with the help of experimental studies, for example see [HS97] and [WHV95], for an overview consult [SS97]. In [DHB97] the authors propose three novel load balancing algorithms which are implemented on an IBM SP2. They present a thorough experimental study with Poisson-distributed synthetic load, the study shows that their algorithms are very e ective. In [MD96] the authors propose a randomized scheme for load balancing called random seeking (RS): source processors randomly seek out sink processors for load balancing by inging probe messages. These probes do not only seek out sink processors for load balancing, but also collect load distribution information which is used to eciently regulate load balancing activities. The authors also give a bound on the probability that a probe will be able to allocate a sink and the average number of visits required by a probe to allocate a sink.

1.2 New Results

To our knowledge we present the rst analysis of a continuous load balancing algorithm assuming local load generation, at the same time giving a high probability bound on the maximum load of each processor. We consider a set of n processors. At any step a processor can generate tasks, perform local computations, communicate with at most a constant number of other processors, and send some tasks to another processor. We assume that yetto-be-performed tasks are stored in a FIFO like manner. We mainly concentrate on the following load generation model: Single: At each time step every processor generates a task with probability p and consumes one task with probability q = p +  (provided there is at least one task) for arbitrary constant  > 0. (The  is needed because otherwise no steady-state statements can be made.) This models a scenario where the running time of tasks is geometrically distributed. In order to be able to obtain a steady-state load distribution on the processors, the probability that a processor generates one task per step has to be smaller than the probability that a processor consumes one task per step. For our load balancing algorithm we divide time into consecutive phases of length 161 T with T = (log log n)2 . Each processor initiates a load balancing action only if at the beginning of a phase its load exceeds a threshold of 21 T . During that phase each of this processor tries to nd a balancing partner with load at most 161 T (again, at the beginning of the phase) in order to transfer 41 T of its tasks to this partner. This process of nding balancing partners is carried

out by the so-called collision game, which is introduced in section 2. We now formulate our Main Theorem: Theorem 1 Fix an arbitrary point of time. Given the Single generation model, our algorithm assures that with high probability1 (w.h.p.) the maximum load of any processor is bounded by (log log n)2 . Note that the Theorem also implies the stability of the system. We then show that the expected number of communication rounds needed to nd a balancing partner is constant. Here under a communication round we understand the execution of a collision game. There are several models closely related to Single. Our analysis has to be only modi ed slightly to give the very same result for any model with overall expected system load O(n), in which steady-state statements like in Lemma 2 can be made2 and a processor is allowed to generate up to O((log log n) ) tasks in an interval of the same length. For example, for the following two models we can show the same results (the proofs are omitted due to space limitations): Geometric: In one step each processor generates up to k (k has to be a constant) tasks. For i 2 f1; : : : ; kg a processor generates i tasks with probability 2?(i+1) , that is, with probability 1=4 it generates one task, with probability 1=8 two task, and so on. With the remaining probability (> 1=2) no task is generated. Furthermore, it deterministically consumes one task if at least one is present. Multi: For 0  i  c, a processor may generate up to i < c tasks per step with some probability p(i), as long as c is a constant and the expected number of tasks generated per step is less than one. Each processor consumes deterministically one task per step if at least one is present. In the above cases our maximum load will be bounded by k  (log log n)2 and c  (log log n)2 , respectively. The two load generation schemes introduced above model constant running time for each task, but more than one task may be generated per time step. For these models we additionally show that the tasks being in the system at an 2arbitrary xed time step spend no more than O((log log n) ) steps in the system, w.h.p. The last generation model that we consider is an adversarial model: Adversarial: In T = (log log n)2 steps a processor is allowed to change its load on its own by O(T ) tasks in either direction. Additionally, a upper bound B for the overall system load has to be given for each point of time. Provided that an upper bound of the total system load is given, this load generation scheme is capable of describing tree-like load generation schemes, where each task currently being performed is able to generate a constant number of new tasks. In this case we can prove that at an arbitrary point of time the maximum load of any processor is bounded by O(B +(log log n)2 ), w.h.p. The expected number of communication rounds needed to nd a balancing partner is constant. The proofs of these results are omitted for reasons of brevity, too. 1 With high probability means with probability at least 1 ? n?c for some constant c.

Compared to balls-into-bins games that are spreading tasks among di erent processors, our algorithm saves on communication costs. Parallel balls-into-bins games need at least O(n) communication messages to allocate the n tasks which are generated per step. We show that, at the beginning of one phase, the number of processors initiating a load balancing action is bounded by n=(log n)log log n , w.h.p., and the overall number of communication rounds per phase is bounded by n=(log n)loglog(n)?1 . Thus, for our algorithm ? O n= log(n)(log log(n)?1) messages per whole phase (consisting of (log log n)2 steps) are sucient, w.h.p. A processor initiates a load balancing action only if its load is too high, and this can only be the case after the generation of ((log log n)2 ) new tasks since its last balancing action. Another advantage of our algorithm is that it tries to have the tasks generated on the same processor together. This is an important feature if these tasks are not independent.

2 The Collision Protocol In this section we describe a very useful tool that we are going to apply in our balancing algorithm. The so-called (n; ; a; b; c)-collision protocol originates in shared memory simulations (see [MSS95]), but we adapt it to determine an assignment of load balancing requests to processors. Suppose that we have n processors, from which n are overloaded, for some 0 <  < 1. We say that each of these processors initiates a request. For each request we now select a processors independently and uniformly at random (i.u.a.r.), and send a query to each of them. Thus, we have a queries belonging to one request. The main goal of the collision protocol is to nd an assignment of queries to processors that satis es two conditions, namely 1. no processor has to answer more than c queries, and 2. at least b < a of the queries that belong to a request are answered. The protocol can be brie y depicted as in gure 1. In [MSS95] it was shown that w.h.p. this process termilog log n +3 nates with a valid assignment of queries (after log( c(a?b)) rounds of the For-loop), provided that we have  n processors,   na requests for some 0 <  < 1,  2  a  plog n queries for each request,  b < a accepted queries are required for each request, ?    collision value c = O plog n=(a ? b) 1=3 ,

and that the following two conditions hold for some constant  > 0:

c2 (a ? b) > 1 +  c+1     1  1 a?b  1   aa ? 2 ? b c!

(1) (2)

For each request select i.u.a.r. a processors. P1 ; : : : Pa , and send queries to these processors. For r = 1 to log(logc(loga?nb)) + 3 do  Each processor with at most c queries accepts all queries and sends accept messages to the corresponding processors. If a processor gets more than c queries, it answers none of them (we call c the collision value).  Each requesting processor counts the number of arriving accept messages and stores the id of the sending processor (we call the corresponding queries accepted queries). If at least b accept messages arrived, further queries are cancelled and the processor leaves the game.  Each processor getting fewer than b accept messages (summed up over all rounds of the protocol) sends new queries to the processors which have not accepted the queries up to now (note that no new random choices are made). Figure 1: The (n; ; a; b; c)-collision protocol Since the a queries are checked for success sequentially and an overloaded processor has to wait c steps to know if a query was accepted, one round of the  time  protocol needs n + 3 steps. a  c. This leads to a total of a  c  log(logc(log a?b)) Note that it can be shown that w.h.p. the protocol needs O(n=a) messages altogether. In order to simplify upcoming descriptions we now state the following Lemma: Lemma 1 Let a = 5, b = 2, c = 1, and  < 1 suitably chosen. The (n; ; a; b; c)-process determines a valid assignment of queries such that after 5 log log n steps there are two accepted queries for each request and each processor is assigned at most one query. Proof We just have to plug in our parameters: 



n a  c  log(logc(log a ? b)) + 3   = 5  logloglog3 n + 3  5 log log n for n large enough. Furthermore, it is easy to see that both conditions 1 and 2 from above are ful lled. ut

3 The Balancing Algorithm In this section we introduce our balancing algorithm. Let T = (log1 log n)2 . We divide time into consecutive phases of length 16 T . Processors with load at least 12 T at the beginning of a phase are called heavy and processors with load

at most 161 T at the beginning of a phase are called light. During a phase each heavy processor tries to nd one light processor in order to transfer 41 T tasks to it. In the following, we describe a balancing phase in more detail. We split each phase into rounds. During the rst round, each processor P being heavy at the beginning of the phase sends two balancing requests to two processors P (1) and P (2) . In order to nd these processors we apply the collision protocol globally, that is, seen over all requesting processors. We now have that either at least one of P (1) and P (2) was light at the beginning of the phase and was not \reserved" to be assigned load from some other overloaded processor in this phase and, therefore, is able to accept tasks from P , or both of them were not. In the latter case they \support" P to nd a balancing partner and send out two load requests each during communication round two to processors P (1;1) ; P (1;2) and P (2;1) ; P (2;2) , respectively. Again, we apply the collision protocol for nding these processors. This approach continues in the next rounds, so that in the i-th round of the phase we have 2i balancing requests for each heavy processor. In the rst round a heavy processor sends a balancing request to two other processors, in the next round each of the two send a balancing request to two others (if they are both heavy), and so on. One can imagine this as building a binary balancing request tree for each heavy processor. Now we are going to describe how to apply the collision protocol. In the following we call the processors that participate at the collision game active. Each active processor chooses i.u.a.r. a = 5 processors and participates at the collision protocol with parameters c = 1 and b = 2, that is, each request consists of a queries, and at least b of these queries will be accepted at the end of the protocol. We apply the collision protocol to distribute the queries between each pair of adjacent levels of our query trees. Recall that queries are doubled each round. We say, any processor can start a query (in case it is supporting some other processor), but only roots of query trees originate them. Note that the acceptance of a query not necessarily means that the corresponding processor is able to actually accept additional load; it merely means that the number of queries directed to it does not exceed c. We will see in Lemma 4 and Lemma 5 that there are at most O(n=(log n)log log n ) heavy processors, and? it is su- log n cient for each heavy processor to ask at most  log log log n processors at random to assure with high probability that at least one of them can accept additional load. Thus, this is the case if we build these query trees up some depth o(log log n). Since a constant fraction of log log n is sucient as depth of the query trees we use 801 log log n to keep the following description simple. Note that in the last level of our query trees we therefore have O(n  log(n)=(log n)log log n ) queries, summed over all these trees, and the total numberlogoflogqueries of a phase can n ), too. The running be bounded by O(nlog(n)=(log n) time of a phase can be bounded by 5 log log n  801 log log n  (log log n)2 . After completing a collision round each processor that has accepted a query and that is able to accept additional load sends an id message to the originator of the request. In case that an originator receives more than one id message, an arbitrary one is selected. Note that our algorithm makes use of the collision e ect, that is, independent on the number of

queries (messages) directed to a processor, only a constant number of them is actually evaluated. As we noted earlier, in order to give some bound on the expected amount of communication later on, we assume that a processor forwards a request if and only if both it and its sibling (since we have b = 2 there are two accepted queries per request; by sibling of a query we mean the other one) in the query tree cannot be assigned additional load. Checking this is easy: they just have to communicate via their parent node in the tree. Furthermore we assume that tasks generally are processed in a FIFO like manner and that tasks being transferred during a balancing action are both taken from and appended to the back of the corresponding queues. The algorithm can now be stated as in gure 2. Note that, since we set c = 1, each processor has to answer at most one query.

4 The Analysis We split the analysis into two parts. First, we consider the unbalanced system only, that is, a system without any load balancing actions. Second, we transport some of the results of the unbalanced system to the balanced one.

4.1 The Unbalanced System

We now estimate probability bounds on the load of single processors as well as on the total system load. Lemma 2 Fix an arbitrary time step. Given the Single generation model, in the unbalanced system a node will have load k with probability (1=c)k , for some constant c > 1. Furthermore, the system load is O(n) with high probability. Proof Fix an arbitrary processor. Suppose we have generation probability p and consumption probability p +  with arbitrary  > 0. It is easy to see that at any time step any node gains a packet with probability pg := p  (1 ? (p + )), loses a packet with probability pl := (p + )  (1 ? p), or remains at it's level with probability ps := 1 ? pg ? pl . We now construct a discrete-parameter Markov chain. This chain models the time-dependent load situation, that is, the di erent states of the chain correspond with the processor's load. We have the transition probability matrix P = (p)ij where pij denotes the probability for the chain to reach state j directly from state i. Due to our generation model we have pij = 0 for ji ? j j > 1 and pi;i+1 = pg , pi;i?1 = pl for i > 0, and pi;i = ps . We can compute the steady state probability vector ~v = (v0 ; v1 ; : : :) with vi as the probability for the chain being in state i (that is, for the processor having load i) as 





pg i pl l  i  p  (1 ? p ?  ) p g = 1? p (p + )  (1 ? p) l   2 i = 1 ? ppg (p ?p ?pp??p2p) +  l  i 1   c for 1 < c  (p?p?pp??p2p)+ 2 :

vi =

1 ? ppg

/* We have the following variables for each processor Pi : sibling[i] Sibling of Pi search[i] Is 1 i Pi starts a request in the current round boss[i] Originator of a query accepted from Pi found[i] Is j if Pi accepted a query of Pj , ?1 otherwise light[i] Is 1 i Pi is light at the beginning of the phase heavy[i] Is 1 i Pi is heavy at the beginning of the phase assign[i] Is 1 i Pi was assigned load before round */ For each processor Pi Do in parallel If heavy[i] Then search[i] = 1; boss[i] = i Else search[i] = 0 For i = 1 To 801 log log n Do \Play collision game with fPi j search[i]=1g" /* We assume that the collision game ends up with a valid assignment for the array found, as described earlier */ For each Pi with found[i] > ?1 Do in parallel boss[i] = boss[found[i]] If (light[i]=0 Or assign[i]=1) And (light[sibling(i)]=0 Or assign[sibling(i)]=1) Then /* Both me and my sibling cannot accept load. So we keep on searching. */ search[i] = 1 Else If light[i] = 1 Then search[i] = 0 assign[i] = 1 send id message to processor boss[i]

End End For each Pi with heavy[i]=1 Do in parallel If Pi received an id message Then send 41 T tasks to the corresponding processor and leave the game End Figure 2: A Phase of the Balancing Algorithm

Given this it is easy to show that with high probability the total system load is bound by O(n), using e.g. ChernovHoe ding bounds (note that as long as we consider an unbalanced system we have independent events so that Chernov bounds apply). ut Note that similar calculations can be done for the other generation models that have been introduced. So far we have seen that the unbalanced system does not behave too badly as long as we only watch the total system load. Nevertheless, with ?probability  1 ? o(1) there will be some node having load logloglogn n .

4.2 Analyzing the Balanced System

It is easy to see that a balanced system does not behave worse (in terms of total system load) than an unbalanced one. To give an intuition, suppose you have a completely unbalanced system. Then let, at some time step, two processors with load l1 and l2 , where l1 < l2 , perform a balancing action, where m tasks are moved. Now we have the situation that without balancing and without generation of additional tasks after l1 steps only the higher loaded processor would continue to consume load whereas with balancing both consume for l1 + m steps and after l1 steps the total system load starts to get less than in the unbalanced case. Of course, the situation is slightly di erent considering the case that further tasks may be generated after balancing. But still the following Lemma can be proven easily:

Lemma 3 Fix an arbitrary time step. In the balanced system, the system load is O(n) with high probability. As a next step we want to estimate the numbers of heavy and light processors, respectively. Lemma 4 Fix an arbitrary phase. With high probability there are no more than O(n=(log n)log log n ) heavy processors at the beginning of this phase. Furthermore, there are with high probability at least n(1 ? 16Tc ) light processors at the beginning of the phase, for some constant c. Proof During a balancing action a processor is assigned 41 T tasks. 1Since we only balance with processors having load at most 16 T at the beginning of one phase, afterwards such a processor has load at most 166 T (the 14 T tasks assigned in correspondence to a balancing action and at most another 1 16 T tasks generated by itself during the phase). Thus, a processor never exceeds load 166 T due to balancing actions where it was assigned load. This means that a processor must have raised its load by at least 12 T ? 166 T = 1 T tasks on its own in order to reach our threshold of 1 T , 8 2 i.e., it stores at least 18 T self-generated tasks. It can be shown that the probability for a processor to have 12 T can be upper bounded by the probability that a processor has load 81 T at a xed time step in the case of a process without any balancing actions at all. This can be done by observing a slightly di erent process in which processors prefer self-generated tasks over tasks which are received due to balancing actions (self-generated

tasks are not delayed by received tasks, and due to actively triggered load balancing actions the self-generated load only gets smaller). In this case the number of self-generated tasks can be bounded by observing the load of a processor that does not participate in balancing actions at all. Note that the load of the processors does not dependent on the order the tasks are performed in. Thus, we can apply Lemma 2. Let bi be the load of processor Pi . Then we have

p(bi  81 T ) 

1 X k= 18 T

p(bi = k) 

1  k X 1 k= 18 T 

c



  81 T = O (log n)1log log n = c ?c 1  1c Since there are n processors, the expected number of processors with load at least 21 T is   n O (log n)log log n : Again, it is easy to show that this bound holds with high probability, using e.g. Chernov-Hoe ding bounds. (Note that since we said that a processor must have raised its load on its own we can ignore balancing actions and therefore the events are independent.) Since we have w.h.p. a total load cn for some constant c at the beginning of an arbitrary phase, w.h.p. 1there can be at most 16cn=T processors with load at least 16 T at the beginning of an arbitrary phase. Thus, at the beginning of the phase,1 there are at least n ? 16cn=T processors with load ut at most 16 T . ?



log n Lemma 5 If we select  log log log n = o(log n) processors

with the help of the collision protocol, with high probability at least one of them is light.

Proof The assignment of queries to processors is not really random since each processor is assigned at most one query (thus, leading to some kind of partial permutation). But we can nd a lower bound on the number of available light processors if we decrease the number of light processors as given by Lemma 4 by upper bounds on both the number of queries already handled in this phase and the number of processors already been reserved for balancing actions (cf. variable assign in the algorithm). Therefore, for appropriate constants c, c1 , c2 , and c3 , in order to determine the sought probability, we can bound the number of processors that are able to accept load by ?



n ? 16Tcn ? c1  (lognn)loglognlog n ? c2  (lognn)loglognlog n  n ? 16Tc3 n Given this bound, it is easy to verify that the Lemma follows.

ut

By Lemma 4 we now know that w.h.p. there are at most



O (log n)nlog log n



heavy processors, thus we can bound the number of requests per round by   n log n O (log n)log log n : Thus, the conditions for Lemma 1 are met for each round of the phase. The collision protocol is applicable and nds w.h.p. a valid assignment of queries to processors in each round. We state the following Lemma: Lemma 6 With high probability after 161 (log log n)2 steps each heavy processor has found a light one. Proof As we have seen in Lemma 5 we need to build the query trees up to some depth o(log log n) in order to assure that w.h.p. each heavy processor nds a light one (we had to select o(log n) processors). 1 For reasons of simpli cation we build the trees to depth 80 log log n. Since each of the 1 80 log log n collision games w.h.p. nds a valid assignment after 5 log log n steps, the Lemma follows. ut 1 Remark: Assume a processor P with load 2 T ? 1 at the beginning of a phase, therefore not being overloaded. P may generate up to 161 T tasks during this phase. Thus, at the beginning of the next phase P is heavy and tries to nd a balancing partner. Again, P may generate up to another 1 the balancing was 16 T tasks in this phase. Provided that successful, P is left with load at most 21 T ? 1 + 162 T ? 41 T < 6 1 16 T , therefore it is below the threshold of 2 T . This implies that if a processor's rst balancing attempt was succesful, it cannot be heavy during the next phase. We are now ready to proof the Main Theorem. In order to upper bound the probability that there exists a processor P which has load exceeding the threshold at an arbitrary point of time t we bound the expected number of processors exceeding the threshold. Obviously, this expected value yields a bound for the seeked probability. Fix an arbitrary point of time t at which the system will be examined and assume that at this time some processor P has load exceeding the threshold. Then, there must be a time step t ? t0 (being the beginning of one phase) in which P 's load rst reached this mark. Between steps t ? t0 and t, P always had too much load, and in each phase P either unsuccessfully tried to nd a balancing partner, or found a partner but was not able to give away as much load as would have been needed to drop below the threshold. In the following we de ne three events.  Ei is the event that P 's load exceeded the threshold rst in step t ? i.  F is the event that P is still above the threshold in step t.  U is the event that the rst balancing phase was unsuccessful. We now have

p(P has too much load in step t)

 

1 X i=1

1 X i=1

p(F ^ Ei ) p(Ei)  p(F j Ei )

Since p(F j Ei )  p(U ) (Note that the condition that the rst balancing step was unsuccessful is necessary for our situation), we can write

p(P has too much load in step t)



1 X i=1

p(Ei)  p(U )

= p(U ) 

1 X i=1

p(Ei) = p(U )

In the following we bound U , the probability that the rst balancing step of P was unsuccessful. First we show that, given the situation that there exists a processor P starting its rst balancing action, the total system load still can be estimated to O(n). Let  A be the event that we have !(n) tasks in the system, and  B be the event that P initiates a balancing action. We now can say p(AjB ) = p(Ap(B\ )B )  pp((BA)) We can upper bound p(A)  1=nl for some constant l > 2 as usual using Chernov-Hoe ding bounds. We can easily lower bound p(B ) by considering a modi ed protocol that, on all accounts, performs fewer balancing actions than the original one. We omit details, but it can be shown that we can bound p(B )  1=n. All conditions are met to apply Lemma 4 which shows that there are w.h.p. O(n=(log n)log log n ) heavy processors at the beginning of an arbitrary phase. Hence, there are not too many requests for the collision game during each round of the phase and bounds given for the 16collision game hold. Furthermore, there are at least n(1 ? Tc ) light processors at the beginning of the phase w.h.p. Lemma 6 shows that w.h.p. each heavy processor nds a light one in order to transfer load within 161 T steps. Thus, the probability that the load of P exceeds the threshold in step t can be bounded by n?(l+1) for some constant l. Furthermore, the probability that there exists a processor exceeding the threshold can be upper bounded by n?l . This concludes the proof of our Main Theorem.

4.3 Expected Behavior

We know that with high probability a heavy processor nds a light one after issuing o(log n) balancing requests. But one

might also be interested in the expected number of queries. In the following we show that the expected number of requests performed for one heavy processor is bounded by a constant. This means that with high probability the expected depth of our query trees is constant and the expected total number of messages communicated for each heavy processor also is constant. By this we mean that if the system load is bounded by O(n) | and this is true with high probability | the expectations are as given. Furthermore, we show that the waiting time of a task, i.e. the time2 a task spends in the system, is bounded by O((log log n) ), w.h.p.

Lemma 7 Fix an arbitrary phase starting at time t. With high probability the expected number of requests sent for a heavy processor in that phase is constant. Proof In the following we call a processor applicative for a balancing action if it is light at the beginning of the phase and was not reserved to get load during this phase so far. To get a bound on the expected amount of communication, we remember that we let a node complete its subtree in our balancing request tree if and only if both itself and its sibling are not applicative. In the following we call a path from the root of a balancing tree to a node v on level i an active path if all the predecessors of v and their siblings correspond to requests sent to not applicative processors. Thus, the path corresponds to 2  (i ? 1) requests send to non-applicative processors. We know that with probability about 1 ? 16Tc a processor is applicative for a balancing action, for some constant c. For ease of notation we assume that with probability at most 1=4 it is not applicative, and, therefore, cannot accept new tasks. Now we can say p(9 active node on level i)  p(9 active path to this node)  2(i?1)  2i  41  

1 3i?4 2 Each active processor of level i sends two balancing requests, this gives rise to =

Expected number of balancing requests sent by an overloaded processor



logX log n i=1 logX log n

2i+2  p(9 active node on level i)

 3i?4 2i+2  21 i=1 logX log n  3i?4 2i+2  21  i=1





logX log n  i=1

1 2i?6 2

This clearly is bound by some constant. ut Furthermore, in the case of constant running time per task, we can easily show that the time a task spends in the system is bounded by O((log log n)2 ), w.h.p. (note that the expected time is constant). This holds because the processors perform the jobs in a FIFO manner, and their load is bounded by O((log log n)2 ), w.h.p. Furthermore, if a task is transferred due to a balancing action, it's position in the receiver's queue is closer to the front than it was in the sender's queue (these tasks are appended to the end of the queue of the new processor in their old order). Thus, we can state the following corollary. Corollary 1 Fix an arbitrary point of time t. If the lengths of the tasks are bounded by some constant, the waiting times of all requests in the system are bounded by O((log log n)2 ), with high probability. Nearly the same analysis can be applied to the two other randomized load generation schemes. In that case the number of tasks2 generated during2 a phase can be bounded by k(log log n) and c(log log n) , respectively. Thus, w.h.p. the maximum load can be bounded by k(log log n)2 and c(log log n)2 , respectively. In the case of the adversarial generation model the proof will very similar, too. Only two additional modi cations have to be taken into account. The load generated dur2 ), yielding a ing a phase can be bounded by O((log log n ) maximum load of O(B + (log log n)2 ), w.h.p., where B is the average load of the system. The rst modi cation is that we have to proof that the average load of the system remains O(B + log log n)2 during a whole balancing phase. The other modi cation is is that we have to estimate the number of queries which have to be handled by the collision game. It is easy to see that we can modify our protocol in a way that only processors which are not applicable and processors which are applicable at the beginning of a phase but are already chosen as balancing partners have to forward balancing requests. Thus, their number gives us a bound on the number of queries. Unfortunately this yields a high constant for our bound on the maximum load. To elude this, we have to modify our load balancing algorithm. At the beginning of a phase all heavy processors send a balancing request to one i.u.a.r. chosen processor. If this processor is light and gets only one request it initializes 2a balancing action. that after O(log log n) ) steps at most   It can be shown n O (log n)log log n heavy loaded processors are left. Then we can continue like before.

5 Concluding Remarks In order to present the basic ideas in a (hopefully) more intuitive way we neglected some technical details. They may appear in an extended version of this paper. We have seen that our balanced system does not behave worse (in terms of overall system load) than an unbalanced one. Since we know that the latter recovers from worst case scenarios, this also holds for our system. Additionally, since during balancing actions we do not assign load to overloaded processors, due to our generation model they recover from worst case situations.

It is not necessary to move a complete packet of O(T ) tasks from one processor to another. W.h.p. at most O(T ) tasks have to be transported, this can be done in a streamlike manner during the next interval of length O(T ), where new balancing decisions are made. Note that a time step in our model actually consists of four steps. A processor can generate and consume load, perform balancing decisions, and actually move load. If there is no load to move, or no balancing decisions to be performed, this time can be used to perform local computation, that is, speed up the working on the tasks. Only a fraction of   n O log nlog log n processors tries to nd a balancing partner during a balancing phase, and the expected amount of communication needed to nd a partner is constant. Thus, the realistic load balancing overhead will be very small. Without using the collision protocol, or, actually, without using any protocol, one could show a high probability bound of O(log n) on the maximum load of any processor just by using Lemma 2. We easily could have reduced the bound for the maximum load of any processor to O(log log n) if we would not have focused on minimization of load ow. At the beginning of each interval of length log log n one could simply throw all load into the air and distribute it via the simple collision protocol. This would lead to load O(log log n) for all processors but, on the other hand, would mean that the load of a processor would be spread among a lot of other processors, instead of being transferred to just one as in our model. Note that the dividing of time in phases is just an analytical instrument and is by no means essentially necessary for the algorithm itself (but, of course, the collision protocol would have to be modi ed).

References [ABKU94] Y. Azar, A. Z. Broder, A. R. Karlin, and E. Upfal. Balanced allocations (extended abstract). In Proceedings of the Twenty-Sixth Annual ACM Symposium on Theory of Computing, pages 593{ 602, Montreal, Quebec, Canada, 23{25 May 1994. [ACMR95] Micah Adler, Soumen Chakrabarti, Michael Mitzenmacher, and Lars Rasmussen. Parallel randomized load balancing. In Proceedings of the 27th Annual Symposium on Theory of Computing, pages 238{247, New York, NY, USA, May 29{June 1 1995. ACM Press. [BMS97] P. Berenbrink, F. Meyer auf der Heide, and K. Schroder. Allocating weighted balls in parallel. In Proceedings of the 9th annual Symphosium on Parallel Algorithms and Architectures. ACM, 1997. [CS97] Artur Czumaj and Volker Stemann. Randomized allocation processes. In Proceedings of the 38th IEEE Symposium on Foundations of Computer Science. IEEE, 1997.

[DHB97] S. K. Das, D. J. Harvey, and R. Biswas. Design of novel load balancing algorithms with implementations on an ibm sp2. In Proc. of the 3rd EURO-PAR Conference, 1997. [HS97] C.-J. Hou and K. G. Shin. Implementation of decentralized load sharing in networked workstations using the Condor package. Journal of Parallel and Distributed Computing, 40:173{184, 1997. [Lau95] Thomas Lauer. Adaptive Dynamische Lastbalancierung. PhD thesis, Universitat des Saarlandes, 1995. [LM93] Reinhard Luling and Burkhard Monien. A dynamic distributed load balancing algorithm with provable good performance. In Proceedings of the 5th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 164{172. ACM Press, 1993. [MD96] N. R. Mahapatra and S. Dutt. Random seeking: A general, ecient, and informed randomized scheme for dynamic load balancing. In Proc. of the 10th International Parallel Processing Symposium (IPPS), pages pages 881{885, 1996. [Mit96] M. Mitzenmacher. Load balancing and density dependent jump markov processes. In Proceedings of 37th Annual Symposium on Foundations of Computer Science, pages 213{223. IEEE, oct 1996. [Mit97] Michael Mitzenmacher. On the analysis of randomized load balancing schemes. In Proceedings of the 9th annual Symphosium on Parallel Algotithms and Architectures. ACM, 1997. [MSS95] Friedhelm Meyer auf der Heide, Christian Scheideler, and Volker Stemann. Exploiting storage redundancy to speed up randomized shared memory simulations. In Proceedings of the 12th Annual Symposium on Theoretical Aspects of Computer Science, pages 267{278. Springer, 1995. [RSU91] L. Rudolph, M. Slivkin-Allalouf, and E. Upfal. A simple load balancing scheme for task allocation in parallel machines. In Proceedings of the 3rd Annual ACM Symposium on Parallel Algorithms and Architectures, pages 237{245. ACM Press, 1991. [SS97] T. Schnekenburger and G. Stellner. Dynamic Load Distribution for Parallel Applications. Teubner, 1997. [Ste96] Volker Stemann. Parallel balanced allocations. In Proceedings of the 8th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA '96), pages 261{269. ACM, 1996. [WHV95] G. S. Wol e, S. H. Hosseini, and K. Vairavan. Performance of an adaptive algorithm for dynamic load balancing. In Proc. of the ISCA International Conference on Parallel and Distributed Computing Systems, pages 613{618, 1995.