Iterative Dynamic Load Balancing in Multicomputers

1 downloads 0 Views 192KB Size Report
We present a survey and critique of dynamic load balancing strategies that are ...... and F. Ramme (1991) Load balancing in large networks: a comparative study ...
Iterative Dynamic Load Balancing in Multicomputers Cheng-Zhong XU and Francis C.M. LAU Department of Computer Science, The University of Hong Kong

Abstract Dynamic load balancing in multicomputers can improve the utilization of processors and the eciency of parallel computations through migrating workload across processors at runtime. We present a survey and critique of dynamic load balancing strategies that are iterative: workload migration is carried out through transferring processes across nearest neighbor processors. Iterative strategies have become prominent in recent years because of the increasing popularity of point-to-point interconnection networks for multicomputers. Key words: dynamic load balancing, multicomputers, optimization, queueing theory, scheduling.

INTRODUCTION Multicomputers are highly concurrent systems that are composed of many autonomous processors connected by a communication network1;2. To improve the utilization of the processors, parallel computations in multicomputers require that processes be distributed to processors in such a way that the computational load is evenly spread among the processors. In static load balancing, this distribution of workload is done at the initialization phase of the computation. There have been numerous studies on static load balancing using techniques from graph theory, integer programming, queueing theory, etc., as well as heuristical approaches3?5 . Because of its being done just once (at the beginning of the computation), static load balancing is useful only to problems that have a rather static workload among the processors throughout execution6 , and is de nitely not suitable for computations that have a dynamically changing workload such as those of branch-and-bound algorithms for solving combinatorial optimization problems7 . To keep the workload balanced during execution of this kind of computations, it is necessary to perform load balancing at various stages during runtime|this is called dynamic load balancing which is the subject of this paper.  Journal

of Operational Research Society Vol.45,

No.7, July 1994, pp.786{796

1

The execution of dynamic load balancing procedure requires some means of maintaining a global view of the system and some negotiation mechanism for process migrations across processor boundaries to take place. Every dynamic load balancing strategy has to resolve the issues of when to invoke a balancing, who makes the load balancing decisions and according to what information, and how to manage process migrations between processors. Combining di erent answers to the above yields a large space of possible designs of dynamic load balancing methods for multicomputers. In this study, we classify dynamic load balancing methods into iterative and direct according to the manner in which process migrations takes place. With an iterative method, processes migrate \iteratively"|that is, one step (to a neighboring processor) at a time|each step according to a local decision made by the intermediate processor concerned. In contrast, a processor executing a direct load balancing procedure would make decisions on the nal destinations of the local processes it wants to migrate. Examples of direct methods include the bidding8 and the drafting9 algorithms. Direct methods, because of their need of matching senders and receivers of workloads eciently, are most appropriate for systems equipped with a broadcast mechanism or a centralized monitor10?13. On the other hand, iterative methods, characterized by their one-decision-at-a-time behavior, are found to be most e ective in multicomputers that are based on a point-to-point communication network. The study of iterative methods has thus become somewhat popular in the eld of parallel and distributed computing in recent years because of the growing popularity of point-to-point networks. This paper presents a comprehensive survey of the existing, representative works on iterative methods for dynamic load balancing.

CLASSIFICATION OF ITERATIVE LOAD BALANCING Iterative load balancing methods rely on successive approximations to a global optimal workload distribution, and hence at each iteration, need only to concern with the direction of workload migration. Some methods would select a single direction (hence one nearest neighbor) while the other would consider all directions (hence all the nearest neighbors). These various methods can be categorized into deterministic and stochastic. Deterministic iterative methods proceed according to certain prede ned rules. Which neighbor to transfer extra workload to and how much to transfer depend on certain parameters to these rules such as the states of the nearest neighbor processors. With stochastic iterative methods, on the other hand, workloads are redistributed in some randomized fashion, subject to the objective of the load balancing. It is obvious that stochastic methods, without too many prede ned rules, are simpler than deterministic methods. However, the behavior of stochastic methods is much harder to model and analyze in a formal way. We survey three important kinds of deterministic methods in later sections: di usion, di2

mension exchange, and the gradient model. The di usion and the dimension exchange methods are closely related; they both examine all the direct neighbors in every step. With the di usion method, a processor exchanges workload with all its neighbors simultaneously at every step. With dimension exchange, a processor goes around the table, exchanging workload with its neighbors one at a time; after an exchange with a neighbor, it uses the the new workload for the exchange with the next neighbor, and so on. With gradient-based methods, workloads are restricted to being transferred along the direction of the most lightly loaded processor. Stochastic iterative load balancing methods throw dices along the way in an attempt to drive the system into equilibrium state with high probability. The simplest method is randomized allocation in which any newly created process is transferred to a (usually neighboring) processor which is randomly selected, which, upon receiving the process and nding itself to be quite occupied already, can transfer the process to yet another randomly selected processor. Another approach is to use the technique of simulated annealing, which o ers a bit more variety in the control of the randomness in the redistribution of processes. This control mechanism makes the process of load balancing less susceptible to being trapped in local optima and therefore superior to other randomized approaches which could produce locally optimal but not globally optimal results. Figure 1 summarizes our classi cation of iterative dynamic load balancing strategies in multicomputers. Dynamic load balancing

Iterative load balancing

Deterministic iterative

Diffusion

Dimension Exchange

Direct load balancing

Stochastic iterative

Gradient Model

Randomized Allocation

Simulated Annealing

Figure 1: Classi cation of dynamic load balancing strategies

3

DETERMINISTIC ITERATIVE LOAD BALANCING Di usion With the di usion method, each processor would \di use" fractions of its workload to some of its neighbors while simultaneously receiving workloads from its other neighbors at each iteration step; by exchanging an appropriate amount of workload with the neighbors, the processor strives to enter a more balanced situation. Casavant and Kuhl gave a formal description of this method using some state transition model of communicating nite automata14. Under the synchronous assumption that a processor would not proceed into the next iteration until all the workload transfers of the current iteration have completed, Cybenko studied the dynamic behavior of the di usion method by modeling it as an iterative process15. Speci cally, let W t = (w1t ; w2t ; : : :; wnt ) denote at time t the workload distribution of the n nodes of the network|i.e., wit is the workload of processor i at time t; and let A(i) be the set of direct neighbors of processor i. Then the change of workload in processor i from time t to t + 1 is modeled as

wit+1 = wit +

X

j 2A(i)

ij (wjt ? wit) + it+1 ? it+1

1in

(1)

where 0 < ij < 1 is called the di usion parameter of i and j , which determines the amount of workload to be exchanged between the two processors; it+1 and it+1 denote the amounts of workload generated and nished respectively from time t to t +1. This equation corresponds to one iteration step of the di usion process. Cybenko showed that under the quiescent assumption that no new workload is generated or existing workload completed during the execution of the load balancing procedure (i.e., i = i = 0), the di usion method is convergent given any initial workload distribution15 . Then, without the quiescent assumption, he showed that the di usion method can control the growth of the variance of the unbalanced workload distribution of the processors and keep it bounded. Similar results for the di usion method in hypercubes, generalized hypercubes16 , tori, and rings have also been obtained using probabilistic theory by Hong et al.17 and Qian and Yang18. In fact, their works are a special case of Cybenko's, not only in terms of the multicomputer's structure, but also in the choice of the di usion parameters. In their work, the di usion parameters are set as ij = 1=(jA(i)j +1) for all i (for which they did not give an explanation). This choice of the di usion parameters for hypercube happens to be equal to the choice made by Cybenko15 , which is proved by Cybenko to be optimal (in the sense that it gives the load balancing procedure the fastest convergent rate). For structures other than the hypercube, it is possible to derive the optimal di usion parameters using circulant matrix theory as discussed in the work by Xu and Lau19. 4

On the asynchronous track, the di usion method was studied theoretically by Bertsekas and Tsitsiklis20 . In an asynchronous environment, processes at a local processor do not have to wait at predetermined points for predetermined messages from other processors. Because of communication delays, the information maintained in a processor concerning its neighbors' workload could be outdated. Moreover, workloads that are still in transit must be accounted for in the modeling. This makes the theoretical study of the di usion method in asynchronous systems a dicult one. Using linear system theory, Bertsekas and Tsitsiklis showed that the asynchronous version of the di usion method is convergent provided that the communication delays are bounded. However, the problem of determining its convergence rate or the optimal di usion parameters remains unsolved. The theoretical study of the di usion method revealed its sound mathematical foundation. On the practical side, its bene ts were demonstrated in the context of distributed computation of branch-and-bound algorithms on Intel iPSC/2 by Willebeek-LeMair and Reeves21;22, and on transputer networks with the deBruijn and ring topology by Luling and Monien23 . One disadvantage of the synchronous di usion method is that it requires a synchronization phase prior to load balancing in order to shift all the processors into load balancing mode for performing local balancing simultaneously. This is a rather time consuming procedure especially in structures with large diameters. To reduce this cost, the implementation by Willebeek-LeMair and Reeves21 allows some processors to bypass the load balancing. A processor would participate in the load balancing only when its workload exceeds the average of its direct neighbors including itself by a certain threshold. Experimental results showed that such an implementation when applied to the distributed computation of branch-and-bound algorithms performs signi cantly better especially in heavily loaded situations. The implementation by Luling and Monien23 followed the idea of Willebeek-LeMair21 , but included an extra control mechanism for tuning the threshold using feedbacks from previous load balancing decisions. This adaptive bit was proved to be useful especially for networks with large diameters. Moreover, Willebeek-LeMair and Reeves distinguished between sender-initiated and receiver-initiated di usion methods, and showed that the former performed better than the latter22. Note that processes in branch-and-bound applications are in general rather independent of one another, which makes studies of the di usion method as applied to these applications on multicomputers quite tractable. For other applications where processes are less independent of one another, possibly because of the need of interprocess communications or synchronizations, the suitability and performance of the di usion method requires further investigation. On the theory side, two problems remain open: One is the calculation of the optimal di usion parameter for arbitrary structures, and the other is the analysis of the convergence rate of the asynchronous di usion method. The work of Cybenko15 and the work of Bertsekas and Tsitsiklis20 can serve as the basis for further work along these lines. 5

The di usion method is based on a communication model in which a processor can communicate with all its neighbors simultaneously. This works best therefore on hardware that supports true parallel communications over the set of links of a processor. But even when this is possible, true parallelism is dicult to have as the actions of the local process that carries out the exchanges over the links must execute its steps sequentially. When based instead on a model of serialized communications, the di usion method, which is patterned after the Jacobi fashion of relaxation, becomes less e ective. The alternative is one that is patterned after the Gauss-Seidel fashion of relaxation|the dimension exchange method. Dimension exchange The dimension exchange method was initially intensively studied in hypercube-structured multicomputers24;25. In an n-dimensional hypercube, each processor compares and equalizes its workload with those of its neighbors one by one. A \sweep" of the iterative process corresponds to going through all the dimensions of the hypercube once. Since the set of neighbors correspond exactly to the dimensions of the hypercube, the processor would have compared and exchanged load with every one of its neighbors after one sweep. Cybenko proved that regardless of the order of stepping through the dimensions, this simple load balancing method yields after one sweep a uniform distribution from any initial workload distribution15 . He also revealed the superiority of the dimension exchange method to the di usion method on hypercubes in that the former can yield a smaller variance of the workload distribution in non-quiescent situations. This theoretical result was supported in part by the experiment carried out by Willebeek-LeMair and Reeves22 . They used the dimension exchange method in distributed computations of branch-and-bound algorithms on iPSC/2. Prior to load balancing, a global synchronization is required so that all processors become geared up for the execution of the load balancing. Unlike the implementation of the di usion method as discussed before, all the processors have to participate in this global synchronization. Nevertheless, the speedup of using the dimension exchange method over one with no load balancing is better than the speedup of using the di usion method. Willebeek-LeMair and Reeves however conjectured that this would not be the case for larger systems. The dimension exchange method is not limited to hypercube structures. Using edgecoloring of undirected graphs, Hosseini et al. analyzed the method as applied to arbitrary structures26. With edge-coloring27 , the edges of a given graph are colored with some minimum number of colors such that no two adjoining edges are of the same color. A \dimension" is then de ned to be the collection of all edges of the same color. At each iteration, one particular color/dimension is considered, and only processors on edges with this color execute the dimension exchange procedure. Since no two adjoining edges have the same color, each processor needs to deal with at most one neighbor at each iteration|this matches perfectly

6

with an underlying communication mechanism that only supports serialized communications. Figure 2(a) shows an example of a 4  4 mesh colored with four colors (minimum)|hence four-dimensional. The four numbers in brackets correspond to the four colors. Suppose the workload distribution at some time instant is as in Figure 2(a), in which the number inside a processor represents the workload of the processor. Then, after a sweep of the dimension exchange procedure, the workload distribution changes to that in Figure 2(b). For an arbitrary 17

(1)

26

(1)

9

(4) 4

(1)

0

(4)

(4)

(3)

14

13

14

11

12

11

12

11

11

11

12

5

9

9

12

(4) (1)

(3) (2) 14

15

(3) (1)

(2) 8

6

16

(1) 20 (3)

(2)

(1) 13 (3)

0

15

(3)

(3) 12

(2)

22 (3)

(1)

4

(a)

(b)

Figure 2: Workload distribution before and after a sweep of dimension exchange structure, it is unlikely that the dimension exchange method would yield a uniform workload distribution in a single sweep. Nonetheless, Hosseini et al. showed that given any arbitrary structure, the dimension exchange method converges eventually to a uniform distribution26 . In all the works mentioned above, an exchange over an edge of the network will invariably result in equal workloads in the two processors concerned. This \equal splitting" of workload, although happens to be optimal for hypercubes, is found to be non-optimal for some other structures by Xu and Lau19;28. They generalized the dimension exchange method using an exchange parameter  to control the splitting of workload between a pair of directly connected processors. Consider a k-dimensional color structure, such as the one in Figure 2(a). If processors i and j are direct neighbors, and the edge (i; j ) is with color c, 1  c  k, then the change of workload in processor i at time t with c = (t mod k) + 1 is modeled as

wit+1 = (1 ? )wit + wjt

(2)

where 0 <  < 1. Note that a single  is used for the entire network. This generalized dimension exchange method reduces to the ordinary dimension exchange method when  = 1=2. Xu and Lau found that the optimal exchange parameter for a structure is closely related to its topology and size19 . For the structures of chain, ring, mesh and torus, they derived the optimal exchange parameters that would lead to the fastest convergence. 7

A variant of the dimension exchange method was proposed for the hypercube structure by Hong et al.29, which they called cyclic load balancing. With this method, a processor would equalize its workload with one neighbor of some dimension, return to the execution of the application, and then sometime later equalize its workload again with another neighbor of another dimension, and so on. The length of the time interval between two workload equalizations is adjustable for di erent desired degrees of balancing. When it is set to zero, the cyclic load balancing method is reduced to the ordinary dimension exchange method. There have also been studies of the dimension exchange method in SIMD systems. Plaxton derived both lower and upper bounds for the time complexity of the method on hypercubes with serialized communications30 . JaJa and Ryu gave similar results for shue-exchange networks, cube-connected cycles and butter y networks31, and concluded that the load balancing procedure takes a longer time to complete in these structures than in the hypercube. The dimension exchange method in hypercube structures has been thoroughly examined from both theoretical and practical point of view. It outperforms the di usion method in small and medium scale multicomputers in which the cost of the necessary global synchronization phase is not so signi cant. In large scale multicomputers, however, its performance is not yet known and requires further work in the future. Hypercubes aside, the dimension exchange method on other structures still appears to have been given too little attention|in particular, the problem of calculating the optimal exchange parameters for arbitrary structures. Gradient model The idea of gradient-based methods is to maintain a contour of the gradients formed by the di erences in workloads in the network. Workloads in high points (heavily loaded processors) of the contour would ow naturally, following the gradients, to the lower regions (lightly loaded processors). Two examples of gradient-based methods are the Gradient Model (GM for short) by Lin and Keller32 and Contracting Within a Neighborhood (CWN for short) by Shu and Kale33 . The contour in the GM case is called a gradient surface which is in terms of the proximities of heavily or moderately loaded processors to the lightly loaded processors. Since in a distributed environment maintaining accurate proximity values is very costly, they used an approximation instead, which is called the pressure surface. The pressure surface is simply a collection of propagated pressures of all processors and is maintained dynamically according to the workload distribution. The propagated pressure of a processor i, Pi , is de ned as

(

0 if processor i is lightly loaded 1 + minfPj : processor j is a neighbor of ig otherwise Figure 3 shows an example of a pressure surface which is derived from the workload distribution in Figure 2(a). In Figure 3, those processors with propagated pressures equal to 0 are lightly loaded; the rest are heavily loaded. Based on the pressure surface, workload in a

Pi =

8

3

2

1

2

2

1

0

1

propogated pressure 1

2

1

2

0

1

2

3

Figure 3: An example of a pressure surface heavily loaded processor would migrate towards the nearest lightly loaded processor along the steepest gradient. Chowkwanyu and Hwang implemented the GM algorithm in their hybrid load balancer which combined the gradient method with a direct strategy based on the drafting algorithm34. Their application of the hybrid load balancer to concurrent Lisp execution on tree and ring structures demonstrated the superiority of the GM algorithm to the drafting algorithm. In the GM approach, workload migration would only occur between heavily loaded and lightly loaded processors. When there are no lightly loaded processors in the system, no workload migration would take place, or in other words, the algorithm would not try to transfer workload between heavily loaded and moderately loaded processors (these moderately loaded processors are also labeled heavily loaded in the GM approach). Therefore, when a large portion of moderately loaded processors suddenly turn lightly loaded, much commotion will result. To remedy, Luling et al. proposed an improved version of the GM algorithm, the Extended Gradient Model (X-GM)35. In X-GM, a suction surface is added, which is based on the (estimated) proximities of non-heavily-loaded processors to heavily loaded processors. Then, in addition to workload migration from heavily loaded processors to lightly loaded processors driven by the pressure surface, the suction surface would cause workload migration from heavily loaded processors to nearby local minima which may be moderately loaded processors. Its advantage over the GM algorithm is evident from the results of the simulation experiments carried out by the authors. With the CWN algorithm33, each processor needs only to maintain the workload indexes of its direct neighbors. A processor executing the CWN algorithm would migrate its extra workload (such as a newly created process) to the neighbor with the least workload. A processor that receives the process keeps the process for execution if it is most lightly loaded when compared with all its neighbors; otherwise, it forwards the process to its least loaded neighbor. Thus, a newly created process travels along the steepest load gradient to a local minimum. The analogy is water owing from the top of a mountain that settles in a nearby basin. To 9

avoid \horizon e ect" (water stagnating on a at surface), the algorithm imposes a minimum distance a process is required to travel. It also imposes a maximum distance a process is allowed to travel in order to cut down on the traveling cost. If we allow these parameters to be tunable at runtime, the algorithm becomes Adaptive Contracting Within a Neighborhood (ACWN for short). Kale compared ACWN and GM as applied to parallel computations of divide-and-conquer and bonacci programs through simulation36 , and concluded that ACWN performs better than GM in most cases because of the agility of ACWN in spreading workloads around during load balancing. The same conclusion applies to iterative deepening A search algorithm which Kale implemented on an iPSC/2. By xing the values of the minimum and maximum distances a process is allowed to travel at zero and one respectively, Baker and Milner demonstrated on a mesh-connected network of transputers the bene ts of CWN over the version with no load balancing37 . The same restricted version of CWN, coupled with a threshold parameter to classify the states of the processors, has also been used by Su et al. in the parallel execution of logic programs38. All gradient-based methods can be considered a form of relaxation, in which the single hop migration of workload is a successive approximation towards a global balancing of the system. In the works reviewed above, the issues addressed include which processors to migrate workload to, how far a distance a migrating workload can travel, and of course the correctness (i.e., they indeed can lead to a balanced state), and performance in practical implementations. What seems to be lacking is a more exact characterization and analysis of these algorithms as applied to di erent structures. For example, the issue of how much to migrate should be but has not been an issue in the above works, which is a very important parameter to consider in the study of the optimality of these and other algorithms based on the gradient idea.

STOCHASTIC ITERATIVE LOAD BALANCING Randomized allocation Randomization has long been used as a tool in algorithm design. Stochastic techniques like Monte Carlo and Las Vegas are among the most standard methods for solving certain kinds of problems. The routing chip for the latest generation of transputers (the T9000) has incorporated the randomized routing algorithm by Valiant and Brebner39 . In randomized load balancing, to migrate some workload, a random choice of a processor is made. There are two issues that need to be resolved: one is whether processes are allowed to be transferred for more than one time, and the other is whether the randomly selected processor is restricted to a nearest neighbor or not. Note that even if a process is allowed to be transferred to a remote processor, the load balancing can still operate in an iterative fashion as it is not necessary that

10

this remote processor be the destination processor for the process. Eager et al. addressed the issue of whether a process should be allowed to be transferred multiple times10 . They used a parameter, the transfer limit, to control the number of times a process may be transferred. They showed that randomized load balancing algorithms without a transfer limit are unstable in the sense that processors could be devoting all of their time to transferring processes and little time to actually executing them. They also demonstrated that setting the transfer limit to one is a satisfactory choice in situations when the system load is high; and it introduces little extra trac to the system. Randomized load balancing algorithms with a transfer limit of one randomly assign newly generated processes to processors for execution. To study theoretically the performance of these algorithms, Ben-Asher et al. modeled their behavior as the random throwing of a variable number of balls into a xed number of holes40 . Each ball is weighted with some value corresponding to the processing requirement of the process modeled by the ball. Under the assumption that the balls' weights are probabilistically distributed with a known expected value as well as minimum and maximum values, they derived an upper bound on the number of balls needed (i.e., number of processes generated at runtime) for the algorithm to achieve optimal or near-optimal load balancing with very high probability. Luling et al. addressed the issue of whether a process should be migrated only through direct neighbors or not35. Algorithms that always pick a random processor among the direct neighbors are called local random algorithms; algorithms that can pick among remote processors are called global random algorithms. They compared local random and global random algorithms through simulation and showed the superiority of global random algorithms. Their global random algorithm was implemented on top of the Chare kernel36 . Although it was less ecient than the ACWN algorithm of Shu and Kale33 , it performed better than the complicated gradient model in various applications running on iPSC/2. Randomized load balancing algorithms, although are rather simple (their analyses are not), can lead to substantial performance improvement over no load balancing10 . Rabin articulated the bene ts of randomization in algorithm design in his classic paper41 , which are yet to be fully seen in future proposals of randomized load balancing algorithms. Ben-Asher et al.'s formal analysis of a simple case is an example of theoretical approaches to obtaining a better understanding of the behavior of randomized load balancing algorithms40. Similar analyses need to be carried out for other more complicated cases or designs. Simulated annealing Simulated annealing is a general and powerful technique for combinatorial optimization problems borrowed from crystal annealing in statistical physics42;43 . The technique is Monte Carlo in nature, which simulates the random movements of a collection of vibrating atoms in

11

a process of cooling. Atoms are more vibrant at high temperatures than at low temperatures. The objective is to arrive at an optimally low energy con guration of the atoms at some low temperature. At each temperature T , a set of random and permissible proposals of atom con guration are evaluated in terms of the change in energy, H . A con guration of the atoms is accepted if it represents a decrease in energy (i.e., H < 0); otherwise, it is accepted with a conditional probability of exp(?H=T ), which can help to avoid local minima. As the temperature decreases, fewer and fewer increases in energy are accepted and the atoms eventually settle on a low energy con guration that is very close, if not identical, to the optimal. It is conceivable that the slower the annealing process, the more con gurations that can be explored at each temperature, and the higher the probability of arriving at an optimal solution at the end. When applied to load balancing, a con guration corresponds to a state of the system (i.e., global distribution of workload), and the nal con guration at some low temperature corresponds to the result|hopefully a balanced state|of the execution of the load balancing procedure. Bollinger and Midki applied simulated annealing to static mapping of processes onto processors and improved upon previous results using deterministic methods44 . Its application to dynamic load balancing was rst carried out by researchers at Caltech45?47. Their implementation was a centralized one. During dynamic load balancing, a processor, which has been designated the main balancer, goes through the simulated annealing exercise and arrives at a decision of how the global workload should be redistributed; it then instructs the other processors to implement the decision. This centralized algorithm su ers from a sequential bottleneck due to the main balancer. In fact, simulated annealing, because of its Monte Carlo nature, can be extremely slow, especially in solving a complex problem involving many variables. The current trend is to parallelize the annealing process and execute it on parallel machines48 . There are even proposals of building parallel machines speci cally for simulated annealing, for example, the design by Abramson49. To distribute the sequential decision making of simulated annealing among a number of processors so that each processor can o er partial proposals of process redistribution and take part in the evaluation of energy, it is necessary to deal with the problem of \collisions". A collision occurs when the proposed changes made by one processor have con icts with those made by another processor. Rollback is one technique for solving this problem6 . With rollback, processors make redistribution decisions in parallel, and then check whether there is any collision. If there is, the processors undo enough of the changes until there is no collision. Fox et al.6 demonstrated this technique in the simulation of shes and sharks living, moving, breeding, and eating in a two-dimensional ocean (i.e., the WaTor problem50 ). However, since it incurred a great deal of overhead in maintaining the history states of the system which were needed for rollbacks, its performance was not satisfactory. Instead, Williams allowed 12

collisions to occur in his application of the simulated annealing method to the load balancing of unstructured mesh calculations in an NCUBE/10 multicomputer of 16 processors51. In his method, collisions occur along boundaries of clusters of mesh nodes. Di erent clusters during an iteration are handled by di erent processors, and at the end of each iteration, neighboring processors exchange boundary values to maintain consistency. Note that although allowing collisions to occur did not pose a serious problem in this case, it might lead to incorrect results in other applications. Williams tried various starting temperatures and cooling rates. Despite the possibility of some performance loss due to collisions, the results indicated that with suciently slow cooling, the parallel annealing process eventually produced high quality solutions to the load balancing problem. Improvements on the cooling rate will have to rely on successful choices of optimal initial values for the various parameters of the process. Finally, apart from simulated annealing, two other optimization methods that borrowed ideas from natural sciences have also been adapted to load balancing|genetic algorithms and neural networks. Fox et al.45;52;53 have tried out various sequential versions in the context of static load balancing. The performance of parallel versions is still under investigation.

REFERENCES 1. W.C. Athas and C.L. Seitz (1988) Multicomputers: message{passing concurrent computers. IEEE Computer, 21(8), 9{24. 2. G. Trew, and A. Wilson (1991) Past, Present and Parallel: A Survey of Available Parallel Computer Systems. Springer{Verlag. 3. S. So anopoulou (1992) The process allocation problem: a survey of the application of graph{ theoretic and integer programming approaches. Journal of the Operational Research Society, 43(5), 407{413. 4. K. W. Ross and D. D. Yao (1991) Optimal load balancing and scheduling in a distributed computer system. Journal of Association for Computing Machinery, 38(3), 676{690. 5. E. Silva and M. Gerla (1991) Queueing network models for load balancing in distributed systems. Journal of Parallel and Distributed Computing, 12, 24{38. 6. G.C. Fox, M.A. Johnson, G.A. Lyzenga, S.W. Otto, J.K. Salmon, and D.W. Walker (1988) Solving problems on concurrent processors, VoL. I. Prentice Hall. 7. E.L. Lawler and D.E. Wood (1966) Branch and Bound Methods: A Survey. Operations Research, 14, 699{719. 8. J.A. Stankovic and I.S. Sidhu (1984) An adaptive bidding algorithm for processes, clusters and distributed groups. In Proceedings of the 4th International Conference on Distributed Computer Systems, pp. 49{59.

13

9. L.M. Ni, C.-W. Xu, and T.B. Gendreau (1985) A distributed drafting algorithm for load balancing. IEEE Transactions on Software Engineering, 11(10), 1153{1161. 10. D.L. Eager, E.D. Lazowska, and J. Zahorjan (1986) Adaptive load sharing in homogeneous distributed systems. IEEE Transactions on Software Engineering, 12(5), 662{675. 11. P. Krueger and M. Livny (1987) Load balancing, load sharing and performance in distributed systems. Technical Report TR 700, Computer Science Department, University of Wisconsin at Madison . 12. T.L. Casavant and J.G. Kuhl (1988) A taxonomy of scheduling in general-purpose distributed computing systems. IEEE Transactions on Software Engineering, 14(5), 141{154. 13. Y.-T. Wang and R.J.T. Morris (1985) Load sharing in distributed systems. IEEE Transactions on Computers, 34(3), 204{217. 14. T.L. Casavant and J.G. Kuhl (1990) A communicating nite automata approach to modeling distributed computation and its application to distributed decision-making. IEEE Transactions on Computers, 39(5), 628{639. 15. G. Cybenko (1989) Load balancing for distributed memory multiprocessors. Journal of Parallel and Distributed Computing, 7, 279{301 . 16. L.N. Bhuyan and D.P. Agrawal (1984) Generalized hypercube and hyperbus structures for a computer network. IEEE Computer, 17(4), 323{333. 17. J.-W. Hong, X.-N. Tan, and M. Chen (1988) From local to global: an analysis of nearest neighbor balancing on hypercube. In Proceedings of ACM{SIGMETRICS, pp. 73{82. 18. X.-S. Qian and Q. Yang (1991) Load balancing on generalized hypercube and mesh multiprocessors with LAL. In Proceedings of 11th International Conference on Distributed Computing Systems, pp. 402{409. 19. C.Z. Xu and F.C.M. Lau (1992) The generalized dimension exchange method on some speci c structures. Technical Report TR-92-02, Department of Computer Science, The University of Hong Kong . 20. D.P. Bertsekas and J.N. Tsitsiklis (1989) Parallel and Distributed Computation: Numerical Methods. Prentice-Hall. 21. M. Willebeek-LeMair and A.P. Reeves (1989) Distributed dynamic load balancing. In Proceedings of Conference on Hypercube Concurrent Computers and Applications, pp. 609{612 . 22. M. Willebeek-LeMair and A.P. Reeves (1990) Local vs. global strategies for dynamic load balancing. In Proceedings of International Conference on Parallel Processing, Vol. I, pp. 569{570. 23. R. Luling and B. Monien (1992) Load balancing for distributed branch and bound algorithm. In Proceedings of 6th International Parallel Processing Symposium, pp. 543{5448. 24. S. Ranka, Y. Won, and S. Sahni (1988) Programming a hypercube multicomputer. IEEE Software, 5, 69{77.

14

25. Y. Shih and J. Fier (1989) Hypercube systems and key applications. In Parallel Processing for Supercomputers and Arti cial Intelligence 1989 (K. Hwang and D. DeGroot Eds.), pp. 203{243. McGraw{Hill. 26. S.H. Hosseini, B. Litow, M. Malkawi, J. Mcpherson, and K. Vairavans (1990) Analysis of a graph coloring based distributed load balancing algorithm. Journal of Parallel and Distributed Computing, 10, 160{166 . 27. S. Fiorini and R. J. Wilson (1978) Edge-coloring of graphs. Selected Topics in Graph Theory, 1978 (L. W. Beineke and R. J. Wilson, Eds.), pp. 103{125. Academic Press. 28. C.Z. Xu and F.C.M. Lau (1992) Analysis of the generalized dimension exchange method for dynamic load balancing. Journal of Parallel and Distributed Computing, Dec. 1992. To appear. 29. J.-W. Hong, X.-N. Tan, and M. Chen (1989) Dynamic cyclic load balancing on hypercube. In Proceedings of Conference on Hypercube Concurrent Computers and Applications, pp. 595{598. 30. C. G. Plaxton (1989) Load balancing, selection and sorting on the hypercube. In Proceedings of 1989 ACM Symposium on Parallel Algorithms and Architectures, pp. 64{73. 31. J. JaJa and K.-W. Ryu (1990) Load balancing on the hypercube and related networks. In Proceedings of 1990 International Conference on Parallel Processing, Vol. I, pp.203{210. 32. F.C.H. Lin and R.M. Keller (1987) The gradient model load balancing method. IEEE Transactions on Software Engineering, 13(1), 32{38. 33. W. Shu and L.V. Kale (1989) A dynamic scheduling strategy for the chare kernel systems. In Proceedings of Supercomputing, pp. 389{398 . 34. R. Chowkwanyun and K. Hwang (1989) Multicomputer load balancing for concurrent Lisp execution. In Parallel Processing for Supercomputers and Arti cial Intelligence 1989 (K. Hwang and D. DeGroot, Eds.), pp. 325{365. McGraw{Hill. 35. R. Luling, B. Monien, and F. Ramme (1991) Load balancing in large networks: a comparative study. In Proceedings of 3th IEEE Symposium on parallel and distributed processing, pp. 686{689. 36. L.V. Kale (1990) The chare kernel parallel programming language and system. In Proceedings of International Conference on Parallel Processing, Vol. II, pp. 17{25. 37. S.A. Baker and K.R. Milner (1991) A process migration harness for dynamic load balancing. In Proceedings of the World Transputer User Group Conference, pp. 52{61 . 38. S.C. Su, P. Biswas, and R. Krishnaswamy (1989) Experiments in dynamic load balancing of parallel logic programs. In Proceedings of Conference on Hypercube Concurrent Computers and Applications, pp. 623{626 . 39. L.G. Valiant and G.J. Brebner (1981) Universal schemes for parallel communication. In Proceedings of ACM Symposium on Theory of Computing, pp. 263{277. 40. Y. Ben-Asher, A. Cohen, A. Schuster, and J.F. Sibeyn (1992) The impact of task{length parameters on the performance of the random load{balancing algorithm. In Proceedings of 6th International Parallel Processing Symposium, pp. 82{85.

15

41. M.O. Rabin (1976) Probabilistic algorithms. In Algorithms and Complexity: New Directions and Recent Results 1976 (J. F. Traub, Ed.), pp. 21{39. Academic Press. 42. S. Kirkpatrick, C. Gelatt, and M. Vecchi (1983) Optimization by simulated annealing. Science, 220(4598), 671{680 . 43. R.H.J. Otten and L.P.P.P. van Ginneken (1989) The Annealing Algorithm. Kluwer Academic Publishers. 44. S.W. Bollinger and S.F. Midki (1991) Heuristic technique for processor and link assignment in multicomputers. IEEE Transactions on Computers, 40(3), 325{333. 45. G.C. Fox, A. Kolawa, and R. Williams (1987) The implementation of a dynamic load balancer. In Hypercube Multiprocessors (H. T. Heath, Ed.), pp. 114{121. SIAM. 46. J. Koller (1989) The MOOS II operating system and dynamic load balancing. In Proceedings of Conference on Hypercube Concurrent Computers and Applications, pp. 599{602. 47. W.I. Williams. Load balancing and hypercubes: a preliminary look. In Proceedings of 2th Conference on Hypercube Multicomputers, pp. 108{113. 48. R. Diekmann and R. Luling and J. Simon (1992) A general purpose distributed implementation of simulated annealing. In Proceedings of 4th IEEE Symposium on Parallel and Distributed Processing, pp. 94{101. 49. D. Abramson (1992) A very high speed architecture for simulated annealing. IEEE Computer, 25(5), 27{36. 50. A.K. Dowdney (1984) Computer recreations. Scienti c American, December 1984. 51. R.D. Williams (1991) Performance of dynamic load balancing algorithms for unstructured mesh calculations. Concurrency: Practice and Experience, 3(5), 451{481 . 52. G. C. Fox, W. Furmanski, J. Koller and P. Simic (1989) Physical optimization and load balancing algorithms. In Proceedings of Conference on Hypercube Concurrent Computers and Applications, pp. 591{594. 53. N. Mansour and G. C. Fox (1992) Allocating data to multicomputer nodes by physical optimization algorithms for loosely synchronous computations. Concurrency: Practice and Experience, 4(7), 557{574.

16