Document not found! Please try again

A Distributed Diffusion Method for Dynamic Load ... - Semantic Scholar

3 downloads 568 Views 469KB Size Report
employs overlapping balancing domains (a processor and its neighbors) to .... within its own domain and cannot evaluate the average load within the system.
A Distributed Diffusion Method for Dynamic Load Balancing on Parallel Computers E.Luque, A.Ripoll, A. Cortés, T. Margalef Departament d´Informàtica Universitat Autònoma of Barcelona 08193-Bellaterra (Barcelona).SPAIN Phone No.: +34- 3- 581 13 56 Fax No.: + 34- 3- 581 24 78 e-mail:[email protected] Abstract. Parallel application can be divided into tasks that can be executed simultaneously. A mechanism for assigning these tasks to the processors is required. The objective is to minimize the overall execution time of a single application running in parallel on a multicomputer system. We propose a new dynamic load balancing algorithm based on the diffusion approach which employs overlapping balancing domains (a processor and its neighbors) to achieve global balancing. Since current diffusion methods consider discrete units, the algorithms may produce solutions which, although they are locally balanced, prove to be globally unbalanced. Our method solves this problem taking into account the load maximum difference between two processors within each domain, providing a more efficient load balancing process. This method is performed in a distributed fashion and can easily be scaled to support highly parallel machines. The algorithm has been applied to different interconnection networks and the results obtained are very encouraging

I.- Introduction. Multiprocessor systems have been shown to be very efficient at solving problems with uniform computation and communication patterns [Hoc88]. However, there exists a large class of non-uniform problems with uneven and unpredictable requirements. To efficiently solve non-uniform problems on multiprocessor systems, more complex methods have to be devised in order to achieve a well balanced overall load of the systems. Two kinds of load balancing schemes have been proposed and reviewed in the literature [Cas88]. The first one, static load balancing, is used when the computational and communication requirements of a problem are known a priori. In this case, the assignment task-processor is performed once before the parallel application initiates its execution. By contrast, dynamic load balancing schemes, is applied in situations where no a priori estimations of load distribution are possible. It is only during the actual program execution that it becomes apparent how much work is being assigned to individual processors. In order to retain efficiency, the imbalance must be detected and an appropriate dynamic load balancing strategy must be devised. In particular, our interest is focused in the second approach. This work was supported by CICYT under contract number TIC 92/0547

Some dynamic strategies that use local information in a distributed memory architecture, have been proposed in the literature. These strategies describe rules for migrating tasks on overloaded processors to underloaded processors in the network of a given topology [Lin87]. Tradeoffs exist between achieving the goal of balancing the computational load and the communication costs associated with migrated tasks. In this paper a new strategy based on the diffusion approach is described to solve the dynamic load balancing problem. We propose a fully distributed load balancing algorithm that will be present on each processor of a given topology and can easily be scaled to support highly-parallel machines. No assumption has to be made about the underlying architecture of the parallel system. We prove that interchanging local load between processors within a domain, a global balanced load is achieved. The problem statement and different works related with diffusion schemes are commented in section II. In Section III the new strategy for dynamic load balancing process is described. Some examples are introduced to show how our algorithm corrects the imbalance obtained using others diffusion approaches. Experimental results are presented in section IV and the conclusions are exposed in section V.

II.- The problem statement and related work. The load balancing process can be divided in four phases [Will93]. These four phases are the following: a) processor load evaluation: A load value is estimated for each processor in the system in order to use the concept of load unit in the second and third phases. b) load balancing profitability determination: This phase evaluates the imbalance load factor and decides whether or not load balancing is profitable at that time. c) task migration strategy: Sources and destinations for load migration are determined in this phase. d) task selection strategy: The source processors select the most suitable tasks for efficient and effective load balancing and send them to the appropriate destinations. Our interest is focused in the second and the third phases. The second phase of dynamic load balancing strategies can be guided in two ways depending on the information they use: centralized or distributed approaches. This classification is based on the scope of the domain applied, where a domain is the group of processors whose load is used to execute the load balancing strategy. Centralized schemes [Lin92] tend to be more accurate because they use global domains where the load information of all the processors in the system is collected. Nevertheless, these approaches use to provide more overhead and they are not well scalable when the number of processors increase. Alternatively, distributed approaches produce less overhead to the detriment of less accurate decisions due to apply local domains [Sue92].

The goal of the task migration strategy phase is to obtain the load amount to migrate and the target processors involved in the movement. For this purpose, one can define two different methodologies. One of them consists of describing some heuristics rules in order to migrate load units from overloaded processors in the network to underloaded ones. A threshold parameter determines the triggered condition for these rules. This value can be a fixed value [Zho88] or a variable one which will be tuned during execution time [Eag86]. These approaches try to obtain tradeoffs between achieving the goal of complete load balancing and the communication costs associated with migrating tasks [Cho82]. The other methodology correspond closely to the iterative methods used for the solution of diffusion problems which are developed to involve the minimum possible energy. In these methods, the surplus load can be interpreted as diffusing through the processors to reach a steady balanced state. To achieve this situation a portion of the load excess of the overloaded processor will be exchanged. Since this approach will not, in general, provide a immediately balanced solution, the process is iterated until the load difference between any two processors is smaller than a specified value. These kind of approaches are known as diffusion methods. In order to compare different strategies based on diffusion schemes, we introduce the concept of exchange unit, which can be identified with only one processor, when overlapped domains are used, or with an entire domain, when independent domains are applied. This concept allows us to classify the different algorithms depending on the scope of their second phase, load balancing profitability determination. Different works include some refinements or simple modifications to the diffusion method [Sal90][Hor93][Boi90][Cyb89][Will93]. Diffusion methods with overlapped domains have, however, two disadvantages which result from the local nature of the load information used. Firstly, the number of iterations required by the load balancer may be high, and unknown a priori. The second problem is that since the work is packaged into discrete units, the algorithms may produce solutions which, although they are locally balanced, prove to be globally unbalanced. Such a situation is shown in the example of figure 1 where the work load of five processors connected in a linear array is obtained as balanced solution applying the previous algorithms.

Figure 1.- A globally unbalanced situation.

We propose an alternative dynamic load balancing strategy using the diffusion strategy with overlapped domains to obtain a more accurate and stable situation where the maximum load difference between any two processors in the system, is smaller than the one obtained applying the previous algorithms.

III.- An enhanced diffusion method. The load balancing algorithm proposed in this section belongs to the family of diffusion methods. In section III.1 we describe the simple diffusion method using real and integer numbers. This method is extended in section III.2. III.1 Simple diffusion method (SDM) Essentially, the simple diffusion method (SDM) is as follows: each processor pi compares its current load average Li with each of its neighbors' load in turn and transfers enough work units to achieve a local load balance. This process is repeated until all processors detect the load to be locally balanced. In order to describe the behavior of this algorithm in a formal way, we have introduced the mathematical descriptions and notations shown in table 1. Ni is the group of immediate neighbor processors of pi. Li represents the load of the processor pi in terms of load units. The load average of the processor pi, Li , is calculated adding Li with the load of all the processors belonging to Ni, and dividing this value by the number of neighbors plus one (#Ni+1). The difference between Li and Li , represents the excess load (E(pi)) of the processor pi . The deficit D(pji) of one neighbor processor (pj) in respect of the processor pi,, is obtained subtracting the load of the processor pj, from the load average Li . It is possible to define a subgroup of Ni, NiLOW, which includes the neighbor processors whose load is under Li . Finally, the portion of the load excess to be migrated to each processor pj belonging to NiLOW (Pij), is evaluated dividing D(pj) by the addition of all the load deficits from the processors belonging to NiLOW. Ni

processor's neighbors .

Li

processor pi's load.

Li

load average of the processor pi.

E(pi) D(pji) NiLOW

Pij

pi's excess load compared to Li . processor pj's load deficit compared to Li . processor pi's neighbors with a load value under Li .

load portion to send from pi to pj.

{ pj d ( pi , pj ) = 1} ---

∑L +L j

Ni + 1 Li − Li Li − Lj

{ pj pj ∈ Ni ∧ L j < L i }

D ( pji ) ∑ D( pji )

∀j∈Ni LOW

Table 1:SDM Parameters.

i

∀j∈Ni

The formal description of the behavior of the SDM is as follows: each processor pi collects the load information of its own and its immediate neighbors in order to evaluate the load average of its domain ( Li ), and its potential load excess (E(pi)); afterwards, if the processor detects that its load excess is a positive value (E(pi)>0) and its neighbors' load is under Li (#NLOW 0 ), it calculates the amount of load to migrate (Lsendj) to each underloaded processors pj, multiplying E(pi) by the Pij value which indicates the load portion to be migrated. This algorithm is described in figure 2. begin if (#NiLOW0) && (E(pi)>0) then for j:=1 to #Ni do begin Lsendj = E(pi)* Pij migrate ( Lsendj load ) to pj end end Figure 2.-SDM using real numbers.

If this algorithm is applied to the example of the figure 3a where an ideal case of a full connected network is shown, we can observe that the equilibrium is reached in only one step. In this case, the load balancing decisions are based on information belonging to all processors of the system and therefore, it is possible for each processor to evaluate the total load within the system and to calculate the average load among the processors ( L 0 = L 1 = L 2 = L 3 = 27.5 in the example). In figure 3a processors p0 and p1 are underloaded, and processors p2 and p3 are overloaded, being their load excess E(p2)=7.5 and E(p3)=12.5 respectively. The migrated load from the overloaded processor p2 to p0 is Lsend0=P20*E(p2)=4.68 which represents a portion of its excess load This amount of load is proportional to the deficit detected in processor p0 and is calculated using the expression Pij defined in table 1 (P20=12.5/20, in the example). Once the overloaded processors (p 2, p3) have executed this algorithm, the load of the system is balanced directly (figure 3b), and only one step is required. This algorithm is completely distributed and use overlapped domains where a domain includes all the processors in the system, and the exchange unit is defined as a single processor.

3a.- Before balancing

3.b After balancing

Figure 3.- Full connected system balanced in one step

The main constrain of this algorithm is that it represents an unrealistic situation. In the message passing multiprocessor systems, each processor is limited to load information from within its own domain and cannot evaluate the average load within the system. However, the same algorithm can be applied considering the domain of every processor. With this strategy, every processor tries to balance its load with the load of the other processors of the domain. In such situation, the strategy provides a load balancing within the system, although the number of steps required depends on the initial distribution of the load among the processors and on the interconnection network of the system. In figure 4 the behavior of the SDM algorithm applied to a non-full connected system is shown. In this case we can observe that it is not possible to reach the equilibrium in one step because now we are restricted to use local information in spite of global information.

Figure 4.- SDM in a non-full connected system

The SDM considers the load of each processor as a real number. However, in contrast to genuine diffusion problems, the work in a multiprocessor system is packaged into discrete units and, therefore, the amount of load to be migrated from one processor to another must be truncated in an integer load number. Taking this consideration into account, the algorithm may produce solutions which, although they are locally balanced, prove to be globally unbalanced. In order to consider this new requirement it is necessary to include a simple modification in the algorithm described in figure 2. The load amount to migrate will be the integer part of Lsendj in spite of the initial real number ( L sen dj ). This version of the simple diffusion method is shown below: begin if (#NiLOW0) && (E(pi)>0) then for j:=1 to #Ni do begin Lsendj = E(pi)* Pij migrate ( L sen dj

load ) to pj

end end Figure 5 .- Discrete SDM

If we apply this discrete version to the same example used in figure 4, one can obtain the results shown in figure 6. It is interesting to observe that the final state in this case does not

represent the perfect equilibrium because the maximum difference between any two processors in the system is over one.

Figure 6.- Discrete-SDM in a non-full connected system

This problem can produce two different situations. One of them, the gradient effect, is due to the propagation of the difference of one load unit that can be produced when it is not possible to obtain exactly the same load value in each processor because the load is not a multiple of the number of processors. This situation is represented in figure 7a. The other problematic situation is shown in figure 7b. In this case there exists a local maximum which cannot distribute the exceeding load.

7.a Gradient effect problem.

7.b Local maximum problem.

Figure 7.- Problematic situation .

To solve these problems, we propose an alternative algorithm which is able to detect unbalanced domains and to correct the global imbalance of the system.

III.2 DASUD ( Diffusion Algorithm Searching Unbalanced Domains) algorithm. This algorithm uses overlapped domains where every domain includes all the immediate neighbors of the underlying processor, so that the exchange unit is identified with a single processor. The triggered condition for the second phase is the load average from each domain which is similar to SID algorithm described in [Will93]. The mathematical descriptions and notations required to describe the proposed algorithm are shown in table 2, and have the following descriptions. Ldifi is the maximum difference between any two processors in the domain of the processor pi (Ldifi={ max(Lk-Lj) ∀k , j ∈ Ni k ≠ j }).

The Lvi value is a new neighborhood load average, which is calculated dividing the addition of the load of all the neighbor processors by the number of neighbors (#Ni).

σv is defined as the accumulation of the absolute difference value between every neighbor's load and Lvi. σv is called the deviation in the neighborhood of the processor pi. Ldifi

maximum difference between any two processors in the domain.

Lvi

load average of neighbor's processors of pi

{ max(Lk-Lj) ∀k , j ∈ Ni k ≠ j }

∑L

j

∀j∈Ni

Ni

σv

neighborhood deviation

∑ j Ni

L vi − L j

2

∀∈

Table 2.- DASUD parameters

Taking into account the meaning of these parameters, our algorithm has the following characteristics: - A local diffusion approach is defined which employs overlapping balancing domains to achieve global balancing. - The algorithm is purely distributed and asynchronous. Each processor acts independently, distributing load excess to deficient neighbors. - Balancing is performed by each processor whenever it detects that it is overloaded (E(pi)>0). - If an overloaded processor has underloaded neighbors (#NiLOW0 and E(pi)>0), it distributes its excess load proportionally to each neighbor deficit. The load amount migrated from processor p i to processor pj will be Lsendj, defined as the integer part of the product Pij*E(pi). - If Lsendj 1. - If Ldifi>1, then it is necessary to calculate Lvi and σv . Depending on these values two situations can be found: - If σv i =0, this indicates that all the neighbors have the same load and the underlying processor can distribute his extra load one by one in a sequential order among the

processors within his domain. The processor stops when his load is one unit more than Lvi. - If σv i≠ 0 and the highest load in the neighborhood is higher than the underlying processor load, then one load unit is migrated to the processor with the lowest load. All the previous situations describe the behavior of the proposed algorithm, which is shown in figure 8. begin

if ( #NiLOW ≠ 0) && (E(pi)>0) then for j:=1 to #NiLOW do begin Lsendj = E(pi)* Pij; migrate ( Lsendj load´s units) to pj; end else if Ldifi>1 then begin calculates Lvi and σv i; case σv i in 0: distribute [Li - (Lvi+1)] load´s units among neighbors; *: migrate one load´s unit to lowest load neighbor end; end;

end. Figure 8.- DASUD (Diffusion Algorithm Searching Unbalanced Domains)

If this new algorithm is applied to the two examples shown in figure 7 we obtain the situations shown in figure 9. In both cases, the state reached presents a stable situation where the maximum difference between any two processors is up to one. This phenomenon of load propagation due to overlapped domains in the system, provides a better global balanced system. In the example of the figure 7a, processor p1 detects that its domain is unbalanced due to a gradient effect (Lv1=3 and σv 1=2). This processor migrates one load unit to its lowest load neighbor (processor p0). Afterwards, when processor p 2 executes the balancing algorithm, it generates the migration of one load unit to processor p1, providing the final global balanced situation shown in figure 9a. When the unbalanced domain appears, due to a local maximum, as is shown in figure 7b (Lv2=4 and σv 2=0), the processor with a load excess (p2 in the example) detects this state and distributes its excess load units one by one among its neighbors. The result obtained applying our algorithm to the example of figure 7b, is shown in figure 9.b.

9.a Gradient situation

9.b Local maximum situation

Figure 9. Solutions obtained applying DASUD algorithm

IV.- Experimental Results. In order to evaluate the goodness of the DASUD algorithm, a simulation environment was developed. The Discrete-SDM algorithm and the DASUD algorithm were considered using different interconnection topologies. Figure 10 includes a representative set of experimentation performed using interconnection networks such as wraparound mesh, hipercube and linear. The initial load values, are represented in the network's model, that includes a node for each available processor, and one edge for each connection in the system. In this model, the shaded nodes indicate the initial overloaded processors. The two following columns represent the final state reached, using the Discrete-SDM algorithm and DASUD algorithm respectively. The remark numbers in the final state represent the maximum difference of load between any two processor in the system. The last column is used to compare the final situation using the following parameters:

σn : the standard deviation; steps: the number of steps needed to achieve the final state; σn min: the minimum standard deviation for each particular case. The minimum value of σn ( σn min) depends on the global load in the system. When the global load is not a multiple of the number of processors, it is not possible to obtain a stable state where each processor in the system has the same load value. The most accurate equilibrium in these cases, will be the one where the maximum difference between any two processors is one load unit. In such situations, the minimum standard deviation cannot be zero ( σn min0). By contrast, if the global load is a number that it is a multiple of the number of

processors, the minimum value of σn must be zero ( σn min=0). In figure 10 three different examples are shown. AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA Network Discrete-SDM DASUD Results AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAA initial state final state final state AAAA AAAAAAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

(a)

26 24 26 26

20 23

24 23

23 23

24

23 25

23 25 24

Discrete SDM σn = 1.964 steps= 5 DASUD σn = 0.78 steps = 5

σn min=0.33 (b) 40 38 34 38

38 34 29 34

35 37 35 32

38 37 37 35

37 35 35 34

37 37 36 34

34 37 34 36

37 35 37 36

Discrete SDM σn = 2.66 steps= 5 DASUD σn = 1.21 steps= 8

σn min= 0.46 Discrete SDM σn = 1.09 steps= 3

(c)

7 6 4 6 7

6 6 6 6 6

DASUD σn = 0 steps=4

σn min= 0 Figure 10.- Experimental Results.

The examples 10a and 10b provide a value of σn min different than zero, because in both cases, the global load value is not a multiple of the processor number. If we compare the value of σn obtained applying DASUD algorithm and Discrete-SDM algorithm, with σn min, we can observe that for each example our algorithm performs better results than Discrete-SDM algorithm. The example, 10c, is a situation where it is possible to reach the perfect equilibrium because the global load in the system is a multiple of the number of processors. DASUD algorithm provides this perfect situation using only one step more than Discrete-SDM which provides a poor final state. The results obtained show that DASUD algorithm provides a more equilibrated final state than the Discrete-SDM one, using a similar number of steps.

V. Conclusions. The proposed DASUD algorithm for load balancing offers a good alternative to the proposed load diffusion techniques. Indeed, this load strategy is distributed over the whole processor network and it does not need any global synchronization. The synchronizations are restricted to local interactions between neighboring processors. The load balancing algorithm seems to be particularly well suited for any network because the load algorithm can adapt itself to each new processor topology. Since current diffusion methods consider discrete units, the algorithms may produce solutions which, although they are locally balanced, prove to be globally unbalanced. Our algorithm solves this problem and the experimental results obtained have proved

that such situations are corrected and a globally balanced state is achieved. The results obtained using DASUD algorithm confirms our expectations and are very encouraged. References [Boi90] J.E.Boillat and P.G. Kropf, A fast distributed mapping algorithm, Proceedings, Springer LNCS 457, 1990.

in CONPAR 90- VAPP, IV

[Boi90] J.E. Boillat, Load Balancing and Poisson equation in a graph, Concurrency: Practice and Experience, Vol 2(4), December 1990, pp. 289-313. [Cas88] T.L. Casavant and J.G. Kuhl, A Taxonomy of Scheduling in General-Purpose Distributed Computing Systems, IEEE Transactions on Software Engineering, vol 14 No. 2, February 1988, pp 141- 154. [Cho82] Chou, T.C.K. and Abraham, J.A., Load balancing in distributed systems. IEEE Trans. Software Engrg.,8 , 1982, pp. 401-412. [Cyb89] G. Cybenko, Dynamic Load Balancing for Distributed Memory Multiprocessor, J. Parallel Distributed Comput. 7 (1989), pp. 279-301. [Eag86] Derek L. Eager, Edwuard D. Lazowska and John Zahorjan, Adaptive Load Sharing in Homogeneous Distributed Systems, IEEE Trans. Software Engrg., vol SE-12, No 5, May 1986, pp. 662-675 [Fox88] G. Fox Johnson, G.Lyzenga, S. Otto, J.Salmon and D.Walker, Solving Problems on Concurrent Processors I, Prentice-Hall, Englewood Cliffs, 1988. [Hoc88] R.W. Hockney and C.R. Jesshope, Parallel Computers 2, Adam Hilger, Bristol, 2 ed. 1988. [Hor93] G. Horton, A Multi-level Diffusion Methods for Dynamic Load Balancing, Parallel Computing 19 (1993), pp. 209-218. [Lin92] Hwa-Chun Lin and C.S. Raghavendra, A Dynamic Load-Balancing Policy With a Central Job Dispatcher (LBC), IEEE Trans. Software Engrg., vol. 18, No 2, February 1992, pp. 148-158 [Sal90] Vikram A. Saletore, A Distributed and Dynamic Load Balancing Scheme for Parallel Processing of Medium-Grain Tasks, Proc. Fifth Distributed Memory Comput. Conference, (1990) pp. 994-999 [Sue92] T.T.Y. Suen and J.S.K. Wong, Efficient Task Migration Algorithm for Distributed Systems, IEEE Transactions on Parallel and Distributed Systems, vol.3, No.4, July 1992, pp. 488-499. [Will93] Marc H. Willebeek-LeMair, Anthony P. Reeves, Strategies for Dynamic Load Balancing on Highly Parallel Computers, IEEE Transaction on Parallel and Distributed Systems, vol 4, No 9, September 1993, pp. 979-993. [Zho88] Songnian Zhou, A Trace-Driven Simulation Study of Dynamic Load Balancing, IEEE Trans. Software Engrg., vol 14, No 9, September 1988, pp 1327-1341.