A New Load Balancing Algorithm in Parallel Computing

1 downloads 0 Views 248KB Size Report
2010 Second International Conference on Communication Software and Networks ... kind of architecture used in the system (multi data/task) in addition to the ...
2010 Second International Conference on Communication Software and Networks

A New Load Balancing Algorithm in Parallel Computing Kamran Zamanifar, Naser Nematbakhsh, Razieh Sadat Sadjady Department of Computer Engineering Islamic Azad University, Najafabad Branch Isfahan, Iran [email protected], [email protected], [email protected] Abstract—Due to the outstanding progress in computer technology and an ever-rising demand for high-speed processing able to support the distributed mode there is an increasing trend towards the use of parallel and distributed systems. In addition, one of the important stages of any system utilizing parallel computing is the load balancing stage in which the balance of workload among all of the system’s processors is aimed. In this article, load balancing in parallel systems is studied then; a new load balancing algorithm which has new capabilities will be introduced. Among the new capabilities of this algorithm, its independence of a separate route-finder algorithm between the load receiver and sender nodes can be mentioned. Finally the results of the simulation of this algorithm will be presented.

II.

Keywords- Parallel computing; load balancing

I.

INTRODUCTION

With increasing advance in scientific endeavor and the necessity of high-speed processing which may even tend toward the mode of distribution, the need for parallel and distributed systems is speedily increasing. The presence of a number of processors in these kinds of systems, shows, from one perspective, the necessity of the uniform distribution of workload among these processors, since, studies have shown that in these systems the probability of a processor being idle in the system and other processors having a queue of tasks at hand is very high. The issue can, in fact, be thus presented that the use of parallel and distributed systems, due to the speed which they add to the processing tasks, is an important factor, but the capital needed to elevate systems to the parallel system type seems logical only on condition that the workload of the system be distributed suitably among the processors. Such an aim becomes practical in parallel and distributed systems by the implementation of a certain type of algorithm called “load balancing algorithm”. In this article, the explication of load balancing in parallel calculations, the presentation of a new algorithm with new capabilities, a study of the structure of this algorithm based on categorization in [8], the simulation of the algorithm, and finally, the presentation of the simulation results have been undertaken.

978-0-7695-3961-4/10 $26.00 © 2010 IEEE DOI 10.1109/ICCSN.2010.27

LOAD BALANCING

As was mentioned, one of the issues considered in any parallel or distributed system, is the issue of load balancing in such systems. In general, load balancing is used toward many benefits, including minimizing execution time and maximizing resource utilization [1]. Load balancing algorithms are divided in to two major groups: static and dynamic load balancing algorithms. In the former type, based on an estimation of the time needed to complete any given task, tasks are assigned to processors during the compile time and their relation is determined and there is no decision at this type regarding a shift of task from one processor to another during the execution time. But in dynamic load balancing algorithms (DLB), the load status at any given moment is used to decide on task shifts between processors [1] [4] [8] [9]. Random, Central and Rendez-vous are among the existing load balancing algorithms which can be seen in [2] [3] [8]. Also, load balancing algorithms can be compared with respect to parameters of quality of which nature and overhead-associated can be mentioned [1]. Principally, processors in parallel and distributed systems in relation to load balancing algorithms are divided into three groups based the workload level: • Processors which have a large number of tasks in waiting to be done called heavily loaded processors or sometimes overloaded processors. • Processors which have a small number of tasks in waiting to be done referred to as lightly loaded processors and also underloaded processors. • Processors named idle processors which have no tasks to be done [1] [3]. Dynamic load balancing algorithms are categorized based on a structure in [8], which algorithms are studied from various different perspectives, for example one of the criteria for algorithm categorization in this structure is “Initiation”, in which the way algorithms begin work and whether the algorithm is carried out periodically or is event driven are focused this categorization can be seen in [8]. Also, various strategies have been proposed for the categorization of load balancing algorithms. In each of the strategies different ranges for the definition of load balancing algorithms appear. Examples of these strategies can be seen in [5] [7]. Another aspect which should be considered in relation to load balancing algorithms is the variety of parallel and 449

distributed operational environments. For example, in a system the sources may be of the same kind with the same capacity (homogeneous systems) or of various types and with varying capacities (heterogeneous systems). Or, the kind of architecture used in the system (multi data/task) in addition to the above can affect the execution or even definition of the algorithm. Other examples of this type can be tracked in [1]. In the next part, after presenting the new algorithm, it will be studied based on categorization in [8]. III.

THE NEW LOAD BALANCING ALGORITHM

To begin the explication of this algorithm it should be stated that with respect to the variety manifested in the topology used to connect processors, each processor has a definite maximum number of neighbors. For example in a system in which the processors are ruled by mesh topology, each processor has, at most, four neighbors. We can define a field for each processor the amounts of which would determine each one of its neighbor processors. For the sake of additional explanation, let us suppose we call the characteristic “direction”. As an example, in a system represented by fig. 1, for each processor there are at most four neighbors; therefore “direction” in each processor can acquire four amounts i.e. {left, right, up, down} or {1,2,3,4}. The idea of this algorithm stemmed from the perspective that a processor may neither be idle nor overloaded but have both idle and overloaded processors neighboring it and can therefore serve to relate them; in other words the relater processor can send the message that it is not idle itself but does have an idle neighbor processor to its neighboring processors. The algorithm is as follows; as soon as a processor falls idle, it sends a message to its neighbors. This message includes the number of the idle processor, the message number, a counter and a field to determine the validity of the message. The number of the processor is, in fact, its id. Any processor may fall idle many times; in order to determine the validity of a message i.e. it is not a previously expired message, a message number is used. “Counter” is a characteristic to which one unit is added each time a message is conveyed from one processor to another and determines the distance of the message from the first idle processor from which it originated. The neighboring receiver processor saves the message complete with all its related information and with regard to the direction from which it came. If a processor reaches underloaded level it chooses from the received messages of neighboring processors that which has the highest priority level (i.e. the closest) and sends the message to its own neighboring processors. Eventually an overloaded processor chooses from among its received messages the one with the highest priority in order to send a portion of its load to the processor whence the message originated.

Figure 1. Specific amounts of direction centered on the p1 processor

A part of the pseudo-code of this algorithm is shown in fig. 2 and 3. In this pseudo-code, d shows the highest number of neighbors that any processor with in a system embodying a specific protocol can have; n is the number of processors in the system. The last_message_number array of an n*2 array in which last_message_number(p,1) shows the number of the last message sent by processor p when in idle status and last_message_number(p,2) shows the current status of the message; if it be zero the message is no longer valid, if 1 the message is valid and if 2 it indicates that the message has been received by an overloaded processor that it ready to transfer load to the idle processor. After load transfer the quantity of this field will equal zero. Another array which has been used, is the receive_message array which is an n*d*4 array. This array should, in fact, be defined as separate d*4 arrays for each processor, but we have, for the sake of simplicity, defined it as noted. In this array, receive_message(p,d1) has four fields which save the message received from the d1 direction of processor p and, as mentioned the message includes: 1) the number of the idle processor, 2) the message number, 3) counter and 4) message validity field. The message validity field falls to zero after the message expires but shows 1 in normal status. Also, the path array is an n*n*n array which after the execution of the algorithm for any overloaded processor i which finds an idle processor j to receive load, the path of load transfer from processor i to processor j is saved in path(i,j). It should be noted that this type of address giving is different from the typical in that instead of using the processors’ id, the direction of transfer along transfer course is incorporated in this array. In addition it should be noted that this array should, also, be defined separately for each processor as a local n*n array, but we have used it in this form for the simplicity of code definition. The get_processorid(p,d1) function returns the number of the processor on the d1 side of processor p. After an overloaded processor chooses an idle processor, the workload of the overloaded processor is divided between the two processors according to their processing capacities; if the system is a multi task system the workload is measured in terms of the number of task and if multi data, the data amount. It is then divide between the two processors.

450

Figure 2.

The first parts of the new load balancing algorithm pseudo-code

In relation to algorithm execution time is should be said that each processor in any status (idle, overloaded, underloaded) can carry out its duty toward the load balancing algorithm independently, but it is a better option if, in the load transfer stage or, rather, the transfer of load between the idle and overloaded processors, the execution be harmonious and simultaneous. After presenting the algorithm we now wish to call the readers attention to a number of points. 1. One of the capabilities of this algorithm is its ability to find the linking route between the idle and overloaded. 2. A threshold can be considered for the message counter. Another reason for using the message counter is, in fact, presented here. This threshold can be between one and the maximum distance between the two processors, so that any amount surpassing the threshold would automatically render the message in-valid. In addition, having a threshold enables us to control the overhead resulting from continual repetition of message sending and to change the threshold level based on the circumstance as well as enabling us, through the adjustment of the threshold level, to switch between various communication policies (local, uniform, global) even during execution; that is, we can allow workload transfer for any processor within a certain circumference of its locality.

3. The algorithm attempts, as far as possible, to solve the workload problem of an idle processor locally, and have as least as possible communication cost. 4. The disadvantage of this method can be its high level of overhead in case of unsuitable system management. In order to overcome this liability, having time ranges for the execution of the algorithm and an appropriate threshold level deems a necessity. 5. Considering the fact that this algorithm is a dynamic load balancing algorithm, we will analyze it based on categorization in [8]. In terms of Initiation, the algorithm can be activated by both periodic and event-driven initiation. As previously mentioned the load transfer stage would be better controlled periodically. In relation to load balancer location it should be said that the algorithm operates in a distributed and asynchronous manner meaning that all processors are included in the algorithm execution but there is no need for simultaneous operation. Concerning information exchange it should be said that processor decision can be based on global information but this depends on the threshold level. Communication topology is uniform meaning that processor neighbors at any given moment during execution do not change, and also worthy of mention that the workload exchanges can be global and this, again, depends on threshold level.

451

Figure 3. The second parts of the new load balancing algorithm pseudo-code

IV.

range with normal random distribution (µ=44Kbyte, σ=400). Also we chose the task computation time for the MIMD architecture, from a range 0f [64,768] msec with an exponential distribution (λ=0.006) for each unit of data. These numerical ranges and their distribution were chosen with guidance from the simulation presented in [7] and also considering the simulation presented in [6]. For each of the ensuing operational environments, experimentation was repeated several times in order to validate the results; eventually, the result was the simulated outcome for each of the eight characteristically varying environments. Instances of this simulation results can be seen in fig. 4 and fig. 5. For example in fig. 4 the result of simulation for an environment with SPMD architecture and 2-d homogeneous torus topology can be seen. What can be seen as the results of simulation are the results of processor status for the entire duration of simulation time and therefore these figures do not show processors status at any particular moment. The characteristics of each experimental environment and the average speedup (serial runtime/ parallel runtime) of the experiments for that particular environment have been shown in table I.

SIMULATION

Since it was necessary that the efficiency of the algorithm be studied in various parallel environments, we considered certain aspects when simulating it. First, simulation was carried out for two topologies, 2-d mesh with wraparound link (2-d torus 8*8) and ring with 64 processors. In addition in order to study the system’s efficiency in homogeneous and heterogeneous environments, these experimentations were conducted both for environments in which processors were equal in terms of processing capacity and for those in which within the range of [1,3] they were different in this environment it was supposed that a processor with a processing capacity equal to 3 has a processing speed three times that of a processor, the processing capacity of which equals 1. Another issue at hand was the parallel system’s architecture. We considered two architectures: SPMD (single program multiple data) and MIMD (multiple instruction stream, multiple data stream). We chose for the SPMD architecture a [80,240] byte range with uniform random distribution as the processor data range from which we chose data quantities and for the MIMD we chose a [6,202] Kbyte

452

Figure 4. The results to homogeneous torus topology with SPMD architecture

TABLE I.

Figure 5. The results to heterogeneous ring topology with MIMD architecture

THE CHARACTERISTICS OF EACH EXPERIMENTAL ENVIRONMENT

number

Topology

1 2 3 4 5 6 7 8

2-D torus 8*8 2-D torus 8*8 2-D torus 8*8 2-D torus 8*8 ring with 64 processors ring with 64 processors ring with 64 processors ring with 64 processors

Homogeneous/ Heterogeneous Homogeneous Homogeneous Heterogeneous Heterogeneous Homogeneous Homogeneous Heterogeneous Heterogeneous

Note that the number of processors for all environments is 64; but the sum of processors capacity in various environments differs; therefore the maximum speedup amount for heterogeneous environments is 128. As is evident from the table the best speedup result relates to the homogeneous torus topology environment with SPMD architecture. V.

[2]

[3]

CONCLUSION

This article attempted to study load balancing in parallel systems and presented a new load balancing algorithm which has new capabilities, the most significant of which is its independence of a separate route-finding algorithm functioning as path-finder between the sender and receiver workload nodes. Further analysis of this algorithm was covered and revealed its advantage and disadvantages as well as showing it capability to work in various operational environments. Finally the simulation process and its results were presented. The direction for work in the future is towards the simulation of other existing load balancing algorithms and the comparison of their operation with that of this new specimen.

[4] [5]

[6]

[7] [8]

[9]

REFERENCES [1]

A. Chhabra, G. Singh, E. Waraich, B. Sidhu, G. Kumar, “Qualitative parametric comparison of load balancing algorithms in parallel and

453

SPMD/MIMD SPMD MIMD SPMD MIMD SPMD MIMD SPMD MIMD

Processing capacity 64 64 128 128 64 64 128 128

speedup 50.0774 37.9567 56.1752 59.0516 49.1170 35.2917 54.0600 56.2839

distributed computing environment,“ Proc. World Academy of Science, Engineering and Technology (PWASET), vol. 16, November 2006. C. Fonlupt, P. Marquet, J. Dekeyser, “Analysis of synchronous dynamic load balancing algorithms,” (PARCO'95) Parallel Computing: State-of-the Art Perspective Advances in Parallel Computing, Elsevier Science Publishers, September 1995. C. Fonlupt, P. Marquet, J. Dekeyser, “Data-parallel load balancing strategies,” West Team, High Performance Computing, Laboratoire d'Informatique Fondamentale de Lille, Université de Lille 1, December 1996. A. Grama, A. Gupta, G. Karypis, V. Kumar, “Introduction to parallel computing,” Second Edition, Addison Wesley, January 2003. H. Kuchen, A. Wagener, “Comparison of dynamic load balancing strategies,” Rwth Aachen, Department Of Computer Science, Technical Reports, Aachener Informatik-Berichte (AIB), 1990. A. Legrand, H. Renard, Y. Robert, F. Vivien, “Mapping and loadbalancing iterative computations,” IEEE Transactions on Parallel and Distributed Systems, vol. 15, No. 6, pp. 546-558, June 2004. R. Lüling, B. Monien, F. Ramme, “A study on dynamic load balancing algorithms,” Proc. the 3rd IEEE SPDP, pp. 686-689, 1992. A. Osman, H. Ammar, “Dynamic load balancing strategies for parallel computers,” Scientific Annals Journal of Cuza University, International Symposium on Parallel and Distributed Computing (ISPDC), vol. 11, pp. 110–120, 2002. M. Wu, “On runtime parallel scheduling for processor load balancing,” IEEE Transactions on Parallel and Distributed Systems Press Piscataway, vol. 8, pp. 173-186, February 1997.