A Dynamic Load Dispersion Algorithm for LoadBalancing in a Heterogeneous Grid System David Solomon Acker, Sarvesh Kulkarni Electrical and Computer Engineering Department, Villanova University, Villanova, Pennsylvania 19085 Email: {
[email protected],
[email protected]} Abstract - The ever changing demands on computational resources has information systems managers looking for solutions that are more flexible. Using a “bigger box” that has more and faster processors and permanent storage or more random access memory (RAM) is not a viable solution as the system usage patterns vary. In order for a system to handle the peak load adequately, it will go underutilized much of the time. A grid based distributed system can solve this problem by allowing multiple independent jobs to run over a network of heterogeneous computers. Applications can be based on several parallel jobs, several sequential jobs, or a single job. Keeping the workload represented by these jobs balanced over the network of computers requires network-aware scheduling algorithms that are dynamic, transparent, scalable, and quick. We present such an algorithm that handles load-balancing of jobs submitted at any point in the grid. Our algorithm accommodates jobs with differing CPU and I/O requirements and load-balances them over varying grid loads and varying network latencies. I. INTRODUCTION
Distributed systems are a popular way of handling the variable demands of back office applications. Grid computing in particular, is being targeted as the solution to dealing with a load variable in both the number of clients and complexity of work required per client. One explanation of a grid comes from Buyya [1], which states that a grid is a parallel and distributed system that that enables the sharing of autonomous and geographically diverse computing resources. One difficult issue is how to assign the workload among the various systems within the grid with the primary goal to achieve a balance of the load over all the systems. Another slightly different approach might be to assign the load such that jobs finish quickly. The two goals may diverge, especially when there is a large variance in the performance of the systems comprising the grid. Load balancing algorithms have been developed that try to resolve this problem, by limiting some parameters in order to reduce complexity, particularly in high performance computing (HPC). In HPC, the systems used are usually either multiprocessor shared memory systems or multicomputer systems where the individual subsystems have similar architectures, computational speed, and I/O capabilities. The network latency and bandwidth, long term storage capabilities, and random access memory capacity are all assumed to be equal. A truly flexible system would allow for a completely heterogeneous environment with varying architectures, computational and I/O speeds. The challenge is to achieve balance between the computational and I/O workload on a grid composed of disparate systems.
A. Assumptions and Simplifications in Existing Algorithms The simplest system to load-balance is a shared memory, shared storage, multiprocessor system. Here, all the processors are connected by high speed, low latency processor interconnects. Xu and Hwang [7] use a traditional multiprocessor shared memory system called the iPSC/2 hypercube. Their model has a central host machine that controls the submission of tasks and periodically samples and distributes the load information to be used by the other nodes for load-balancing purposes. This approach requires considerable a priori job-specific information. In addition, the algorithm will allow an overloaded processor to migrate a process to another processor, although the migration can only occur within its immediate neighborhood. The homogeneity of the system and the centralized nature of the host make this algorithm unsuitable for grid computing. Another problem simplification is offered by Ghose, et al. [3]. In this model, the communication startup time, the message passing latency, and the latency in returning processed results are ignored. Furthermore, it is assumed that communication and computation can occur simultaneously through a communication co-processor. These assumptions pose significant problems when applied to current clusters, who often use sockets-based connections over TCP/IP, over gigabit Ethernet. In such an environment, TCP/IP has CPU and memory requirements, thus they cannot be ignored. Buzen and Chen [2] employ non-linear optimization for load-balancing among nodes with different processing speeds. However, the solution ignores the effect of communication delays. They assume that each node has an independent channel to the memory, which is reasonable only when higher end network switches are used. The algorithm also relies on a master node to distribute the load, thus making it susceptible to system failure, should the master node be incapacitated. The authors assume that a job arrives into a queue that processes one job at a time. Modern operating systems can run multiple threads and processes at the same time; new jobs run concurrently with older jobs. Lüling and Monien [6] provide a dynamic solution to distributed load-balancing, with the load defined as the number of jobs currently running. However, they ignore the effects of network latency. Liu et al. [4] use an agent based system, where work is always moved from the busiest nodes to the least busy nodes. However, they assume that the grid is homogeneous.
B. Motivation for a Network-Aware Job Scheduling Algorithm Catalyzed by flaws found in older algorithms, we propose in this paper a new distributed network-aware job scheduling algorithm that is both robust as well as flexible enough to meet the requirements of being (i) distributed in operation, to enhance robustness, (ii) load-balancing in nature with built-in estimation of network latency and available bandwidth, to minimize job completion time, (iii) able to handle a variable number of heterogeneous nodes, for increased flexibility, and (iv) capable of dynamically adapting to changing operating parameters. The algorithm should handle a grid of computers on the same local area network. The rest of this paper is organized as follows. Section II presents our new 'Dynamic Job Dispersion Algorithm', its state machine and a discussion of its messaging overhead. Section III describes the simulation setup, and Section IV explains the experimental tests along with the associated results. Section V concludes the paper. THE DYNAMIC JOB DISPERSION ALGORITHM We propose a distributed load-balancing algorithm that is dynamic, decentralized, and that handles systems that are heterogeneous in terms of processor speed, architecture and networking speed. The system allows individual nodes to leave and join the grid at any time and have jobs assigned to them as they become available. Our model is a grid-based system of nodes with software agents running on all nodes in the system. The nodes can vary in performance and architecture. Architectural differences are hidden by the use of interpreted languages or dynamic compilation. The amount of random access memory (RAM) in each node and the network performance of each node can vary. The nodes discover neighboring nodes through the control protocol, defined below. Each node saves information about its neighbors including the network bandwidth available between the local node and its neighbor in bytes per second; the current CPU utilization; and the current I/O utilization. Knowing this each node can choose when to send jobs to its neighbors. A job can be described as a task with a CPU to I/O ratio requirement. The CPU to I/O ratio characterizes the I/O-bound or CPU-bound nature of the job. If known, it represents the actual number of time units of I/O and CPU required to process the job. However, this information is not strictly required a priori. The size of the job, in bytes, represents the number of bytes that need to transported across the network if the job has to be transferred to another node. Thus, the job size is a measure of the overhead involved in job migration from one node to another. The number of hops traversed so far is the number of nodes that have already refused to run this job. The I/O utilization is a measure of how often jobs are performing I/O actively or waiting on I/O requests. The algorithm is divided into two parts: the control protocol and the scheduling algorithm (i.e. the scheduler). II.
A. Control The control protocol obtains information about the neighboring nodes that will be used by the scheduler to decide where jobs should be run. The control algorithm avoids slowing down the scheduler or the processing of jobs. We assume that the network supports the delivery of Ethernet multicast messages and that all nodes can and will join multicast groups. In the event that multicasting services are not available, the list of nodes will be preset by the administrator. A node will then use a series of unicast messages to each node in the list instead of a single multicast message. All messages will be carried over UDP/IP unless otherwise specified. Periodically, nodes send multicast messages with the following node information: current CPU utilization, and current I/O utilization computed using exponential forgetting. When a node (N2) receives node information, it adds the sending node (N1) to its list of neighbors and stores the sender's node information. If N2 has not sent a multicast message within the preceding R units of time, it then sends a reply message to the multicast group. This message includes the replying node's information. On reception of this message, the original sender, N1, now knows the approximate latency of the network between itself and N2 by measuring the delay in the response. It adds this information to the node information it received from N2 within its reply packet. N1 then sends a final unicast reply back to N2. Upon reception of this message, N2 now also knows the approximate latency of the network between N1 and N2. When other nodes receive N2's initial multicast reply of node information for N1, they treat it as a request message. Thus, if N3 has not sent a multicast message within the preceding R units of time, N3 will send a multicast response with its information. Upon receiving N3's response, N2 sends a unicast final reply so that both N3 and N2 know the latency of the network between them. The parameter R is used to control the rate of multicast messages. A node will not send more than one multicast message in the time period R. This prevents the multicast storm that would otherwise occur. The parameter T is the average time period between request messages within the grid. On average, a node sends request messages once in every time period T*k, where k is the number of neighbors that a node currently knows about from receiving replies or requests from them. Thus, on average, each node will receive a request from one other node in time period T. If no neighbors have been discovered, then the node sends request messages once in every time period T. As the number of nodes increases, the value of T can be increased so that the average rate of multicast messages within the system is decreased. When a node initiates a request message it is in the ‘requester’ mode. If a node is merely responding to a request from another node, it is in the ‘responder’ mode. Table I lists the state that a node may enter and the mode in which that state may be entered. Table II lists the actions that may be taken by a node and the mode in which those actions are taken. the corresponding transition diagram is shown in Figure 1.
TABLE I
TABLE II
TYPES OF STATES State Mode
ACTIONS FOR DIFFERENT STATE TYPES Action Mode
TABLE III
Messages Sent per Node
Total Messages Sent
MESSAGES PER TIME T Messages Total Received Messages per Node Received
quiet
requester, responder
request timer
request sent
requester
receive request responder
1/N
1
(N–1)/N
N–1
requester
reply sent
responder
(N–1)/N
N–1
N–2+1/N
(N–1)(N–1) Multicast response
responder
receive local reply
requester
(N–1)/N
N–1
(N–1)/N
N–1
Unicast response
2–1/N
2N–1
N–1/N
N2–1
Total
responder
receive remote responder reply
reply received request received reply sent
requester
done sent
requester
receive done
responder
Multicast request
Node N1 knows everyone's information Nodes N2 through NN know node N1's information Nodes N2 through NN know nodes N2 through NN's info except for speed Average node must be able to send 2 – 1/N packets per round Average Node must be able to receive N – 1/N per round When N = 5: 0.2
1
0.8
4
Multicast request
0.8
4
3.2
16
Multicast response
0.8
4
0.8
4
Unicast response
1.8
9
4.8
24
Total
For 5 nodes, an average node must be able send 1.8 packets per T and receive 4.8 packets per T
Figure 1: State Transitions
Table III shows that the use of multicasting allows the process of sending and receiving of control messages to scale quite well (linearly). On average, a host sends approximately 2 messages per time interval T. Let N be the total number of nodes in the system at a given time. A node has to send a maximum of N messages when it is the requester and one message when it is the responder. The receiving side is more complicated. On average, a node receives N – 1/N packets per request interval. A requester must handle up to N – 1 messages at once. The responder can receive up to N messages, where one message is the initial request, N – 2 messages are responses from other nodes, and one message is the final 'done' message from the requester. B. Scheduler When a node comes online, an initial list of other nodes may be provided by the system administrator, or the list may be discovered through the control protocol. A mix of both approaches can also be accommodated. Nodes that are in the initial list are assumed to have a very fast response with no CPU or I/O usage. Jobs can be submitted to the system through any node. The node evaluates the cost of running the job on each available node, including itself. It knows the CPU and I/O utilization of any node that has responded to a request message or made requests to it.
The suitability of a node to execute a job is evaluated with a cost function fc, such that fc = CJ * CN + IJ * IN + SJ / SN (1) where, CJ is the CPU time required by the job, CN is the current CPU utilization of the candidate node, IJ is the I/O time required by the job, IN is the I/O utilization of the candidate node, SJ is the size of the job in bytes and SN is the application bandwidth between the node at which the job arrives, and the candidate node, in bytes per second. SN is determined by control protocol response times and by timing the response to job requests. The suitability of the local node to execute the job is also evaluated by setting the size of the job to zero. The node with the lowest cost is sent the job for processing. If the local node has the lowest cost, the job is added to its job list for processing. If a remote node has the lowest cost, a message is sent that includes the node's current CPU and I/O utilizations, the job description (including the updated number of hops), and the job data itself. On receipt of the job request, the remote node responds with its CPU and I/O utilizations, and performs its own cost computation. A limit is set on the number of hops to prevent a job from bouncing around the network forever. In this fashion, nodes learn about each other. If an initial list of nodes is provided, nodes will dynamically discover the properties of other nodes without having to reply on the control protocol. The algorithm is completely decentralized. All information gathering is done dynamically without a special master node. Control protocol polling improves the system and makes the initial list unnecessary, but is not required. The system accommodates various network setups. For example, a node could be used as a gateway between two separate networks. This node could move jobs judiciously between the networks based on its knowledge of cost involved in migrating those jobs. We could also force all nodes into becoming each others' neighbors by subscribing all of them to the same multicast group.
However, as the network increases in size, the time required to determine the node to which a job needs to be migrated also increases. Similar load-balancing can be achieved by using smaller groups where each node is a member of more than one group. For example, node N1 may have nodes N2 through N4 as neighbors. Node N2 may have nodes N3 through N5 as neighbors. Using this scheme allows us to pass jobs to any node while keep the next hop computation short. SIMULATION EXPERIMENTS We simulated our dynamic load dispersion algorithm using the ns2 network simulator [5]. To limit the generated request and response messages, we set the T and R parameters appropriately, as mentioned earlier. In particular, when a node comes online, it sends its first request message in the time interval [0,T]. Thereafter, the time interval between successive request messages is a uniformly distributed random number between ½ T and 1½ T. Job arrivals are Poisson. Jobs arrive asynchronously at nodes within the grid in a random manner. They have a random size, and random CPU and I/O run times. The cost computation for scheduling is performed by accounting for the amount of CPU and I/O time a job would be allocated on a node while sharing resources with other jobs on that node. In all simulation experiments, the average job size is 100,000 bytes. Job sizes are varied uniformly from 50,000 bytes to 150,000 bytes. These sizes represent the approximate average size of a binary executable on the Linux CentOS 4 system used for the simulations. Each node has a 100 megabit per second link to a central switch. All the experiments use a grid of 20 nodes. A job requires an average of 5 seconds of CPU time and 5 seconds of I/O time. As noted earlier, the CPU and IO times for each job are uniformly distributed between 2.5 seconds and 7.5 seconds. The CPU and I/O times required by a job are not always equal. The job arrival rate is varied in each experiment to simulate the effect of varying load on the system. We define the load on any node in the system as the ratio of the resources (in terms of CPU and I/O time) required by the jobs being processed, to the resources that are actually available at that node. We perform simulation runs for low as well as high values of network latency. Network latency is the sum of all queuing and propagation delays along a path, but does not include the transmission delay. We consider a latency of 0.1 milliseconds (ms) to be a low value, and a 1 ms value to be high. These numbers are reasonable approximations of latencies on a lightly loaded and somewhat congested 100 megabit per second networks. We simulated two versions of our load balancing algorithm. In the first version, jobs are dispersed over a maximum of two hops from the job originating node. In the second version, the dispersion is carried out over a maximum of five hops. The first experiment examines the performance of both versions of our algorithm when the network latency is low. The job arrival rate is 1 every 0.6 seconds. With 20 nodes over which to distribute the workload, this represents an average load per node of 0.83. Next, we increase the III.
average load per node to 1.0 by increasing the job arrival rate to 1 job every 0.5 seconds and record the results. As a third step, we overload the grid, with jobs arriving every 0.4 seconds for an average load per node of 1.25. Finally, we create a very overloaded grid with jobs arriving every 0.35 seconds to yield an average load per node of 1.43. The two versions of our algorithm are compared against a local scheduling algorithm. In local scheduling, jobs originate at random nodes in the grid and are processed at the originating node itself. The results are plotted in Figures 2 and 4. The second experiment is similar to the first, except that the simulation is run under assumption of high network latency. The results are plotted in Figures 3 and 5. Both experiments yield measurements of the number of jobs completed over the simulation run and the maximum number of jobs running at any single node, as a function of the load. The number of jobs completed over a simulation run represents the performance of the grid. The maximum number of jobs running on any node represents how unbalanced the grid is in terms of load distribution. A high maximum number of jobs at a given node is an indicator of ineffective load distribution within the grid. Consequently, the mean and the variance of the expected job completion delay increases, resulting in poor response time for jobs. The third experiment involves varying the load by keeping the job arrival rate at a steady 1 per second, and varying the average job size itself from 10 seconds of CPU and I/O to 28 seconds of CPU and I/O. The network latency is assumed to be low. The results are shown in Figure 6. SIMULATION RESULTS Each data point in the graphs shown in Figures 2 through 5, is the average of 100 runs of 200 seconds each. The Figures show how the two versions of our loadbalancing algorithm compare with the random scheduling algorithm across a wide range of loads and network latencies. Figures 2 and 3 demonstrate that as the load increases, both versions (2-hop and 5-hop) of our load-balancing algorithm outperform the local node selection algorithm, in spite of the extra cost incurred in 'shipping' jobs across a network. The number of hops allowed has very little effect on the performance of the load balancing algorithm. We also see that regardless of the network latency, the overall performance of the grid can be improved by using loadbalancing. Figures 4 and 5 show the average maximum number of jobs for any single node during the simulation across a wide range of loads, from underloaded to heavily overloaded. Across all ranges, load-balancing greatly improves the equitable distribution of load within the grid and thus decreases the variance of the job completion times and the expected response time. Furthermore we observe that the maximum number of jobs at any single node is more sensitive to latency (Figures 4 and 5) when compared to the overall performance of the grid (Figures 2 and 3). Increasing the number of hops from 2 to 5 decreases the maximum number of jobs for any single node slightly but has little effect on overall grid performance. IV.
Local
Load Balancing 2 Hops
Load Balancing 5 Hops
Local
525
16
500
15
Max Jobs Per Node
450 425 400 375 350 325 300 0.75
1
1.25
13 12 11 10 9 8 7 6
1.5
0.75
1
1.25
Load
Local
Load Balancing 2 Hops
1.5
Load
Fig 2: Average no. of jobs completed vs. Load (low network latency)
Figure 5: Maximum no. of jobs assigned to a node vs. Load (high network latency)
Load Balancing 5 Hops
525
Local
500
Load Balancing 2 Hops
Load Balancing 5 Hops
180
475
175
450
170
425
Jobs Completed
Jobs Completed
Load Balancing 5 Nodes
14
475
Jobs Completed
Load Balancing 2 Nodes
400 375 350 325 300 0.75
1
1.25
1.5
165 160 155 150 145 140 135
Load 0.8
Figure 3: Average no. of jobs completed vs. Load (high network latency) Local
Load Balancing 2 Hops
15
Max Jobs Per Node
14 13 12 11 10 9 8 7 6 5 1
1.25
1.5
Load
Figure 4: Maximum no. of jobs assigned to a node vs. Load (low network latency)
We also note from Figures 2 through 5 that when loadbalancing, job dispersion over 5 hops is better than job dispersion over 2-hops, but not by much. Dispersing jobs over multiple hops incurs larger costs in terms of networklatency and job transfer times. Figure 6 shows the effect of increasing load by increasing the job size and keeping the job arrival rate constant. For any method, the number of jobs completed over 200 seconds will drop as we increase the size of each job but load balancing will show less of a decrease in jobs completed. We see a greater difference made by using the load balancing algorithm as job size increases compared to when job arrival rate increases. At the highest load of 1.4, load balancing with 5 -hops completed 12.5% more jobs. CONCLUSION In this paper we present our Dynamic Job Dispersion Algorithm to distribute jobs arriving into a computing grid equitably over all nodes. We present preliminary simulation results which indicate that load-balancing can improve the performance of a grid regardless of the network latencies and the degree of loading. The algorithm allows the nodes to be used for other work while being engaged in the grid. If the non-grid related work takes up resources, the algorithm will be less likely to use that node. The algorithm is scalable in the sense that a series of longV.
0.9
0.95
1
1.05
1.1
1.15
1.2
1.25
1.3
1.35
1.4
Load
Figure 6: Average no. of jobs completed vs. Load (increasing job size)
Load Balancing 5 Hops
16
0.75
0.85
duration jobs will result in lower overhead for the algorithm itself. We found the algorithm to be somewhat sensitive to latency, especially with respect to the maximum number of jobs on any node. Although we did not simulate it here, the algorithm is flexible enough to be run over heterogeneous grids as well. In a heterogeneous grid, a node with fewer resources will appear busier running the same job when compared with a node with greater resources; this allows our load-balancing algorithm to operate at a coarse-grained level and compensate appropriately. REFERENCES [1] R. Buyya, “Grid Computing Information Centre: FAQs,” http://www.gridcomputing.com/gridfaq.html, Oct. 2005. [2] J. P. Buzen and P. P. S. Chen, “Optimal load balancing in memory hierarchies,” Information Processing ’74, pp. 271-275, Amsterdam, North Holland, 1974. [3] D. Ghose, H. Joong Kim, and T. Hoon Kim, “Adaptive Divisible Load Scheduling Strategies for Workstation Clusters with Unknown Network Resources,” IEEE Transactions on Parallel and Distributed Systems, vol. 16, no. 10, pp. 897-907, Oct. 2005. [4] J. Liu, X. Jin, and Y. Wang, “Agent-Based Load Balancing on Homogeneous Minigrids: Macroscopic Modeling and Characterization,” IEEE Transactions on Parallel and Distributed Systems, vol. 16, no. 7, pp. 586-598, Jul. 2005. [5] “NSNAM”, http://nsnam.isi.edu/nsnam/ [6] R. Lüling, B. Monien, “A Dynamic Distributed Load Balancing Algorithm with Provable Good Performance,” Proc. of the 5th ACM Symposium on Parallel Algorithms and Architectures (SPAA ’93), pp. 164-173, 1993. [7] J. Xu and K. Hwang, “Heuristic Methods for Dynamic Load Balancing in a Message-Passing Supercomputer,” Proceedings of the 1990 ACM/IEEE conference on Supercomputing, pp. 888-897, 1990.