A fault-tolerant load-balancing protocol for distributed ...

1 downloads 770 Views 275KB Size Report
distributed computing. In general, the goal of load balancing is to dynamically distribute some set of tasks or jobs among a number of processing nodes in such a.
A Fault-Tolerant Load-Balancing Protocol for Distributed Multiserver Queuing Systems Alexander Kostin and Gurcu Oz Department of Computer Engineering, Eastern Mediterranean University Magusa, KKTC, via Mersin 10, Turkey { Alexander.Kostin, Gurcu.oz}@emu.edu.tr Abstract A simple and efficient scheme to achieve a fault tolerance in a new load-balancing protocol for a distributed multiserver queuing system is proposed. It is assumed that the distributed queuing system consists of a job producer and a number of independent servers, or workers, who compete for produced jobs. All communications between the job producer and workers in the underlying network are based on reliable multicast. The proposed scheme is empirically investigated in a LAN of Ethernet type on a cluster of computers.

1. Introduction Load balancing is one of the hot topics of research in distributed computing. In general, the goal of load balancing is to dynamically distribute some set of tasks or jobs among a number of processing nodes in such a way that provides the highest performance of the underlying system and fairness to tasks. The most popular approaches to load balance use some sort of coordinator that performs duties of dynamic distribution of tasks among a set of processing nodes [1], [2], [3]. These approaches exploit one of two basic schemes. The first one is a “coordinator-initiated” scheme in which a coordinator pools processing nodes to collect information about their current load to be used in assignment of tasks to selected nodes. The second scheme is a “worker-initiated” one, where each processing node requests a task from the coordinator every time it becomes idle. A serious drawback of both schemes is a single point of failure and a bottleneck due to the existence of a coordinator. In our paper [4], a novel protocol for load balancing in distributed multiserver queuing systems has been proposed. The protocol does not require any coordinator and is based on anonymous and reliable multicast communication in a network consisting of a job producer and group of servers, or workers. The investigation of the protocol, carried out with a detailed

simulation and a prototype system implemented on a cluster of computers in a LAN of Ethernet type, showed that, in both simulation and prototype system, response time and average number of jobs in the distributed multiserver queuing system are quite close in their behavior to the respective performance measures of an ideal centralized queuing system. With the use of the proposed probabilistic assignment of jobs by workers, the communication complexity of the protocol in terms of an average number of messages sent per executed job can be done quite low. Moreover, the anonymity of multicast communication considerably simplifies its implementation since no worker in the distributed system needs to know exact size of the group of involved workers. This paper presents the results of a further study of the protocol described in [4]. This study is focused on the fault tolerance capability of the protocol which was not addressed in the initial, basic scheme. The proposed fault-tolerance solution takes into account possible crashes of workers which can happen when the worker is idle or running some job. The proposed scheme is based on the use of an additional time-out by all non-crashing workers, with a reasonably estimated value and some random margin for each job. The rest of the paper is organized as follows. In Section 2, an outline of the load balancing protocol described in [4] is given. Section 3 describes the proposed fault-tolerant extension to that protocol. In Section 4, an implementation of the proposed faulttolerant scheme in a prototype system on a cluster of computers in a LAN of Ethernet type is considered.

2. Basic Load-Balancing Protocol This section outlines the load-balancing protocol described in detail and investigated with the use of simulation and prototype implementation in [4] and [13]. For subsequent references, that protocol will be called the basic one. Its new version which is the subject of this paper and which provides a faulttolerant capability, will be called the extended protocol.

Proceedings of the Eighth IEEE International Symposium on Computers and Communication (ISCC’03) 1530-1346/03 $17.00 © 2003 IEEE

In the outline, we will focus only on those aspects of the basic protocol which are essential for understanding of its extended version. The load-balancing protocol under study is intended for the use in a distributed multiserver queuing system consisting of a job producer and group of independent autonomous worker nodes, or workers connected to a network. The job producer represents a client which generates a random flow of jobs and multicasts them to all workers in the group. Each worker can be viewed as a server with the resources necessary to perform any job generated and multicast by the job producer. With this protocol, each worker maintains a replica of the common waiting FIFO queue to store all multicast jobs. The grain of work for each worker is a complete job. All messages of the protocol and jobs are assumed to be transmitted in a reliable multicast mode. Different approaches to the implementation of the reliable multicast mode can be found in the literature [5], [6]. The protocol is “worker-initiated” and contentionbased. This means that idle workers compete for new jobs which are generated and multicast by the job producer. Even though each multicast job enters the waiting queue of each worker, the worker may take a job from its waiting queue for execution only after winning of competition for it with other idle workers. This is done as a result of exchanging of protocol messages in the underlying network. During the competition for jobs, conflicts between workers are quite possible. A conflict is a situation when two or more workers intend to perform the same job J and each of them multicasts an assignment, or “intention” message A(J) containing the identifier of job J. From the point of view of each worker, there are two types of conflicts – local and remote ones. A conflict is considered as “local” if the worker competes for the same job with other workers in the group. A conflict is “remote” if the worker only detects it as a conflict between some other workers but does not compete for this job itself. If worker W, which sent an assignment message with the identifier of job J, receives an assignment message from at least one other worker with the identifier of the same job J before the interval T1 expires, it understands that there is a local conflict related to job J. In this case, each worker, involved in the conflict, de-assigns job J in its queue and starts a random delay T3, after which it may again compete for jobs in its queue. A simulation model and prototype implementation of the basic load-balancing protocol in [4] were designed with the assumption that the whole system is an M/M/n queuing system, with n probabilistically identical workers. This assumption was done only for comparison of behavior of the distributed multiserver queuing system, with the proposed load-balancing

protocol, to that of a multiserver queuing system M/M/n with the ideal, centralized scheduler, and wellknown analytical expressions for performance measures [7]. However, the proposed protocol is capable to function properly with arbitrary probability distributions for interarrival time of jobs and service time which can be the same or different for different workers.

3. Extended Load-Balancing Protocol with Fault Tolerance The basic load-balancing protocol proposed in [4] and outlined in the previous section does not provide any fault tolerance. However, this capability is highly important since any real-world system is susceptible to different kinds of faults. Some of these faults can make the system completely impractical if they are not handled properly. In general, the problem of fault tolerance is quite challenging, especially for distributed systems which can adhere to different modes of behavior and use different mechanisms of failure detection. Some surveys of the problem can be found in literature [8], [9] and [10]. Its detailed analysis is beyond the scope of this paper. It should be remarked only that the results of many researches in the area show that there is no universal solution to this problem, and a concrete solution must be sought taking into account the concrete system and the types of failures which should be handled in this system. The extension of the load-balancing protocol under consideration was aimed first of all at handling of failures caused by crash of workers in the queuing system. It can be shown that, without this capability, the crash of a worker will result not only in degradation of system’s performance but also in the complete failure of the system. In the distributed queuing system under study, a worker can crash in any of its states. These states are idling, negotiating for a job, and running a job. The effect of a crash is different for these states. If a worker crashes when it is idle, then the only effect of this will be a subsequent degradation of the system performance, such as mean response time per job and mean number of jobs in the system. After recovering, the worker will compete only for new jobs. Since the previously generated jobs in the queues of all other workers will “die out” with time as a result of their execution by the remaining workers, after some transient period the queue of the recovered worker will acquire the same contents as the queues of other workers, and the system will recover to its pre-crash performance level.

Proceedings of the Eighth IEEE International Symposium on Computers and Communication (ISCC’03) 1530-1346/03 $17.00 © 2003 IEEE

Workers W1 Job Jj arrives t1

W2 A(Jj)

A(Jj) t2 t4

t3 t5

t6

T1

T1 t7

T2

t9

t11

t13 t14

t16

F(Ji)

t15

T4

A(Jj)

t20 X Crash

t22 T1

t23 T2 t24

t26 t27

A(Jk) t28

t31

t30

t32

A(Jk)

T1

t35 T4

Restart

t29

t33

T2 t36

Job Jj starts

t18

T4 t21

t10

T1

t17

t19

Job Jk arrives

A(Jj)

T4

T2

T2

t25

T3

t8

T3

t12

Job Jj starts

W3

T3

T1

T4 T3

t34

Time Fig. 1. A protocol scenario with three workers, one of which crashes during execution of a job.

Proceedings of the Eighth IEEE International Symposium on Computers and Communication (ISCC’03) 1530-1346/03 $17.00 © 2003 IEEE

The case, when a worker crashes during execution of some job, is much more complex. According to the protocol, when a worker starts some job, this job is marked as “is being executed” in the queues of all other workers, without any time restriction. Without the fault-tolerance capability, the crash of the worker in this case, in addition to degradation of the system’s performance, has two serious negative effects. Firstly, due to the crash, the job which was being executed by the worker at the moment of the crash, will never be finished. Secondly, the crashed worker will not be able to multicast, to all other workers, a notifying message about the completion of this job. Therefore, this job will remain in the queues of all other workers with the flag “is being executed”, and no alive worker may assign it for a repeated execution, since the alive workers do not know that the worker who was busy with this job has crashed. After recovering, the worker has its queue empty, so that the marked and uncompleted job will stay for ever in the queues of all other workers. Finally, the case, when a worker crashes during a negotiation for a job, can be considered as intermediate one between the two previous cases. There are two possibilities in this case. The first one is with a conflict, and the second one is conflictless. If, during the negotiation, a conflict arises between this worker W and some other workers, and worker W crashes before the resolving of the conflict, then this logically means that worker W just quits the system until it recovers. The remaining workers continue the competition for the job. Thus, logically this situation is similar to the first case when a worker crashes during its idle state. The proposed handling of the most challenging crash of a worker during execution of a job is as follows. Every time a worker W wants to choose some job J for execution, it multicasts, in its assignment message, not only the identifier of job J (as is done in the basic protocol), but also the necessary duration to perform this job. If this time is not known to W, then its best estimation, or some upper bound for it can be used instead. If worker W wins the computation for job J, then all other workers not only mark job J as “is being executed” in their queues, but start a separate time-out interval T4(J) related to job J, with T4(J) = t(J) + W, where t(J) is the estimated time to execute job J by worker W, and W is a small random margin. Now, with the use of time-out T4, the crash of worker W will be handled in the following way. In the normal mode, worker W will finish job J and multicast a notification message F(J) to all other workers who will then remove the job J from their queues before timeout T4(J) elapses. However, if worker W crashes during execution of job J, the notification message will not be

sent, so that time-out T4 will elapse in other, alive workers. With the additional random margin W in T4, the moments of elapsing of T4 are slightly different in different workers. Each idle worker, in which T4(J) elapses during a “conflict window” T1, will compete for job J with other workers in the same way as for any new job. Since, in general, workers know only an approximate value of T4, it is possible that worker W did not crash, but T4 elapsed in some other worker prematurely. Then this worker will immediately multicast an assignment message A(J), with the intention to execute job J. However, in this case the alarm is false, worker W is still running job J and will respond with a multicast message E(J) informing all other workers that job J is still being executed. This message includes also the estimated remaining time to execute job J. Having received this message, all other workers restart its timeout T4(J) with the modified duration. If, on the other hand, worker W has really crashed, then all idle workers, with elapsed time-out T4(J), will send an assignment message A(J) and start competing for the interrupted job J in the usual way as for any other job, so that sooner or later one of the workers wins the competition and starts J anew. There is an important aspect related to a job that will be started anew after the crash of a worker. This is not a problem if the effect of the interrupted job is cancelled with the crash or if the job is idempotent, i.e. if it can be performed repeatedly by different workers with the same effect as if it had been executed only once [11]. In general, however, this is not the case. A possible solution to this problem is to use some scheme of rollbacking [12]. Fig. 1 illustrates one possible scenario of the protocol with three workers W1, W2, and W3. Initially, workers W1 and W3 are idle and worker W2 is running job Ji. At time t1 a new job Jj arrives to each worker, but only workers W1 and W3 start to compete for it by sending assignment message A(Jj) each. The conflict between W1 and W3 is resolved by the use of time-out T1 and random delay T3. Even though worker W2 does not participate in the competition for job Jj, it starts its own time-out T2 after receiving of the first message A(Jj) at t3 and, after receiving of the second message A(Jj) during delay T2, detects a remote conflict related to job Jj. After elapsing of delay T3 at time t10, worker W3 repeats its attempt to assign job Jj for execution, wins the competition and, at time t15, starts this job. At time t16 worker W1 who knows already that job Jj is being executed, starts its time-out T4. Note that, at time t14, worker W2 finished job Ji and sent the notifying message F(Ji) to all other workers. This worker also

Proceedings of the Eighth IEEE International Symposium on Computers and Communication (ISCC’03) 1530-1346/03 $17.00 © 2003 IEEE

knows that job Jj is under execution, and starts its time-out T4 at t17. At time t20 worker W3 crashes. After elapsing of time-out T4 at t21, worker W1 sends an assignment message A(Jj) related to the interrupted job Jj and starts this job at time t25. Note that time-out T4 in worker W2 was interrupted at time t22 by message A(Jj) received from worker W1. The crashed worker restarts at t24 and, at time t29, after arriving a new job Jk, begins to compete with worker W1 for this job. Analysis of time-outs T1, T2 and T3 and guidelines for choice of their values, together with their sensitivity analysis, are given in [4].

4. Implementation and Performance study of a Prototype System with the Extended Load-Balancing Protocol

Sliding average number of jobs in the system

To investigate the extended load-balancing protocol in a real-world environment, a prototype distributed queuing system M/M/n was designed and implemented on a group of Pentium-based PCs connected to a 10 Mbps Ethernet LAN switch. The LAN used for the

prototype system is part of a university campus network. To evaluate the protocol performance in a realistic environment, the prototype system was run concurrently with other applications in the LAN. In the prototype system, the job producer and each worker ran on a separate computer under Windows 2000. Each worker was implemented as a multithreaded C++ program. To multicast messages, the socket mechanism of interprocess communication was used, with the UDP transport protocol which was found to be quite reliable in the LAN environment. Job interarrival times, service times, and time-outs were simulated by Win32 API functions Sleep() and WaitForSingleObject(). To control the workers’ load, the mean execution time of jobs by each worker, Te, was fixed at 2000 ms, and the mean interarrival time of jobs of the job producer, Tg, was varied to change a worker’s load U in the range from 0.1 to 0.9. Thus, for the desired load U, the corresponding value of Tg was calculated from the relation U = Te / (nTg), where n is the number of identical workers.

12 No crash Crash of two workers

10

8

6

4

Crash interval of worker W1

2

Crash interval of worker W2

0 0

200

400

600

800

1000

1200

1400

Number of jobs processed

Fig. 2. Average number of jobs in the system vs. number of jobs processed by a worker, in the prototype-system experiments, with five workers, when two workers crash and then recover. Crashes of workers were simulated in a straightforward way by switching off the corresponding computer. This was done during the interval when the worker to be crashed was running a job.

To see the effect of workers’ crash, the average number of jobs in the system was used as a performance measure. This performance measure was calculated in a sliding window consisting of 100 executed jobs.

Proceedings of the Eighth IEEE International Symposium on Computers and Communication (ISCC’03) 1530-1346/03 $17.00 © 2003 IEEE

In the distributed prototype system, a number of experiments have been carried out, with varying number of workers and different workers’ load in the range (0.1, 0.9). Detailed description of these experiments and their results are presented in [13]. Fig. 2 shows the behavior of sliding average number of jobs in the distributed queuing system M/M/n, for two experiments with n = 5 workers under offered load U = 0.5. The first experiment corresponds to a system run without crash of workers. In this case, the sliding average number of jobs in the system varies slightly around its exact theoretical value of 2.63 jobs. In the second experiment, two workers crashed at overlapping intervals. The first worker crashed during processing of the 380th job in the system and restarted after 860 jobs were executed. The second worker crashed during execution of the 540th job and restarted after 1020 jobs were performed. As could be expected, there is a considerable increase of the average number of jobs in the system when workers crash. This is especially noticeable during the overlapped interval of crash of two workers. However, the protocol continues to operate properly during the crash interval. After restarting of both crashed workers, the system gradually recovers to its pre-crash behavior. It is worth to note that each recovered worker starts with its job queue empty, but gradually the queue will acquire the some contents as the job queues of all other active workers. Thus, the protocol does not require any special initialization of the worker when it restarts after recovering from a crash. Experiments were carried out also with a crash of the job producer. In this simple case, the workers will sooner or later execute all available jobs in their queues and become idle waiting for new jobs from the job producer.

5. Conclusion An extended version of the basic load-balancing protocol for a distributed multiserver queuing system described in [4] is proposed, with a fault tolerance capability with respect to crashes of servers, or workers. The protocol was implemented in a prototype system running in a real-life network environment on a group of computer, and its performance was investigated. We did not address, in this paper, other types of failures which can happen in a distributed system, such as a network partition or byzantine failures. Even though a possible approach to handle non-idempotent

jobs was mentioned, no detailed scheme to implement this approach has been elaborated on. These aspects deserve a separate investigation and can constitute an interesting topic for a future study.

6. References [1] C.N. Nicolaou and L. Richter, “Special Issue on Load Balancing in Distributed Systems – Introduction,” Information Sciences, Vol. 97, No. 1 – 2, 1997, pp. 1 – 3. [2] Y. Azar, B. Kalyanasundaram, S. Plotkin, K.R. Pruhs, and O. Waarts, “Online Load Balancing of Temporary Tasks,” Journal of Algorithms, Vol. 22, No. 1, 1997, pp. 93 – 110. [3] X.T. Deng, H.N. Liu, J.S. Long, and B. Xiao, “Competitive Analysis of Network Load Balancing,” Journal of Parallel and Distributed Computing, Vol. 40, No. 2, 1997, pp. 162 – 172. [4] A.E. Kostin, I. Aybay, and G. Oz, “A Randomized Contention-Based Load-Balancing Protocol for a Distributed Multiserver Queuing System,” IEEE Transactions on Parallel and Distributed Systems, Vol. 11, No.12, 2000, pp. 1252 – 1273. [5] H. Garcia-Molina and A. Spauster, “Ordered and Reliable Multicast Communication,” ACM Transactions on Computer Systems, Vol. 9, No. 3, 1991, pp. 242 – 271. [6] S. Paul, K.K. Sabnani, J.C.H. Lin, and S. Bhattacharyya, “Reliable Multicast Transport Protocol,” IEEE Journal on Selected Areas in Communications, Vol. 15, No. 3, 1997, pp. 407 – 421. [7] R. Jain, The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling, John Wiley & Sons, 1991. [8] A. Agarwal and J.W. Atwood, “A Unified Approach to Fault-Tolerance in Communication Protocols Based on Recovery Procedures,” IEEE/ACM Transactions on Networking, Vol.4, No.5, 1996, pp. 785 – 795. [9] F.C. Gartner, “Fundamentals of Fault Tolerant Distributed Computing in Asynchronous Environments,” ACM Computing Surveys, Vol. 31, No. 1, 1999, pp. 1 – 26. [10] R.A. Bazzi and G. Neiger, “Simplifying FaultTolerance – Providing the Abstraction of Crash Failures,” Journal of the ACM, Vol. 48, No. 3, 2001, pp. 499 – 554. [11] G. Coulouris, J. Dollimore, and T. Kindberg, Distributed Systems: Concepts and Design, 3rd ed., AddsionWesley, 2001. [12] B. Janssens and W.K. Fuchs, “Ensuring Correct Rollback Recovery in Distributed Shared Memory System,” Journal of Parallel and Distributed Computing, Vol 29, No.2, 1995, pp. 211 – 218. [13] G. Oz, Organization and Assessment of Distributed Load-Balancing on a Cluster of Computers in a Local Area Network, PhD Thesis, EMU, Dept. of Computer Engineering, 2001.

Proceedings of the Eighth IEEE International Symposium on Computers and Communication (ISCC’03) 1530-1346/03 $17.00 © 2003 IEEE

Suggest Documents