Maximizing Service Reliability in Distributed Computing ... - ECE UNM

3 downloads 4044 Views 607KB Size Report
networks, telecommunications, data replication in hard- drive arrays, and other .... jk: the transfer time of the failure-notice (FN) sent from the jth to the kth node.
1

Maximizing Service Reliability in Distributed Computing Systems with Random Failures: Theory and Implementation Sagar Dhakal, Associate Member, IEEE, Jorge E. Pezoa, Student Member, IEEE and Majeed M. Hayat, Senior Member, IEEE, Abstract—Service reliability in a heterogeneous distributed computing system (DCS) can be assessed by means of the probability of successfully serving all the tasks queued at the server nodes. Moreover, randomness in delays imposed by the communication medium adds to the uncertainties in the shared information among these faulty nodes, which complicates the dynamics of the DCS. A novel approach based on stochastic regeneration is used to obtain recurrence equations that characterize the service reliability provided by the DCS. The theory is applied to develop load-balancing policies that maximize the service reliability of a two-node DCS. Such solution is then extended to obtain a computationally efficient algorithm that maximizes the service reliability of a multi-node system. We have tested the validity of our theoretical model by using both simulations as well as experimental data collected from a small-scale DCS testbed. Index Terms—renewal theory, queuing theory, reliability, distributed computing, load balancing

F

1

I NTRODUCTION

A

distributed computing system (DCS) allows its users to process large, time-consuming workloads in a cooperative fashion. To achieve this goal each workload has to be divided into smaller units, called tasks, redistributed to appropriate computational elements, and subsequently processed independently at any computational element (or node) in the DCS. These tasks have to be intelligently allocated onto the nodes in order to efficiently use the resources available in the system. Such task allocation is referred to in the literature as load balancing (LB). LB in distributed computing (DC) is of great importance because it is well-known that the performance of a given DCS strongly depends upon the distribution of the tasks in the system [1]. Furthermore, the LB problem belongs to a more general class of problems in resource allocation. These problems appear not only in DC, but also in routing in wireless networks, telecommunications, data replication in harddrive arrays, and other problems in computer science and operational research [2]–[7]. Issues of LB in DC have been studied considering several performance metrics, such as, the average response time of an entire workload [1], [8], the probability of successfully serving a workload (PSSW) [9]–[13], the PSSW within a given amount of time [14], the average • S. Dhakal is currently with Nortel Networks Inc., Richardson, TX, 75080; E-Mail: [email protected]. • M. M. Hayat and J. E. Pezoa are with Department of Electrical and Computer Engineering, University of New Mexico, Albuquerque, NM 87131, USA; E-Mail: {hayat,jpezoa}@ece.unm.edu. • M. M. Hayat is also with Center for High Technology Materials, University of New Mexico, Albuquerque, NM, USA.

queue-length of a node [15], [16], and the total sum of communication and service times [5], [17]. In addition, the problem has been studied under static and dynamic scenarios. In static LB a centralized entity allocates the tasks offline, i.e., tasks are allocated prior to their execution in the DCS [10], [11], [15], [18], while in dynamic load balancing (DLB) tasks are queued at the nodes and LB is triggered online, when there is imbalance in the DCS [8], [17], [19], [20]. The LB problem is particularly challenging in DLB because information has to be exchanged among the nodes in order to estimate the state of the DCS and detect whether the DCS is imbalanced or not. In addition, if an LB action is needed, then the same information is utilized to calculate both the appropriate amount of tasks that needs to be reallocated to other nodes and the appropriate set of nodes receiving the tasks. However, unavoidable speed and bandwidth limitations in practical communication networks introduce uncertainty in the estimate of the DCS state, and consequently, uncertainty in the ability of a DCS to detect imbalances. Moreover, such limitations also introduce random fluctuations in the redistribution times of tasks. In most of the existing literature, the transfer time per task is either ignored or considered deterministic [16]–[18], [21]. Clearly, this is not the case of DCSs comprising slow and/or timevarying links such as a wireless channel. The dynamics of the DCS becomes further complicated in volatile or harsh environments in which nodes are prone to fail. In such case, messages have to be broadcasted to the working nodes in order to detect and isolate the faulty nodes. Once again, due to network communication delays, information available to each node about

2

the number of the functional nodes in the DCS may not be current; as such, DLB policies as well as methods for retrieving tasks from the queues of faulty nodes must be analyzed employing a statistical framework. When LB is employed to reduce the effect of node failures on the execution of a workload, the objective is to maximize the PSSW, while the response time of the workload is simultaneously minimized. To date, existing analytical solutions to this problem have been based upon multi-objective optimization approaches. Some approaches have assumed deterministic communication delays [22]–[27] while introducing task and/or hardware redundancy to compensate for the delays [28], [29]. Other solutions either exploit a priori information on the network configuration [30] or provide computationally fast solutions by using heuristic algorithms such as genetic algorithms [31] and simulated annealing [32], [33]. Most relevant to this paper are the recent works by Dai et al. [10], [11]. The authors solve the static LB problem by using a centralized entity, which allocates tasks in the DCS in order to maximize the PSSW. In these works, the authors have considered random communication delays as well as random server failure. Additionally, in an earlier work we have studied the effect of node failure and recovery on the average response time of a workload executed by a two-node DCS [9]. In this paper we have considered the problem of LB for maximizing the service reliability of a DCS. Unlike Dai et al., we have addressed a DLB problem and proposed an online decentralized solution. In our study we have considered a heterogeneous DCS, where the server nodes provide random task service-times and can fail permanently at any random time. Moreover, it is assumed that the communication network imposes random non-negligible transfer times. A novel theory based on stochastic regeneration is developed to rigorously analyze the complex dynamics of the DCS. The theory allows us to obtain a set of recurrence equations characterizing the PSSW assigned to the DCS. Then, we use the recurrence equations to systematically devise LB policies for maximizing the PSSW. Our study is motivated by our need to assess and maximize reliability in scenarios such as DC in wireless sensor networks and network resilience in harsh environments, like those resulting from stress inflicted by weapons of mass destruction. This paper is organized as follows. In Section 2 we build the regeneration-based stochastic theory for analyzing the performance of DCSs. In Section 3 we apply the theory to devise LB strategies that maximize the service reliability of a DCS. Their performance is tested both theoretically and experimentally and our conclusions are given in Section 4.

2

T HEORY

2.1 Problem statement The problem addressed in this paper is concerned with maximizing the probability of successfully serving a set

of M tasks, comprising a workload (or application), which are to be processed in parallel by a DCS whose server nodes can permanently fail at any random time. The DCS is composed of n nodes connected over a network of arbitrary topology. Let mj ≥ 0 be an integer representing the number Pn of tasks queued at the jth node at time t = 0, with j=1 mj = M . The service time of each task as well as the failure times of all functional nodes are assumed to be random. We further assume that there is no future arrival of external tasks (i.e., tasks that are not present in the system at t = 0) to the DCS. In addition, all the nodes are assumed to be functioning at t = 0. At such time the queue-length information (QI) is broadcasted by each node to all other nodes. The QI takes a random amount of time to reach the destination nodes. In order to maximize the service reliability of the workload, LB is performed at time tb ≥ 0 so that each functional node, the jth node, say, transfers a positive amount, Ljk , of tasks to the kth node (j 6= k) that is functioning according to the knowledge of the jth node. Naturally, these task exchanges over the network take random transfer times. For analytical tractability, the following assumptions are imposed on the random times that characterize the DCS generated at t = 0. Assumption A1 (Exponential distribution of the random times): For any j 6= k, the following random variables are exponentially distributed: (i) Wki : the service time of the ith task at the kth node (with rate λdk ); (ii) Yk : the failure time of the kth node (with rate λfk ); (iii) Xjk : the transfer time of the QI sent from the jth to the F kth node (with rate λjk ); (iv) Xjk : the transfer time of the failure-notice (FN) sent from the jth to the kth node (with rate λF jk ); and (v) Zik : the transfer time of the ith ˜ i,k ). group of tasks sent to the kth node (with rate λ Assumptions on the exponential distribution of the service and failure times are commonly adopted in the literature [1], [10], [15], [32]. Regarding the transfer times, our assumptions are justified according to our prior work [8], [9] and the empirical data obtained from the experiments conducted over the DC architecture to be discussed in Section 3. In addition, we have assumed that the mean transfer time of the ith group of tasks being transferred from the jth to the kth node follows the ˜ −1 = ajk Li + bjk , where ajk first-order approximation: λ jk i,k and bjk are positive constants (in seconds per task and seconds, respectively) that depend upon the communication channel connecting the jth and the kth nodes, and Lijk is the number of tasks in the ith group. This firstorder approximation captures the linear dependence of the mean transfer time on: (i) the number of tasks to be transferred; (ii) the end-to-end transmission time per task, through the parameter ajk that is related to the bandwidth; and (iii) the combined effects of the absolute minimum end-to-end propagation time and any delays resulting from queuing (due to congestion), which can be represented by a single parameter, bjk . Assumption A2 (Independence of the random times):

3

All the random variables listed in Assumption A1 are mutually independent. Assumption A3 (Backup nodes): Each node in the DCS is equipped with a backup system that performs the following duties in the event of node failure: (i) If a node becomes faulty, its backup system broadcasts a FN that takes a random time to reach the destination nodes. (ii) If a node becomes faulty, its backup system evenly distributes all unserved tasks among the functional nodes. (iii) Upon the reception of tasks by a faulty node (due to task reallocation performed by other nodes before their knowledge of the failed state of the faulty node), the backup system evenly distributes those tasks back to the functional nodes. Clearly, an even redistribution of tasks upon failure is not the best approach the backup system can perform, specially in heterogeneous DCSs. However, in order to use simple and inexpensive reliever equipments, we have assumed that the backup nodes have very limited capabilities, so the duties performed by them must be kept simple. Convention C1 (Degenerate cases): The following random times are set to infinity: (i) service time at any node with no task; (ii) service time at a faulty node; (iii) failure time of a faulty node; (iv) QI arrival time when there is no QI in transit; (v) FN arrival time when there is no FN in transit; and (vi) task arrival time when there is no group of tasks in transit. 2.2 Methods for allocating loads First, we determine the total amount of tasks to be reallocated. At time tb ≥ 0, when nodes simultaneously execute LB, a node, say j, computes its excess load by comparing its local load to the average load of the system. Let Qj (tb ) be the number of tasks queued at ˆ l,j (tb ) be the estimate the jth node at time tb , and let Q of the number of tasks queued at the lth node as perceived by the jth node at time tb . Here we assume that ˆ l,j (tb ) = ml if the QI has been received by the node j Q ˆ l,j (tb ) = 0 otherwise. The excess load, at time tb and Q ex Lj (tb ), which is random, is defined by Λj Lex j (tb ) = Qj (tb ) − Pn

n X

ˆ l,j (tb ), Q

in the case of an extremely reliable node (λf ≈ 0), the parameter Λj is approximately equal to the processing rate of the node. On the contrary, for an unreliable node the parameter Λj is only a reduced fraction of its processing rate. Second, we have to determine the amount of tasks to reallocate to each node. Let us define the collection V of candidate sender nodes in the DCS as all those nodes that, at the balancing instant, perceive themselves as overloaded with respect to their perceived fair share of the total workload of the system. Mathematically, 4 we define V = {j : Lex j (tb ) > 0}. Similarly, for each candidate sender node j, we define the collection Uj of candidate receiver nodes as all those nodes that, at time tb , are perceived by the sender node as underloaded with respect to their own perceived fair shares of the total 4 workload; namely, Uj = {k : Lex k,j (tb ) < 0}, where j ∈ V ex and Lk,j (tb ) is the excess load at the node k as perceived ˆ by the node j and is defined as Lex k,j (tb ) = Qk,j (tb ) − Pn ˆ Pn −1 Λj ( l=1 Λl ) l=1 Ql,j (tb ) with the understanding that ˆ j,j (tb ) = Qj (tb ). The jth node partitions its excess load Q among all those nodes that are estimated to be below the average load of the DCS and the partition pjk is defined P 4 ex as pjk = Lex k,j (tb )/( l∈Uj Ll,j (tb )) whenever k ∈ Uj and 0 otherwise. Note that in the special case when the task transfer times are negligible, the partitions pjk will maximize the PSSW under all node-failure rates [8]. This is due to the fact that upon the occurrence of a failure, the unserved tasks of the failed node can instantly join the queues of other surviving nodes. In general, however, the above partitions pjk may not be effective and must be adjusted in order to compensate for the effects of the random transfer times. The load to be transferred from the jth to the kth must be adjusted according to what is called the load-balancing gain [8], [9], [20], which is denoted as Kjk , yielding Ljk (tb ) = bKjk pjk Lex j (tb )c. (bxc is the greatest integer less than or equal to x.) The remainder of this section focuses on deriving recurrence equations that characterize the PSSW. We begin by introducing some necessary definitions of key system variables.

(1)

2.3 System-information, system-function and network states

where the Λj ’s are parameters that can be defined in several ways in order to establish different balancing criteria. For example, if the Λj ’s are associated with the processing speed, namely Λj = λdj , then the imbalance in the DCS is determined by the relative computing powers of the nodes. Alternatively, if the Λj ’s are associated to the reliability of the nodes, namely Λj = λ−1 fj , then the reliability of the nodes determines the amount of imbalance. Yet another option is to define the Λj ’s so that we simultaneously transfer fewer tasks to the lessreliable nodes and transfer larger number of tasks to the faster processors. With this criterion in mind, we Pn −1 can define Λj = λdj (1 − λfj ( k=1 λfk ) ). Note that

At any time, the DCS can be described by the following quantities: (i) the number of tasks queued at each node; (ii) the functional or dysfunctional state of each node in the system; and (iii) the amount of tasks in transit over the communication network. Therefore, based upon these quantities, we have constructed a discrete-state continuous-time model to analyze the dynamics of the DCS. For each node and for each instant of time, we assign an n-tuple binary vector that describes the information state that a node has about the queue-lengths of all nodes. Let us denote by ik the vector associated to kth node. More precisely, a “1” entry for the jth component

l=1

Λl

l=1

4

of ik (j 6= k) indicates that the kth node has received the QI from the jth node. (By definition, the kth component of ik is always “1.”) Clearly, at t = 0 when the QIs are just broadcasted, all the entries of ik (except for the kth component) are set to 0. We define the system-information 4 state as the concatenated vector I = (i1 , . . . , in ). For example, in a two-node DCS (n = 2), the state I = (10, 11) at time t corresponds to the configuration for which the first node has not received the QI from the second node (i1 = (10)) while the second node has received the QI from the first node (i2 = (11)) by the time t. Similarly, for each kth node and each instant of time, we assign a binary vector fk of size n, where a “1” (correspondingly, “0”) entry in the jth component at time t indicates that the jth node is functioning (correspondingly, faulty) as perceived by the kth node at time t. 4 We define the concatenated vector F = (f1 , . . . , fn ) as the system-function state. As nodes are assumed to be functioning at t = 0, F has all its entries set to “1” at t = 0. As in the case of the QI, the random transfer times of FNs introduce uncertainty on the functioning state that a node perceives about the other nodes in the DCS. In addition, due to random transfer-times in the communication network, each group of tasks being transferred over the network has a random arrival time. Let gk ∈ {0, 1, 2, 3, ...} represent the number of different groups of tasks that are simultaneously in transit from different nodes to the kth node at time t. We can assign a vector ck of size gk + 1 such that each component of ck (except the first component) represents the number of tasks in a particular group in transit, while the first component of ck is always set to gk . We now define the network state as the concatenated 4 vector C = (c1 , . . . , cn ). For example, in a three-node DCS, C = ([2 10 1], [1 5], [0]) at time t corresponds to the network state for which two different groups of tasks (10 tasks in the first group and 1 task in the second group) are being transferred to the first node (c1 = [2 10 1]), one group of 5 tasks is being transferred to the second node (c2 = [1 5]), while there is no transfer being made to the third node (c3 = [0]). Finally, it should be noted that the implicit dependence of I, F and C on time t becomes evident as we progress deeper into the analysis of the DCS. I,F Let Tm (tb ; C) denote the random time taken by 1 ,...,mn the DCS to serve all the tasks in the system if LB is performed by all functioning nodes at time tb , and the initial system condition at t = 0 is as specified by I, F, C while mk tasks (k = 1, . . . , n) are in the queue of the kth node. Our objective is to calculate the probability of success I,F in serving all tasks, denoted as P{Tm (tb ; C) < ∞}, 1 ,...,mn for C = ([0], [0], . . . , [0]), I = (10...0, 01...0, . . . , 00...1) and F = (11...1, 11...1, . . . , 11...1), with tb as a free parameter. (That is, we assume a null information initial state, for which all nodes are functioning and no tasks are in transit in the communication network.)

2.4 Regeneration time We will undertake an approach that is based on the concept of stochastic regeneration to obtain recurrence equations that characterize the PSSW. Our key idea is to introduce a special random variable, called the regeneration time, τ , which is defined as the minimum of the following five random variables: the time to the first task service by any node, the time to the first occurrence of failure at any node, the time to the first arrival of a QI at any node, the time to the first arrival of a FN at any node, or the time to the first arrival of a group of tasks at any node. The key property of this regeneration time τ is that upon the occurrence of the regeneration event {τ = s}, a fresh copy of the original stochastic process (from which I,F (tb ; C) is defined) will the random variable Tm 1 ,...,mn emerge at time s with a new initial system configuration that transpires from the regeneration event. More precisely, the occurrence of the regeneration event {τ = s} gives birth to a new DCS at t = s whose random times satisfy Assumptions A1 and A2 while having its own initial system configuration. The new initial system configuration can be one of the following: (i) a new initial task distribution if the regeneration event is a service to a task; (ii) a new F and a new C if the regeneration event is a node failure; (iii) a new I if the regeneration event is an arrival of a QI; (iv) a new F if the regeneration event is an arrival of an FN; (v) or a new task distribution and a new C if the regeneration event is an arrival of a group of tasks. An illustration of our regeneration idea and the associated change in the initial system configuration in the special case of a two-node system is shown in Fig. 1.

Fig. 1. An illustration of the concept of stochastic regeneration in a two-node DCS. The picture shows all the possible regeneration events occurring in the system, and how they change the initial system configuration.

2.5 Renewal equations for a two-node DCS Consider the case of a two-node system (n = 2). 4 I,F I,F Let Rm (tb ; C) = P{Tm (tb ; C) < ∞}. Trivially, 1 ,m2 1 ,m2 I,F R0,0 (tb ; [0], [0]) = 1 for all balancing instants and for any I and F, since there are no tasks to be completed

5

in the DCS. Also, if there exists at least one task to be served in the system and both nodes have failed then the PSSW is zero. Mathematically, if m1 > 0 or m2 > 0 then I,(0 i,j 0) Rm1 ,m2 (tb ; [0], [0]) = 0 for any I and for i, j ∈ {0, 1}. Our main result is given in Theorems 1 and 2. Theorem I,F 1 characterizes the probability Rm (tb ; C) in the form 1 ,m2 of a set of coupled difference-differential equations and Theorem 2 provides the initial conditions required to solve Theorem 1. Theorem 1: For n = 2, m1 , m2 ∈ ZZ+ and tb ≥ 0, the I,F probability Rm (tb ; C) satisfies (2)–(10) shown below: 1 ,m2 d I1,1 ,F3,3 I ,F3,3 R (tb ; C0,0 )=λd1 Rm1,1 (tb ; C0,0 ) 1 −1,m2 dtb m1 ,m2 I ,F3,3 I1,3 ,F3,3 +λd2 Rm1,1 (tb ; C0,0 )+λ12 Rm (tb ; C0,0 ) 1 ,m2 1 ,m2 −1 I

,F1,3

1,1 I3,1 ,F3,3 +λ21 Rm (tb ; C0,0 )+λf1 R0,m 1 ,m2 2

(tb ; C0,m1 ) I1,1 ,F3,2 I1,1 ,F3,3 +λf2 Rm1 ,0 (tb ;Cm2 ,0 )−λRm1 ,m2 (tb ;C0,0 )

(2)

d I1,1 ,F3,2 I ,F3,2 R (tb ; Cm2 ,0 ) = λd1 Rm1,1 (tb ; Cm2 ,0 ) 1 −1,0 dtb m1 ,0 I ,F3,2 I ,F3,2 +λ12 Rm1,3 (tb ; Cm2 ,0 )+λ21 Rm3,1 (tb ; Cm2 ,0 ) 1 ,0 1 ,0 I1,1 ,F2,2 I ,F 1,1 2,2 F ˜ 1,1 R +λ R (tb ; Cm ,0 )+ λ (tb ; C0,0 ) 21

2 m1 ,0 I1,1 ,F3,2 λ1 Rm1 ,0 (tb ; Cm2 ,0 ),

m1 +m2 ,0

− (3) d I1,1 ,F1,3 I1,1 ,F1,3 R (tb ; C0,m1 ) = λd2 R0,m2 −1 (tb ; C0,m1 ) dtb 0,m2 I1,3 ,F1,3 I3,1 ,F1,3 + λ12 R0,m (tb ; C0,m1 )+λ21 R0,m (tb ; C0,m1 ) 2 2 I1,1 ,F1,1 I ,F 1,1 1,1 F ˜ 1,2 R +λ R (tb ; C0,m )+ λ (tb ; C0,0 ) 12

1 0,m2 0,m1 +m2 I1,1 ,F1,3 − λ2 R0,m2 (tb ; C0,m1 ), I ,F3,2 I ,F3,2 Rm1,2 (tb ; Cm2 ,0 ) = Rm1,3 (tb ; Cm2 ,0 ), 1 ,0 1 ,0 I2,1 ,F1,3 I3,1 ,F1,3 R0,m2 (tb ; C0,m1 ) = R0,m2 (tb ; C0,m1 ), I ,F2,2 I1,1 ,F2,2 Rm1,1 (tb ; Cm2 ,0 ) = λd1 λ−1 4 Rm1 −1,0 (tb ; Cm2 ,0 ) 1 ,0 ˜ 1,1 λ−1 RI1,1 ,F2,2 (tb ; C0,0 ), +λ 2 m1 +m2 ,0 I1,1 ,F1,1 −1 I1,1 ,F1,1 R0,m (t ; b C0,m1 ) = λd2 λ3 R0,m2 −1 (tb ; C0,m1 ) 2 ˜ 1,2 λ−1 RI1,1 ,F1,1 (tb ; C0,0 ), +λ 1 0,m1 +m2 I1,1 ,F2,2 Rm1 +m2 ,0 (tb ; C0,0 ) = (λd1 (λd1 +λf1)−1 )m , 1 +m2 I1,1 ,F1,1 R0,m (tb ; C0,0 ) = (λd2 (λd2 +λf2)−1 )m , 1 +m2 1 +m2

(4) (5) (6) (7) (8) (9) (10)

P2 P ˜ 1,1 = a21 m2 + where λ = k=1 (λdk + λfk + j6=k λkj ), λ ˜ 1,2 = a12 m1 + b12 , λ1 = λd + λ12 + λ21 + λF + b21 , λ 21 1 ˜ 1,1 + λf , λ2 = λd + λ12 + λ21 + λF + λ ˜ 1,2 + λf , λ3 = λ 12 1 2 2 ˜ 1,2 , λ4 = λd + λf + λ ˜ 1,1 , I1,1 = (1 i1 , i2 1), λd2 + λf2 + λ 1 1 I1,2 = (1 i1 , 0 1), I2,1 = (1 0, i2 1), I3,1 = (1 i1 , 1 1), I1,3 = (1 1, i2 1), i1 , i2 ∈ {0, 1}, F1,1 = (01, 01), F1,3 = (01, 11), F2,2 = (10, 10), F3,2 = (11, 10), F3,3 = (11, 11), C0,0 = ([0], [0]), Cm2 ,0 = ([1 m2 ], [0]), C0,m1 = ([0], [1 m1 ]), and mk − 1 is set to 0 when mk = 0. The proof of Theorem 1 is lengthy and technical. We provide here a sketch of the proof and defer the remaining details to the Appendix.

Sketch of the proof: The key idea is to first calculate the probability of successfully serving the tasks conditional on knowledge of the regeneration time τ , and then average the conditional probability with respect to the known distribution of τ . In calculating the conditional probability, we exploit the definition of the regeneration time to decompose the conditional probability according to all possible, disjoint regeneration events. Next, we exploit the pairwise independence of the random variables defined in Assumptions A1, the memoryless property of the exponential distribution, and Convention C1 to prove that conditional on the time of occurrence τ of the regeneration event, the following three observations hold: (i) a fresh copy of the underlying stochastic process emerges at the regeneration time τ but with a new initial system configuration; (ii) the emergent stochastic process is independent of the original process and satisfies assumptions A1 and A2; and (iii) the independence of the new process allows us to shift the time origin to t = τ . Then, we apply these observations to all the possible system-information states, considering that initially all the nodes in the DCS are functioning and no tasks are in transit among the nodes. Equation (2) is obtained for the case when both nodes are functioning while equations (3) and (4) regard situations when solely one of the nodes is functioning. Equations (5)–(10) provide relationships between the probabilities for the special case when only one node is functioning. For instance, equations (5) and (6) show that when a node has failed, some system-state information becomes redundant. Equations (7) and (8) are the expressions to compute the PSSW when the only jth node is functioning and mk tasks are in transit to the jth node due to node failure. Finally, equations (9) and (10) determine the PSSW when one of the nodes is working and there are no tasks being transferred in the network. Remark. Equations (2)–(4) actually denote four coupled difference-differential equations (one per each systeminformation state) that have to be solved recursively. These equations characterize the rate of change in the PSSW as a function of the LB instant for any initial task allocation, for any system-information state, for network states where tasks are in transit due to node failure, and for system-function states where at least one node is functioning. In order to solve the recurrence equations (2)–(10) of Theorem 1, we first need to calculate their initial conditions corresponding to tb = 0. According to the methods for allocating tasks, if the jth node has a positive excess load, then it must transfer Ljk (0) = bKjk pjk Lex j (0)c tasks to the kth functional node, while rj = mj − Ljk (0) tasks remain queued at the jth node. Note that, for a given information state ij , the excess load at node j as well as the partitions pjk are fixed quantities; the amount of tasks to reallocate to each node is determined by the LB gain Kjk . In this paper, the LB gains are regarded as parameters that have to be optimally selected in order to maximize the PSSW. Given that the quantity pjk Lex j (0)

6

defines the maximum number of tasks to be exchanged from node j to k, we have assumed here that the LB gains are real numbers in the interval [0,1]. Before presenting Theorem 2 we introduce some necessary definitions. For simplicity, let Ljk denote the amount of tasks to be exchanged from node j to k, with the understanding that this quantity is computed at t = tb . Let T˜rF1 ,r2 (C; K) be the time to serve all the tasks in the DCS given that the system function state and the network state at t = 0 are F and C, respectively, let K = [K12 K21 ] denote the LB gains and let r1 and r2 denote the number of tasks queued at the first and second node, respectively. By construction, we have the following relationships for the PSSW before and after the LB I,F3,3 ˜F3,3 instant: Rm1 ,m 2 (0; C0,0 ) = P{Tr1 ,r2 ([1 L21 ], [1 L12 ]; K) < I,F3,2 F ∞}, Rm1 ,0 (0; Cm2 ,0 ) = P{T˜r13,2 ,0 ([1 m2 ], [1 L21 ]; K) < ∞} I,F1,3 F 1,3 ˜ and R0,m2 (0; C0,m1 ) = P{T0,r2 ([1 L12 ], [1 m1 ]; K) < ∞}. Theorem 2: For n = 2, r1 , r2 , Lijk ∈ ZZ+ , gj ≥ 0 an integer number, the probability P{T˜rF1 ,r2 (C; K) < ∞} = ˜ rF ,r (C; K) satisfies the simple recursions described in R 1 2 (11)–(15): ¡ −1 3,3 ˜F ˜ F3,3 (Cg ,g ; K) R λd1 R r1 ,r2 (Cg1 ,g2 ; K) = λ1 1 2 r1 −1,r2 F3,3 F1,3 ˜ ˜ +λd R (Cg ,g ; K) + λf R (Cg ,g +1 ; K) 2

r1 ,r2 −1

1

2

1

0,r2

1

g1 X ˜ i,1 R ˜ F3,2 (Cg +1,g ; K)+ ˜ F3,3 i +λf2 R λ 1 2 r1 ,0 r1 +L i=1

+

g2 X

2

21 ,r2

¢ ˜ i,2 R ˜ F3,3 i (Cg ,g −i ; K) , λ 1 2 r1 ,r2 +L

(Cg1 −i,g2 ; K) (11)

12

i=1 ¡ ˜ F3,2 (Cg ,g ; K) = λ−1 λd R ˜ F3,2 (Cg ,g ; K) R 1 2 1 1 2 r1 ,0 2 r1 −1,0 g2 g1 X X ˜ i,2 ˜ i,1 R ˜ F3,2 i (Cg −i,g ; K) + λ λ + 1 2 r1 +L21 ,0 i=1 i=1 ¢ ˜ F3,2 (Cg +i,g −i ; K) , ×R 1 2 r1 ,0 ¡ ˜ F1,3 (Cg ,g ; K) = λ−1 λd R ˜ F1,3 (Cg ,g ; K) R 1 2 2 1 2 0,r2 3 0,r2 −1 g1 g2 X X ˜ i,1 R ˜ i,2 ˜ F1,3 (Cg −i,g +i ; K) + λ λ + 1 2 0,r2 i=1 i=1 ¢ ˜ F1,3 i (Cg ,g −i ; K) , ×R 1 2 0,r2 +L12 ˜ F3,2 (C0,0 ; K) = (λd (λd +λf )−1 )r1 , R 1 1 1 r1 ,0 F1,3 ˜ R0,r2 (C0,0 ; K) = (λd2 (λd2 +λf2 )−1 )r2 ,

P2

(12)

(13) (14) (15)

˜ −1 = a Li + bjk , λ1 = where λ + λfk ) + k k=1 (λd Pg1 ˜ i,k Pg2jk ˜ jk P g1 ˜ λ + λ , λ = λ + λ + i,1 2 d1P f1 i=1 i,2 i=1 λi,1 + Pi=1 P g2 ˜ g1 ˜ g2 ˜ λ , λ = λ + λ + λ + 3 d2 f2 i=1 i,2 i=1 i,1 i=1 λi,2 , g1 g2 1 1 Cg1 ,g2 = ([g1 L21 . . . L21 ], [g2 L12 . . . L12 ]), Cg1 +1,g2 = ([g1 + 1 L121 . . . Lg211 r2 ], [g2 L112 . . . Lg122 ]), Cg1 ,g2 +1 = ([g1 L121 . . . Lg211 ], [g2 + 1 L112 . . . Lg122 r1 ]), Cg1 −i,g2 = ([g1 − g1 g2 i+1 1 1 L121 . . . Li−1 = 21 L21 . . . L21 ], [g2 L12 . . . L12 ]), Cg1 ,g2 −i g1 g2 i+1 1 1 ([g1 L21 . . . L21 ], [g2 − 1 L12 . . . Li−1 L . . . L ]), 12 12 12 Cg1 +i,g2 −i = ([g1 + 1 L121 . . . Lg211 Li12 ], [g2 − g2 i+1 1 L112 . . . Li−1 Cg1 −i,g2 +i = ([g1 − 12 L12 . . . L12 ]), i−1 i+1 1 1 L21 . . . L21 L21 . . . Lg211 ], [g2 + 1 L112 . . . Lg122 Li21 ]) and rj − 1 is set to 0 when rj = 0.

We omit the proof of Theorem 2 since it is similar to that of Theorem 1. (We refer the reader to [34] for a similar type of proof.) 2.6 Extension to a multi-node DCS Renewal equations characterizing the PSSW for the case of a multi-node DCS can be obtained following the same principles as those presented for the case of a two-node DCS. However, the complexity grows exponentially with the number of nodes and the solution soon becomes intractable. Nevertheless, a suboptimal solution can be efficiently computed by decomposing the n-node system in a two-node DCS and invoking Theorems 1 and 2 every time a load is exchanged between pairs of nodes. Suppose that the jth node is a sender node and let Uj be the collection of candidate receiver nodes as perceived ∗ by the jth node. Let Kjk denote the LB gain associated with the load transfer from the jth node to the kth node and let Uj0 denote the set containing those recipient ∗ nodes k for which Kjk is already calculated. In addition, 0 assume that Uj is initially empty and that Kjk = 1 for all k ∈ Uj and any j. (Namely, we have assumed that the jth node can send full load partitions to the recipient nodes.) The suboptimal multi-node solution for the DCS is described by the following steps. Step 1: Select a node, say the kth node, from the collection Uj , and consider a two-node system comprising node k and j. As a result, upon the execution of LB at ˆ k,j (tb ) tb , the kth and Q Pthe jth nodesexwill have loads ex and Qj (tb ) − ( i∈(U\{k}) bpji Lj (tb )c + bKjk pjk Lj (tb )c), respectively, while bKjk pjk Lex j (tb )c tasks are assumed to be in transit from the jth node to the kth node. Theorems ∗ 1 and 2 can now be invoked to compute Kjk . Once this 0 gain is known, we update the set Uj by including the kth node in it, at which point Uj0 = {k}. Step 2: Select another recipient node, say the lth node, from the collection Uj \ {k}. Then, we utilize the assumption Kji = 1 for all i ∈ Uj \ {k, l} as well as the ∗ value of Kjk (which was obtained in the previous step) ∗ ∗ to calculate the LB gain Kjl . The LB gain Kjl is obtained by considering a two-node system as well. Thus, upon the execution of LB at tb the lth and the jth nodes have ˆ l,j (tb ) and Qj (tb ) − (P bpji Lex loads Q j (tb )c + i∈(U j \{k,l}) P ∗ ex ex bK p L (t )c + bK p L (t )c), respectively, 0 ji b jl jl b ji j j i∈Uj while bKjl pjl Lex j (tb )c tasks are in transit from the jth to the lth node. We now update the set Uj0 and obtain Uj0 = {k, l}. Repeat: The second step is repeated until the gains of all the nodes in Uj are calculated. In particular, at time a new node is selected from the set Uj \ Uj0 , the corresponding pairwise gains are calculated using twonode systems, and the set Uj0 is augmented to include the new node. This step terminates when Uj0 = Uj . Termination condition: Given that the algorithm described so far turns out to be dependent on the order selection of the nodes, we iterate over steps 1 and 2 to recalculate all the LB gains until either all of the LB gains

7

converge to a certain value or a predefined maximum number of iterations is executed. The announced LB gains are those obtained after either one of these two termination conditions are met.

3

M AXIMIZING

SERVICE RELIABILITY

3.1 Distributed computing architecture We have implemented a small-scale DCS testbed to experimentally validate the theoretical achievements shown in this paper. The hardware architecture consists of the computing nodes, the backup nodes and the communication network. The set of computing nodes comprises heterogeneous processors, such as Pentium II- and Pentium 4-based computers. Some of the computing nodes are dedicated machines, while others are serving as lightly loaded web-, mail- and databaseservers. Given that the occurrence of a failure at any node is simulated by software, the set of backup nodes is the same set as the set of working nodes. Upon the occurrence of a failure, a node is switched from the so-called working state to the backup state. If a node is in the backup state then it is not allowed to continue processing tasks. The communication network employed in our architecture is the Internet, where the final links connecting the computing nodes are either wired or wireless. On one hand, some communication links connect nodes separated geographically by a large distance; hence, they naturally exhibit a notorious communication delay. On the other hand, for those nodes in the DCS connected by high speed links, we have introduced some artificial latency by employing traffic shaper applications. Such kind of applications allow us to reduced the actual transfer speed of the network interfaces to slow speeds such as 1024 to 512 Kbps. The software architecture of the DCS is divided in three layers: application, task allocation and communication. Layers are implemented in software using POSIX threads. The application layer executes the application selected to illustrate the distributed processing: matrix multiplication. We have defined the service of a task as the multiplication of one row by a static matrix, which is duplicated in all nodes. To achieve variability in the processing speed of the nodes, the randomness is introduced in the size of each row by independently choosing its arithmetic precision with an exponential distribution. In addition, the application layer updates the QI of each node and determines the failure instants of each node. As part of the latter task, the application layer also switches the state of a node from working to backup. The task allocation layer executes the LB policy defined for each type of experiment conducted. This layer schedules and triggers the LB instants when task-exchange is performed. It also: (i) determines if a node is overloaded with respect to the other nodes in the system; (ii) selects which nodes are candidate receiving nodes; and (iii) computes the amount of task to transmit

to the receiver nodes by solving the recursions (11)– (15). In addition, when a node is in the backup state, this layer executes the reallocation of tasks to all the surviving nodes as it is described in Section 2. Finally, the communication layer of each node handles the transfer of tasks as well as the transfer of QI and FN among the nodes. Each node uses the UDP transport protocol to transfer either a QI or an FN to the other nodes. The TCP transport protocol is used to transfer tasks between the nodes. 3.2 Maximizing the service reliability of a two-node DCS The model detailed in Section 2 completely characterizes the stochastic dynamics of a DCS. By exploiting this model, we can now devise an one-shot LB policy to maximize the service reliability of a twonode DCS. Therefore, the objective is to find the optimal LB policy, defined by the choice of optimal tb together with the optimal LB gains, that maximizes the service reliability, i.e., the PSSW assigned to the DCS. In symbols, the optimal LB policy is defined by: (10,01),F argmax(tb ,K12 ,K21 ) Rm1 ,m2 3,3 (tb ; C0,0 ). Example 1: We have conducted experiments using a dedicated (node 1) and a non-dedicated computer (node 2). The nodes are separated by a large geographic distance and communicate through the Internet. The free parameters of the system, namely, the initial workload and the average failure times, were defined to be: m1 =50 −1 tasks and m2 =25 tasks, λ−1 f1 = 300 s and λf2 = 100 s. The remaining system parameters were estimated conducting experiments on the two-node DCS: (i) The estimated service rates of each node are λd1 = 0.1682 tasks per second (tps) and λd2 = 0.4978 tps; (ii) The mean arrival times of QI packets and FN packets are F,−1 F,−1 = 1.6659 s; and = 4.6402 s and λ−1 λ−1 21 = λ21 12 = λ12 (iii) The parameters for the first-order approximation of the average transfer time of tasks are a12 = 0.243 seconds per task, b12 = 1.971 s, a21 = 0.339 seconds per task and b21 = 1.652 s. Figure 2(a) shows results of the experiments conducted on our two-node DC testbed for the case of the communication channel linking the node 2 to the node 1. In the Figure, dots represent measurements of the transfer time of a group of tasks between the nodes, while the straight-line represents the first-order approximation for the average transfer time of tasks. We observe that, for the amount of tasks to be exchanged in the experiments conducted in this paper, a first-order approximation for the average transfer time is valid for the communication channel. Let us first look at the solution to the initial condition corresponding to tb = 0, when I = (10, 01), F = (11, 11) and C0,0 = ([0], [0]). In order to explore all the possible amounts of tasks to exchange among the nodes, for the amount of tasks to exchange we use the formula Ljk (0) = bKjk mj c with 0 ≤ Kji ≤ 1. Note that for the case of a two-node DCS, nodes do not have to partition

8 (10,01),F

50 45 40

˜−1 , s Z i1 ; λ i1

35 30 25 20 15 10 5 0

0

10

20

30

40

50

60

70

Tasks

R 50,25

(10,01),F3,3

(0 ; C 0,0 )

(a) 0.6 0.55 0.5 Theoretical

0.45

MC Simulations 0.4 0.35

Experiments 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

R 50,25

(10,01),F3,3

(0 ; C 0,0 )

K 12

0.6 0.55 0.5 Theoretical

0.45

MC Simulations 0.4 0.35

Experiments 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

K 12

(b)

R 50,50

(10,01),F3,3

(tb ; C 0,0 )

0.6

0.55

0.5

0.45 K12=0.6 K21=0 K =0.1 K =0.6 12

0.4

21

listed in Theorem 2 to calculate R50,25 3,3 (0; C0,0 ). In Fig. 2(b), the PSSW under two choices of K21 is plotted as a function of K12 . On one hand, small values for K12 imply that node 1 (slower) keeps most of its initial load. Consequently, the load distribution remains unfair even after LB is performed. Therefore, the time required to serve the workload becomes “large” and the success probability is “small.” On the other hand, when K12 approaches to 1, the first node transfers most of its initial load to the second node. Hence, almost all the tasks are queued and served at the less reliable node until it fails. Upon failure, the remaining tasks are transferred back to node 1, if it is functioning; thereby, the success probability is reduced by an excessive queuing of tasks in the communication network. In addition to the theoretical predictions, Fig. 2(b) shows Monte-Carlo (MC) simulations as well as experimental results obtained for the one-shot LB policy. In our simulations, each success probability is calculated by averaging outcomes (failures or successes) from 4000 independent realizations of the policy. The simulation results are in good agreement to our theoretical predictions, and remarkably, experiments conducted on the two-node DCS show a fairly good agreement with the theoretical curves. In the experiments, the success probability is calculated by averaging the results of 500 independent trials for each policy shown in the Fig. 2(b). Next, we look at the solution of the equations given in Theorem 1. Figure 2(c) shows theoretical predictions and the MC simulation results for the PSSW as a function of the balancing instant, for some representative selections of pairs of LB gains. Note that an improper selection of the LB gains can produce a notorious reduction on the PSSW, as is depicted for the case of choosing K12 = 0.1 and K21 = 0.6. Finally, by solving the differential equations of Theorem ∗ ∗ ) = , K21 1, we obtain the optimal LB policy: (t∗b , K12 (10,01),F3,3 argmax(tb ,K12 ,K21 ) R50,25 (tb ; C0,0 )=(0,0.6,0), which provides a theoretical optimal success probability of 0.6006. For such a policy, the PSSW estimated by means of 4000 independent trials of simulations is 0.6136, while for 500 independent experiments conducted on the DCS testbed the estimated PSSW is 0.6011.

K12=0.5 K21=0.6 MC Simulations

0.35

0

5

10

tb

15

20

25

(c)

Fig. 2. (a) Dots are realizations of the transfer time of tasks in a two-node DCS. The solid line represents the first-order approximation for the average transfer time. (b) The PSSW as a function of the LB gain of the node, 1 when LB is performed at time t = 0. (c) The PSSW as a function of the balancing instant for some representative values of the LB gains.

their excess load because there is only one recipient node. Now, we can precisely solve the recursions

3.3 Maximizing the service reliability of a multi-node DCS In this section we maximize the service reliability of a multi-node DCS utilizing the algorithm presented in Section 2.6. We devise several decentralized one-shot LB policies considering different balancing criteria. For the sake of simplicity in the exposition, we have assumed that tb = 0; the LB policies are determined by the balancing criterion as well as the values for the LB gains. The scenario considered in the following examples comprises a five-node DCS, for which a workload of M = 150 tasks is provided. We have assumed that the average failure times of the nodes are λ−1 f1 = 350 s,

9

−1 −1 −1 λ−1 f2 = 10 s, λf3 = 20 s, λf4 = 200 s, and λf5 = 300 s. The service rates, estimated using some training sets of tasks on our DCS testbed, are λd1 = 0.16823 tps, λd2 = 0.49784 tps, λd3 = 0.25869 tps, λd4 = 0.25361 tps, and λd5 = 0.18356 tps. In this scenario, the nodes dedicated only to compute our tasks are the first, the fourth and the fifth node, while the remaining two are non-dedicated nodes. The channel-dependent parameters, also estimated from data collected using our testbed, are listed in Table 1. For brevity, we provide only the minimum and the maximum values for the estimated mean arrival times of FN packets, namely, minj,k λF,−1 = 0.343 s and maxj,k λF,−1 = 7.727 s, for jk jk j, k ∈ {1, . . . , 5}.

TABLE 1 The parameters ajk and bjk used in the first-order approximation of the average transfer delay per task for the case of a five-node DCS. ajk =1 =2 =3 =4 =5 bjk j=1 j=2 j=3 j=4 j=5 j j j j j

k=1 — 0.336 0.541 0.248 0.219 k=1 — 1.651 5.001 4.131 3.009

k=2 0.898 — 0.665 0.532 0.355 k=2 1.970 — 4.997 7.604 2.887

k=3 0.838 0.335 — 0.408 0.298 k=3 2.219 1.993 — 5.862 2.731

k=4 0.706 0.273 0.677 — 0.234 k=4 2.000 1.876 5.203 — 2.943

k=5 0.751 0.350 0.617 0.273 — k=5 2.199 1.667 5.557 7.604 —

Example 2: We devise and discuss three LB policies that have the same balancing criterion. The balancing criterion utilized by the policies is based upon the reliability of the nodes. So, we have set the Λj ’s parameters in (1) to be Λj = λ−1 fj for j = 1, . . . , 5. The three LB policies investigated are: (i) The Null LB policy where all the LB gains employed by the policy are equal to zero; (ii) The Full LB policy where all the LB gains are equal to one; and (iii) The Maximal-Service LB policy where the LB gains employed by the policy are computed using the algorithm presented in Section 2.6. Note that the Null LB policy determines the service reliability inherently provided by the DCS, i.e., it defines the PSSW when LB is not performed by the nodes in the system. Therefore, the Null LB policy establishes the minimal PSSW that can be demanded to any effective LB policy acting on the DCS. The theoretical predictions obtained for the three LB polices under study, and for different initial task allocations, are listed in Table 2. In addition, the column labelled as Experiments presents the results obtained after averaging 500 realization of experiments conducted on our DCS testbed. The first five rows of Table 2 list the results for cases when the system is totally imbalanced. The sixth row presents the case of an initial uniform distribution of tasks. The seventh, eighth and ninth rows correspond to cases where tasks are initially allocated according to the reliability of the nodes, the processing

rate of the nodes, and a combination of the latter two parameters, respectively. Finally, the last row represents a case of an arbitrary task distribution. We can see from Table 2 that the Maximal-Service LB policy outperforms the other two policies in all the cases considered. We can also note that the Maximal-Service LB policy effectively increases the inherent service reliability provided by the DCS. This increment in the service reliability can be attributed mainly to two issues: (i) the Maximal-Service LB policy trades off network queuing times and node idle times by computing appropriate LB gains; and (ii) the Maximal-Service LB policy effectively exploits the extra balancing action provided by the backup system of a faulty node. To support these statements, we discuss a representative case from Example 2. Consider the case where all the tasks are queued at the fourth node (fourth row in Table 2). If no LB action is performed, then on average at t = 200 s the fourth node fails, while on average the following events have occurred in the DCS: (i) the second and third nodes have failed; (ii) the fourth node has been informed about the failures of the second and third nodes; and (iii) the fourth node has served 50 tasks. Upon the failure of the fourth node, its backup system evenly reallocates the remaining 100 tasks to the first and fifth node. So, we clearly note that the first and the fifth node have remained idle for long periods of time, and worst than that, we notice that the second and third node were never used to serve any task. On the contrary, if the Full LB policy is employed, then at tb = 0 the fourth node decides to transfer 59, 1, 3, and 50 tasks to the first, second, third, and fifth node, respectively, while 37 tasks remain queued at the fourth node. As such, we can deduce from the discussion that the Full LB policy is advantageous over the Null LB policy, as evidenced by the PSSW shown in Table 2. Notably, the Maximal-Service LB policy takes an even better decision at the balancing instant by exploiting one extra mechanism. After executing the proposed algorithm, the policy obtains the following LB gains: ∗ ∗ ∗ ∗ = 1, and K45 = 0.824; this = K43 K41 = 0.661, K42 implies that the fourth node has to transfer 38, 1, 3, and 41 tasks to the first, second, third, and fifth node, respectively. Unlike the Full LB policy, a total of 67 tasks remain queued at the fourth node. Note that by sending fewer tasks to the first and fifth node, the MaximalService LB policy reduces the idle time of these nodes as compared to the Full LB policy. In addition, we can note that on average: (i) the fourth node is able to process only 50 tasks before it fails; and (ii) by the average failure time of the fourth node, the first and fifth node are still busy serving the tasks reallocated at the balancing instant. Therefore, the task reallocation performed upon the failure of the fourth node does not introduce any idle time in the receiving nodes. It can be observed from Table 2 that caution must be exercised in selecting the amount of tasks to reallocate at the balancing time; otherwise, we can devise policies

10

TABLE 2 The PSSW under different LB policies. The balancing criterion utilized is based on the reliability of the nodes. Initial load (m1 , . . . , m5 ) (150,0,0,0,0) (0,150,0,0,0) (0,0,150,0,0) (0,0,0,150,0) (0,0,0,0,150) (30,30,30,30,30) (59,2,4,34,51) (18,55,29,27,21) (26,30,28,38,28) (40,15,40,35,20)

Null Theo. 0.0831 0.2514 0.2342 0.1990 0.1737 0.2964 0.2579 0.2943 0.3185 0.2722

PSSW Full Max-Service Theo. Theo. Exp. 0.2287 0.2338 0.242 0.2388 0.2853 0.296 0.2095 0.2868 0.274 0.2310 0.2915 0.268 0.2472 0.2545 0.302 0.2490 0.2953 0.298 0.2579 0.2579 0.246 0.2538 0.2943 0.302 0.2555 0.3185 0.326 0.2552 0.2860 0.280

that perform worse than taking no LB action! Consider for instance the case where all the tasks are queued at the second node. When the Full LB policy is employed, the policy determines that 59, 3, 34, and 51 tasks have to be transferred to the first, third, fourth, and fifth node. The three tasks that remain queued at the second node are, on average, served by the node. In addition, the average transfer plus service time of the tasks assigned to the third and fourth nodes are about 15 and 145 s, respectively. We now note that the inappropriate task reallocation performed by the Full LB policy forces the fourth node to remain idle for 55 s before it fails. If we perform a similar kind of analysis for the case of the Null LB policy, we can conclude that, due to the failures of the second and third nodes and the task exchanges performed by their backup systems, the fourth node is kept busy until it fails at t = 200 s, and its average idle time is only 24 s. Therefore, it can be concluded that the Full LB policy performs worse than the Null LB policy. In light of the previous discussions, we can now comprehend the counterintuitive behavior shown in Table 2. It can be noted that for cases where all the load is queued at a single node and no LB action is taken, the service reliability is better when tasks are initially allocated at the less reliable nodes. This situation is justified because, by initially allocating the workload at less reliable nodes, we can exploit both the computing power of the unreliable servers and the task reallocation action executed by the backup systems of the faulty nodes. Example 3: We study now the effect of the selection of various balancing criteria on the PSSW. We have considered three LB policies, each one of them having a different balancing criterion but sharing the same algorithm to compute the LB gains. The Maximal-Service LB policy balances the DCS according to the reliability of the nodes. The Processing-Speed LB policy balances the DCS based upon the processing rate of the nodes, i.e., Λj = λdj in (1). Finally, the Complete LB policy uses a balancing criterion that combines both processing and failure rates, specifically, the Complete LB policy defines Pn −1 Λj = λdj (1 − λfj ( k=1 λfk ) ). Additionally, we have

conducted MC based exhaustive search, over the LB gains, in order to estimate the optimal probability of success for each case considered. The results of our evaluations are listed in Table 3. Note that the fastest servers in the example are also the less reliable ones. Consequently, the balancing criterion employed by the Processing-Speed LB policy appears to be inappropriate in order to maximize the PSSW. However, it can be seen from Table 3 that, in most of the cases, the three policies achieve approximately the same performance, which shows the strength of our approach. For example, in the case when all the tasks are initially queued at the fourth node, the Processing-Speed LB policy dictates that 54 tasks have to be transferred to the second node. However, the LB gain computed by our algorithm reduces such amount to only 11 tasks. Finally, it can be seen from Table 3 that the PSSW achieved by the policies is within 70% of the optimal probability of success for each case. In fact, the optimum is achieved in some cases. TABLE 3 The PSSW achieved by three LB policies, which have different balancing criteria. For comparison purposes, we list the optimal value obtained for each case. Initial load (m1 , . . . , m5 ) (150,0,0,0,0) (0,150,0,0,0) (0,0,150,0,0) (0,0,0,150,0) (0,0,0,0,150) (30,30,30,30,30) (59,2,4,34,51) (18,55,29,27,21) (26,30,28,38,28) (40,15,40,35,20)

4

Max-Service 0.2338 0.2853 0.2868 0.2915 0.2545 0.2953 0.2579 0.2943 0.3185 0.2860

PSSW Proc-Speed Complete 0.1845 0.2223 0.2908 0.2653 0.2678 0.2760 0.2965 0.2978 0.2940 0.3125 0.3105 0.3045 0.2583 0.2868 0.3098 0.3053 0.2978 0.3040 0.2873 0.2845

Optimum 0.2632 0.2908 0.2867 0.2978 0.3125 0.3170 0.3183 0.3148 0.3185 0.2963

C ONCLUSIONS

We have undertaken a novel approach to analyze the stochastic dynamics and the service reliability of DCSs in the presence of communication uncertainty. We have rigorously modeled the random service time taken by a DCS to serve a collection of tasks queued at its nodes. Our model takes into account the heterogeneity in the computing resources, the random communication and transfer delays in the network and the uncertainty associated with the number of functional nodes in the DCS. We have introduced three fundamental random vectors to track the underlying point processes associated with the DCS. At any given time, these vectors store information about task distribution among nodes, available functional nodes and tasks in the communication network. A regeneration argument has been established yielding a set of coupled recurrence equations that analytically characterize the PSSW. Our approach can be easily generalized to calculate other performance metrics, such as computing speed-up and sojourn time of workloads.

11

Based on our model, we have devised an optimal DLB policy for maximizing the service reliability in a two-node DCSs and used it to develop an effective algorithm for devising DLB policies in multi-node DCSs. We have evaluated the performance of such policies and noticed that they effectively improve the inherent service reliability of the DCS. We have developed also a smallscale testbed DCS to conduct real-time experiments. In general, we have found that the experiments confirm our theoretical predictions as well as our MC simulations. Through experimentation, we have also observed that the computational overhead introduced by our algorithm, which is mainly due to the calculations associated to the recurrence equations, is negligible as compared to the time to serve the tasks. Our theory enables us to understand the effectiveness of task reallocation in a delay-infested DCSs while offering an algorithm for generating task-reallocation policies that maximize the PSSW. The interplay between the task transfer time and the idle time of the nodes has been discussed, and we have noted that the service reliability can be improved if the idle times of the nodes are reduced as much as possible. The LB gain parameter plays a key role in this issue because it trades off between the idle times and the task transfer time. Future work will consider relaxing the Poisson assumption on the random task-transfer and task execution times. To this end, we will undertake an agedependent regeneration-based approach for which an auxiliary “age-variable” is introduced in the analysis.

A PPENDIX For clarity, we first introduce some useful definitions and then present Lemmas 1–4, which will be used in the proof of Theorem 1. Recall the definition of the regeneration time and ¡ note that it can be writ4 ten precisely as τ = min mink¢(Wk1 ), mink (Yk ), minj6=k F (Xjk ), minj6=k (Xjk ), mink,i (Zki ) . Note that in light of Assumptions A1, A2, and Convention C1, it is straightforward to see that τ is an exponentially distributed random variable. For the DCS emerging at the regeneration time τ , let the random times (all measured from 0 0 0 0 F0 τ ) Wki , Yk , Xjk , Xjk and Zik , respectively, be the service time for the ith task at the kth node, the failure-time of the kth node, the arrival time of the QI sent from the jth node to the kth node, the arrival time of the FN sent from the jth node to the kth node, and the arrival time of the ith group of tasks sent to the kth 0I,F node. In addition, on {τ ≤ tb }, we define Tm (tb ; C) 1 ,m2 as the time taken by the new DCS emerging at τ to serve all the tasks in the system if LB is performed by all functioning nodes at time tb provided that the system condition at t = τ is specified by I, F, C while mj ≥ 0 tasks (j = 1, 2) are queued at the kth node. To prove that the DCS is regenerated upon the occurrence of {τ = s}, it suffices to show that the conditional distributions of

0

0

0

0

0

F Wki , Yk , Xjk , Xjk and Zki given that the event {τ = s} has occurred, satisfy assumptions A1 and A2.

¯ (10,01),F Lemma 1: For s ≤ tb , P{Tm1 ,m2 3,3 (tb ; C0,0 ) < ∞ ¯τ = (10,01),F s, τ = W11 } = P{Tm1 −1,m23,3 (tb −s; C0,0 ) < ∞}. Proof: Note that the regeneration event {τ = s, τ = W11 } is precisely service to the first task at the first node before any other activity takes place in the DCS. Thus, no FN has been sent and no task-redistribution (required only upon failure) has been made in the DCS that emerges prior to time t = s. Therefore, according to Convention F0 C1, we obtain P{Xjk = ∞|τ = s, τ = W11 } = 1 0 and P{Zki = ∞|τ = s, τ = W11 } = 1. Next, upon the occurrence of {τ = s, τ = W11 }, the state vectors remain the same, i.e., I = (10, 01), F = (11, 11) and C = ([0], [0]), while m1 − 1 tasks are now queued at the first and m2 are queued at the second node. There(10,01),F fore, by construction P{Tm1 ,m2 3,3 (tb ; C0,0 ) < ∞|τ = 0(10,01),F3,3 s, τ = W11 } = P{τ + Tm1 −1,m2 (tb ; C0,0 ) < ∞|τ = s, τ = W11 }. The proof is complete once we establish 0(10,01),F that P{Tm1 −1,m2 3,3 (tb ; C0,0 ) < ∞|τ = s, τ = W11 } = (10,01),F P{Tm1 −1,m23,3 (tb − s; C0,0 ) < ∞}. 0 0 Next, by definition, Wk1 = Wk1 − τ, Yk = Yk − τ and 0 Xjk = Xjk − τ for k, j ∈ {1, 2}, j 6= k. Moreover, it is shown below in the Appendix that the conditional 0 distribution of W21 is 0

P{W21 ≤ t|τ = s, τ = W11 } = (1 − e−λd2 t )u(t),

(16)

where u(.) is the unit step function. Similarly, we get 0 P{Yk ≤ t|τ = s, τ = W11 } = (1 − e−λfk t )u(t) and 0 P{Xjk ≤ t|τ = s, τ = W11 } = (1 − e−λjk t )u(t), for 0 all j ≥ 2 and k ∈ {1, 2}, Wkj and Wkj have identical distributions. Therefore, conditional upon the occurrence of {τ = s, τ = W11 }, all random times of the newly emerging DCS satisfy Assumption A1. 0 0 The conditional independence of W21 and Y1 is proved later in this Appendix. Similarly, it can also be shown that conditional upon the occurrence of {τ = s, τ = W11 }, 0 0 0 Wkj , Yk , and Xjk are mutually independent. Therefore, upon the occurrence of {τ = s, τ = W11 }, all random times of the emerging DCS satisfy Assumption A2. In summary, we have shown that conditional on the occurrence of {τ = s, τ = W11 }, the random times characterizing the DCS at time s satisfy Assumptions A1 and A2. Therefore, by shifting the time origin from t = 0 to t = s, we can think of the emergent DCS as the original system but with m1 − 1 tasks in the queue of the first node, while other system initial conditions remain the same. In addition, due to the shift of origin, the LB instant is now at tb − s units of time from the new ori0(10,01),F gin. Hence, we conclude that P{Tm1 −1,m2 3,3 (tb ; C0,0 ) < (10,01),F3,3 ∞|τ = s, τ = W11 } = P{Tm1 −1,m2 (tb − s; C0,0 ) < ∞}, which completes the proof of Lemma 1. 2 Lemmas 2 and 3 are presented without proof as they follow similar principles as those of Lemma 1.

12 (10,01),F

Lemma 2: For s ≤ tb , P{Tm1 ,m2 3,3 (tb ; C0,0 ) < ∞|τ = (11,01),F s, τ = X21 } = P{Tm1 ,m2 3,3 (tb − s; C0,0 ) < ∞}. (10,01),F

Lemma 3: For s ≤ tb , P{Tm1 ,m2 3,3 (tb ; C0,0 ) < ∞|τ = (10,01),F1,3 s, τ = Y1 }=P{T0,m2 (tb − s; C0,m1 ) < ∞}. (10,01),F P{Tm1 ,m2 3,3 (tb ; C0,0 )

Lemma 4: (10,01),F P{Tm1 ,m2 3,3 (0; C0,0 ) < ∞}.

< ∞|τ > tb } =

Proof: The occurrence of the event {τ > tb } implies that the system condition of the DCS at time tb is exactly the same as the initial system condition of the original 00I,F system. For {τ > tb }, let Tm (tb ; C) be the time 1 ,...,mn taken by the new DCS emerging at tb to serve all tasks if LB is performed by all functioning nodes at time tb , and provided that the system condition at t = tb is specified by I, F, C while mk ≥ 0 tasks are queued at the kth (10,01),F node. Therefore, by definition, P{Tm1 ,m2 3,3 (tb ; C0,0 ) < 00(10,01),F ∞|τ > tb } = P{tb + Tm1 ,m2 3,3 (tb ; C0,0 ) < ∞|τ > tb }. Let the random times characterizing the DCS emerging 00 00 00 00 F 00 at tb be Wki , Yk , Xjk , Xjk and Zik , all measured from tb . Note that we have maintained the same symbols (with the addition of double primes) to characterize the same random times. Clearly, no FN has been sent and no taskredistribution has been made in the DCS that emerges prior to time t = tb . Therefore, based on Convention C1, 00 F 00 P{Xjk = ∞|τ > tb } = 1 and P{Zki = ∞|τ > tb } = 1. 00 For the ith task of the kth node, Wki = Wki − tb and for 00 00 k, j ∈ {1, 2}, j 6= k, Yk = Yk − tb and Xjk = Xjk − tb . Based on Assumptions A1 and A2 it is straightforward 00 to show that, P{Wki ≤ t|τ > tb } = (1 − e−λdk t )u(t), 00 00 00 P{Wki ≤ t1 , Yk ≤ t2 |τ > tb } = P{Wki ≤ t1 |τ > 00 tb }P{Yk ≤ t2 |τ > tb }. Similarly, conditional on the occurrence of {τ > tb }, the conditional distributions 00 00 00 00 F 00 of Wki , Yk , Xjk , Xjk and Zki can be shown to satisfy A1 and A2. Consequently, nothing has changed in the initial condition as well as the statistics of the random times characterizing the DCS while tb units of time have elapsed. Therefore, we can shift the origin by tb units of time, which makes the LB instant at t = 0 for the 00(10,01),F new DCS. Thus, P{Tm1 ,m2 3,3 (tb ; C0,0 ) < ∞|τ > tb } = (10,01),F P{Tm1 ,m2 3,3 (0; C0,0 ) < ∞}. 2

regeneration events occurring at the time τ = s as (10,01),F3,3 P{Tm (tb ; C0,0 ) < ∞|τ = s} 1 ,m2

=

2 X

(10,01),F3,3 P{Tm (tb ; C0,0 ) < ∞|τ = s, τ = Wk } 1 ,m2

k=1

× P{τ = Wk |τ = s} +

2 X X

P{τ = Xjk |τ = s}

k=1 j6=k (10,01),F3,3 × P{Tm (tb ; C0,0 ) < ∞|τ = s, τ = Xjk } 1 ,m2

+

2 X

(10,01),F3,3 P{Tm (tb ; C0,0 ) < ∞|τ = s, τ = Yk } 1 ,m2

k=1

× P{τ = Yk |τ = s}. (18) R∞ (10,01),F Also, note that tb P{Tm1 ,m2 3,3 (tb ; C0,0 ) < ∞|τ = (10,01),F

s}fτ (s)ds = P{Tm1 ,m2 3,3 (tb ; C0,0 ) < ∞|τ > tb }P{τ > tb }. We now apply Lemmas 1–3 to (18), Lemma 4 to the later result, and substitute in (17) to obtain Z tb ³ (10,01),F3,3 Rm (t ; C ) = P{τ = W1 |τ = s} b 0,0 1 ,m2 0

× × × × ×

(10,01),F Rm1 −1,m23,3 (tb (10,01),F Rm1 ,m2 −13,3 (tb (10,11),F3,3 Rm (tb 1 ,m2 (11,01),F3,3 Rm1 ,m2 (tb (10,01),F1,3 R0,m2 (tb

− s; C0,0 ) + P{τ = W2 |τ = s} − s; C0,0 ) + P{τ = X12 |τ = s} − s; C0,0 ) + P{τ = X21 |τ = s} − s; C0,0 ) + P{τ = Y1 |τ = s}

− s; C0,m1 ) + P{τ = Y2 |τ = s} ´ (10,01),F3,2 × Rm1 ,0 (tb − s; Cm−2,0 ) fτ (s) ds (10,01),F3,3 + P{τ > tb }Rm (0; C0,0 ). 1 ,m2

Next, observe that in conjunction with A1, A2, and C1, −λt it is straightforward to show P2 P that fτ (t) = λe u(t), where λ = k=1 (λdk + λfk + j6=k λjk ). Moreover, using basic concepts from probability theory, we can show that P{τ = Wk1 |τ = s} = λdk λ−1 , P{τ = Xjk |τ = s} = λjk λ−1 , and P{τ = Yk |τ = s} = λfk λ−1 . Therefore, the last equation becomes (10,01),F3,3 (10,01),F3,3 Rm (tb ; C0,0 ) = e−λtb Rm (0; C0,0 ) 1 ,m2 1 ,m2 Z tb ³ (10,01),F + λd1 Rm1 −1,m23,3 (tb − s; C0,0 ) 0

(10,01),F

Proof of Theorem 1: We begin by proving Equation (2) for the case I = (10, 01). Conditioning the PSSW on the regeneration time we have that (10,01),F3,3 P{Tm (tb ; C0,0 ) < ∞} 1 ,m2 Z tb (10,01),F3,3 = fτ (s)P{Tm (tb ; C0,0 ) < ∞|τ = s} ds 1 ,m2 0 Z ∞ (10,01),F3,3 + fτ (s)P{Tm (tb ; C0,0 ) < ∞|τ = s} ds. 1 ,m2

(17)

tb

Moreover, we can further condition the first integrand at the right side of (17) on all the possible, disjoint

(10,11),F3,3 + λd2 Rm1 ,m2 −13,3 (tb − s; C0,0 ) + λ12 Rm (tb − s; C0,0 ) 1 ,m2 (10,01),F1,3

(11,01),F3,3 + λ21 Rm (tb − s; C0,0 ) + λf1 R0,m2 1 ,m2 ´ (10,01),F3,2 + λf2 Rm1 ,0 (tb − s; Cm2 ,0 ) e−λs ds.

(tb − s; C0,m1 ) (19)

Finally, by differentiating (19) with respect to tb and rearranging terms we obtain (2). Proofs of Equations (2)–(4) for any given I can be achieved in a similar fashion. Next, recall that according to the LB policy, no LB action is taken by a faulty node, say the kth faulty node, whereby the information state vector ik becomes redundant and does not affect the PSSW. This observation leads to the proof of (5) and (6). Also, note that in the case of a twonode model, whenever a failure is detected, viz., F =

13

(10, 10) or F = (01, 01), no LB action is taken by the only remaining functional node. Consequently, tb and I have no role in determining the success probability in these cases (but we do not drop tb and I from the notation just to maintain consistency). Therefore, we obtain Equations (7)–(10) by exploiting the following conditional probabilities, which can be proved similarly I,F to Lemmas 1–4: P{T0,m1,1 (tb ; C0,m1 ) < ∞|τ = W21 } = 2 I,F1,1 I,F P{T0,m2 −1 (tb ; C0,m1 ) < ∞}, P{T0,m1,1 (tb ; C0,m1 ) < 2 I,F ∞|τ = Z21 } = P{T0,m1,1 (t ; C ) < ∞} and b 0,0 1 +m2 o I,F

P{T0,m1,1 (tb ; C0,m1 ) < ∞|τ = Y2 2 the proof of Theorem 1. 2

= 0. This completes

Proof of Equation (16). Let us look at the conditional 0 0 distribution of W21 : P{W21 ≤ t|τ = s, τ = W11 } = P{W21 − τ ≤ t|τ = s, τ = W11 } = P{W21 ≤ t + s|τ = s, τ = W11 } Note that {τ = s, τ = W11 } = {W11 = s, W21 > s, Y1 > s, Y2 > s, X12 > s, X21 > s}. Therefore, the latter equation 0 becomes P{W21 ≤ t|τ = s, τ = W11 } = P{W21 ≤ t + s|W11 = s, W21 > s, Y1 > s, Y2 > s, X21 > s, X12 > s}. 0 Exploiting Assumption A2, we obtain P{W21 ≤ t|τ = s, τ = W11 } = P{W21 ≤ t + s|W21 > s} = (1 − e−λd2 t )u(t).2 0

0

Proof of the conditional independence of W21 and Y1 . 0 0 Recall that W21 = W21 − τ and Y1 = Y1 − τ . Therefore, 0 0 for any t1 , t2 ∈ (−∞, ∞), P{W21 ≤ t1 , Y1 ≤ t2 |τ = s, τ = W11 } = P{W21 ≤ t1 + s, Y1 ≤ t2 + s|τ = s, τ = W11 } = P{W21 ≤ t1 +s, Y1 ≤ t2 +s|W11 = s, W21 > s, Y1 > s, Y2 > s, X21 > s, X12 > s} = P{W21 ≤ t1 + s, Y1 ≤ t2 + s|W21 > s, Y1 > s} = P{W21 ≤ t1 + s, Y1 ≤ t2 + s, W21 > s, Y1 > s}(P{W21 > s, Y1 > s})−1 = P{s < W21 ≤ t1 + s}(P{W21 > s})−1 P{s < Y1 ≤ t2 + s}(P{Y1 > s})−1 , where the last equality follows from Assumption A2: W21 and Y1 are mutually independent. Therefore, we 0 0 0 get P{W21 ≤ t1 , Y1 ≤ t2 |τ = s, τ = W11 } = P{W21 ≤ 0 t1 |W21 > s}P{Y1 ≤ t2 |Y1 > s}, which concludes the 0 proof by noting that P{W21 ≤ t1 |τ = s, τ = W11 } = 0 0 P{W21 ≤ t1 |W21 > s} and P{Y1 ≤ t2 |τ = s, τ = W11 } = 0 P{Y1 ≤ t2 |Y1 > s}. 2

ACKNOWLEDGMENT This work was supported by the Defense Threat Reduction Agency (Combating WMD Basic Research Program) and in part by the National Science Foundation (award ANI-0312611).

R EFERENCES [1]

[2]

[3]

R. Shah, B. Veeravalli, and M. Misra, “On the design of adaptive and decentralized load balancing algorithms with load estimation for computational grid environments,” IEEE Trans. Parallel and Dist. Systems, vol. 18, pp. 1675–1686, 2007. L. Tassiulas and A. Ephremides, “Stability properties of constrained queuing systems and scheduling policies for maximum throughput in multihop radio networks,” IEEE Trans. Automatic Control, vol. 37, pp. 1936–1948, 1992. M. Neely, E. Modiano, and C. Rohrs, “Dynamic power allocation and routing for time varying wireless networks,” in Proc. IEEE INFOCOM, San Francisco, CA, 2003.

[4] [5] [6] [7] [8]

[9]

[10] [11]

[12] [13] [14] [15] [16] [17] [18] [19]

[20]

[21] [22]

[23] [24] [25] [26] [27]

D. Burman and D. Smith, “A light traffic theorem for multi-server queues,” Mathematics of Op. Research, vol. 8, pp. 15–25, 1983. G. Koole, P. Sparaggis, and D. Towsley, “Minimizing response times and queue lengths in systems of parallel queues,” Journal of Applied Probability, vol. 36, pp. 1185–1193, 1999. L. Golubchik, J. Lui, and R. Muntz, “Chained declustering: Load balancing and robustness to skew and failures,” in Proc. Research Issues on Data Engineering, 1992, pp. 88–95. A. Brandt and M. Brandt, “On a two-queue priority system with impatience and its application to a call center,” Methodology and Computing in Applied Probability, vol. 1, pp. 191–210, 1999. S. Dhakal, M. Hayat, J. Pezoa, C. Yang, and D. Bader, “Dynamic load balancing in distributed systems in the presence of delays: A regeneration-theory approach,” IEEE Trans. Parallel and Dist. Systems, vol. 18, pp. 485–497, 2007. S. Dhakal, M. Hayat, J. Pezoa, C. Abdallah, J. Birdwell, and J. Chiasson, “Load balancing in the presence of random node failure and recovery,” in Proc. IEEE IPDPS ’06, Rhodes, Greece, 2006. Y.-S. Dai and G. Levitin, “Optimal resource allocation for maximizing performance and reliability in tree-structured grid services,” IEEE Trans. Reliability, vol. 56, pp. 444–453, 2007. Y.-S. Dai, G. Levitin, and K. Trivedi, “Performance and reliability of tree-structured grid services considering data dependence and failure correlation,” IEEE Trans. Computers, vol. 56, pp. 925–936, 2007. G. Attiya and Y. Hamam, “Reliability oriented task allocation in heterogeneous distributed computing systems,” in Proc. Ninth Int. Symp. Computers and Comms., 2004, pp. 68–73. C.-I. Chen, “Task allocation and reallocation for fault tolerance in multicomputer systems,” Trans. Aerospace and Electronic Systems, vol. 30, pp. 1094–1104, 1994. V. Shestak, J. Smith, A. Maciejewski, and H. Siegel, “Stochastic robustness metric and its use for static resource allocations,” Journal Parallel Dist. Computing, vol. 68, pp. 1157–1173, 2008. M. Trehel, C. Balayer, and A. Alloui, “Modeling load balancing inside groups using queuing theory,” in Proc. 10th Int. Conf. on Parallel and Distributed Computing System, New Orleans, LO, 1997. C. Hui and S. Chanson, “Hydrodynamic load balancing,” IEEE Trans. Parallel and Dist. Systems, vol. 10, pp. 1118–1137, 1999. Z. Lan, V. Taylor, and G. Bryan, “Dynamic load balancing for adaptive mesh refinement application,” in Proc. ICPP 01, Valencia, Spain, 2001. G. Cybenko, “Dynamic load balancing for distributed memory multiprocessors,” IEEE Trans. Parallel and Dist. Systems, vol. 7, pp. 279–301, 1989. M. Hayat, S. Dhakal, C. Abdallah, J. Birdwell, and J. Chiasson, Dynamic time delay models for load balancing. Part II: Stochastic analysis of the effect of delay uncertainty, ser. LNCS v. 38. SpringerVerlag, 2004, ch. Advances in Time Delay Systems, pp. 355–368. S. Dhakal, B. Paskaleva, M. Hayat, E. Schamiloglu, and C. Abdallah, “Dynamical discrete-time load balancing in distributed systems in the presence of time delays,” in Proc. IEEE CDC ’03, Maui, HI, 2003, pp. 5128–5134. T. Casavant and J. Kuhl, “A taxonomy of scheduling in generalpurpose distributed computing systems,” IEEE Trans. Soft. Eng., vol. 14, pp. 141–154, 1988. H. Lee, S. Chin, J. Lee, D. Lee, K. Chung, S. Jung, and H. Yu, “A resource manager for optimal resource selection and fault tolerance service in grids,” in Proc. Int. Symp. Cluster Comp. and the Grid, Chicago, IL, 2004. M. Litzkow, M. Livny, and M. Mutka, “Condor - a hunter of idle workstations,” in Proc. 8th Int. Conf. Dist. Comp. Systems, 1988, pp. 104–111. E. Gelenbe, D. Finkel, and S. K. Tripathi, “On the availability of a distributed computer system with failing components,” ACM SIGMETRICS Perf. Evaluation Review, vol. 13, pp. 6–13, 1985. R. Sheahan, L. Lipsky, and P. Fiorini, “The effect of different failure recovery procedures on the distribution of task completion times,” in Proc. IEEE DPDNS ’05, Denver, CO, 2005. M. Neuts and D. Lucantoni, “A markovian queue with n servers subject to breakdowns and repairs,” Management Science, vol. 25, pp. 849–861, 1979. J. Palmer and I. Mitrani, “Empirical and analytical evaluation of systems with multiple unreliable servers,” in Proc. Int. Conf. Dependable Syst. and Nets., Philadelphia, PA, 2006, pp. 517–525.

14

[28] S. Shatz and J.-P. Wang, “Models & algorithms for reliabilityoriented task-allocation in redundant distributed-computer systems,” IEEE Trans. Reliability, vol. 38, pp. 16–27, 1989. [29] V. Ravi, B. Murty, and J. Reddy, “Nonequilibrium simulatedannealing algorithm applied to reliability optimization of complex systems,” IEEE Trans. Reliability, vol. 46, pp. 233–239, 1997. [30] S. Srinivasan and N. Jha, “Safety and reliability driven task allocation in distributed systems,” IEEE Trans. Parallel and Dist. Systems, vol. 10, pp. 238–251, 1999. [31] D. Vidyarthi and A. Tripathi, “Maximizing reliability of a distributed computing system with task allocation using simple genetic algorithm,” Journal of Systems Architecture, vol. 47, pp. 549– 554, 2001. [32] G. Attiya and Y. Hamam, “Task allocation for maximizing reliability of distributed systems: a simulated annealing approach,” Journal Parallel and Dist. Computing, vol. 66, pp. 1259–1266, 2006. [33] Y. Hamam and K. Hindi, “Assignment of program tasks to processors: a simulated annealing approach,” European Journal of Operational Research, vol. 2000, pp. 509–513, 122. [34] S. Dhakal, J. E. Pezoa, and M. Hayat, “A regeneration-based approach for resource allocation in cooperative distributed systems,” in Proc. ICASSP, 2007, pp. III–1261–III–1264.