Load Balancing Process Allocation in Fault-Tolerant Multicomputers Heejo Lee, Jong Kim and Sunggu Lee Dept. of Computer Science and Engineering Pohang University of Science and Technology Pohang 790-784, KOREA Technical Report CS-95-001 January, 1995
Load Balancing Process Allocation in Fault-Tolerant Multicomputers Jong Kim*, Heejo Lee*, and Sunggu Lee** *Dept. of Computer Science and Engineering **Dept. of Electrical Engineering Pohang University of Science and Technology San 31 Hyoja Dong, Pohang 790-784 Republic of KOREA Phone : +(82)-562-279-2257 Fax : +(82)-562-279-2299 E-mail : fjkim,heejo,
[email protected]
ABSTRACT In this paper, we consider a load-balancing process allocation method for fault-tolerant multicomputer systems that balances the load before as well as after faults start to degrade the performance of the system. In order to be able to tolerate a single fault, each process (primary process) is duplicated (i.e., has a backup process). The backup process executes on a dierent processor from the primary, checkpointing the primary process and recovering the process if the primary process fails due to the occurrence of a fault. In this paper, we rst formalize the problem of load-balancing process allocation and show that it is an NP-hard problem. Next, we propose a new heuristic process allocation method and analyze the performance of the proposed allocation method. Simulations are used to compare the proposed method with a process allocation method that does not take into account the dierent load characteristics of the primary and backup processes. While both methods perform well before the occurrence of a fault in a primary process, only the proposed method maintains a balanced load after the occurrence of such a fault.
Keywords { Backup process, checkpointing, fault-tolerant multicomputer, load balancing, process allocation.
This research was supported in part by the Korea Science and Engineering Foundation under Grant 941-0900055-2.
1 Introduction The application areas of multi-computer systems are continuously expanding. At the same time, the role of the computer system in the application is becoming more important. In critical application areas, a computer failure can cause an enormous nancial loss to the user or the service provider. Hence, it is necessary to have a fault-tolerant computing environment. One way to achieve fault-tolerance in multi-computer systems is by duplicating processes running on the system. A fault on one processor (node) only aects processes running on the faulty node. Duplicated processes running on other nodes can take over the role of the processes running on the faulty node. Process allocation is important for multi-computer systems since it aects the performance of the system. In multi-computer systems, a process is no longer a single independent job. Either it has message dependencies with other processes or it blocks the execution of other processes by allocating useful system resources such as cpu, memory, and disks. A careful allocation of processes can minimize the overhead of inter-process interactions. In a system with duplicated processes, process allocation aects the performance and reliability of the system [1-6]. Not only does the system's performance degrade because of the increment in the number of running processes, but the system's reliability also depends on the placement of duplicated processes in the system. Process allocation in fault-tolerant multi-computer systems has been studied by several researchers [1, 2, 6]. Nieuwenhuis [1] studied transformation rules which transform an allocation of non-duplicated processes into an allocation of duplicated processes. The transformed allocation is proven optimal in terms of reliability. Shatz and Wang [2] proposed a process allocation algorithm which maximizes the reliability of non-homogeneous systems. Bannister and Trivedi [6] proposed a process allocation algorithm which evenly distributes the load of 1
the system to all nodes. A common assumption of all of the above research works is that each duplicated process is not only a complete replica of the original process, but also has the same execution load as the original. This kind of fault-tolerant process is called an active process replica [7]. The fault-tolerant computing process model considered in this paper is the primary-backup process model which is commonly used in distributed experimental and commercial systems such as Delta-4 and Tandem [8-10]. In this model, there is a backup copy for each process in the system. However, only one process in a pair is running actively at any one time. The active process is called the primary process and the non-active process is called the backup (or secondary) process. The active process regularly checkpoints its running state to the backup process. During normal operation, the non-active backup process is either waiting for a checkpointing message or saving a received checkpointing message. When the node in which the primary process is running becomes faulty, the backup process takes over the role of the primary process. Thus, in order to be able to tolerate faults, primary and backup processes should not be executed on the same node. This kind of fault-tolerant process is called a passive process replica [9]. The dierence between the computing model considered here and the models considered in other research is in the number and role of the backup process. In the previous research, the number of backup processes (r) is given as r 1, whereas we use only one backup process (as is done in most commercial fault-tolerant systems). Also, it is assumed that backup processes are exact copies of the primary process. Hence, the load of the backup process is exactly the same as the primary process, whereas the load of the backup process in our model is much less than that of the primary process, i.e., 5 10 percent of the load of the primary process. In this paper, we study the problem of static load-balancing process allocation for faulttolerant multi-computer systems. Process allocation is called static when all processes are 2
allocated at the beginning by the system. Dynamic process allocation assumes that processes are allocated as soon as they arrive. In the general computing environment where the occurrence, load, and duration of processes cannot be predicted in advance, dynamic process allocation is more suitable than static process allocation since dynamic process allocation allocates processes as they arrive, taking into consideration the current load of each processor. Static process allocation can be used most eectively in those systems in which the computational requirement of each process is known beforehand. Examples of such systems include on-line transaction processing and real-time systems, in which most processes are running continuously or repeating in a periodic manner. The previously proposed process allocation algorithms [2, 6], which were proposed for the active replica computing environment, exhibit poor performance in the passive replica computing environment since they were designed either to have maximum reliability or to run with the assumption that the load of the backup process is exactly the same as the primary. To achieve load-balancing with the given computing model, we have to consider the distribution of primary and backup processes and the load increment to be added to each node in the event of a fault. This paper is organized as follows. The next section introduces static task allocation and presents an overview of related research work. Section 3 presents mathematical formulations for the allocation problem. In Section 4, basic primitives for allocating processes are addressed and a heuristic process allocation algorithm is given to solve the problem de ned in Section 3. Section 5 analyzes the expected performance of the proposed allocation algorithm and compares the performance of the algorithm with another allocation method. Finally, in Section 6, we summarize and discuss the signi cance of our results.
3
2 Scope and Related Work The main objective of this paper is to develop an ecient process allocation algorithm for fault-tolerant multi-computer systems. The process allocation problem is dierent from the process scheduling (or more commonly referred as task scheduling) problem. In this paper, process allocation is used to refer to the allocation of processes in an on-line transactions environment in which processes are running continuously or repeatedly in a periodic or aperiodic manner. By contrast, process scheduling is used to allocate processes that execute only once. Moreover, it is assumed that there are constraining factors in the system such as memory capacity or processor speed in the allocation problem. Hence, the allocation problem is usually formulated as a constrained optimization problem. In contrast, the only constraining factor in the scheduling problem is the number of nodes. Process allocation is called static when all processes are allocated at the beginning by the system. Dynamic process allocation assumes that processes are allocated as soon as they arrive. In the general computing environment where the occurrence, load, and duration of processes cannot be predicted in advance, dynamic process allocation is more suitable than static process allocation since dynamic process allocation allocates processes as they arrive, taking into consideration the current load of each processor. Static process allocation can be used most eectively in those systems in which the computational requirement of each process is known beforehand. Examples of such systems include on-line transaction processing and real-time systems, in which most processes are running continuously or repeating in a periodic manner. Three approaches that have been used to solve the process allocation problem are graph theoretic, integer programming, and approximation methods [6]. In the graph theoretic ap-
proach, the problem is rst represented by a graph and then solved using common graph 4
techniques. In the integer programming approach, the process allocation problem is transformed into an integer programming problem and then solved using branch-and-bound or enumeration techniques. The third approach, and the one adopted in this paper, is to solve the process allocation problem by nding a good approximation to the optimal solution. Previously proposed allocation algorithms can be classi ed into two types depending on the objective of the algorithm. In the rst type, the main objective of process allocation is to enhance the reliability of the system [1, 2]. Nieuwenhuis studied optimal allocation of replicated processes with respect to reliability [1]. The study shows a simple transformation rule which leads to an optimal allocation of replicated processes from an allocation of a given non-replicated system. Also, the study shows another transformation rule which generates an r-replicated system with fewer links and equal reliability. Shatz and Wang [2] also approached the process allocation problem with the objective of maximizing the reliability of non-homogeneous systems.
In the second type of allocation algorithm, processes are allocated to balance the load among nodes in the system. Bannister and Trivedi assumed in their research that there are replicated copies for each process [6]. Also, it is assumed that all replicas have the same load as their primaries. The proposed algorithm allocates each process to the node with the lightest load after sorting all processes in descending order. This method balances the load not only before, but also after the occurrence of a fault in a system with active replica of each process. Balaji [11] proposed a process allocation method in which faults are tolerated by reexecuting the processes running on the faulty node instead of having backup processes. In
this method, the processes which should have been executed by the faulty node are reallocated to fault-free nodes. Although Balaji's method does not provide fast fault recovery, it has the advantages of low processing overhead and relatively easy allocation of processes. In [12, 5
13], the process allocation problem was studied with the objective of meeting deadlines in real-time fault-tolerant systems. The main dierence between the work presented in this paper and previously reported research is in the fault-tolerant process model. Most previously proposed primary-backup process allocation algorithms assume that the backup processes are exact replicas of their respective primary processes, with identical loads. This implies that backup processes have the same load as their primary processes before as well as after the failure of a primary process. This kind of a backup process is referred to as an active replica. The biggest advantage of using active replica is that there is no loss of computation when a fault occurs. The main disadvantage is that each active replica consumes the same amount of resources as the primary. With passive replica, the overhead due to the backup processes is signi cantly reduced. In our research, we assume that the backup processes are passive replicas which have dierent load characteristics when compared to the primary processes. The Tandem Nonstop system [8, 10] is an example of a commercial system that uses the passive replica method for fault-tolerant computing.
3 Load-Balancing Process Allocation Problem In this section, we formalize the load-balancing process allocation problem in fault-tolerant multi-computer systems. First, we present the fault-tolerant multi-computer system model considered in this paper and provide an informal description of the problem to be solved. Next, we describe the notation used in the paper. Finally, we formulate the problem mathematically.
6
node 1
node 2
node n
Bi
Pi
Bj
Pj
Bk Pk Pl
Bl
Figure 1: Primary-backup processes running in the fault-tolerant multi-computer system.
3.1 System Model The fault-tolerant multi-computer system considered in this paper consists of n nodes (processors). To tolerate a fault, each process is replicated and executed as a pair, referred to as a primary-backup process pair. In this paper, we shall consider the possibility of a single node failure only. However, the model can easily be extended to allow more than one fault by generating more backup processes. Primary processes can be allocated to any node. However, there is one restriction on the placement of backup process. That is, the primary and backup processes cannot be allocated to the same node. It is assumed that there are m primary processes running in the system and that the cpu loads of the primary and backup processes are known in advance. This assumption about the load requirement is not unrealistic since many on-line transaction systems run the same processes continuously [8]. Fig. 1 shows a fault-tolerant multi-computer system with processes running on the system. The primary-backup process model considered in this paper is actually used in experimental and commercial systems [8-10] such as the Tandem Nonstop system [8]. In the Nonstop system, every process has an identical backup process which is allocated to a dierent node. 7
The backup process is not executed concurrently with the primary process, but is in an inactive mode, prepared to assume the function of the primary process in the event of a primary process failure. To guarantee that the backup process has the information necessary to execute the required function, the primary process sends periodic checkpoint messages to its backup process. In the fault-free situation, the cpu load of the backup processes is much less than that of their respective primaries. The actual load of a backup process is determined by the interval and number of messages checkpointed by the primary process. Each backup process has a dierent percentage of the primary process' cpu load. When a fault occurs, the backup processes of the primary processes which were running on the faulty node take over the role of the primary processes. The backup process is executed continuously starting from the last point at which it received a valid checkpointing message from its primary process. Therefore, the cpu load of the backup process becomes the same as that of its primary. Load-balancing is required to utilize system resources evenly, thereby enhancing the performance of the system. It has been reported that the approximation algorithm proposed by Bannister and Trivedi [6] has a near-optimal load-balancing result if backup processes have the same cpu load as their primaries. However, in the situation where the cpu load of a backup process is dierent before and after the occurrence of a fault, the system does not maintain a balanced load after a fault as the cpu loads of the newly activated backup processes increase. The load-balancing process allocation problem considered here is to nd a static process allocation algorithm which balances the cpu load of every node in the fault-free situation and also balances the cpu load when there is a fault in the system. The algorithm should consider the cpu load increment of each node in the event of a fault. In general, it is impossible for all nodes to have exactly the same cpu load. Thus, the quality of the load-balancing is measured either by the deviation from the average of all processors' loads or by the load dierence between the node with the highest load and the node with the lightest load. As these 8
parameters approach zero, the system approaches a balanced load. We consider a process allocation to be optimal (with respect to load-balancing) if any change in the placement of processes leads to a higher load-balancing measure.
3.2 Notation The following notation is used to formulate the allocation problem.
n : number of nodes. m : number of primary processes (m backup processes also exist). pi : load of primary process i. bi : load of backup process i. xij : 1 if primary process i is allocated to node j , 0 otherwise. yij : 1 if backup process i is allocated to node j , 0 otherwise. j : 0 if node j has failed, 1 otherwise. j : load increment to be added to node j when a fault occurs. Bjk : the k-th group of backup processes of the primary processes assigned to node j . Gjk : the sum of the load dierences of the backup group Bjk . P (j ): total load of node j . Pf : set of processor's loads when node j has failed. j
Pnon : set of processor's loads with no faulty nodes. Pfaulty : set of processor's loads with one faulty node. (P ): standard deviation of load set P . (P ): dierence between the maximum and the minimum load of load set P . : the objective cost function to be used. 9
3.3 Formal Problem Description The load-balancing process allocation problem is represented as a constrained optimization problem. Let us assume that there are n nodes and m processes in the system. Because each primary process has a backup process, the total number of primary and backup processes is 2m. One prominent constraint is on the allocation of primary-backup process pairs. To be able to tolerate a fault in a node, a primary process and its backup should not be allocated to the same node. Thus, n m X X i=1 j =1
xij = m;
n m X X i=1 j =1
yij = m;
n m X X i=1 j =1
xij yij = 0
(3:1)
At this point, let us de ne a set A with n + 1 elements as A = fA0; A1; : : :; An g. Each element Ai (0 i n) is a processor status vector of the form [12 : : :n ], where j 2 f0; 1g for all 1 j n. A0 is an all-1 vector (all nodes are fault-free) and Ak is the vector with
k = 0 and i = 1 if i 6= k (only node k is faulty). The load increment to be added to node j in the event of a fault in node k can be m P represented as xik yij (pi ? bi ). Thus, the total load increment for node j (j ) is i=1
j =
n " X k=1
(1 ? k )
m X i=1
# xik yij (pi ? bi )
(3:2)
The load of node j , denoted as P (j ), is the sum of the load before the fault occurrence and the load increment (j ) to be incurred upon the occurrence of a fault.
P (j ) = j +
m X i=1
(xij pi + yij bi )
(3:3)
The commonly used metric for evaluating load-balance between nodes is the standard deviation of the processor load [6]. The standard deviation ( ) of the load when node k is 10
faulty (Ak ) can be represented as follows:
v 2 32 u u n n X X u 1 i (Ak ) = t n ? f 4i P (i) ? n ? f j P (j )5 i=1
j =1
(3:4)
where f represents the number of faulty nodes. The second term in the bracket represents the average of P (j ). The number of backup processes becomes less than m when a fault occurs. Therefore, the rst and second expressions in Eq. (3.1) are no longer valid when a fault occurs. In Eq. (3.4), the average load of the live nodes is computed by dividing the total load by the number of live nodes (n ? f ). Another metric that can be used in the load-balancing process allocation problem is the load dierence between the node with the heaviest load and the node with the lightest load. The load dierence, denoted as , for the given set of process loads P can be represented as follows : (P ) = max(P ) ? min(P ) The set of process loads before a failure is represented as Pnon and the set of process loads after any node failure is represented as Pfaulty . In this paper, the load dierence is used as the objective cost function because of two reasons. First, the load dierence re ects the goal of load-balancing more closely than the standard deviation. This is illustrated by the following two scenarios. In the rst scenario, one node has an exceptionally large load and all others are slightly less loaded than the average of all nodes. In the second scenario, some nodes are somewhat more loaded than the average and other nodes are somewhat less loaded than the average. Suppose that the standard deviations of both scenarios are the same, while the load dierence are dierent. In such a case, the latter scenario is more preferable than the former, since there exists one heavily overloaded 11
node in the former scenario. The measurement will indicate that the latter scenario is better. The second reason is that standard deviation requires more computation than load dierence. The load-balancing process allocation problem is the problem of nding values of xij and
yij for all possible i and j which minimize multiple cost functions (Pnon ) and (Pfaulty ) with the constraint given in Eq. (3.1). There are two main approaches to solving an optimization problem that involves multiple objective functions [14]. One approach is to solve the problem a number of times with each objective in turn. When solving the problem using one of the objective functions, the other objective functions are considered as constraints. The other approach is to build a suitable linear combination of all the objective functions and optimize the combination function. In this case, it is necessary to attach a weight to each objective function depending on its relative importance. In this paper, the second approach is used to formulate the load-balancing process allocation problem. We de ne a new objective function as = w1 (Pnon ) + w2 (Pfaulty );
(3:5)
where w1 and w2 are the relative weights of importance before and after a fault occurrence, respectively. If we assume that the weights have the same value, i.e. w1 = w2 = 1, the objective function is = (Pnon ) + (Pfaulty )
(3:6)
3.4 Complexity of Related Allocation Problems The objective function of the process allocation problem, , is de ned as the sum of the load dierences before and after the occurrence of a fault. An optimal allocation is an assignment of processes that minimizes the objective cost function . In this subsection, we 12
analyze the decision and optimization problems related to the load-balancing primary-backup process allocation problem. Let us consider the following set assignment problem. Assume that there are n sets and each set has k elements, each with a positive weight. From these sets, k groups are formed by choosing one element from each set. The grouping problem, denoted as k-GP, is the problem of deciding whether there exists an assignment of elements such that the sum of the weights of the n elements of each group is the same. When k is two, we denote the grouping problem as 2-GP.
k-Grouping Problem (k-GP): Given n sets with k positively weighted elements in each set, does there exist an assignment of elements to k groups such that no two elements in the same set are assigned to the same group, and each group has the same weight sum?
Theorem 1: 2-GP is NP-Complete. Proof: (i) 2-GP 2 NP
It can be easily veri ed that 2-GP is in NP. There is a nondeterministic algorithm that forms two groups such that no two elements from the same set are in the same group. Given two such groups, it can be decided whether both sums are equal or not in polynomial time. Thus, 2-GP is in NP. (ii) PP P 2-GP
We use the partition problem which is a known NP-Complete problem [15] to show that 2-GP is NP-Complete. The partition problem, denoted as PP, is a decision problem that determines the existence of a subset A whose weight sum is equal to the weight sum of subset S ? A for the given set S . Let a set S have n elements, i.e., fa1; a2; ::; ang. The set S is transformed to n sets of 13
two elements fa1; 0g; fa2; 0g; ::; fan; 0g. It can be easily veri ed that the solution of the PP problem for the given set S can be found if the 2-GP problem for the transformed sets has a solution. Then, since the above set transformation can be done in polynomial time, 2-GP is
2
NP-Complete.
Corollary 1: k-GP is NP-Complete. Let us consider a variant of the grouping problem. When some elements of the n sets have a xed assignment to certain groups, the problem of nding groups with equal weight sums is called the restricted grouping problem. We are especially interested in the case when there are n elements in each set and one element from each set is preassigned to a dierent group. We denote this as the restricted grouping problem (RGP).
Restricted Grouping Problem (RGP): Given n sets, Ai = fai ; ai ; : : :; aing, i = 1::n, 1
2
where aii is assigned to the group Gi, does there exist an assignment of elements to n groups,
Gg = fa1g1 ; a2g2 ; ::; ang g, g = 1::n such that no two elements from the same set are assigned n
to the same group and each group has the same weight sum?
Theorem 2: RGP is NP-Complete. Proof: (i) RGP 2 NP
It can be easily veri ed that RGP is in NP by using the same method as in the 2-GP problem. (ii) 2-GP P RGP
A 2-GP problem with n sets fa11; a12g; fa21; a22g; ::; fan1; an2g is transformed to two RGP problems as follows: a) fa11; a12; S2 ; : : :; S2 g; fa21; a22; 0; : : :; 0g; : : :, fan1; an2; 0; : : :; 0g:
b) fa12; a11; S2 ; : : :; S2 g; fa21; a22; 0; : : :; 0g, : : :, fan1; an2; 0; : : :; 0g: 14
P P where S = ni=1 2j =1 aij . Assume that diagonal elements in RGP are preassigned to the dierent groups. Hence,
a11 and a22 of a) cannot be joined into the same group. If there is an assignment that has the same weight sum, no non-zero elements will be assigned to the groups which have the element S 2
in the rst set A1 because of the way S is de ned. Note, however, that two elements a11
and a22 of 2-GP can be assigned to the same group in 2-GP. Hence, we alternatively assign the diagonal element of the second set in 2-GP as indicated in case b). We can nd a solution for a 2-GP problem if a solution to one of the two corresponding RGP problems can be found. If there is a solution to the transformed RGP problem, then there is a corresponding solution to the 2-GP problem. Since the transformation from 2-GP to RGP is done in polynomial
2
time, RGP is an NP-Complete problem.
The problems of assigning elements of n sets to k groups have an analogy with the loadbalancing process allocation problem. Consider n as the number of processes, k as the number of nodes, and the weight of each element as the load of each process. Then the grouping problem is a load-balancing process allocation problem. The problem of minimizing (Pnon ) is the optimization problem based on the k-GP decision problem. Thus, it follows that the load-balancing process allocation problem is an NP-hard problem. Hence, we propose a heuristic approximation algorithm which is cost-eective and results in a well balanced processor load before and after the occurrence of a fault.
4 Heuristic Process Allocation Algorithm In this section, we rst present basic primitives for process allocation which form the core of the algorithm. A heuristic algorithm is presented with an example. Then the complexity of the proposed algorithm is analyzed. 15
4.1 Basic Primitives of Process Allocation In this section, it is assumed that processes are allocated one by one, and, once allocated, processes are not reallocated to other nodes. When we allocate processes one by one, there are three questions we have to consider. First, is it better to assign a process to a lightly loaded node rather than a heavily loaded node to minimize ? Second, is it better to assign a process with less load than a process with more load to minimize ? Third, when there are two ordered sets with n elements and two sets are combined into one set by adding one element from each set, does combining the highest from one set with the lowest from the other set for all elements minimize ? The following lemmas will address these questions one by one.
Lemma 1: If 0 P (1) P (2) :: P (n) and x > 0, then (P (1) + x; P (2); ::; P (n)) (P (1); P (2); ::; P (i) + x; ::; P (n)), for all i such that 1 < i n. Proof: Let us rewrite the equations
(P (1) + x; P (2); ::; P (n)) = max(P (n); P (1) + x) ? min(P (1) + x; P (2)) (P (1); ::; P (i)+ x; ::; P (n)) = max(P (n); P (i) + x) ? P (1): Since max(P (n); P (i) + x) max(P (n); P (1) + x) for any i > 1 and P (1) min(P (1) +
x; P (2)), it follows that (P (1) + x; P (2); ::; P (N )) (P (1); P (2); ::; P (i) + x; ::; P (n)). 2 Lemma 1 implies that a process should be allocated to a node with the minimum load rst to minimize . The function Alloc(P; x) represents the allocation of x to the most lightly loaded node
P (1) when P is an ordered set of n elements, i.e., P = fP (i)jP (i) P (j ) for 1 i j ng. Alloc(P; x) = fP (1) + x; P (2); ::; P (n)g 16
Lemma 2: For a given ordered set P of n elements and x y 0, (Alloc(Alloc(P; x); y )) (Alloc(Alloc(P; y ); x)): Proof: There are three cases depending on the allocation of x and y to P . These are i)
P (1)+y P (1)+x P (2), ii) P (2) P (1)+y P (1)+x, and iii) P (1)+y P (2) P (1)+x. We show that, in each case, (Alloc(Alloc(P; x); y )) (Alloc(Alloc(P; y ); x)).
Case-1 : P (2) P (1) + x P (1) + y In this case, x and y are assigned to P (1) since the P (1) + x is still the lowest load after the allocation of x. Hence, (Alloc(Alloc(P; x); y )) = (Alloc(Alloc(P; y ); x)) = max(P (1) + x + y; P (n)) ? min(P (1) + x + y; P (2)).
Case-2 : P (2) P (1) + y P (1) + x Let us nd the maximum and minimum after the allocation of x and y . max(Alloc(Alloc(P; x); y )) = max(P (n); P (1) + x; P (2) + y )
(4.1)
min(Alloc(Alloc(P; x); y )) = min(P (1) + x; P (2) + y; P (3))
(4.2)
max(Alloc(Alloc(P; y ); x)) = max(P (n); P (2) + x)
(4.3)
min(Alloc(Alloc(P; y ); x)) = min(P (1) + y; P (3))
(4.4)
If P (n) P (2) + x, then Eqs. (4.1) and (4.3) become P (n). If P (n) < P (2) + x, then Eq. (4.3) becomes P (2) + x, which is larger than any value in Eq. (4.1). Hence, max(P (2) + x; P (n)) max(P (n); P (1) + x; P (2) + y ), which consequently means that max(Alloc(Alloc(P; y ); x)) max(Alloc(Alloc(P; x); y )). Now, let us consider the minimum part of each allocation. If P (1) + y P (3), then Eqs. (4.2) and (4.4) become
P (3). If P (1) + y < P (3), then Eq. (4.4) becomes P (1) + y, which is less than any value in Eq. (4.2). Hence, min(P (1) + y; P (3)) min(P (1) + x; P (2) + y; P (3)), which conse17
quently means that min(Alloc(Alloc(P; y ); x)) min(Alloc(Alloc(P; x); y )). Therefore, (Alloc(Alloc(P; x); y )) (Alloc(Alloc(P; y ); x)).
Case-3 : P (1) + x P (2) P (1) + y The maximum and the minimum of Alloc(Alloc(P; x); y ) are the same as in Eqs. (4.1) and (4.2). Hence, we need only to nd the maximum and minimum of Alloc(Alloc(P; y );
x). max(Alloc(Alloc(P; y ); x)) = max(P (n); P (1) + x + y )
(4.5)
min(Alloc(Alloc(P; y ); x)) = min(P (2); P (1) + x + y )
(4.6)
Since P (1)+ y P (2) and P (1)+ y + x P (2)+ x, Eq. (4.6) becomes P (2). We classify this case in more detail depending on the values of P (n), P (2) + x, and P (1) + x + y . i) P (n) P (2) + x As discussed in Case 2, Eq. (4.1) becomes P (n), which is less than or equal to max(P (n); P (1)+ x + y ). Also, any value of Eq. (4.2) is larger than or equal to P (2). Therefore, (Alloc(Alloc(P; x); y ) (Alloc(Alloc(P; y ); x)) when P (1) + x
P (2), P (1) + y < P (2) and P (n) P (2) + x. ii) P (2) + x > P (n) P (1) + x + y Eqs. (4.2) and (4.5) become P (n), and min(P (1) + x; P (2) + y; P (3)) P (2). Therefore, (Alloc(Alloc(P; x); y )) (Alloc(Alloc(P; y ); x)). iii) P (n) < P (2) + x and P (n) < P (1) + x + y Eq. (4.5) becomes P (1) + x + y , which is larger than max(P (n); P (1) + x; P (2) +
y). Also, min(P (1) + x; P (2) + y; P (3)) is greater than or equal to P (2). Thus, (Alloc(Alloc(P; x); y )) (Alloc(Alloc(P; y ); x)). As shown in all cases, (Alloc(Alloc(P; x); y )) (Alloc(Alloc(P; y ); x)). 18
2
Lemma 2 implies that processes with more load should be allocated prior to processes with less load to minimize .
Lemma 3: Given two ordered sets of n elements, P and P , is maximally reduced when 1
2
the two sets are merged by adding the element with the largest weight from one set to the element with the smallest weight from the other set, i.e., the merged unordered set P is
P = fP1(1) + P2(n); P1(2) + P2 (n ? 1); ::; P1(n) + P2 (1)g. Then, (P ) max((P1); (P2)). Proof:
The rst part follows from Lemmas 1 and 2. For the second part, (P1 ) =
P1 (n) ? P1(1), (P2) = P2(n) ? P2 (1). Without loss of generality, let us assume that P (i) = P1 (i) + P2(n ? i + 1) has the maximum value. Then, since P1 and P2 are ordered sets, the minimum value of any merged term before P (i) is larger than or equal to P2 (n ? i +1)+ P1 (1) and the minimum value of any merged term after P (i) is larger than or equal to P1 (i)+ P2(1). Therefore, (P ) = P1 (i) + P2 (n ? i + 1) ? min(P2 (n ? i + 1) + P1 (1); P1(i) + P2 (1)). Hence, (P ) = max(P1 (i) ? P1(1); P2(n ? i + 1) ? P2 (1)) max((P1); (P2)).
2
The above lemma implies that the merging of the two ordered sets as proposed in the lemma does not increase the of two ordered sets.
4.2 Two-stage Allocation The proposed heuristic algorithm, which satis es the allocation primitives discussed above, works in two stages. In the rst stage, primary processes are allocated using a standard load balancing algorithm. We used a greedy method which allocates the process with the highest load to the node with the lowest load. As shown in Lemmas 1 and 2, this method minimizes . In the second stage, backup processes are allocated considering the load increment to be added to each node in the event of a fault. Let us assume that the algorithm is currently working on node j . The backup processes whose primary processes are assigned to node j 19
are divided into (n ? 1) groups having approximately equal incremental load in the event of a fault in node j . These (n ? 1) groups of backup processes are allocated to (n ? 1) nodes, excluding the node j . The algorithm is formally described below.
Two-Stage Algorithm Stage{1 Allocate primary processes 1. Sort primary processes in descending order of cpu load. 2. Allocate each primary process to the node with the minimum load from the highest load to the lowest.
Stage{2 Allocate backup processes Do the following steps for each node. 1. Compute the load dierence between each primary process and its backup process for all primary processes assigned to the node. 2. Sort, in descending order, the backup processes using the load dierence. 3. Divide the backup processes into (n ? 1) groups having approximately equal incremental load by assigning each backup process to the group with the smallest load, in the order of the sorted list in the previous step. 4. Compute the actual backup process load of each group. Do the following for the n(n ? 1) backup groups which are generated by the above steps. 1. Sort all n(n ? 1) backup groups using the actual loads in descending order. 2. Sort the n nodes using their current loads in ascending order. 3. Allocate each backup group to the node with the minimum load. However, if the backup group has a corresponding primary processes in this node or if one of the 20
backup groups which are already allocated to this node comes from the same node as the backup group to be allocated, choose the node with the next-to-the-minimum load. The allocation of backup processes is the most important part of this algorithm. For each node, the backup processes are rst divided into (n ? 1) groups using the load dierence between each primary process and its backup. This load dierence is the amount of load increment to be incurred upon the occurrence of a fault. Next, the algorithm computes the actual load of each group using the actual load of the backup processes. The total number of backup groups is n(n ? 1). Each group is assigned to nodes depending on its actual load. When we allocate each backup group, we check whether there is a pre-allocated backup group which comes from the same node as the to-be allocated backup group. In such a case, we select the node with the next-to-the-minimum load. The purpose of dividing the backup processes into (n ? 1) groups for each node is to guarantee that each node has an approximately equal amount of load increment. Hence, the system will have a balanced load when a fault occurs. The purpose of computing the actual load of each group and assigning groups based on their actual loads is to guarantee that each processor's load is balanced before the occurrence of a fault. Therefore, we can balance the processor load before as well as after the occurrence of a fault. The allocation of n(n ? 1) backup process groups for load-balancing is shown in Fig. 2. In this gure, Bjk denotes one of the groups of backup processes for the primary processes assigned to node j . Hence, Bjk (k = 1::n; k 6= j ) can be allocated any node except node j to tolerate a fault on the node j . The problem of allocating grouped backup processes is the RGP problem which was discussed in Section 2.4.
21
Node 1 Backup Groups
B 12
Node 2 Backup Groups
B 21
Node 3 Backup Groups
B 31
B 32
Node 4 Backup Groups
B 41
B 42
B 13 B 14 B 23 B 24 B 34 B 43
Node 1 Node 2 Node 3Node 4
Figure 2: Backup groups allocation problem. P1 p1 (27) p6 (10) p3 (8) 45
i 1 2 3 4 5 6 7 pi 27 21 8 17 22 10 23 bi 2.5 1.0 0.5 1.0 2.0 1.0 2.0 Table 1: Processes and their loads.
P2 P3 p7 (23) p5(22) p4 (17) p2(21) 40
43
Table 2: Allocated primary processes.
4.3 Example For simplicity, let us assume that there are 7 primary processes (m = 7) and the same number of backup processes with the loads shown in Table 1. Also, let us assume that the number of nodes is 3 (n = 3). Our two-stage allocation algorithm works as follows.
Stage{1 Allocate primary processes 1. Sort primary processes in descending order. The result is p1(27), p7(23), p5 (22),
p2 (21), p4 (17), p6 (10), p3(8). 2. Allocate to the node with the minimum load in sorted sequence. After allocating primary processes, the results are shown in Table 2. 22
Group Processes From P 1 B12 (p1 ? b1) = 24:5 B13 (p6 ? b6) = 9:0, (p3 ? b3 ) = 7:5 From P 2 B21 (p7 ? b7) = 21 B23 (p4 ? b4) = 16 From P 3 B31 (p5 ? b5) = 20 B32 (p2 ? b2) = 20
Dierence (Load) 24.5 (2.5) 16.5 (1.5) 21 (2.0) 16 (1.0) 20 (2.0) 20 (1.0)
Table 3: Grouping backup processes
Stage{2 Allocate backup processes 1. To form backup groups of processes whose primaries are allocated to P 1, sort the backup processes using the load dierence between the primary and backup. The result is (p1 ? b1 ) = 24:5, (p6 ? b6 ) = 9:0, (p3 ? b3) = 7:5. Divide these backup processes into 2 (= n ? 1) groups. 2. Repeat the above step for all nodes. The 6 groups and their actual loads for the 3 nodes are shown in Table 3. 3. These backup groups are sorted as (B12 ; B21; B31; B13; B23; B32) using their actual loads. 4. Allocate each backup group to the node with the minimum load. B12 is allocated to P 2 which has the minimum load. The current loads of all nodes after the allocation of B12 are shown in Table 4. 5. The backup group B21 is allocated to P 3 rather than P 2, since B21 comes from the node P 2. 6. The backup group B13 is allocated to P 3 rather than P 2, since B12 is already allocated to the node P 2. The nal result after the allocation of all primary and backup processes is shown in Table 5.
23
P1 p1(27) p6(10) p3(8) 45
P2 p7 (23) p4 (17) b1(2:5) 42.5
P3 p5(22) p2(21) 43
Table 4: After the allocation of B12 .
P1 p1(27) p6(10) p3(8) b4(1:0) b2(1:0) 47.0
P2 p7 (23) p4 (17) b1 (2:5) b5 (2:0) 44.5
P3 p5(22) p2(21) b7(2:0) b6(1:0) b3(0:5) 46.5
P1 P2 p7 (23) p4 (17) p1 (27) b5(2:0) 0
69
P3 p5 (22) p2 (21) b7 (2:0) p3 (8) p6 (10) 63
Table 5: Final result of allo- Table 6: Loads after a fault at
P 1.
cation.
Stage{3 After a fault occurrence on P 1 When a fault occurs in node P 1, the backup processes whose primary processes were running on P 1 become active and their loads becomes the same as those of their primary processes. Each processor's load after the fault occurrence is shown in Table 6.
4.4 Algorithm Complexity The time complexity of each stage of our algorithm is analyzed as follows.
primary allocation We can use a general sorting algorithm whose running time is O(m log m) to sort primary processes. Allocating m primary processes to n nodes requires O(m log n) time (this is a process of updating the highest value m times in a priority queue.)
backup allocation Let us consider the allocation with the worst execution time in which most of the primary processes are assigned to one node. Sorting the m backup processes takes O(m log m) time. Dividing backup processes into (n ? 1) groups having equal incremental load and 24
computing their actual loads takes O(m log n + m) time. The above grouping is repeated
n times, which requires O(n(m log m + m log n + m)) time. Sorting these n(n ? 1) backup groups requires O(n2 log n) time. Sorting n nodes and nding the proper node to allocate the one of the n(n ? 1) groups requires O(n log n + n) time. Thus, allocating n(n ? 1) groups requires O(n(n ? 1)(n log n + n)) time. The time complexity of this stage is thus
O(n(m log m + m log n + m)+ n2 log n + n(n ? 1)(n log n + n)) = O(nm log nm + n3 log n). Hence, the total time complexity of the two stage allocation algorithm is
O(m log m + m log n + nm log nm + n3 log n) = O(nm log nm + n3 log n)
(4:7)
Assuming that the number of processes is much larger than the number of nodes (m > n2 ), the execution time is bounded by O(nm log nm). This is a reasonable execution time for process allocation. The time complexity obtained above is based on the worst case scenario in which all primary processes are allocated to one node. If we assume that all primary processes are evenly distributed, the time complexity becomes O(m log m).
5 Performance Analysis In this section, the performance of the two-stage algorithm is estimated by analysis and compared with related work using simulation.
5.1 Expected performance The proposed allocation algorithm consists of three parts for load-balancing. First, primary processes are allocated to nodes. Second, the backup processes of each node are grouped on the basis of their primary{backup load dierences. Third, the backup process groups are allocated to each node using their actual load. The following lemmas and theorems estimate 25
pk
Load
P(1)
P(2)
P(n)
P(1) P(2)
P(n)
Figure 3: Before and after allocating pk . the performance of the proposed algorithm.
Lemma 4: Given m primary processes with decreasing order in their load, the load dierence between the node with the heaviest load and the node with the lightest load after the allocation of all primary processes by the proposed algorithm is not more than the load of
n-th smallest primary process, pm?n+1 . Proof: Let us assume that the node with the heaviest load has more than one primary
process after the allocation of primary processes. Then let qi be the node with the heaviest load and pk be the process which was allocated last to qi . Note that the node qi is the most lightly loaded node before the allocation of pk and qi is also the most heavily loaded node after the allocation of pk . Let P (i) be the ordered load of all nodes before the allocation of
pk , i.e., 0 P (1) P (2) ::: P (n): When pk is allocated to the node with P (1), the node becomes the most heavily loaded node with P (1) + pk . Processes which are not allocated yet, i.e. pk+1 to pm , should be allocated to the dark area in Fig. 3 to keep P (1) + pk at the maximum load. Since this area is less then 26
(n ? 1)pk , the following equation results. m X i=k+1
pi (n ? 1)pk
(5:1)
From the condition that pk pi (k + 1 i m), we can form the following equation. m X i=k+1
pi (m ? k)pk
(5:2)
The value k which satis es Eq. (5.1) and (5.2) is m ? n + 1 since n ? 1 = m ? k. Therefore,
pk is the n-th smallest process among the m primary processes. After allocating primary processes using the proposed algorithm, the dierence between the minimum and maximum
2
load is not more than pm?n+1 .
Lemma 5: The load dierence between the node with the heaviest load and the node with the lowest load in the allocation of backup processes is bounded by
! m X (pi ? bi) =1::m 1 ? n2 (n ? 1) i=1(pi ? bi) + imax
1
where is the maximum ratio of a backup process load to its primary. Proof: The number of primary processes in node j after the allocation of primaries using
the proposed algorithm is represented by mj . Note that there are m primaries which were allocated to n nodes.
m=
n X j =1
mj
The load dierence between each primary and backup process pair is denoted as di. D represents the load dierences of all processes, i.e. D = fd1; d2; ::; dmg. The set of load dierences of primary processes allocated to node j is represented as Dj = fdj;1; dj;2; ::; dj;mj g 27
which is a subset of D. Let Gjk represent the sum of the load dierences of the k-th group after dividing Dj into n ? 1 groups by the proposed algorithm. Then, Gjk has the following bound when mj n ? 1 on the basis of Lemma 4. k=1::n?1 (Gjk ) dj;mj ?(n?1)+1 = dj;mj ?n+2 max(Dj ) When mj < n ? 1, the number of primary processes allocated to the node j is less than the number of groups (n ? 1). Thus, k=1::n?1 (Gjk ) = max(Dj ) Also, the maximum value in Dj cannot be larger than the maximum in D. Since j can be any node, we obtain the following equation. (pi ? bi) k=1::n?1 (Gjk ) max(D) = imax =1::m
(5:3)
If we know the maximum of Gjk , then it is possible to estimate the maximum of the actual load of Gjk since the maximum actual load is less than or equal to 1? max(Gjk ). To estimate max(Gjk ) from k=1::n?1 (Gjk ), we should rst nd the relationship among max, min, average, sum, and when a set S is divided into n groups. (n ? 1) max(S ) + min(S )
X
(S ) max(S ) + (n ? 1) min(S )
(5.4)
X max(S ) = min(S ) + (S ) 1 (S ) +
(5.5)
n
EqFrom Eq. (5.4), the maximum of Gjk is bounded as follows: nX ?1 k=1
Gjk
max (Gjk ) + (n ? 1) k=1min (G ) ::n?1 jk
k=1::n?1
= n k=1 max (G ) ? (n ? 1) k=1::n?1 Gjk ::n?1 jk ?1 1 fnX max ( G ) k=1::n?1 jk n k=1 Gjk + (n ? 1) k=1::n?1 (Gjk )g
28
(5.6)
In . (5.3), (Gjk ) is estimated as being less than the maximum of all processes' load dierences. P ?1 G is bounded as follows : From Eq. (5.5), we know that nk=1 jk nX ?1 k=1
Gjk n1
?1 n nX X j =1 k=1
1 Gjk + imax ( p ? b ) = i i =1::m n(n ? 1)
m X (pi ? bi ) + imax (pi ? bi ) =1::m
(5:7)
i=1
Hence, the maximum of Gjk is bounded by Eqs. (5.3), (5.6), and (5.7) as follows: m 1f 1 X max ( G ) (pi ? bi ) + (n ? 1) imax (pi ? bi)g jk =1::m =1::m k=1::n?1 n n(n ? 1) i=1 (pi ? bi ) + imax
=
1
m X
(pi ? bi) =1::m n2(n ? 1) i=1 (pi ? bi) + imax
Therefore, the load dierence of backup allocation, (Bjk ), is bounded as
! m X 1 (Bij ) = 1 ? k=1max (G ) 1 ? n2 (n ? 1) (pi ? bi) + imax (pi ? bi) =1::m ::n?1 jk i=1
2
Theorem 3: The expected performance of the proposed algorithm before the occurrence of a fault is
1
m X
(Pnon ) = max pm?n+1 ; 1 ? n2 (n ? 1) (pi ? bi) + imax (pi ? bi) =1::m i=1
!!
:
Proof: By Lemmas 4 and 5, the performances of allocating primary and backup processes
are estimated by means of the objective function . The combined load dierence after the allocation of all primary and backup processes by the proposed algorithm is not larger than the maximum of individual primary and backup allocations as stated in Lemma 3. Hence, (Pnon ) is bounded by the maximum of the of the primary and backup process allocations.
2 29
Theorem 4: The objective cost function of the proposed algorithm
m X
1
2 max pm?n+1 ; 1 ? n2 (n ? 1) i=1 Proof:
!! (pi ? bi) + imax (pi ? bi) + imax (pi ? bi): =1::m =1::n
After the occurrence of a fault in node j , the loads of all non-faulty nodes are
increased by their corresponding load increment Gjk . The load dierence after the occurrence of a fault is not increased by more than maxj =1::n (k=1::n?1 (Gjk )). Hence, (Pfaulty ) is less than the sum of maxj =1::n (k=1::n?1 (Gjk )) and (Pnon ). Since k=1::n?1 (Gjk ) is bounded as maxi=1::n (pi ? bi ) by Eq. (5.3), the objective function is bounded as follows: = (Pnon ) + (Pfaulty )
(Pnon ) + (Pnon ) + jmax::n(k =1
::n?1 (Gjk ))
=1
!! m X 1 (pi ? bi) : imax (pi ? bi ) + 2 max pm?n+1 ; 1 ? n2 (n ? 1) (pi ? bi ) + imax =1::m =1::n i=1
2 In the next subsection, we will show by comparison with simulation that this bound can serve as a quick approximation to the actual value.
5.2 Performance Comparison In this subsection, we analyze the performance of the two-stage algorithm using simulations. Note that there has been no previous research on load-balancing with passive replicas. The closest research of this kind is Bannister and Trivedi's work on load-balancing with active replicas [6]. Thus, our algorithm will be compared to Bannister and Trivedi's process allocation (BT) algorithm. Process allocation by the BT algorithm is proven to be near-optimal under the assumption that backup processes have exactly the same load as their primaries [6]. 30
The BT algorithm is applied to the problem of load-balancing with passive replica as follows. First, all primary and backup processes are sorted in descending order of their cpu load. Starting from the highest load process, each process is allocated to the node with the minimum load. When the node with the minimum load in allocating a backup process is the node on which the primary process has already been allocated, the backup process is allocated to the node with the next smallest load. The environment parameters used in the simulation are as follows. To keep the total load of each node below 100%, the load of the primary processes are chosen randomly in the range of 0.2 to 2 times 100 (n ? 1)=m based on a uniform distribution. The load of the backup processes are also chosen randomly between 5 10% of the load of their primaries, also based on a uniform distribution. Fig. 4 shows each node's load when processes are allocated using the BT algorithm. It is assumed that the number of nodes (n) is 8. The horizontal dotted line in the gure represents the average load of a faulty system and the solid line represents the average load of a fault-free system. The gure also shows the load dierence of the maximum and the minimum load when the number of primary processes varies from 50 to 300. Fig. 5 shows the simulation results using the two-stage allocation algorithm with the same parameters. When the number of primary processes is 150, the load dierence between the maximum and the minimum load is approximately 2 3% in Fig. 5. By contrast, Fig. 4 shows that the load dierence between the maximum and the minimum load is almost 25 % of the total load when the BT algorithm is used. The reason that the BT algorithm shows such poor performance is that the allocation of backup processes is done without considering the load variation after the occurrence of a fault. The load dierences between the minimum and the maximum load before and after the occurrence of a fault is shown in Fig. 6 for each algorithm. These results were obtained 31
100
100 One Fault using BT Algorithm
95
95
One Fault using Our Algorithm
90
90
85
85 Load
Load
Non Fault usingBT Algorithm
80
80
75
75
70
70
65
65
60 50
100
150 200 Number of Processes
250
60 50
300
Non Fault using Our Algorithm
100
150 200 Number of Processes
250
300
Figure 4: Minimum, maximum load using
Figure 5: Minimum, maximum load using
the BT algorithm.
our algorithm.
with n = 8 nodes. The upper two lines show (Pfaulty ) for the two algorithms after a fault has occurred in one node of the system and the lower two lines show (Pnon ) before the occurrence of the fault. It can be seen that before the occurrence of a fault, (Pnon ) using the two-stage allocation algorithm is similar to that of the BT algorithm. However, after the occurrence of a fault, (Pfaulty ) using the two-stage allocation algorithm is signi cantly less than that using the BT algorithm. Next, we experimented with the eect of the number of nodes in the system. The simulation was conducted by varying the number of nodes while keeping the number of processes xed. The simulation results are shown in Fig. 7. In this and the following gures, the simulation results before and after the fault occurrence were combined and represented as , as shown in Eq. (3.6). The upper two lines show the simulation results when the number of processes is 200 and the lower two lines show the simulation results when the number of processes is 400. increases as the number of nodes increases. This result was expected since the number of processes allocated to each node decreases as the number of nodes increases when the number of processes is xed. This gure again shows that the two-stage allocation 32
30
40 One Fault using BT Algorithm One Fault using Our Algorithm Non Fault using Our Algorithm Non Fault using BT Algorithm
35 30
200 Processes using BT Algorithm 400 Processes using BT Algorithm 200 Processes using Our Algorithm 400 Processes using Our Algorithm
25 20
25 20
15
15
10
10 5
5 0
0 50
100
150 200 Number of Processes
250
300
2
4
6
8
10 12 14 Number of Nodes
16
18
20
Figure 6: before and after a fault occur-
Figure 7: aected by the number of
rence.
nodes.
algorithm is better than the BT algorithm. Fig. 8 shows the simulation results when the number of processes is varied from 50 to 300. The number of nodes is xed at 8. The upper line shows when the BT algorithm is used in the simulation, the middle line is the upper bound of our algorithm based on Theorem 4, and the lower line shows the simulated result for the two-stage algorithm. As the number of processes increases, decreases since the number of processes per node increases. From this gure, we can conclude that it is easier to balance the load of a system when there are many processes with small loads rather than a few processes with large loads. In the previous gures, we assumed that the backup processes have 5 10% of the load of their primary processes. Next, we compare the performance of the two algorithms by varying the load of the backup processes from 10% to 100% of their primaries. For a backup process load ratio in Fig. 9, the load of the backup processes was chosen randomly between 50 to 100% of the primary process load in order to provide some variance in the backup process load. We tested for 200 processes with 8 nodes, and for 400 processes with 16 nodes. The two-stage allocation algorithm shows better performance in Fig. 9. However, as the load of 33
20
40 BT Algorithm Upper Bound Our Algorithm
35 30
200 8 using BT Algorithm 400 16 using BT Algorithm 200 8 using Our Algorithm 400 6 using Our Algorithm
18 16 14
25
12
20
10 8
15
6
10
4
5
2
0 50
100
150 200 Number of Processes
250
0 0.1
300
0.2
0.3
0.4
0.5 0.6 0.7 Backup Load Ratio
0.8
0.9
1.0
Figure 8: aected by the number of pro-
Figure 9: aected by varying the load
cesses.
ratio of backup processes.
the backup processes approaches 100% of the primary process load, both algorithms have similar performance. In our simulation, 100% means that a backup process has 50% 100% of its primary process. This result implies that the BT algorithm has better performance when a backup process has the same load of its primary process.
6 Summary and Conclusion A load-balancing static process allocation algorithm for fault-tolerant multi-computer systems is proposed and analyzed. The fault-tolerant process model considered in this paper is the passive process replica model. The passive replica of a process is inactive during normal operation and becomes active only when the main active process becomes faulty or the node on which the active process was running becomes faulty. The load of a passive replica varies before and after the occurrence of a fault. The load-balancing problem with passive replicas is formalized as a constrained optimization problem. Since optimal process allocation is an NP-hard problem, a heuristic approximation algorithm is proposed. The proposed algorithm 34
is compared with the load-balancing algorithm for active replica proposed by Bannister and Trivedi [6] using simulations. The simulation results show that the proposed algorithm has signi cantly better performance in the passive replica model. Also, it is shown that the system has a better load balance if we can divide a \big load" process into many \small load" processes. The main contribution of this paper is the presentation and analysis of a new static load-balancing algorithm for a passive replica fault-tolerant process model. The proposed static process allocation algorithm is applicable to on-line transaction processing and realtime systems. We are currently working on extending this algorithm to handle the dynamic situation. Also, we plan to study the problem of nding an allocation algorithm which guarantees optimal performance and reliability at the same time.
References [1] L. J. M. Nieuwenhuis, \Static allocation of process replicas in fault-tolerant computing systems," in Proc. of FTCS-20, pp. 298{306, June 1990. [2] S. M. Shatz, J. P. Wang, and M. Goto, \Task allocation for maximizing reliability of distributed computer systems," IEEE Trans. on Computers, pp. 1156{1168, 1992. [3] X. Qian and Q. Yang, \Load balancing on generalized hypercube and mesh multiprocessors with LAL," in Proc. of 11th ICDCS, pp. 402{409, May 1991. [4] M. Livingston and Q. F. Stout, \Fault tolerance of allocation schemes in massively parallel computers," in Proc. of 2nd Massively Parallel Computers Conference, pp. 491{ 494, 1989. [5] Y. Oh and S. H. Son, \Scheduling hard real-time tasks with 1-processor-fault-tolerance," Tech. Rep. CS-93-27, University of Virginia, May 1993. 35
[6] J. A. Bannister and K. S. Trivedi, \Task allocation in fault-tolerant distributed systems," Acta Informatica, vol. 20, pp. 261{281, 1983. [7] M. Chereque, D. Powell, et al., \Active replication in delta-4," in Proc. of FTCS-22, pp. 28{37, June 1992. [8] D. P. Siewiorek and R. S. Swartz, Reliable System Design :The theory and practice. New York: Digital Press, 1992. [9] N. Speirs and P. Barrett, \Using passive replicates in delta-4 to provide dependable distributed computing," in Proc. of FTCS-19, pp. 184{190, June 1989. [10] A. Nangia and D. Finkel, \Transaction-based fault-tolerant computing in distributed systems," in 1992 IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems, pp. 92{97, July 1992. [11] S. Balaji et al., \Workload redistribution for fault-tolerance in a hard real-time distributed computing system," in Proc. of FTCS-19, pp. 366{373, June 1989. [12] Y. Oh and S. H. Son, \Scheduling hard real-time tasks with tolerance of multiple processor failures," Tech. Rep. CS-93-28, University of Virginia, May 1993. [13] C. Krishna and K. G. Shin, \On scheduling tasks with a quick recovery from failure," IEEE Trans. on Computers, vol. c-35, pp. 448{454, May 1986. [14] H. P. Williams, Model Building in Mathematical Programming. John Wiley & Sons Ltd., 2 ed., 1985. [15] M. R. Garey and D. S. Johnson, Computers and Intractability : A Guide to the Theory of NP-Completeness. W. H. Freeman and Company, 1979.
36