Machine Partitioning and Scheduling under Fault ... - CiteSeerX

Approximation and Complexity in Numerical Optimization: Continuous and Discrete Problems (P. M. Pardalos, Editor), pp. 209{244

c 2000 Kluwer Academic Publishers

Machine Partitioning and Scheduling under Fault-Tolerance Constraints Dimitris A. Fotakis1;2 ([email protected]) (1) Department of Computer Engineering and Informatics University of Patras, 265 00 Rion, Greece Paul G. Spirakis1;2 ([email protected]) (2) Computer Technology Institute Kolokotroni 3, 262 21 Patras, Greece

Abstract We consider the problem of computing fault-tolerant, redundant assignments of jobs to faulty parallel machines with objective to minimize the maximum machine load. In particular, we are given a set M of faulty parallel machines, each having an integer speed vi and failing independently with probability fi . We are also given a set of jobs to be processed on M , and a fault-tolerance constraint (1 ? ), and we seek a redundant assignment that minimizes maximum machine load L1 (), subject to the constraint that, with probability no less than (1 ? ), all the jobs have a copy on at least one active machine. We present a polynomial-time 4-approximation algorithm for identical speed mal m ln( j M j = ) chines and arbitrary job sizes, and a 2 ln(1=fmax) -approximation algorithm for related speed machines and unit size jobs. Both algorithms are based on computing a collection fM1 ; : : : ; Mg of disjoint machine subsets such that, with probability no less than (1 ? ), at least one machine is active in each subset. The objective is to maximize the sum of the minimum subset speeds. Since the exact version of this problem is NP complete, we provide a 2-approximation algorithm for identical speeds, and a polynomial-time (8 + o(1))-approximation algorithm for arbitrary speeds. Keywords: Fault-Tolerant Scheduling, Polynomial-Time Algorithms.

This work was partially supported by ESPRIT LTR Project no. 20244|ALCOM{IT.

209

1 Introduction In many practical applications involving design with faulty components (e.g. faulttolerant network design, fault-tolerant scheduling), a combinatorial structure, such as a graph, should be optimized to best tolerate random and independent faults with respect to a given property, such as connectivity or non-existence of isolated points (e.g. [12]). For instance, let us consider some jobs to be processed on a set of faulty parallel machines. Each machine has an integer speed vi, fails independently with probability fi, and, in case of failure, it processes no jobs. Then, redundancy can be employed to increase the probability that all the jobs have a copy on at least one active machine, i.e. all the jobs are processed on some active machine. On the other hand, redundancy also increases the maximum machine load. A natural question arising from this setting is whether it is possible to assign, with probability no less than (1 ? ), all the jobs to at least one active machine, without assigning jobs of total size more than vi to any machine i 2 M . The answer requires the construction of the most reliable, redundant assignment ? not violating the capacity constraint, and the computation of ?'s reliability, that is the probability all the jobs to get a copy on at least one active machine. In this work, we investigate how fault-tolerant redundant assignments can be eciently computed, and how the structure of the most reliable assignment ? looks like. In this paper, we consider a set M of faulty parallel machines, each having an integer speed vi and failing independently with probability fi, for some rational 1 > fi > 0. We are also given a set J of jobs, each of an integer size sj , and a rational fault-tolerance constraint (1 ? ). Each job j 2 J of size sj causes a load sj =vi when assigned to a machine i 2 M of speed vi. We seek a redundant assignment : J 7! 2M that tolerates the faults (i.e. all the jobs get a copy on at least one active machine) with probability no less than (1 ? ), and minimizes maximum machine load L1 (). Then, we distinguish the case of identical speed machines, where all the speeds vi are equal, and the case of related speed machines, where the speeds vi can be arbitrary integer numbers. Obviously, the problem of determining Minimum Fault-Tolerant Maximum Load is NP -hard. Moreover, since the veri cation of a feasible solution involves the computation of a #P -complete function, Minimum Fault-Tolerant Maximum Load does not seem to belong to NP . We identify a complexity class that provably contains Minimum Fault-Tolerant Maximum Load, and we show that it also contains the whole Polynomial Hierarchy PH. A natural strategy for constructing a redundant, (1 ? )-fault-tolerant assignment is to compute a collection M = fM1; : : :; M g of disjoint machine subsets such that, with probability no less than (1 ? ), at least one machine is active in each subset Mj . Then, each subset Mj can be thought as a reliable eective machine of speed V (Mj ) = mini2Mj fvig, and any algorithm for non-redundant scheduling on M can 210

be used for computing . A reasonable objective for a reliable collection M is to maximize the total eective speed V (M) = Pj=1 V (Mj ). We show that the problem of computing an optimal, (1 ? )-fault-tolerant collection of disjoint machine subsets is NP -complete even for identical speeds. Then, we present a polynomial-time 2-approximation algorithm for partitioning m l a set of ln(jM j=) identical speed machines. In case of related speeds, we obtain a simple 2 ln(1 =fmax) approximation algorithm, and, for any constant > 0, a polynomial-time (8 + )approximation algorithm. As for the approximability of Minimum Fault-Tolerant Maximum Load, in case of identical speeds, we prove that near optimal assignments can be obtained by partitioning M into an optimal number of reliable eective machines. This proof is based on a technical lemma of independent interest that provides a tight upper bound on the reliability of any redundant assignment. This lemma also bounds from above the probability that isolated nodes do not appear to a not necessarily connected hypergraph, whose edges fail randomly and independently. Then, we show that Minimum Fault-Tolerant Maximum Load can be approximated within a factor of 4 in polynomial-time. In case ofl related mspeeds, we restrict our attention to unit size jobs, ln(jM j=) -approximation algorithm. and we present a simple 2 ln(1 =fmax) Up to the best of our knowledge, similar optimization problems concerning the computation of minimum maximum load, redundant assignments, subject to the constraint to tolerate random and independent faults with a given probability, have not been studied so far. The de nition of Minimum Fault-Tolerant Maximum Load does not assume any upper bound on the number of faulty machines, and since any kind of reaction to machine failures is not allowed, a non-trivial lower bound on the optimal maximum load should identify the most reliable redundant assignments. On the other hand, unlike other on-line fault-tolerant scheduling problems (e.g. [7, 8]), Minimum Fault-Tolerant Maximum Load is an o-line problem, and any algorithm with sucient computational power will eventually come with the optimal solution. The fault-tolerant versions of some routing problems, such as minimizing congestion of Virtual Path layouts in a complete ATM network [4], have been studied in an o-line setting. In [4], redundancy is also employed to overcome faulty links, but mainly the worst-case fault model, where each layout must tolerate any con guration of at most f faulty links, is considered. In case of random and independent faults, they only provide a trivial logarithmic upper bound. Also, the graphs and hypergraphs that maximize and minimize the probability to remain connected under random and independent edge faults have been studied in [12], where tight upper and lower bounds are derived. Next, Section 2 is devoted to the introduction of the notation used throughout this paper, the formal de nitions of Minimum Fault-Tolerant Maximum Load and Maximum Fault-Tolerant Partition, and the discussion of their complexities. In Section 3, we present a polynomial-time 2-approximation algorithm for Maximum Fault-Tolerant Partition of identical speed machines, that is also used in Section 5 211

to obtain an approximation algorithm for Minimum Fault-Tolerant Maximum Load. Section 4 is devoted to the presentation of two approximation algorithms for Maximum Fault-Tolerant Partition of related speed machines. In particular, in Section 4.1, we present a simple, logarithmic approximation algorithm, which is also used in Section 6 to obtain a logarithmic approximation algorithm for Minimum FaultTolerant Maximum Load. In Section 4.2, we present a more sophisticated, constant factor approximation algorithm. Section 5.1 is devoted to the proof a lower bound for Minimum Fault-Tolerant Maximum Load on identical speed machines, which is employed to the analysis of the constant factor approximation algorithm presented in Section 5. Section 6 is devoted to the presentation and the analysis of the logarithmic approximation algorithm for Minimum Fault-Tolerant Maximum Load, in case of unit size jobs and related speed machines. Finally, some directions for further research are discussed in Section 7.

2 Preliminaries

Let M be a set of machines, such that each machine i 2 M has an integer speed vi 1 and fails independently with probability fi, for some rational 1 > fi > 0. For any subset M 0 M , let Pr[M 0] denote the reliability of M 0, that is the probability that at least one machine of M 0 is active, Pr[M 0] = 1 ?

Y

i2M 0

fi :

Also, fmax = maxi2M ffig denotes the failure probability of the most unreliable machine. Let J = f1; : : : ; ng be a set of jobs, where each job j 2 J must be processed for sj time units on at least one active machine of M . Let Stot = Pj2J sj denote the total size of J , and let smax = maxj2J fsj g denote the size of the largest job. A redundant assignment : J 7! 2M is a function that assigns each job j 2 J to a non-empty set of machines (j ) M . An assignment is feasible for a set of machines M 0 M , if, for all j 2 J , (j ) \ M 0 6= ;. Given an assignment , Pr[] denotes the reliability of , that is the probability to be feasible over the machine availability distribution de ned by the failure probabilities fi over M . Pr[] =

0

X

@

M 0 M is feasible for M 0

Y

i2M 0

(1 ? fi)

Y

i2M ?M 0

1

fiA

Given a redundant assignment : J 7! 2M , a minimal feasible set of machines for is any subset M 0 M , such that is feasible for M 0, but is not feasible for any M 00 M 0. A minimum feasible set of machines for is a minimal feasible set for 212

of minimum cardinality. Moreover, MF() denotes the cardinality of any minimum feasible set for . Given an assignment that is feasible for M , L1() denotes the maximum load assigned by to the machines of M . 8 < X

9

sj = L1 () = max i2M :j :i2(j ) vi ;

De nition 2.1 (Minimum Fault-Tolerant Maximum Load) A set of machines M = f(f1 ; v1); : : :; (fm; vm)g. Each machine i 2 M has an integer speed vi 1 and fails independently with probability fi , for some rational

Instance:

1 > fi > 0. A set of jobs J = fs1 ; : : :; sn g to be processed on M . Each job j 2 J has an integer size sj 1 and must be processed on at least one active machine. A fault-tolerance constraint (1 ? ), for some rational 1 > Qmi=1 fi . M Solution: A (1 ? )-fault-tolerant redundant assignment : J 7! 2 , i.e. an assignment of each job j 2 J to a non-empty set of machines (j ) M , such that Pr[] 1 ? . nP sj o. Objective: Minimize L1 () = maxi2M j :i2(j ) vi

In this paper, we distinguish the identical speed machines case, where all the machines have unit speed (i.e. vi = 1), and the related speed machines case, where each machine can have an arbitrary integer speed. In the related speeds case, we further assume that the machines are numbered in non-increasing order of their speeds, i.e. v1 vm 1. The problem of the Maximum Fault-Tolerant Partition arises from the following natural strategy for computing a (1 ? )-fault-tolerant, redundant assignment . (1) Compute a collection of reliable eective machines by partitioning some M 0 M into disjoint groups M = fM1; : : : ; Mg, such that the probability of at least one machine being active in each group is at least (1 ? ). (2) Use an appropriate algorithm for scheduling the job set J on the set M of reliable eective machines. For all jobs j 2 J scheduled on the eective machine Ml, set (j ) = Ml . The rst step of this approach actually determines an upper bound on the amount of redundancy being necessary for satisfying the fault-tolerance constraint. Moreover, if we set the eective speed V (Ml ) equal to the minimum speed of the corresponding group, V (Ml) = mini2Ml fvig, 1 l , then the makespan of the non-redundant schedule obtained in the second step equals the maximum load of the redundant assignment . 213

The redundant assignments that can be produced by this approach are called partition assignments. In particular, an assignment : J 7! 2M is called a -partition assignment if, for any pair j1; j2 2 J , either (j1 ) = (j2 ) or (j1 ) \ (j2 ) = ;, and assigns the jobs of J to exactly disjoint machine subsets fM1; : : :; M g. Since there exist many ecient algorithms for the implementation of the second step (e.g. see Chapter 1 of [5]), we focus on the design and analysis of approximation algorithms for the rst step, that is the computation of a (1 ? )-fault-tolerant collection fM1; : : :; Mg of disjoint machine subsets.

De nition 2.2 (Maximum Fault-Tolerant Partition) A set of machines/items M = f(f1; v1); : : : ; (fm; vm )g. Each item i 2 M has an integer speed/pro t vi 1 and fails independently with probability fi, for some Instance:

rational 1 > fi > 0. A fault-tolerance constraint (1 ? ), for some rational 1 > Qmi=1 fi . 0 Solution: A partition of a subset M M into disjoint groups M = fM1 ; : : : ; M g such that:

Pr[M] = Pr[M1 ^ : : : ^ M ] = Objective:

Y l=1

Pr[Ml] =

Y l=1

0

1?

@

Y

i2Ml

1

fi A 1 ?

Maximize the total eective speed of the partition M:

V (M) =

X l=1

V (Ml) =

X

minfvig

l=1 i2Ml

Notice that the term \partition" is somewhat abused, because the de nition of Maximum Fault-Tolerant Partition allows M 0 = Sl=1 Ml M . This is crucial in the related speeds case, because there exist many instances, where the optimal solution is a partition of a strict M 0 M . However, in the identical speeds case, where the objective is simply to maximize the number of groups , the addition of some items in M 0 cannotSdecrease V (M). Hence, in the identical speeds case, we can always assume that l=1 Ml = M . The Fault-Tolerant Partition problem can be thought as a version of Bin Covering [1], where, instead of a threshold on the total size of each separate bin, we have to cover a constraint on the product of the total bin sizes. Therefore, any feasible solution of a Bin Covering instance can be mapped to a feasible solution of a corresponding instance of Fault-Tolerant Partition. On the other hand, there may exist feasible Fault-Tolerant Partitions (including the optimal one) that are not feasible solutions of the corresponding Bin Covering instance, because they contain some under lled bins. 214

2.1 The Complexity of Fault-Tolerant Maximum Load

Since it is NP -complete to determine the minimum makespan for scheduling a set of jobs on reliable (i.e. fi = 0) identical machines, it is NP -hard to determine the optimal fault-tolerant L1 , even for instances consisting of unit speed machines with identical failure probabilities. Moreover, given a set of identical speed machines M , each failing independently with probability f = 21 , a set J of n unit size jobs, and a redundant assignment : J 7! 2M , it is #P -complete to exactly compute Pr[], because it is equivalent to #Monotone-Sat originally shown #P -complete by Valiant [15]. In particular, we can W associate a boolean variable xi to each machine i 2 M , and a clause C = j i2(j ) xi to V each job j 2 J . Clearly, the formula F = j2J Cj is satis ed by a truth assignment A to the variables xi i the schedule is feasible for the set MA = fi 2 M : A(xi) = trueg. Thus, Pr[] equals the number of truth assignments satisfying F divided by 2jM j. It is straight forward to verify that Minimum Fault-Tolerant Maximum Load is in PSPACE , but we do not know whether it belongs to the Polynomial Hierarchy PH (e.g., see [13]). Obviously, Minimum Fault-Tolerant Maximum Load can be included in a class containing all the languages L that can be decided by a polynomial-time non-deterministic Turing machine T reducing L to a single call of a function g 2 #P . Moreover, after calling the oracle g once, the only additional computation that T needs to perform is an arithmetic comparison involving the outcome of g. We denote this class by NP #P [1;comp]. In particular, the class NP #P [1;comp] restricts the operation of the non-deterministic Turing Machine T as follows: (a) Initially, given an input x, T is allowed to perform an arbitrary polynomialtime non-deterministic computation in order to compute a valid input yx for a function g 2 #P , and an arbitrary integer number N . (b) Then, T either rejects or calls g(yx) and gets the outcome n = g(yx). (c) The only computation that T is allowed to perform after getting n is to compare n with N . The particular kind of comparison (e.g. equality, less than) depends on the machine. T accepts x i the comparison of n with N succeeds. Up to the best of our knowledge the complexity class NP #P [1;comp] has not been de ned and studied so far. In addition to Minimum Fault-Tolerant Maximum Load, a stochastic version of Knapsack de ned in [10] can be shown to belong to this class. It can be shown that NP #P [1;comp] contains the whole Polynomial Hierarchy PH.

Lemma 2.3 PH NP #P [1;comp]. Proof. Given any language L 2 PH, we can decide if an input x is in L by asking an appropriate function in #PH once. Since #PH FP #P [1] [14], there exists a 215

polynomial-time deterministic Turing machine Td that, on input x, computes an input yx for a function g 2 #P , calls g(yx), and performs a deterministic computation after getting n = g(yx) in order to decide if x 2 L. Next, we show that L also belongs to NP #P [1;comp], i.e. there exists a polynomialtime non-deterministic Turing machine T that reduces L to a function g 2 #P and ful lls the restrictions (a)-(c). The non-deterministic machine T works as follows: 1. It simulates the computation of Td by guessing a value n^ instead of calling g(yx). 2. If Td rejects x with oracle answer n^ , then T rejects. Otherwise, T calls g 2 #P and gets the value n = g(yx). 3. T accepts only i n = n^ . Clearly, T ful lls the restrictions (a) - (c) and accepts an x i Td accepts x. ut Moreover, an application of Cook's Theorem [2] (see also Section 17.2 of [13]) implies that the following problem is complete for NP #P [1;comp]: Given a boolean formula F (X1; X2) with boolean variables partitioned into two sets X1 and X2 , does there exist a partial truth assignment A for X1 such that the remaining formula F (A(X1); X2) has at least (1 ? )2jX2j satisfying partial truth assignments for X2?

2.2 The Complexity of Fault-Tolerant Partition

Then, we show that, given a set M = ff1; : : :; fm g of rational failure probabilities 1 > fi > 0, and a rational > 0, it is NP -complete to decide if M can be partitioned into two sets M1; M2 such that Pr[M1] Pr[M2] 1 ? .

Lemma 2.4 Fault-Tolerant Partition into two groups is NP -complete, even for identical speed machines.

Proof. Clearly, Fault-Tolerant Partition is in NP . We show that it is NP -complete by a simple transformation from Subset Product. The problem of Subset Product is, given a nite set A = fs1; : : :; sQng, si 2 IN?, and a bound B 2 IN?, to decide if there exists a subset A0 A so as i2A0 si = B . This problem is reported NP -complete in [3], problem SP14. Clearly, the NP completeness result holds for rational si's andQB , 1 > si > 0,Q1 > B > 0, since, for any set A, sizes si 2 IN?, and bound B 2 IN?, i2A si = B i i2A s1i = B1 . Given a Subset Product instance IS consisting of (A = ff1; : : : ; fmg; B ), for some rational 1 > fi > 0 and 1 > B > 0, we can construct the following instance IP of Fault-Tolerant Partition into two groups in polynomial-time.

1. P = Qi2A fi. Wlog. we can assume that P 2 < P < B . 216

o

n

2. M = f1; : : :; fm ; fm+1 = PB2 ; fm+2 = PB . 3. = 2P 2 ? P 4. We conclude the proof by showing that IP is a yes-instance i IS is a yes-instance. Since 1 ? = (1 ? P 2)2 and Qmi=1+2 fi = P 4, any (1 ? )-fault-tolerant partition into two groups M1 and M2 must have Pr[M1] = Pr[M2] = 1 ? P 2. Thus, for any (1 ? )feasible partition, the items fm+1 and fm+2 cannot belong to the same group, because fQm+1fm+2 = P 3 < P 2. Hence, if fm+1 2 M1 and A0 = M1 ? ffm+1g, then A0 A and 2B i2A0 fi = P P 2 = B . Conversely, if there exists an A0 A such that Qi2A0 fi = B , then for M1 = A0 [ ffm+1g and M2 = A ? A0 [ ffm+2 g, it is Pr[M1] = Pr[M2] = 1 ? P 2. ut

3 Fault-Tolerant Partition of Identical Speeds The similarity of Fault-Tolerant Partition and Bin Covering suggests that it may be possible to design an approximation algorithm for the former problem using the ideas of some algorithm for the latter one. In this section, we present a 2-approximation algorithm for Maximum Fault-Tolerant Partition of identical speed machines based on the Next Fit algorithm for Bin Covering [1], and an additional idea for handling oversized/over-reliable items. The analysis of Next Fit is based on the trivial fact that the total size of the items divided by the threshold T provides an upper bound on the optimal number of bins. In case of Bin Covering, this upper bound is a reasonable one, because, since any feasible solution must cover all the bins up to the given threshold T , there is no need to consider oversized items, that are items of size greater than T . Similarly, in case of Fault-Tolerant Partition, given a set M = ff1; : : : ; fmg, the solution x0 to the following equation provides a trivial upper bound on the optimal number of groups. 3 2 m !1=x0 x0 Y 5 =1? 41 ? fi (1) i=1

Equation (1) implies that if non-integral bins/groups and placement of items were allowed, thenQan optimal solution would consist of x0 groups, each of reliability 1 ? 0, where 0 = ( mi=1 fi)1=x0 . Assume that the items are sorted in non-increasing order of reliability, and f1 < 0. Then, since each item must be placed into a single group, f1 can contribute at most 1 to any optimal solution, i.e. the solution contains a group M1 = ff1g. However, f1 contributes more than 1 to x0 de ned by Equation (1). In this setting, f1 is an over-reliable item, because it is more reliable than the average reliability of a partition into x0 groups, and Equation (1) may provide a poor upper bound on the optimal number of groups, because the contribution of f1 is over-estimated. 217

Algorithm Next Fit Decreasing { NFD Input: M = ff1; : : : ; fm g, failure probabilities 1 >Qfi > 0.

Fault-tolerance constraint (1 ? ), 1 > i2M fi. Output: A (1 ? )-fault-tolerant partition of M into disjoint groups. (1) Sort the items of M so that f1 f2 fm, i.e. in non-increasing order of reliability. (2) Compute the rst index l, 0 l < m, such that fl+1 > l, where l = Fl1=xl , and xl; Fl are de ned by the following equation:

1 ? Fl1=xl

xl

= 1 P? ; Fl = l

m Y i=l+1

l Y

fi ; Pl = (1 ? fi) i=1

(2)

(3) For j = 1; : : : ; l, Mj = ffj g, i.e. the group Mj only consists of the item fj . (4) The set ffl+1; : : :; fmg is partitioned using Next Fit [1] with threshold 1 ? l.

j = l + 1; = l + 1; while j m do Q if Pr[M ] = 1 ? i2M fi < 1 ? l then /* If M is not yet filled, place fj M = M [ ffj g;

else /* Else place fj to a new group = + 1; M = ffj g; j = j + 1; end while; if Pr[M] < 1 ? l then = ? 1; M = M [ M+1;

M+1

to

M

*/

Figure 1: The Algorithm Next Fit Decreasing (NFD).

218

*/

On the other hand, an optimal solution will take advantage of the over-reliable group containing f1 to relax the average reliability of the remaining groups. In particular, given that f1 contributes 1 to the optimal number of groups, the remaining items can contribute at most x1 groups to any optimal solution, where x1 is de ned by the following equation. 2

1?

4

m Y i=2

fi

!1=x1 3x1 5

= 11??f

1

It is easy to see that 1 = (Qmi=2 fi)1=x1 > 0, because f1 < 0. This situation must go on, until no more over-reliable items exist, i.e. until we nd the rst index l such that fl+1 > l. The algorithm Next Fit Decreasing - NFD (Figure 1) initially computes a collection of l over-reliable items placed into the single item groups M1; : : :; Ml (steps (1) (3)). The remaining items are partitioned using Next Fit with threshold 1 ? l. The following theorem shows that the approximation ratio of Next Fit Decreasing is actually determined by the approximation ratio of Next Fit.

Theorem 3.1 Next Fit Decreasing (Figure 1) runs in time O(log m(m ? log Pr[;])) and is a 2-approximation algorithm for Maximum Fault-Tolerant Partition of identical speed machines.

Proof. We start by showing that the partitions produced by NFD are indeed (1 ? )fault-tolerant. Let Pl = Qli=1 (1 ? fi) = Pr[M1 ^ : : : ^ Ml] be the probability that all the rst l groups M1; : : :Ml, each containing a single item, are active. Therefore, each of the remaining groups Ml+1; : : : ; M must contain an active item with probability at least 1P?l . Since, for i = l + 1; : : : ; , Pr[Mi] 1 ? l and al = Fl1=xl , the number of groups Ml+1; : : : ; M cannot be more than xl, i.e. ? l xl. Hence,

Pr[Ml+1 ^ : : : ^ M] =

Y i=l+1

Pr[Mi] (1 ? l)?l (1 ? l)xl = 1 P? , and l

Pr[M1 ^ : : : ^ M] Pl 1 P? = 1 ? : l

Performance: Initially, by an argument similar to one used in the analysis of Next x ?1

Fit for Bin Covering [1], we show that NFD always produces more than l + l2 groups, where xl is de ned by Equation (2). In particular, we show that NFD partitions the

set

Mr = ffl+1; : : :; fmg = M ?

l [ i=1

Mi

into at least xl2?1 disjoint groups such that Pr[Ml+1 ^ : : : ^ M] = Qi=l+1 Pr[Mi] 1P?l . Since, for all fj 2 Mr , fj > al, for all but possibly the last group containing items 219

of Mr , 1 ? l Pr[Mi] < 1 ? 2l . Moreover, the last group has Pr[M] < 1 ? 3l . Therefore, Y Fl = (1 ? Pr[Mi]) > l2(?l)+1 i=l+1

The de nition of al = Fl1=xl implies that ? l > 21 (xl ? 1), and, since the number of groups must be integer, l + bx2lc . Next, we show that any optimal partition cannot have more than l + xl groups. We consider the most reliable, (1 ? )-fault-tolerant, optimal partition M? = fM1?; : : :; M?g. Let MRI(Mi? ) = minj2Mi? fj be the Most Reliable Item of Mi?, and assume that the groups Mi? are numbered in non-decreasing order of their MRI(Mi?) values, i.e. MRI(M1?) MRI(M2?) MRI(M?). Therefore, the rst l groups in M? contain the items f1; : : :; fl, i.e. the l most reliable items, and the following inclusion holds: l l [ [ (3) Mi Mi?

i=1 i=1 Sl ? ? Let = i=1 Pr[Mi ], Mr = M ? i=1 Mi?, and Fl? = Qj2Mr? fj . Equation (3) implies that Pl Pl? and Fl Fl?. In case that (3) holds with equality, the set Mr? cannot contribute more than xl groups to any optimal partition. Otherwise, each group would have reliability less than 1 ? l and, since the number of groups would be greater than xl, the reliability of such a partition would be less than 1P?l . Then, we assume that the most reliable, optimal solution M? corresponds to Mr? Mr , and Mr? is partitioned into x?l = ? l > xl groups of total reliability at least 1P?l? , and we show that this contradicts to the selection of M? as the most reliable, optimal partition. Since Sli=1 Mi Sli=1 Mi?, some of the groups M1?; : : :; Ml? of the optimal solution M? must contain more than one items. Let Mz?1 , 1 z1 l, be the rst such group of M?. Therefore, the item fz1 belongs to Mz?1 . Also, let fz01h be an item iother than fz1 placed into Mz?1 by the optimal solution M?. Clearly, Pr Mz?1 ? ffz01 g 1 ? l, because fz1 l. Furthermore, since Fl Fl?, x?l > xl, and l = Fl1=xl, M? must contain another group Mz?2 , l + 1 z2 , such that Pr[Mz?2 ] < 1 h? l.i h i Since Pr Mz?2 < Pr Mz?1 ? ffz01 g , the partition obtained by removing fz01 from Mz?1 and adding it to Mz?2 is also feasible, optimal and strictly more reliable than M?. This contradicts the selection of M? and implies that any set Mr? Mr cannot contribute more than xl groups to any optimal solution. Since the optimal number of groups l+xl must be integer, the above discussion

Pl?

Ql

implies that 2 + l and the number of groups obtained by NFD is at least half the optimal number of groups. Then, we show that there exists a family of instances such that 2 = . Let x 1 be an integer and let ; be rational numbers, 1 > > 0, 1 > > 0, that ful ll (1 ? )2x = 1 ? . Also, let be any small rational that ful lls 0 < < 220

p

1=2x

min ; 1 ? 1. Consider an instance of Fault-Tolerant Partition consisting of 4x items, where 2x of them f1; : : :; f2x have failure probability equal to (1 + ) and the remaining 2x items f2x+1; : : : ; f4x have failure probability equal to 1+1 , and let the fault-tolerance constraint be equal to (1 ? ). By the choice of x; and the optimal partition consists of 2x groups ffi; f2x+ig, i = 1; : : :; 2x. Furthermore, p since 1 + < implies that (1 + ) < 1+1 , and f1 > , NFD places the rst 2x items into x groups and all the remaining 2x items into the last group, because (1 ? (1 + )?2x) < 1 ? by the choice of . Complexity: The complexity of the algorithm is dominated by the steps (1) and (2). O(m log m) time is needed for the step (1). As for the step (2), the value of l can be decided using binary search because, by the de nition of l, for all l0 > l, fl0+1 > l0 0 . Additionally, each iteration of binary search can be and, for all l0 < l, fl0+1 alP implemented in time O(m + i2M log(1=fi )), because 1. The function g(y) = (1 ? y)ln(Fl)= ln(y) is monotone decreasing with respect to y, and 2. Even though l can be a real number, we need to determine only the rst Pm i=1 dlog(1=fi )e bits of l in order to (correctly) perform the subsequent comparisons. ut In the sequel, we extensively use the upper bound (l + xl) on the number of reliable groups that can be obtained from a set of identical speed machines. In particular, given a set M of identical speed machines and a fault-tolerance constraint (1 ? ), IUB(M; 1 ? ) = l + xl bounds from above the number of groups that can be produced by M with constraint (1 ? ). The bound IUB(M; 1 ? ) = l + xl consists of the integer l denoting the number of over-reliable items, and the real xl denoting the optimal non-integral number of groups that can be obtained from the instance Mr ; 1P?l , if non-integral placement of items is allowed.

4 Fault-Tolerant Partition of Related Speeds

4.1 A Simple Logarithmic Approximation Algorithm

The Safe Partition { SP algorithm (Figure 2) combines two simple approaches to approximate Maximum Fault-Tolerant Partition of related speed machines within a logarithmic factor. Safe Partition starts by applying Next Fit with threshold equal to 1 ? m . Since any feasible solution cannot have than m groups, the resulting partition is always (1 ? )-fault-tolerant. Then, Safe Partition computes the largest eective speed, (1 ? )-fault-tolerantQgroup consisting of the rst d + 1 machines, where d is the largest index such that di=1 fi > . The Safe Partition algorithm returns the best of these solutions. The analysis of Safe Partition is simple and based on the facts 221

Algorithm Safe Partition { SP Input: M = f(f1 ; v1) : : : ; (fm ; vm )g, failure probabilities 1 > fi > 0,

speeds vi, v1 vm. Fault-tolerance constraint (1 ? ), 1 > Qi2M fi. Output: A (1 ? )-fault-tolerant partition of M into disjoint groups.

j = 1; = 1; M = ;; while j m do Q if Pr[M ] = 1 ? i2M fi < 1 ? m else = + 1; M = f(fj ; vj )g; j = j + 1;

then

M = M [ f(fj ; vj )g;

end while; if Pr[M] < 1 ? m then = ? 1; V (M) = Pl=1 V (Ml) = Pl=1 mini2Ml fvig; Q Let d 0 be the last index such that di=1 fi > . If vd+1 V (M) then return ff(f1; v1); : : : ; (fd+1 ; vd+1 )gg; else return M = fM1; : : : ; M g;

Figure 2: The Algorithm Safe Partition (SP). that all the groups of both the aforementioned partitions have cardinality at most ln(m=) m, and any (1 ? )-fault-tolerant partition cannot have eective speed more ln(1=fmax) than Pmi=d+1 vi.

l

l

m

Lemma 4.1 Safe Partition { SP (Figure 2) is a polynomial-time 2 ln(1ln(=fm=max) ) approximation algorithm for Maximum Fault-Tolerant Partition of related speed machines, where fmax = maxi2M ffi g. Proof. We start by observing that the output of SP is always (1 ? )-fault-tolerant. S Then, if M+1l = M ? ml=1 Ml, for all l = 1; : : :; +1, the group Ml cannot contain more than m? =

ln(m=) ln(1=fmax)

machines, because

ln(m=) ln(1=fmax) e 1 ? m : 1 ? fdmax

Since the eective speed of each group Ml cannot be less than any vi 2 Ml+1, we obtain that, for all l = 1; : : : ; ,

m?V (Ml) jMl+1jV (Ml) j v. Similarly, m?vd+1 jM1jvd+1 PjiM=d1+1 i

222

X

i2Ml+1

vi :

To obtain an upper bound on the optimal eective speed, notice that, by the de nition of d as the largest index such that Qdi=1 fi > , any (1 ? )-fault-tolerant partition must place each machine (fj ; vj ), j = 1; : : : ; d, into the same group with some of the machines (fi; vi), i = d + 1; : : : ; m. Therefore, noP(1 ? )-fault-tolerant partition can have eective speed more than Speed(d + 1) = mi=d+1 vi. Hence, the approximation ratio can be derived by the following inequality

m?

vd+1 +

X l=1

!

V (Ml ) Speed(d + 1) :

ut

4.2 A Constant Factor Approximation Algorithm

In this section, we present Speed Class Partition { SCP (Figure 3), that is a constant factor approximation algorithm for Maximum Fault-Tolerant Partition of related speed machines. This algorithm divides the original instance R into classes Ij of almost identical speed machines. The analysis of Speed Class Partition is based on a technical lemma stating that there existsPan allocation of portions (1 ? ) j? of the fault-tolerance constraint to the classes Ij , j j? 1, so that the total eective speed of an optimal solution can be bounded from above by the sum, over all j , of Ij 's speed ? times the upper bound IUB Ij ; (1 ? ) j on the number of groups obtained from the class Ij with fault-tolerance constraint (1 ? ) j? . In order to approximate the values j?, the Speed Class Partition algorithm computes an appropriately selected set of samples j (i) on the [0; 1]-interval, and for each ( i ) j sample j (i), evaluates the number of groups j (i) = NFD Ij ; (1 ? ) produced ( i ) by NFD from the speed class Ij with fault-tolerance constraint (1 ? ) j . For all the classes Ij , the pro t-size pairs (j (i); bj (i)) form an instance of Generalized? Knapsack, whose solution suggests a near optimal allocation of portions (1 ? ) j (ij ) of the fault-tolerance constraint to each class Ij . Then, a feasible solution M consists of the ? ) ( i j j union of the partial solutions produced by NFD on instances Ij ; (1 ? ) . The Speed Class Partition algorithm returns the best of M and the largest eective speed, (1 ? )-fault-tolerant group consisting of the rst d +1 machines. The following shows that this approach indeed yields a constant factor approximation algorithm.

Theorem 4.2 For any constant > 0, Speed Class Partition { SCP (Figure 3) is a

polynomial-time (8 + )-approximation algorithm for Maximum Fault-Tolerant Partition of related speed machines. Moreover, the time complexity of SCP is polynomial in 1 . Proof. We start by observing that wlog. we can assume that vi = vd+1 , for all i = 1; : : : ; d, because, by the de nition of d, all the machines with index no more than

223

Algorithm Speed Class Partition { SCP Input: M = f(f1; v1) : : : ; (fm ; vm )g, failure probabilities 1 > fi > 0,

speeds vi, v1 vm. Fault-tolerance constraint (1 ? ), 1 > Qi2M fi. Output: A (1 ? )-fault-tolerant partition of M into disjoint groups.

(1) Let d 0 be the last index such that Qdi=1 fi > , and = blog vd+1c. For all i = 1; : : : ; d, set vi = vd+1. For all j = 0; : : : ; , let Ij = ffi : (fi; vi) 2 M ^2j vi < 2j+1 g be the class of machines, whose speeds belong to [2j ; 2j+1 ). In the following, we assume that, for all fi 2 Ij , vi = 2j . (2) For each class Ij , compute a set ?j of pairs (j (i); j (i)), where 1 j (i) 0, and j (i)'s are de ned by j (i) = IUB Ij ; (1 ? ) j(i) . Each ?j must contain pairs (; j (i)), for all integers = 0; 1; : : : ;bIUB(Ij ; 1 ? )c, and, for all i 0, j (i + 1) j (i) ? 1. (3) For each class Ij , compute a set ?^ j of pairs (j (i); j (i)) as follows: For all (j (i); j (i)) 2 ?j , use NFD to compute the pair (j (i); j (i)), where j (i) = ( i ) j . A detailed implementation of the steps (2) and (3) is NFD Ij ; (1 ? ) described in Figure 4. (4) For each class Ij , select exactly one pair (j (i?j); j (i?j)) from ?^ j so as to maximize the function X

j =0

2j j (i?j) ; subject to

X

j =0

j (i?j) 1 :

(5) If vd+1 Pj=0 2j j (i?j) then return a single group f(f1; v1); : : :; (fd+1; vd+1)g. Otherwise, return M = Sj=0 Mj , where Mj = fM1j ; : : : ; Mjj (i?j)g is the partition produced by NFD Ij ; (1 ? ) j (i?j) . Figure 3: The Algorithm Speed Class Partition (SCP).

224

d must be groupped together with machines of speed at most vd+1 by any (1 ? )-faulttolerant partition. Therefore, this cannot decrease the eective speed of an optimal solution. At the steps (2) and (3), SCP computes appropriately selected samples j (i)'s used for the formulation of the Generalized Knapsack instance at the step (4). In particular, j (i)'s are computed so as the corresponding j (i)'s to be integers, and not dier too much from each other, i.e. 1 j (i) ? j (i +1) 0. In Section 4.2.1, we provide a detailed implementation of the steps (2) and (3) (Figure 4), and we prove that an appropriate set of samples j (i) can be computed in polynomialQtime. d+1 f . Feasibility: Clearly, if a single group of speed vd+1 is returned, then i=1 i ? j ] (1 ? ) j (ij ) , for all j = Otherwise, the feasibility of NFD implies that Pr[ M 0; : : : ; , and Pr[M] = Qj=0 Pr[Mj ] 1 ? , since Pj=0 j (i?j) 1. Performance: In the following, we only consider machine speeds vd+1; : : :; vm that are integer powers of 2. Obviously, the original speeds vd+1; : : :; vm can be reduced to the nearest integer power of 2 by only losing a factor of 2 in the approximation ratio. Therefore, the speeds of all the machines belonging to each class Ij are assumed to be equal to 2j . The performance analysis of the SCP algorithm is based on the following technical lemma, whose proof is deferred to Section 4.2.2. This lemma implies that FaultTolerant Partition of related speeds can be reduced to Fault-Tolerant Partition of identical speeds by appropriately allocating portions (1 ? ) j? of the fault-tolerance constraint to the speed classes Ij .

Lemma 4.3 There exist reals 0 j? 1, j = 0; : : : ; , Pj=0 j? 1, such that the objective value V (M? ) of an optimal partition M? ful lls the following inequality: V (M?)

X

j =0

2j IUB Ij ; (1 ? ) j?

Since j (i)'s are computed by NFD, for all i 0, j (i) bj2(i)c . Thus, for the optimal pairs (j (i?j); j (i?j)) 2 ?^ j , we have j

k

j (i?j) j ? j 2 j (ij ) 2 2 ; j =0 j =0 X

X

(4)

where j (i?j) = IUB Ij ; (1 ? ) j(i?j) . Then, we consider the largest valued sample j (i) for the class Ij thatdoes not exceed j?, and we show that the corresponding j (i) = IUB Ij ; (1 ? ) j (i) satis es the inequality bj (i)c IUB Ij ; (1 ? ) j? ? 1. To see this, recall that the samples j (i) are computed so as to satisfy the following properties (see also the step (2) of SCP, Figure 3, and the implementation of the steps (2) and (3), Figure 4): (a) ?j contains pairs (; j (i)), for all integers = 0; 1; : : : ;bIUB(Ij ; 1 ? )c, and 225

(b) For all i 0, j (i + 1) j (i) ? 1. Therefore, if the sample j (i) corresponds to an integer valued j (i), then, j (i) can ? the smallest sample not be less than IUB Ij ; (1 ? ) j ? 1, because of (b). Otherwise, ? j (i0) exceeding j? must correspond to a j (i0) such that IUB Ij ; (1 ? ) j j (i0) dj (i)e, because of the property (a). Hence, the inequality dj (i)e IUB Ij ; (1 ? ) j? implies that bj (i)c IUB Ij ; (1 ? ) j? ? 1. Therefore, by the inequality (4), ?

X

j =0

2j j (i?j)

V (M?)

X

j =0

2j

0

IUB Ij ; (1 ? ) j

2

?1

1

1 @V (M?) ? X 2j A 2 j =0 ? (V (M ) ? v d+1 ) 2 X 2 2j j (i?j) + 2vd+1 j =0

However, the problem of computing an optimal collection (j (i?j); j (i?j)) 2 ?^ j , j = 0; : : : ; , corresponds to the following generalization of Knapsack: Given +1 sets ?^ j , each containing items of pro t ^j (i) = 2j j (i)Pand size j (i), (0; 0) 2 ?^Pj , select exactly one item ij from each ?^ j so as to maximize j=0 ^j (ij ), subject to j=0 j (ij ) 1. Obviously, this problem is NP -complete, because it is a generalization of Knapsack. In Section 4.2.3, we prove that the FPTAS for ordinary Knapsack (e.g., see Section 9.3 of [5], or [6, 11]) can be generalized to an FPTAS for Generalized Knapsack. Hence, since the solution returned by SCP has objective value no less than vd+1 and 1 P 2j (i?), for any constant > 0, SCP is a (4 + )-approximation algorithm j j 1+ j =0 for instances of Fault-Tolerant Partition with machine speeds equal to integer powers of 2. Furthermore, for any constant > 0, SCP is a (8 + )-approximation algorithm for Fault-Tolerant Partition of arbitrary speed machines. Complexity: The steps (1)-(3) can be implemented in time polynomial in the size of the input (see also Section 4.2.1). Moreover, the step (4) can be implemented in time polynomial in the size of the input and 1 , for any > 0. The SCP algorithm only uses the pairs (j (i); j (i)) 2 ?j to decide an appropriate ( i ) collection of instances Ij ; (1 ? ) j for which the performance of NFD must be computed. Therefore, a polynomial number of bits suces for storing the values (1 ? ) j(i), since NFD needs only to compare the value of the fault-tolerance constraint with rational numbers. Let Mj (i) be the partition that NFD produces on input Ij ; (1 ? ) j(i) . Then, since we can approximate Generalized Knapsack within any constant > 0 in time 226

polynomial in 1 , we can use approximate values for ^j (i) log(1?) Pr[Mj (i)] without signi cantly decreasing the overall performance of SCP. We can also use precise values by replacing j (i)'s with Pthe corresponding rational numbers Pr[Mj (i)], which can be stored using at most 2Ij log(1Q=f ) bits. Then, the algorithm for Generalized Knapsack must store the rationals Pr[Mj (i)], and check whether these products are at least (1 ? ). ut

4.2.1 Implementation of the SCP Algorithm: Steps (2) and (3)

In this section, we show that the sets ?j and ?^ j computed by the algorithm described in Figure 4 ful ll the properties required by the algorithm SCP. For each iteration, the variable l represents the number of single item groups consisting of over-reliable items for the current value (1 ? ) j (i) of the fault-tolerance constraint. Additionally, the variable xj (i) represents the number of the remaining groups computed by evenly distributing the failure probability Fl in order to ful ll j (i) (1 ? ) the fault-tolerance constraint Pl . Clearly, the value of l either remains the same or decreases as the value of (1 ? ) j (i) increases towards to 1. By construction, 1 j (i) 0. We show that, for all i 0, j (i) j (i ? 1) ? 1. The steps (2.3) - (2.5) correspond to the case that the value of l remains the same then the initial estimate l + xj (i) for the new value of j (i). If l does not decrease, ( i ) j equals the corresponding upper bound IUB Ij ; (1 ? ) . Moreover, if kj (i ? 1) is integer, then kj (i) = j (i ? 1) ? 1. Otherwise, j (i) = bj (i ? 1)c j (i ? 1) ? 1. In case that the value of l decreases as a result of decreasing j (i), the failure probabilities of some of the l single item groups should be greater than the failure probabilities of the remaining xj (i) groups. Therefore, a partition into at least l +xj (i) groups can be obtained by placing the items belonging to some of the less reliable single item groups together with the remaining itemsfl+1; : : :; fmj , where mj = jIj j. Hence, the corresponding upper bound j (i) = IUB Ij ; (1 ? ) j (i) cannot be less than the initial estimate l + xj (i), that equals either bj (i ? 1)c or j (i ? 1) ? 1. Then, we compute the right index so as the kj (i) values to form a non-increasing sequence. From the discussion above, it also becomes clear that ?j contains pairs (; j (i)) for all integers =0; 1; : : : ;bIUB(Ij ; (1 ? ))c. Also, since for each value j (i), j (i) = IUB Ij ; (1 ? ) j (i) , the analysis of NFD implies that j (i) bj2(i)c . Complexity: At any point in time, neither l nor xj (i) can be more than mj . Also, in each iteration, except for the iteration that follows a decrease of l, either l or xj (i) decrease by 1. Hence, the algorithm terminates after at most m2j iterations. Moreover, the analysis of NFD implies that each iteration can be performed in polynomial time. Additionally, notice that we do not need to calculate the values of j (i), since we only use the corresponding values (1 ? ) j (i) of the fault-tolerance constraint. Since the algorithm SCP only uses j (i)'s computed by NFD for appropriately selected 227

(1) Compute IUB(Ij ; 1 ? ) = lj + xjl . Set j (0) = 1, j (0) = IUB(Ij ; 1 ? ), xj (0) = xjl , l = lj , i = 0, mj = jIj j. Set j (0) = NFD(Ij ; 1 ? ). (2) while j (i) > 1 do (2.1) i = i + 1. If xj (i ? 1) is integer, then xj (i) = xj (i ? 1) ? 1. Otherwise, xj (i) = bxj (i ? 1)c. (2.2) Compute j (i) from the following equation

Fl =

m

j Y

=l+1

f ; P l =

l Y =1

(1 ? f ) ; 1 ? Fl1=xj(i)

xj (i)

= (1 ?P)

j (i)

l

(2.3) If l = 0, then j (i) = xj (i), j (i) = NFD Ij ; (1 ? ) j (i) , and go to (2). (2.4) Compute IUB Ij ; (1 ? ) j(i) = lj + xjl . (2.5) If l = lj , then j (i) = l + xj (i), j (i) = NFD Ij ; (1 ? ) j (i) , and go to (2). (2.6) If l > lj , then j (i) = IUB Ij ; (1 ? ) j (i) , xj (i) = xjl , l = lj , j (i) = NFD Ij ; (1 ? ) j (i) . (2.7) If j (i) < j (i ? 1) then go to (2). (2.8) Otherwise, nd the smallest index i t 0, such that j (i) > j (t) and j (t ? 1) j (i). Then, set j (t) = j (i), xj (t) = xj (i), j (t) = j (i), j (t) = j (i), i = t, and go to (2).

(3) Compute j (i + 1) = log(1?) 1 ? Qm=1j f . (3.1) If j (i + 1) < 1, then set j (i + 1) = 1, j (i + 1) = 1, j (i + 2) = 0, j (i + 2) = 0, j (i + 2) = 0. (3.2) Otherwise, set j (i + 1) = 0, j (i + 1) = 0, j (i + 1) = 0. Figure 4: An implementation of the steps (2) and (3) of SCP.

228

values of the fault-tolerance constraint, it is sucient to store the rst Pm=1j log(1=f ) bits of each (1 ? ) j(i). ut

4.2.2 The Proof of Lemma 4.3

Lemma 4.3 states that there exists a real number j? 0 for each class Ij , Pj=0 j? 1, such that the total eective speed of V (M?) of? an optimal partition M? can be P bounded from above by j=0 2j IUB Ij ; (1 ? ) j , where all the machines of the class Ij are assumed to have speed equal to 2j . The proof actually shows how to calculate an appropriate set of real numbers from the most reliable, optimal partition M?. optimal, (1 ? )-fault-tolerant Proof. Let M? = fM1? ; : : :; M? g bePthe most reliable, P ? ? partition of M , and let V (M ) = i=1 V (Mi ) = i=1 min2Mi? fv g be the eective speed of M?. Given such an optimal solution M?, we show how to calculate an appropriate set of + 1 real numbers j . In order to calculate j 's, we examine the contribution of each class Ij to the groupsQ Mi?. For each group Mi?, we calculate the contribution cij of the Ij items to F i = 2Mi? f = 1 ? Pr[Mi?]. Then, we calculate the contribution ji of the Ij items to the portion log(1?) Pr[Mi?] of the fault-tolerance constraint that has been devoted to Mi?. Due to technical reasons, we have to distinguish two cases. For the groups Mi? entirely consisting of items belonging to the class Ij , the contributions cij and ji are (1) (1) (1) accumulated to c(1) j and j respectively. Both cj and j are initially equal to 0. For the groups Mi? not entirely consisting of items belonging to a single class Ij , the (2) (2) contributions cij and ji are accumulated to c(2) j and j respectively. Also, both cj and j(2) are initially equal to 0. For each class Ij , the real number j is the sum of j(1) and j(2). For each group Mi? entirely consisting of items belonging to the class Ij , we increase (1) cj by ci = cij = 1, and j(1) by i = ji = log(1?) Pr[Mi?]. Clearly, the contribution of each group Mi? to the objective value of M? is exactly 2j . Let Ij1 contain all the items of Ij belonging to some group Mi? Ij . The quantity c(1) j cannot be fault-tolerance more than the optimal number of groups j1 obtained from Ij1 with (1) (1) ? ? j j constraint (1 ? ) , because the groups Mi , Mi Ij , form a (1 ? ) -faulttolerant partition of Ij1. Moreover, by the analysis of NFD,

c(1) j

j1 IUB Ij

(1) ; (1 ? ) j

(5)

Let Zj M? contain all the groups Mi? such that Mi? \ Ij 6= ;, and Mi? does not entirely consist of Ij items. For each Mi? 2 Zj , let F i = 1 ? Pr[Mi?] and Fji = Q (2) (2) ? i i 2Mi? \Ij f . Then, for each Mi 2 Zj , we increase cj by cj = logF i Fj and j by 229

ji = cij (log(1?) Pr[Mi?]) = cij i. Obviously, the contribution of each Mi? to V (M?) is bounded by X V (Mi?) 2j cij : j :Ij \Mi?6=;

Let Ij2 contain all the items of Ij belonging to groups Mi? 2 Zj. Then, we show (2) on the optimal 2 that c(2) j cannot be more than the upper bound IUB Ij ; (1 ? ) j j(2) . number of groups obtained from Ij2 with fault-tolerance constraint (1 ? ) (2) (2) 2 j Assume that, for some class Ij , cj > IUB Ij ; (1 ? ) , and let Fj2 = Q Q i)cij . It should be clear that the number x 0 de ned by 2 ? ( F f = 2Ij Mi 2Zj the equation x (2) 1 ? Fj12=x = (1 ? ) j cannot be less than c(2) j , because of the following inequality (see also Proposition 4.4): 2

0

1 ?@

6 6 4

Y

Mi?2Zj

11=c(2)

i

j

(F i)cj

A

3c(2) 7 7 5

j

Y

Mi?2Zj

1 ? Fi

ci

j

= (1 ? )

P

i i Mi? 2Zj cj

(2)

= (1 ? ) j

(6)

j Therefore, it must be the case that c(2) lj2 is the number of j > l 2 > 0, where (2) over-reliable items included in the instance Ij2; (1 ) j . Hence, by the analysis of NFD, there must exist a group Mz?1 Zj and item fz1 Mz?1 Ij2, such that fz1 < Fj12=x. This implies that

?

2

(2)=x

1 ? fz1 > 1 ? Fj12=x = (1 ? ) j

2

\

(2) (1 ? ) j(2)=cj ;

(2) P (2) P i i i because x c(2) j . Additionally, since cj = Mi?2Zj cj and j = Mi?2Zj cj , where i = log(1?) Pr[Mi?], it must exist another group Mz?2 2 Zj being allocated no more than the average portion of the fault-tolerance constraint. Hence, (2)=c(2) j

Pr[Mz?2 ] = (1 ? ) z2 (1 ? ) j

:

Then, we consider the partition M0 obtained from M? by replacing the group ? 0 ? 0 ? ? Mz1 by Mz1 = ffz1 g, and the group Mz2 by Mz2 = Mz2 [ Mz1 ? ffz1 g . Since 1 ? fz1 > Pr[Mz?2 ], by an argument similar to one used in the analysis of NFD, M0 is strictly more reliable than M?. Moreover, it is not hard to verify that V (Mz01 ) + V (Mz02 ) V (Mz?1 ) + V (Mz?2 ). This is a contradiction to the selection of M? as the most reliable, optimal, (1 ? )-fault-tolerant partition of M . Therefore, there always 230

exists an optimal solution such that, for all speed classes Ij , satis es the following inequality: (2) (2) 2 j cj IUB Ij ; (1 ? ) (7) Obviously, since Ij1 \Ij2 = ; and Ij1 [ Ij2 Ij , Equations (5) and (7) imply that P (2) i cj = c(1) j + cj IUB Ij ; (1 ? ) j . Therefore, since, for all j = 0; : : : ; , i=1 cj = (2) c(1) j + cj , the following holds for the real numbers j : X i=1

V (Mi?)

X X i=1 j =0

cij 2j =

X

j =0

(2) 2j c(1) j + cj

X

j =0

2j IUB Ij ; (1 ? ) j :

By the de nition of j , it is straight-forward that j 0, and, since Pr[M?] 1 ? , X i=1

i =

X X i=1 j =0

ji =

X

j =0

j 1 :

ut

In the proof of Lemma 4.3, Equation (6), we have used the following inequality that holds because the function g(y) = ln(1 ? ey ) is concave for all y < 0.

Proposition 4.4 For any Fi, 1 > Fi > 0, and ci, ci > 0, i = 1; : : : ; n, the following inequality holds:

2

1?

4

where C = Pni=1 ci .

n Y i=1

Fici

!1=C 3C 5

n Y

(1 ? Fi)ci ;

(8)

i=1

Proof. The proof is by induction on n. For n = 1, (8) trivially holds with equality. For n = 2, let F =(F1c1 F2c2 )1=c1+c2 . Also, if z1 = ln(F1) and z2 = ln(F2), then + c2z2 : z = ln(F ) = c1zc1 + 1 c2

Moreover, the function g(y) = ln(1 ? ey ) is concave in (?1; 0), because g00(y) = ?ey (1 ? ey )?2 < 0, for all y < 0. This implies that + c2g(z2) ln(1 ? F ) = ln(1 ? ez ) = g(z) c1g(zc1) + 1 c2 z1 c ln(1 c2 ln(1 ? ez2 ) = 1 ? e c) + 1 + c2 c2 ln(1 ? F2) : = c1 ln(1 ? Fc1) + +c 1

231

2

Hence, h

1 ? (F1c1 F2c2 )1=(c1+c2 )

ic1 +c2

= (1 ? F )c1 (1 ? F )c2 (1 ? F1)c1 (1 ? F2)c2 :

We inductively assume that (8) is true for some integer n 2, and we prove it for Pn Qn n + 1. Let C (n) = i=1 ci and F (n) =( i=1 Fici )1=C(n). Then, 2

1?

6 4

nY +1 i=1

Fici

!1=

Pn+1

i=1 ci

P 3 n+1 ci 7 5

i=1

= 1?

1=(C (n)+cn+1 ) C (n)+cn+1 c C ( n ) n +1 F (n) Fn+1

(1 ? F (n))C(n)(1 ? Fn+1)cn+1 nY +1 (1 ? Fi)ci ; i=1

where we rst use (8) for n = 2 and then the inductive hypothesis.

ut

4.2.3 The Approximability of Generalized Knapsack

In this section, we study the approximability of Generalized Knapsack. Notice that Generalized Knapsack is NP -complete, since it is a generalization of ordinary Knapsack.

De nition 4.5 (Generalized Knapsack) Sets ?^ j = f(^j (0); j (0)); : : : ; (^ j ( j ); j ( j ))g, j = 0; : : : ; , for some integers ^ j (i) 0, and rationals 1 j (i) 0, i = 0; : : : ; j . Each ?^ j contains the Instance:

item (0; 0).

A collection of exactly one pair (^j (ij ); j (ij )) for each ?^ j , such that j =0 j (ij ) 1. P ^j (ij ). Objective: Maximize the total pro t j =0

Solution:

P

Lemma 4.6 There exists a polynomial-time 2-approximation algorithm for Generalized Knapsack.

Proof. Consider the straight-forward Linear Programming relaxation for the Generalized Knapsack problem shown in Figure 5. Any basic feasible solution to this LP has the property that the number of positive variables is at most the number of rows in the constraint matrix. Therefore, in any optimal solution, at most + 2 variables yji are positive. Moreover, since every class j has at least one positive variable associated with it, there exists at most one class such that 1 > yi1 ; yi2 > 0, for some 0 i1; i2 . Clearly, each of the remaining classes j 6= contain exactly one variable yji?j = 1 and the remaining yji's are equal to 0.

232

maximize subject to

j XX

yji^j (i)

j =0 i=0

j XX yji j (i) 1 j =0 i=0

j X yji = 1 i=0 yji 0

j = 0; : : : ; j = 0; : : : ; ; i = 0; : : : ; j Figure 5: A Linear Programming Relaxation for Generalized Knapsack. Wlog. assume that (i1) < (i2) and set i? = i1, y^i? = 1, and y^i2 = 0. Obviously, since (i1) yi1 (i1) + yi2 (i2), the resulting solution is feasible. If X

j =0

^j (i?j) > ^ (i2) ;

the algorithm outputs the items i?j, for each j = 0; : : : ; . Otherwise, the algorithm only selects to include in the knapsack the maximum pro t item, ^max = P maxj;if^j (i)g. Clearly, this is a 2-approximation algorithm, since ^max + j=0 j (i?j) cannot be less than the fractional optimum of the LP relaxation. ut

Lemma 4.7 There exists a Fully Polynomial-Time Approximation Scheme (FPTAS) for Generalized Knapsack based on a pseudo-polynomial dynamic programming exact algorithm.

Proof. The FPTAS is a generalization of the FPTAS for ordinary Knapsack. In the sequel, we follow the presentation of Section 9.3 [5]. Let Volj () denote the smallest knapsack volume that yields an objective function value of exactly and only involves items from the classes f0; : : : ; j g. Since all the sets ?^ j contain the pair (0; 0), we can initialize Volj (0) = 0, for all j = 0; : : : ; . The dynamic programming algorithm is based on the following recursive formula: Volj () = 0min fVolj?1 ( ? ^j (i)) + j (i)g ; i j

that can be used for computing the values of Volj () in increasing order of objective values . The dynamic programming algorithm returns the solution corresponding to the largest value of so that Vol () 1. In particular, let ? be the optimal value, and let 2, 2? 2 ?, be an upper bound on the objective value of the optimal solution computed by the 2-approximation 233

algorithm of Lemma 4.6.PIteratively, for = 1; : : : ; 2, we compute Volj (), for all P ^ j = 0; : : : ; . Let ? = j=0 j?j j = j=0( j + 1). Since, for each value of the computation takes O(?) time, the dynamic programming algorithm needs O(??) time for nding the optimal solution. Then, given any constant > 0, in order to nd a solution of objective value ^ ? ( i ) , we scale down the pro t values to ~j (i) = jt , where t 1 is ~ 1+ ? . This implies that ~ ( +1)(1+ the largest integer not exceeding ( +1)(1+ ) ) ? ? ? t( + 1) 1+ , for any constant > 0. Moreover, if T () is the time for computing a 2-approximate solution of value , then the running time is bounded by ?( +1) ? O t + T () = O + T () , that is polynomial in the size of the input and 1. One can also use the re nements proposed by Ibarra and Kim [6] and Lawler [11] for ordinary Knapsack in order to obtain a more ecient FPTAS. ut

5 Assignments on Identical Speed Machines In this section, we present NFD-LS, that is polynomial-time 4-approximation algorithm for Minimum Fault-Tolerant Maximum Load on identical speed machines. Given a set M of faulty, parallel, identical speed machines, a set J of jobs to be processed on M , and a fault-tolerance constraint (1 ? ), the NFD-LS algorithm works as follows: 1. It invokes Next Fit Decreasing { NFD on instance (M; 1 ? ) to compute a (1 ? )fault-tolerant partition into disjoint groups, M = fM1; : : :; Mg. 2. It invokes List Scheduling { LS (e.g. see Section 1.1 of [5]) to compute a nonredundant schedule 0 of the job set J on reliable, identical speed machines. 3. For all jobs j 2 J , if 0(j ) = l, for some integer 1 l , NFD-LS sets (j ) = Ml . The algorithm returns the redundant assignment . It should be clear, that the reliability of the resulting assignment equals the reliability of the underlying partition M produced by NFD, Pr[] = Pr[M] 1 ? . Additionally, since all the machines are of identical speed, the maximum load of equals the makespan of the non-redundant assignment 0 produced by List Scheduling, L1 () = Makespan(0). The analysis of NFD-LS is based on the following technical lemma, whose proof is deferred to Section 5.1.1. This lemma states that the optimal Fault-Tolerant Maximum Load L?1 cannot be less that the total size Stot of the jobs divided by the ceiling of IUB(M; 1 ? ), that bounds from above the optimal number of groups produced from M with fault-tolerance constraint (1 ? ). 234

Lemma 5.1 Given a set M = ff1; : : :; fm g of identical speed machines, n unit size jobs, and a fault-tolerance constraint (1 ? ), the optimal Fault-Tolerant Maximum Load L?1 cannot be less than n dIUB(M; 1 ? )e?1 . Based on Lemma 5.1 and the analyses of Next Fit Decreasing and List Scheduling, we can show that the partition assignment produced by NFD-LS approximates Minimum Fault-Tolerant Maximum Load within a factor of 4.

Theorem 5.2 The redundant assignment produced in polynomial-time by NFDLS approximates within a factor of 4 Minimum Fault-Tolerant Maximum Load on identical speed machines.

Proof. Given an instance (M; J; 1 ? ) of Minimum Fault-Tolerant Maximum Load, let + 1 = dIUB(M; 1 ? )e. Obviously, Lemma 5.1 implies thato the optimal Faultn S ? tot Tolerant Maximum Load L1 cannot be less than max +1 ; smax . If = 1, then the NFD algorithm on instance (M; 1 ? ) produces at least 1 reliable eective machine, and, since List Scheduling is optimal for a single machine, we obtain that L1 () 2L?1 : For 2, the analysis of NFD implies that NFD(M; 1 ? ) 2 , and, by the analysis of List Scheduling, we obtain that 3S 2S L1 () tot + smax tot + smax 4L?1 +1

ut

5.1 A Lower Bound on Fault-Tolerant Maximum Load

The proof of Lemma 5.1, that bounds from below Minimum Fault-Tolerant Maximum Load on identical speed machines, is based on the following combinatorial lemma, that applies to identical machines, that are machines of both identical speed and identical failure probability. This lemma states that the reliability of the most reliable -partition assignment bounds from above the reliability of any redundant assignment , L1 () , of unit size jobs to at least identical machines.

Lemma 5.3 Given an integer 1, for any integer 1, let be any redundant assignment of unit size jobs to a set M of identical machines, each of failure probability f . If each job is assigned to exactly machines, jM j , and L1 () , then Pr[] (1 ? f ) = Pr[] : Proof. Clearly, the most reliable -partition assignment has Pr[k ] = (1 ? f ) .

Then, we show that is at least as reliable as any other assignment that ful lls the hypothesis. Initially, we assume that MF() = . 235

Next, we adopt a time-evolving representation of random machine subsets. Similar time-evolving representations of random edge subsets have been used in [12, 9] for proving lower bounds on graph reliability. We introduce non-negative time t and de ne a time-dependent random subset M (t) M representing the set of active machines by the time t. Initially, M (0) = ;. Each machine is given an arrival time chosen independently from the exponential distribution with mean 1, and each time a machine arrives, it is included in M (t). In particular, the time ti at which machine i is attached to M (t), has a probability distribution function 1 ? e?t, and the times ti, i 2 M , are independent. Each time a machine i is included in M (t), all the copies of the jobs assigned to i are removed from the set of remaining machines M ? M (t), because, since the set M (t) represents the set of active machines, the execution of these jobs is assured by the machine i. Then, (t) denotes the assignment at time t, that is obtained from by removing all the copies of the jobs assigned to M (t) by . Also, all the machines of M ? M (t) assigned zero load by (t) can be included in M (t), because they do not further aect the stochastic process due to the exponential distribution of the machine arrival times. Then, m(t) = jM ? M (t)j denotes the number of machines whose arrival can alter (t), that are all the machines assigned non-zero load by (t). The random variable Tr (), r = 1; : : : ; , is de ned as the time t at which the arrival of a machine causes (t) to become an assignment of MF((t)) = r ? 1 for the very rst time. We will show that T1() stochastically dominates T1(), that is

8t 0; Pr[T1() t] Pr[T1(k ) t] : Clearly, this implies the lemma, because Pr[T1() t] is the probability that has become feasible by the time t. Additionally, let T+1() = 0, and let the random variables r() = Tr () ?

Tr+1(), r = 1; : : : ; , denote the length of time interval for which MF((t)) is equal to r. Obviously, r () is the time it takes for decreasing the cardinality of a minimum feasible set for (t) from r to r ? 1.P Moreover, since a machine arrival can decrease MF((t)) by at most one, T1 () = r=1 r (). Due to the memoryless property of the exponential distribution, the time it takes for the next machine arrival is independent of the previous machine arrivals and the time t. Therefore, once we have conditioned on the value of M (t), the time for the next arrival has probability distribution function 1 ? e?m(t)t. In particular, it is shown in [9] that, since the time of arrival has no impact on which of the machines of M ? M (t) is the rst to arrive, the time for the next arrival has the right exponential distribution, regardless of the values of M (t) we condition on. As Lomonosov observed [12], we can imagine that M (t) = M corresponds to a state of a discrete Markov chain on 2M . From , we can only visit the states corresponding to M = M [ fig, for some i 2 M ? M , with transition probabilities 1=m , m = jM ? M j = m(t). Then, we can rst chose a path ! of this Markov chain with transition probabilities 1=m out of each state , and then traverse this path 236

with a sojourn time in each exponentially distributed with mean 1=m . Hence, the probability distribution function D! (t) of the time for traversing the path ! can be obtained by the convolution of exponential distributions with mean values 1=m0 < 1=m1 < < 1=ml?1 , where l is the length of the path. In particular,

D! (t) = Pr[! t] =

l?1 X j =0

1 ? e?mj t

mi 0il?1;i6=j mi ? mj Y

Therefore, the probability distribution function of each random variable r() is

Dr (; t) =

X

!

p! D! (t) ;

where ! is any path/sequence of machine arrivals that causes MF((t)) to decrease from r to r ? 1, and p! denotes the probability of choosing this path. Additionally, notice that the random variables r (), r = 1; : : : ; are mutually independent, because they are weighted sums of disjoint sets of mutually independent random variables. Additionally, it is well known that, if X1; X2; Y1; Y2 are independent random variables so that X1 dominates Y1 and X2 dominates Y2, then X1 + X2 also dominates Y1 + Y2. Therefore, in order to prove the lemma, we have to show that, for all r = 1; : : : ; , r () dominates r (). Notice that, at some time t, the stochastic process may be in a state , such that MF() = r (i.e. the minimum feasible set cardinality of the assignment (t) corresponding to is equal to r) and m(t) = m > r. The stochastic process is said to be in a bad state , if MF() = r and the arrival of more than r machines leads to some a state 0 that is either bad or MF(0) = r ? 1. Wlog. we can assume that a transition to a bad state of MF() = r is equivalent to an immediate transition to a state 0 of MF(0) = r ? 1. This eliminates the need to consider a transition from a bad state to another bad state 0, such that MF() = MF(0). The stochastic process is said to be in a good state , if MF() = r and the arrival of at most r machines leads to a state 0 that is either bad or MF(0) = r ? 1. Let be any state reached at time t for which MF((t)) becomes equal to r for the very rst time, and let m > r. Notice that there always exists a path ! = (; 1; : : : ; x), such that MF() = MF(1) = = MF(x) = r and mx r. The path ! corresponds to the arrival of a machine set Mgood M ? M (t), jMgoodj m ? r, that does not change MF((t)), but reduces the cardinality of m(t) to r or less. Moreover, the removal of Mgood from (t) causes the state x to be reached, regardless of the order the machines of Mgood are actually arrive. Since x is a good state, all the states ; 1; : : : ; x?1 are also good ones. Therefore, for all r = 1; : : : ; , all the states for which MF((t)) becomes equal to r for the very rst time are good ones. Let r, 1 r , be any integer, and assume that some machine arrival has just caused (t) to be an assignment of MF((t)) = r for the rst time. Hence, 237

the corresponding state is a good one. Then, we show that the random variable r () dominates a random variable exponentially distributed with mean 1=r. One possibility for to be a good state is m(t) = m r. Then, the sojourn time in is exponentially distributed with mean at least 1=r, and therefore r () dominates an exponentially distributed random variable with mean 1=r. Next we consider the case that m > r. Since is a good state, with probability at most mr , the stochastic process moves into a state of MF() = r ? 1, while with probability at least 1 ? mr , it enters some other good state 0. (Recall that a move into a bad state ^ of MF(^) = r is also considered as a move into a state of MF() = r ? 1). Additionally, regardless of the new state, the sojourn time in is exponentially distributed with mean 1=m . In case that the new state is a good one, the aforementioned situation goes on until m0 r, when wlog. we can assume that MF((t)) becomes r ? 1 with probability 1. Therefore, we can inductively assume that the random variable r(0), which denotes the time it takes for getting from 0 to a state 0 of MF(0)) = r ? 1, dominates an exponentially distributed random variable with mean 1=r. Since ! m ? r 1 + r e?m t ? m e?rt + r 1 ? e?m t = 1 ? e?rt ; m m ? r m ? r m r () also dominates an exponentially distributed random variable with mean 1=r. Therefore, the random variable r () dominates the random variable r (), which is exponentially distributed with mean 1=r. Finally, let 0 be any redundant assignment of unit size jobs, such that 0 assigns all the jobs to exactly machines, L1(0) , and MF(0) > . It should be clear that 0 cannot be more reliable than , because the stochastic process corresponding to 0 must also pass through a good state of MF() = . ut Remark. Lemma 5.3 also applies to the probability that isolated nodes do not appear to a not necessarily connected hypergraph with \faulty" hyperedges of cardinality . In particular, assume that jM j = , and assigns each job to exactly machines, and the load of each machine is exactly . Given such an assignment , we can construct a hypergraph H (N; E ), where N consists of nodes corresponding to the unit size jobs, and E consists of hyperedges corresponding to the machines of M . Moreover, each e 2 E consists of the jobs assigned to the corresponding machine of M by . Therefore, the cardinality of each e 2 E is exactly . Moreover, for any node v 2 N , let degH (v) = jfe 2 E (H ) : v 2 egj. Clearly, an assignment is feasible for a M 0 M , i the removal of the hyperedges corresponding to M ? M 0 does not create any isolated nodes in H, that are nodes of degree 0. Lemma 5.3 implies that the hypergraph corresponding to the most reliable -partition assignment achieves the greatest probability of not having any isolated points under random and independent edge faults. Figure 6 depicts two dierent assignments of 4 unit size jobs, J = f1; 2; 3; 4g, to 6 identical machines, M = fa; b; c; d; e; f g, for = 2 and = 2, and the correspond238

3 3 3 4 4 4 1 1 1 2 2 2 a b c d e f

2 3 4 3 4 4 1 1 1 2 2 3 a b c d e f

a 1

b

3

e

3

c

c d 2

b

1 a

4

f

d 2

e

4

f

Figure 6: Two dierent assignments of 4 unit size jobs to 6 identical machines ( = 2). ing graphs. If all the machines fail with probability f , 1 > f > 0, then the failure probability of the most reliable 2-partition is 2f 3 ? f 6, while the failure probability of the \clique" assignment is 4f 3 ? 6f 5 + 3f 6 . It is straight-forward to verify that 4f 3 ? 6f 5 + 3f 6 > 2f 3 ? f 6, for all 1 > f > 0. Additionally, the probability distribution function for T1() is D(; t) = 1 ? 3e?6t + 6e?5t ? 4e?3t, while the probability distribution function for T1(2) is D(2; t) = 1 + e?6t ? 2e?3t. ut The following is an immediate consequence of Lemma 5.3 and applies to the more general situation that the total number of copies of all the jobs is equal to .

Corollary 5.4 Given an integer 1, for any integer 1, let be any redundant

assignment of unit size jobs to a set M of identical machines of failure P probability f . If each job j , j = 1; : : : ; , is assigned to exactly j machines so as j =1 j = , jM j , and L1 () , then Pr[] (1 ? f ) = Pr[] : Proof. It is easy to verify (e.g. by the inclusion-exclusion formula) that there always exists a redundant assignment 0 of unit size jobs to M , such that 0 assigns each job to exactly machines, L1(0) = L1 (), and Pr[] Pr[0]. ut

5.1.1 The Proof of Lemma 5.1 Then, we prove Lemma 5.1 stating that the optimal Fault-Tolerant Maximum Load L?1 cannot be less than the total load divided by the ceiling of IUB(M; 1 ? ). In order to handle machines of dierent failure probability, we replace each machine by an appropriate number of identical parallel machines (see also [9]). Then, we apply Corollary 5.4 to show that any (1 ? )-fault-tolerant redundant assignment cannot use more than dIUB(M; 1 ? )e eective reliable machines. 239

Proof. Let be any (1 ? )-fault-tolerant redundant assignment of n unit size jobs to M . At rst, we assume that the l-component of the upper bound IUB(M; 1 ? ), that corresponds to the number of over-reliable items, is equal to 0, and we set = dIUB(M; 1 ? )e = dxe, where x is de ned by

1?

x F 1=x = 1

m Y

? ; F = fi : i=1

Therefore, at most groups can be obtained from the Fault-Tolerant Partition instance (M; 1 ? ), each of failure probability at least F 1=x. In order to handle dierent failure probabilities, we choose a suciently small real number , mand replace each machine i of failure probability fi by a \bundle" l of mi = ? ln fi parallel machines each of failure probability f = 1 ? (see also [9]). Furthermore, the jobs assigned to the machine i are assigned to all mi parallel machines. Therefore, the bundle of mi parallel machines contains at least one active machine with probability m l ? ln fi : 1 ? (1 ? ) Since this quantity converges to fi as ! 0, the reliability of the assignment ^ obtained from by applying this transformation tends to Pr[]. Corollary 5.4 implies that we can only consider an assignment that assigns each job j to a set of machines Mj of reliability Pr[ Mj ]m= 1 ? F 1=x. Therefore, ^ l assigns each job to exactly machines, where = ? xln F . Lemma 5.3 implies that, if L1() < n , then Pr[^] < 1 ? . In case that l > 0, we assume that the machines are indexed in non-increasing order of reliability. Then, the failure probability of each of the rst l most reliable 1 =x l machines is less than the failure probability Fl of the remaining xl groups, where Fl = Qmi=l+1 fi (see also the analysis of NFD for the de nitions of l and xl). Therefore, wlog. we can assume that assigns each job j , toSeither exactly one of the rst l most reliable machines, or a subset Mj of Mr = M ? li=1ffig of reliability Pr[Mj ] = 1 ? n Fl1=xl . If L1 () < l+dnxle = dIUB(M; 1?)e , then the reliability of the partial assignment to the set Mr must be strictly less than 1P?l , Pl = Qli=1 (1 ? fi). ut

6 Assignments on Related Speed Machines In this section, we present SP-OPT, that is a logarithmic approximation algorithm for Minimum Fault-Tolerant Maximum Load, in case of unit size jobs and related speed machines. The redundant assignments produced by SF-OPT are based on optimal, non-redundant schedules of the unit size job set J on the related speed, reliable, eective machines obtained by the Safe Partition algorithm. In particular, given an instance (M; J; 1 ? ), SF-OPT works as follows: 240

1. It calls Safe Partition { SP on instance (M; 1 ? ) to obtain a (1 ? )-fault-tolerant partition. 2. If SP returns a single groupQconsisting of the rst d + 1 machines, where d is the largest index such that di=1 fi > , then SP-OPT assigns all the jobs to the rst d + 1 machines, i.e. for all j 2 J , (j ) = f(f1; v1); : : : ; (fd+1; vd+1)g, and returns . 3. If SP returns a partition into disjoint groups, M = fM1; : : :; Mg, then SP-OPT computes an optimal, non-redundant schedule 0 of J on reliable, parallel, related machines, each of speed V (Ml ) = mini2Ml fvig, l = 1; : : : ; . Then, for all jobs j 2 J , if 0(j ) = l for some integer 1 l , SP-OPT assigns (j ) = Ml , and returns the assignment . Since the partitions computed by Safe Partition are always (1 ? )-fault-tolerant, the reliability of the resulting assignment cannot be less than (1 ? ). The analysis of SP-OPT is based on the analysis of Safe Partition, and the fact that any (1 ? )fault-tolerant assignment must have at least one copy of all the jobs to some machine of index greater than d.

Theorem 6.1 The algorithm SF-OPT is a polynomial-time 2

l

ln(m=) m-approximaln(1=fmax)

tion algorithm for Minimum Fault-Tolerant Maximum Load, in case of unit size jobs and related speed machines.

Proof. Since it is straight-forward that the reliability of the resulting assignment cannot be less than (1 ? ), we focus on the justi cation of the claimed approximation ratio. By the de nition of d as the largest index so that Qdi=1 fi > , if all the copies of a job j 2 J are only assigned to the rst d machines, the probability that j has a copy on some active machine is less than (1 ? ). Hence, such an assignment cannot be (1 ? )-fault-tolerant. Therefore, if ^ denotes the optimal, non-redundant schedule of jJ j unit size jobs to the reliable, parallel machines of speeds fvd+1; : : :; vmg, and Makespan(^) = B SpeedjJ(jd+1) denotes the makespan of ^, then B is a lower bound on the optimal Fault-Tolerant Maximum Load L?1 . To prove the approximation ratio, recall that the Safe Partition algorithm returns either a single group f(f1; v1); : : :; (fd+1 ; vd+1)g consisting of the rst d + 1 machines, or a partition M = fM1; : : :; Mg consisting of l groups, m and that, in both cases, the ln( m= ) ? cardinality of all the groups is at most m = ln(1=fmax) . At rst, we consider the case that vd+1 V (M), and SP returns a single reliable eective machine of speed vd+1. Then, all the jobs assigned by ^ to the machines fvd+1; : : :; vm??1g canl cause a load of at most m?B to a machine of speed vd+1. m m Additionally, let b = m? , and, for all j = 0; : : : ; m? ? 1, let M (j) = f(fi; vi) 2 M : i = ym? + j; y = 1; : : :; b ? 1g. Since the partition M consists of b ? 1 groups, the

241

analysis of SP implies that, for all j = 0; : : : ; m? ? 1, vd+1 V (M) Pi2M (0) vi P ? (j ) i2M (j) vi . Therefore, since there exist exactly m sets M , and ? ?1 m[

j =0

M (j) = f(fm? ; vm? ); : : :; (fm; vm)g ;

all the jobs scheduled by ^ to the machines fvm? ; : : :; vmg can cause a load of at most m?B to a machine of speed vd+1. Hence, since vd+1 V (M), L1() 2m?B . Then, we consider the case that V (M) > vd+1 and SP returns a partition M = fM1; : : :; Mg of eective speeds V (Ml) =Pmini2Ml fvig. The analysis of SPSimplies that, for all l = 1; : : : ; , m?V (Ml) i2Ml+1 vi, where M+1 = M ? l=1 Ml. Therefore, all the jobs assigned by ^ to the machines fvi : i 2 Ml+1g, that correspond to the machines of the group Ml+1, can cause a load of at most m?B , if they are assigned to a machine of speed V (Ml). Additionally, since we only consider unit size jobs, for all i = d + 1; : : :; m1 = jM1j m?, the jobs assigned by ^ to the machine vi can be assigned to a set of machines of total speed equal to V (M) > vi so as to cause a maximum load of at most B . Therefore, we can obtain a redundant assignment of L1() 2m?B . ut In the proof of Theorem 6.1, the restriction to unit size jobs is necessary only for bounding the maximum load that the jobs assigned to the machinesPd + 1; : : : ; m1 by ^ can cause to the machines of speeds fV (M1); : : :; V (M)g, where l=1 V (Ml) > vi, for all i = d +1; : : : ; m1. Even though it is possible to replace the unit size assumption by a more general one on the relation between the job sizes and the speeds vi and V (Ml), we do not know how to completely avoid such a restriction. The eective machine con gurations computed by SCP are expected to be much more ecient than the con gurations computed by SP. However, we do not know how V (M?)? of to obtain a lower bound on L?1 by relating it with either thePeective speed an optimal partition, or the maximumvalue of the function j=0 2j IUB Ij ; (1 ? ) j , subject to j? 0, and Pj=0 j? 1.

7 Open Problems The rst open question is whether Minimum Fault-Tolerant Maximum Load, especially in case of related speed machines, is a complete problem for the complexity class NP #P [1;comp] shown to include the whole Polynomial Hierarchy PH. Another direction for further research is to derive a non-trivial lower bound for Minimum Fault-Tolerant Maximum Load in case of related speed machines. This lower bound may be combined with the SCP algorithm in order to obtain a constant factor approximation algorithm. Additionally, the fault-tolerant generalizations of some fundamental graph optimization problems, such as shortest path or connectivity, have not been studied so far 242

under random and independent faults. In particular, the fault-tolerant generalization of connectivity is, given a graph G(V; E ), where each edge e 2 E fails independently with probability fe , and a fault-tolerance constraint (1 ? ), to compute the minimum (w.r.t. the number of edges) subgraph G0(V; E 0), E 0 E , that remains connected with probability at least (1 ? ).

References [1] S.F. Assmann, D.S. Johnson, D.J. Kleitman, and J.Y.-T. Leung (1984), \On a Dual Version of the One-Dimensional Bin Packing Problem", Journal of Algorithms 5, pp. 502{525. [2] S.A. Cook (1971), \The Complexity of Theorem-Proving Procedures", Proc. of the 3rd IEEE Symposium on the Foundations of Computer Science, pp. 151{158. [3] M.R. Garey and D.S. Johnson (1979), Computers and Intractability: A Guide to the Theory of NP -Completeness, Freeman, San Francisco. [4] L. Gasieniec, E. Kranakis, D. Krizanc, A. Pelc (1996), \Minimizing Congestion of Layouts for ATM Networks with Faulty Links", Proc. of the 21st Mathematical Foundations of Computer Science, pp. 372{381. [5] D.S. Hochbaum (ed.) (1997), Approximation Algorithms for NP -hard problems, PWS Publishing. [6] O.H. Ibarra and C.E. Kim (1975), \Fast Approximation Algorithms for the Knapsack and Sum of Subset Problems", Journal of the Association for Computing Machinery 22, pp. 463{468. [7] B. Kalyanasundaram and K.R. Pruhs (1994), \Fault-Tolerant Scheduling", Proc. of the 26th ACM Symposium on Theory of Computing, pp. 115{124. [8] B. Kalyanasundaram and K.R. Pruhs (1997), \Fault-Tolerant Real-Time Scheduling", Proc. of the 5th European Symposium on Algorithms, pp. 296{307. [9] D.R. Karger (1995), \A Randomized Fully Polynomial Time Approximation Scheme for the All Terminal Network Reliability Problem", Proc. of the 27th ACM Symposium on Theory of Computing, pp. 11{17. [10] J. Kleinberg, Y. Rabani, E. Tardos (1997), \Allocating Bandwidth for Bursty Connections", Proc. of the 29th ACM Symposium on Theory of Computing, pp. 664{673. [11] E. Lawler (1979), \Fast Approximation Algorithms for Knapsack Problems", Mathematics of Operations Research 4, pp. 339{356. 243

[12] M.V. Lomonosov (1974), \Bernoulli Scheme with Closure", Problems of Information Transmission 10, pp. 73{81. [13] C.H. Papadimitriou (1994), Computational Complexity, Addison-Wesley. [14] S. Toda and O. Watanabe (1992), \Polynomial-time 1-Turing reductions from #PH to #P ", Theoretical Computer Science 100, pp. 205{221. [15] L.G. Valiant (1979), \The Complexity of Enumeration and Reliability Problems", SIAM Journal on Computing, 8(3), pp. 410{421.

244

Machine Partitioning and Scheduling under Fault ... - CiteSeerX

Machine Partitioning and Scheduling under Fault ... - CiteSeerX

Suggest Documents

A Hardware-Software Partitioning and Scheduling ... - CiteSeerX

Fault-tolerant real-time scheduling under execution time ... - CiteSeerX

Fault-Tolerant Partitioning Scheduling Algorithms in Real-Time ...

Modulo Scheduling, Machine Representations, and ... - CiteSeerX

Scheduling Saves in Fault-Tolerant Computations - CiteSeerX

Efficient Variable Partitioning and Scheduling for DSP ... - CiteSeerX

Efficient Variable Partitioning and Scheduling for DSP ... - CiteSeerX

Optimal Two Level Partitioning and Loop Scheduling for ... - CiteSeerX

Fault-tolerant real-time scheduling under execution time constraints

A Single-Machine Distributed-Scheduling Methodology ... - CiteSeerX

KvmFS: Virtual Machine Partitioning For Clusters and Grids - CiteSeerX

Parallel-machine scheduling with simple linear ... - CiteSeerX

Solving Single Machine Scheduling Problem with ... - CiteSeerX

Hybrid Evolutionary Algorithm for job scheduling under machine ...

Collaborative paradigm for single-machine scheduling under just-in ...

A Survey of Parallel Machine Scheduling under Availability Constraints

optimal scheduling on parallel processors under ... - CiteSeerX

STOCHASTIC PROJECT SCHEDULING UNDER LIMITED ... - CiteSeerX

a fast fault tolerant partitioning algorithm for wireless ... - CiteSeerX

Quenching and Partitioning - CiteSeerX

Games and Mechanism Design in Machine Scheduling ... - CiteSeerX

Scheduling under Uncertainty

Simulation and Fault Detection in PMSM under Dynamic ... - CiteSeerX

Characteristics of Wind Turbines Under Normal and Fault ... - CiteSeerX