Scheduling with Task Replication on Desktop Grids - IC-Unicamp

2 downloads 19563 Views 3MB Size Report
Mar 12, 2015 - Key Words: Desktop grid Scheduling, Approximation Algorithms, Task .... approximation shown in [12], but we show that this result is tight, i.e, it is the best approximation that can be ...... Programming, 91(2):201–213, 2002.
Scheduling with Task Replication on Desktop Grids: Theoretical and Experimental Analysis ∗ Eduardo C. Xavier † Institute of Computing University of Campinas [email protected]

Robson R. S. Peixoto Institute of Computing University of Campinas [email protected]

Jefferson L. M. da Silveira Institute of Computing University of Campinas [email protected], March 12, 2015

Abstract Our main objective in this paper is to study the possible benefits of using task replication when generating a schedule in a desktop grid. We consider the problem of constructing a schedule in an environment where machines’ speeds may be unpredictable and the objective is to minimize the makespan of the schedule. Another considered objective is to minimize the Total Processor Cycle Consumption (TPCC). First we provide a theoretical study of a well known algorithm that makes use of replication: the WQRxx algorithm. We prove approximation ratios for this algorithm in different scenarios, and also show that such ratios are tight. We propose a simple interface and show how to add replication to any scheduling algorithm using this interface. We also prove approximation ratios for any algorithm that uses this interface. We then extend several well known algorithms to generate schedules with replication via the interface, and present computational simulations comparing the quality of the solutions of these algorithms with the addition of replication.

Key Words: Desktop grid Scheduling, Approximation Algorithms, Task Replication.

1

Introduction

In this paper we consider the problem of scheduling n independent tasks in a computational desktop grid that can be seen as a large distributed system composed of m machines where, in general, the speed of machines is unpredictable. This is a typical scenario on a desktop grid where resources are donated by the owners, and the computational power of a processor greatly varies over time. We are interested in algorithms that make use of task replication when constructing solutions for the scheduling problem. Several heuristics for scheduling problems in large distributed systems have been proposed, and most of them rely on the existence of accurate information about resources (see [10]). However ∗

This work was supported by CNPq and FAPESP. Corresponding Author: [email protected]. Av. Albert Einstein 1251, Institute of Computing, UNICAMP, Campinas-SP, Brazil. Fax:(+55) (19) 3521-5847 †

1

1

INTRODUCTION

2

accurate information about the resource capacities of the machines is hard to obtain, and in most cases all that can be done is to predict this information with some degree of reliability. When one cannot rely on such predictions, the use of task replication to tolerate bad assignments seems to be a good idea due to the lack of information. The hope is that one of the replicas of the task will be executed in a fast machine. Several replication schedulers have been proposed in the past [1, 2, 6, 8, 7]. Charlotte [2] is a Java-based grid and its scheduler provides the option to assign a task repeatedly to different resources until one of its replicas is completed. MapReduce [8] also has a replication scheduler considering large cluster environments. The WQR algorithm [7] works in a Round-Robin fashion: All tasks are enqueued, and once a machine becomes available, a task is dequeued and assigned to it. When all tasks are assigned, the unfinished tasks can be replicated. Every time a machine becomes available it receives a replica of some unfinished task to execute. When some task is finished, all its replicas are killed, except the one that finished. The WQR has the option of limiting the maximum number of replicas of a task: WQR2x allows two replicas, WQR3x three and so on (WQRxx allows an infinite number of replicas). The work done by Cirne et. al [5] compares, via simulations, the task replication scheduler WQRxx with some other algorithms that don’t use replication, but have some information about the machines processing capacity. Through their comparisons, one can see that the replication algorithm WQRxx outperforms schedulers that have information about the resource’s speed, but with an increase of resources used due to replication. The extra processing capacity used, compared to an optimum allocation, can vary from 25% to 300% depending on the scenario. Algorithms that generate solutions with a provable solution value guarantee, also known as approximation algorithms, have also been proposed for some scheduling problems in computational grids. The list scheduling with a Round Robin order Replication algorithm (hereafter denoted by RR) [12] is, to the authors knowledge, the first approximation algorithm that has been proposed for a desktop grid. Fujimoto and Hagihara [12], considering the TPCC metric (see Section 1.1), proved that the RR algorithm has an approximation factor of (1 + m(ln(m−1)+1) ) when tasks are the same size, where m is the number of machines and n is the number of n tasks. It is worth noting that the RR algorithm is akin to the WQRxx algorithm, except that it considers tasks of identical length and the unfinished tasks are stored in a ring. In this work we will refer to RR and WQRxx as the same algorithm since we implemented the WQRxx algorithm using the ring structure. In [14], a 2-approximation algorithm is presented for the TPCC metric considering a uniform parallel machine such that processor speed varies over time and is predictable. Schwiegelshohn et al. [19] presented a 5-competitive non-clairvoyant algorithm for minimizing the makespan under a grid system that consists of a large number of identical processors. In [11] Fujimoto studied the problem of scheduling a coarse-grained workflow which is defined as an extension of the classical precedence constrained scheduling problem over a uniform parallel machine with processor speed fluctuation. Fujimoto proved that unless P = NP , the problem is not approximated within a factor of 1.5. Some works considered the hierarchical scheduling on computational grids. The schedule is divided into stages: in the first stage a global scheduler allocates tasks to parallel machines, and then, in the second stage, each parallel machine does a local scheduling of the assigned tasks. In [20], Tchernykh et al. considered the online problem of scheduling in grid environments in two-stages. The objective is to minimize the makespan, and they prove that the proposed algorithm is (2e + 1) competitive. Bougeret et al. [3] considered a clairvoyant hierarchical scheduling problem where jobs are first allocated to some parallel machines and then a local scheduler proceeds to schedule jobs on local processors. They considered the off-line version of the problem with the objective of minimizing the makespan. They proposed a (5/2)-approximation algorithm for the problem. In [18], Quezada-Pina et al. studied the online non-clairvoyant hierarchical problem considering admissible allocations, i.e, they present policies that reduce the set of machines where a job can be allocated. They present an algorithm with competitive ratio equal to 11. Our main interest in this paper is to provide theoretical and experimental analysis of algorithms that use

1

INTRODUCTION

3

replication. We are also interested in check the possible benefits of extending well known algorithms, that were not originally designed to use replication, to make use of task replication. From the theoretical point of view we present an analysis of the WQR algorithm [5, 13] proving approximation factors for it in different scheduling scenarios. We also show that such approximation ratios are tight. We note that for the unpredictable case where tasks have the same length, Fujimoto and Hagihara [12] proved an approximation ratio for the WQRxx algorithm, which is asymptotically the same we prove in this paper. In this work we close the analysis proving that this result is tight. We also propose an extension of the algorithms Sufferage [4], Min-min [16] and DFPLTF [17, 7], to make use of replication. The extension to use replication is made by using an interface that we propose. We implemented all these algorithms. We then performed an extensive simulation to compare the quality of the schedules generated by such algorithms. A preliminary version of this work appeared in [21].

1.1

Notations and Definitions

In all problems we consider m machines (M1 , . . . , Mm ) with different speeds that vary over time, and n tasks (T1 , . . . , Tn ) to be scheduled, where the size of each task Tj is measured by its number of instructions, and it is denoted by Lj . For each machine Mi and time interval [t, t + 1) we denote by sit the number of instructions that this machine can execute in this time interval, and we call this by the processing capacity (or processing speed) of the machine in this time interval. We say that one machine has more processing power than another when it has more expected processing capacity for each given time interval. In the case of tasks with same size, we consider parameter-sweep applications where a set of multiple experiments must be executed, each one with a distinct set of parameters [4]. We also consider the case where coarse-grained tasks have different lengths and we assume that communication delays are negligible. The objective functions considered are: to minimize the makespan Cmax , or to minimize the Total Processor Cycle Consumption (TPCC). The makespan is defined as the first time when all tasks finish executing in the schedule. The TPCC was proposed by Fujimoto and Hagihara [12] and it is defined as: max c m bCX X

i=1

sit +

t=0

m X

(Cmax − bCmax c)sibCmax c .

i=1

The TPCC accounts for all processing capacity of all machines until the makespan time. An example of a schedule and its TPCC is given in Figure 1. 0 M1 M2 M3

3.5

7.5 t

3 5 7 0 0 0 0 2 8 5 1 0 0 0 0 4 5 2 1 0 0 0 0 6

m Makespan = 7.5 TPCC = 43 Figure 1: There are five tasks represented by the ellipses. For each unit time interval, and each machine, the value in the square represents the processing capacity. The total processing capacity consumed by these tasks until the makespan time is 43. Notice that in the TPCC, 3 cycles out of 6 (on the last time interval of machine M3 ) are consumed even though no task uses this.

2

RESULTS FOR THE PROBLEMS Q; SIT |LJ = L|CM AX AND Q; SIT |LJ = L|TPCC

4

In the case of tasks of same length, we use the notation Lj = L to indicate that all tasks are of same length L, where L is the number of instructions needed by each task. Using the notation α|β|γ of [15], we first consider the problem Q; sit |Lj = L|Cmax , which is equivalent to the Q; sit |Lj = L|T P CC problem as we will see, where α = (Q; sit ) means that the machines have different speeds and that their speeds vary over time, β = (Lj = L) means that all tasks have same length L, and the last term is the objective function. We consider the predictable and unpredictable versions of these problems depending on whether it is possible to compute beforehand the machines’ speeds over time or not. We also consider the versions of these problems with tasks with different lengths; in this case β = Lj . From the theoretical point of view, we are interested in polynomial time algorithms with worst-case performance guarantee. Such algorithms are known as approximation algorithms. We say that a polynomial time algorithm A has α approximation, or is an α-approximation, if for every instance I of the problem we have A(I)/OP T (I) 6 α where A(I) is the objective function value of the solution returned by the algorithm and OP T (I) is the objective function value of an optimal solution. The problems considered are off-line, which means that the set of jobs to be scheduled are known a priori. An optimal solution for an instance I of the problem is denoted by OPT(I) and its objective function value is denoted by OPT. When dealing with replication algorithms, i.e, algorithms that can run, at a same time, copies of a same task in the system, we use the term task when only one copy of a task is running in the system. When several copies of a task are running in the system, we refer to all of them as replicas of the task. For instance, the first assignment of a task is referred as task while there is only this copy running in the system. At the moment a new copy of it is assigned to the system, we refer to all of them as replicas of the task, including the first assignment.

1.2

Organization

In Section 2 we show that under the TPCC metric, the WQRxx algorithm is an optimal algorithm when all jobs have the same length and the speed of machines is predictable. In the unpredictable case, we show that the algorithm is a (1 + 3m/2+m(ln(m/2) )-approximation. We note that this is asymptotically the same n approximation shown in [12], but we show that this result is tight, i.e, it is the best approximation that can be proved for the algorithm. In Section 3 we study the problem where jobs have different lengths. We show that the WQRxx algorithm is an O(m)-approximation and this result is also tight when the objective is to minimize the TPCC. When the objective is to minimize the makespan we show that no approximation algorithm exists unless P=NP, even if the speed of machines is predictable. In Section 4 we present a simple interface that can be used to add replication capability to scheduling algorithms, and then prove approximation results for any algorithm using this interface. Finally in Section 5 we compare, via simulation, the WQRxx algorithm, and other well known algorithms with their extension to use replication. We then discuss the possible benefits of the use of replication in these other algorithms.

2

Results for the problems Q; sit |Lj = L|Cmax and Q; sit |Lj = L|TPCC

In this section we consider the problem of scheduling n tasks with equal length L in m machines with processor speeds that vary over time. The objective is to construct a schedule of the tasks that minimizes the Total Processor Cycle Consumption (TPCC) or the Cmax . In Subsection 2.1 we present the WQRxx algorithm and discuss the use of the TPCC as a good metric for the total completion time of a schedule.

2

2.1

RESULTS FOR THE PROBLEMS Q; SIT |LJ = L|CM AX AND Q; SIT |LJ = L|TPCC

5

Background

The TPCC metric was proposed by Fujimoto and Hagihara [12]. Roughly speaking, the TPCC accounts for the total processing capacity of the machines from time 0 until the makespan time, when the last task is finished. The idea is that the longer the makespan is, the larger the TPCC is, and the larger the TPCC is, the longer the makespan is. An optimal makespan schedule is an optimal TPCC schedule, and the converse is also true. Property 1 A schedule S is an optimal makespan schedule if and only if it is an optimal TPCC schedule. It is not difficult to see the validity of this property, since the TPCC computes all machines’ processing capacites until the time of the last executed task. Fujimoto and Hagihara [12] proposed the TPCC metric because it is hard to approximate the makespan in problems where machines’ speeds vary over time. For any given value of approximation factor α, we can construct instances of the problem where after the optimal makespan time OPT all machines are shutdown, and restarted a long time later ((α + 1)OPT for example) so that any other schedule different from the optimum, will not meet the desired approximation factor. Although this is true in the general case, we will see that using replication one can compute an optimal makespan schedule when tasks have the same length and machines’ speeds are predictable. The WQRxx algorithm works as follows: First all n tasks are enqueued. When a machine becomes available, a task in the queue is dequeued and assigned to this machine. The algorithm works in this way until all tasks are assigned to some machine. At this moment there are at most m tasks running in the system, one in each machine. These unfinished tasks are then put in a data-structure called a ring, and there is a head that is positioned in the next task to be replicated in the schedule. At this moment all tasks are already scheduled and the ring is used to generate replicas of uncompleted tasks. When a machine becomes available, the task in the head of the ring is replicated and this replica is assigned to the machine. The head then moves to the next task. This way, we may have several replicas of one task running in the system. When some replica of some task is finished, all its replicas are killed, the task is removed from the ring and the respective available machines can receive new tasks from the ring. An example of a schedule generated by the WQRxx algorithm is given in Figure 2. It is not difficult to see that from the starting time of the schedule to the completion time of the schedule generated by the WQRxx algorithm, there is no idle processor. 0 M1 M2 M3

2

4

5.5

t

T5 T¯1 ¯ ¯ T2 T4 T1 T5 T¯3 T¯5

m Figure 2: WQRxx schedule for five tasks. After T5 is allocated to machine M3 , the replication phase starts. The light ellipses correspond to replicas that are killed since another replica of the task has finished.

2.2

The predictable case

Our main contribution in this section is to show that when the speed of machines is predictable it is easy to construct a schedule with optimal value for both objective functions Cmax and TPCC. In fact, the WQRxx algorithm produces a schedule with optimal value and can be modified to not make use of replication.

2

RESULTS FOR THE PROBLEMS Q; SIT |LJ = L|CM AX AND Q; SIT |LJ = L|TPCC

6

First notice that there is always an optimal schedule where no replication is needed. Suppose an optimal schedule where replication is used. We can remove extra replicas of tasks just leaving one copy of each task. This does not increase the makespan time. We provide a result about the characteristics of an optimal makespan scheduling. First we define an extended optimal schedule, which has the makespan of an optimal schedule, but allows replicas of tasks, and besides that, all machines remains full until the makespan time. Definition 1 We define an optimal extended schedule EOPT(I) as one that satisfies these criteria: 1. Its makespan is equal to the makespan of an optimal schedule OPT(I). 2. On each machine Mi there is no idle time between two tasks assigned to it. 3. All machines are processing tasks until the optimal makespan time. We show that every instance of the Q; sit |Lj = L|Cmax problem has an extended schedule EOPT(I). Lemma 2 For every instance I of the Q; sit |Lj = L|Cmax problem, there is an optimal extended schedule EOPT(I). Proof. Let OPT(I) be an optimal makespan schedule with finishing time OPT. We show how to modify OPT(I) in order to obtain EOPT(I). Suppose that there is some machine Mi in OPT(I) with idle processing time between two tasks assigned to it, i.e, after task Tj finishes there is some idle time and then Tj 0 is processed. Moving Tj 0 to start just after the finishing time of Tj does not increase the makespan. We can also add replicas of some jobs to the idle machines until the makespan time OPT. Let Tj be the last task finished in OPT(I). In each machine, after the point where its last task was executed in OPT(I), put replicas of Tj (of length L) until the makespan time. Notice that in each machine, it can be the case that the last assigned replica is not completely executed by the makespan time. In this case the remaining part of the replica (the part after OPT time) is killed. This new schedule EOPT has optimal makespan time and also has the property that from the starting time of the schedule to the completion time, there is no idle processor. t u Notice that an optimal schedule for the Q; sjk |Lj = L|Cmax can be seen as an optimal schedule for the same problem, but that allows tasks to be replicated. Given an optimal schedule OPT(I) for the problem with tasks of same length, we consider each task in the schedule as a slot where any other task can be assigned, since all tasks have the same length. In the extended EOPT(I) schedule, besides these slots, we have extra replicas (that may not finish L instructions by the OPT time). We call slots the tasks in EOPT(I) that finish L instructions by the OPT time, and notice that there are at least n of these. Property 3 Let I be an instance for the problem with n tasks, then the OPT time is the first time when n tasks finish L instructions each. This is valid, since if at an earlier time, n tasks could have finished L instructions each, then OPT shouldn’t be the optimal time. Also notice that in EOPT(I) more than n tasks can finish L instructions by the OPT time since replication is allowed. Lemma 4 There is a polynomial time algorithm that computes the schedule EOPT(I). Proof. The algorithm works like the WQRxx algorithm but does not kill replicas. It keeps assigning a task of L instructions to an available machine until, for the first time, n different tasks finish executing L instructions

2

RESULTS FOR THE PROBLEMS Q; SIT |LJ = L|CM AX AND Q; SIT |LJ = L|TPCC

7

each. By this time, it kills remaining tasks. To compute the schedule, the algorithm assigns n tasks and then starts the replication phase. At most m tasks are running when the replication phase starts, and clearly each unfinished task can have at most m replicas. Therefore the algorithm performs O(n + m2 ) assignments which is polynomial in n and m. Let S be the schedule generated by this algorithm. To see that S is identical to EOPT(I) (i.e the same number of replicas executing on each machine), notice that on each machine in EOPT(I), as well as in S, there is a sequence of replicas, each one executing L instructions, except perhaps the last assigned task that may be killed due to the finishing of n tasks. So each machine has an identical scheduling in S and EOPT(I), and then both schedules must finish n different tasks by the same time. t u In Algorithm 1 we present a pseudo-code of the algorithm to compute an optimal schedule for the problem Q; sit |Lj = L|Cmax in the predictable case. The input to the algorithm is a list M of machines with their processing capacities, and a set T of tasks, each one with L instructions. The algorithm uses a variable F that controls the number of replicas that finish executing, and uses variables ti , for each machine Mi , to control the time when the next replica can be assigned to Mi . The algorithm creates a fake job T 0 with L instructions, that is used to be allocated to machines until for the first time n replicas of it finish. The algorithm starts by assigning for each machine a replica of T 0 . In the while loop, the algorithm keeps assigning replicas of T 0 until for the first time, n replicas of it finish executing. This time is ti∗ by the end of the while loop. The algorithm then sees the replicas of T 0 as slots and assigns each task to some of these n slots that finish before time ti∗ . Algorithm 1 Alg(M = (M1 , . . . , Mm ), T = (T1 , . . . , Tn )) F = 0. t1 = t2 = . . . = tm = 0. Create a task T 0 with L instructions. for i = 1, . . . , m do Assign a replica of T 0 to Mi . Let t be the first time Mi executes L instructions. ti ← t. end for while True do i∗ ← argmin16i6m {ti }. F ← F + 1. if F = n then Break. else Assign a replica of T 0 to Mi∗ . Let t be the first time Mi∗ executes L instructions starting from ti . ti ← t. end if end while Kill replicas that are not finished until time ti∗ . Consider the replicas that are executed until time ti∗ as slots. for j = 1, . . . , n do Assign Tj to the slot available with earliest starting time. 24: end for

1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23:

In the next theorem we show that if the speed of machines is predictable, then Algorithm 1 produces an optimal schedule for both Q; sjk |Lj = L|Cmax and Q; sjk |Lj = L|T P CC problems. Theorem 5 Algorithm 1 computes makespan and TPCC optimal schedules when all tasks have the same

2

RESULTS FOR THE PROBLEMS Q; SIT |LJ = L|CM AX AND Q; SIT |LJ = L|TPCC

8

length L and processors’ speeds are predictable. Moreover the schedule is generated in polynomial time and no replication is needed. Proof. Let EOPT be the optimal extended schedule satisfying Definition 1. Knowing the processors’ speeds, the algorithm constructs the schedule EOPT beforehand via simulation according to Lemma 4. This is done by assigning replicas of the fake task T 0 to the machines until for the first time n replicas of it finish. Consider each replica that finishes L instructions in EOPT as a slot where it is possible to assign any task. The algorithm then assigns to each slot in EOPT a different task. Since all tasks have the same length, the order they are assigned to slots doesn’t interfere in the makespan/TPCC, as long as n different tasks are assigned. Also notice that having computed the simulated schedule no replication is needed. t u

2.3

The unpredictable case

In this section we show that in the unpredictable case, the WQRxx algorithm generates a schedule where by the optimal makespan time at most m/2 tasks may be still running. With this result we prove that the WQRxx m ln(m/2) algorithm has an approximation factor of (1 + 3m ). We then show that this result is tight. The 2n + n idea is to construct a class of instances where by the optimal time there are m/2 tasks yet to execute in the schedule generated by the WQRxx algorithm. When processors’ speeds are unpredictable, the problem of determining an optimal makespan/TPCC schedule cannot be solved deterministically. Consider this simple example (Figure 3): By the optimal makespan time OPT, we know that three tasks are completed. When task T1 finishes, tasks T2 and T3 are still executing and we have one slot in machine M1 after task T1 . We know for sure that either T2 or T3 is going to finish by the makespan time, and suppose that only one of them will finish. This way, it is impossible to decide which of T2 or T3 to schedule after T1 in order to obtain an optimal schedule. M3

T3

M2

T2

M1

T1 T2 or T3 ?

OPT

Figure 3: It is impossible to deterministically compute an optimal schedule if processors’ speeds are unpredictable. In this case exactly one of T2 or T3 is going to finish by the OPT time. We have to determine which one to schedule after T1 . In the unpredictable case we can guarantee that by the optimal makespan time at most m/2 tasks may be still running in a schedule generated by the WQRxx algorithm. Lemma 6 Consider the schedule S generated by the WQRxx algorithm for an instance of the unpredictable case of the problem Q; sjk |Lj = L|TPCC. Then replicas of at most bm/2c different tasks may be running in S after the optimal makespan time OPT. Proof. To prove this Lemma, we consider two cases depending on the number of tasks n, and machines m. We do not consider the case n 6 m/2 since in this case the Lemma is trivially true. The two considered cases are:

2

RESULTS FOR THE PROBLEMS Q; SIT |LJ = L|CM AX AND Q; SIT |LJ = L|TPCC

9

1. (m/2 < n < m): We know that by the optimal time all n tasks can be finished. Consider these n slots in this optimal schedule. Since n < m, in time t0 = 0 all tasks T1 , . . . , Tn are allocated and some of them, say T1 , . . . , Tx , are duplicated, where 1 6 x < m/2. All tasks are on the ring at this moment and the next task to be duplicated is task Tx+1 where the head of the ring is positioned. First of all, notice that the number of machines is m = n + x and x tasks have two replicas, hence (n − x) tasks with just one replica. When all n slots are wasted (by the OPT time), how many tasks can be running? We will show that when all tasks have 2 replicas, at most n slots are wasted (then this happens before the OPT time) and there will be at most m/2 tasks to finish. If a task, that already has two replicas, finishes then we waste at most two slots, and two other tasks, with one replica each, pass to have two replicas running in the system. Notice that in this case, one task (that finished) is removed from the ring and the head of the ring jumps two tasks, since two machines become available. If a task with one replica is finished, one slot is wasted and then another task with one replica is duplicated. In this last case, it is like the position of the head moves two positions, since the finished job is removed from the ring and another task is duplicated. Either way, for each task that finishes we move two tasks along the ring. Since there are n − x tasks with just one replica at time t0 , it will be necessary d n−x 2 e tasks to finish before for the first time all tasks have at least two replicas. The number of slots wasted is at most   n−x 2 =n−x m): Let t1 be the time when the WQRxx algorithm starts the replication phase. Just before t1 , there were m different tasks running, and at time t1 one of these tasks finishes. Notice that by time t1 , (n − (m − 1)) different tasks finished, and all machines are full, so the optimal schedule cannot do any better. Since all tasks have the same length, the only difference from the optimal schedule is the order the tasks were assigned to the slots. Then we know that there are (m − 1) slots at time t1 that are able to finish the (m − 1) remaining tasks by the optimal time. This way we reached a situation similar to the previous case with n0 = m − 1 tasks and m machines with n0 slots available. The result follows similarly. From (1) and (2) we conclude that there are at most m/2 unfinished tasks running after time OPT, and since the number of tasks running must be an integer, we can bound the value to bm/2c tasks. t u Consider the schedule generated by the WQRxx algorithm after the OPT time. From the previous result there are at most bm/2c tasks running. The following result is direct from ([12], Lemma 1). Property 7 Let t be any time after OPT time in the schedule generated by WQRxx, and let X be the set of tasks running at time t. If X has at least two tasks, then for any two tasks Tj and Tj0 in X, the difference in the number of replicas of these two tasks at time t is at most 1.

2

RESULTS FOR THE PROBLEMS Q; SIT |LJ = L|CM AX AND Q; SIT |LJ = L|TPCC

10

It is not hard to see that due to the way the algorithm assigns replicas using a ring, if there is a set X of tasks running at time t, some of them will have bm/|X|c replicas, while others will have dm/|X|e replicas. With these results we can prove the following theorem 8. From the result of 8 we will provide classes of instances proving the tightness of the algorithm. Theorem 8 Let I be an instance of the unpredictable case of the problem Q; sit |Lj = L|TPCC. The TPCC of Pm/2 the schedule generated by the WQRxx for I is at most OP T + L i=1 d mi e Proof. Let S be the schedule generated by the algorithm for instance I, and let the ith last task be the task such that the order of completion is i from last in the schedule S. From Property 7, in the moment just before the ith last task is finished, there are at most dm/ie replicas of it, since there are i tasks running. Notice that until the OPT time, the optimal schedule OPT(I) and the schedule S have the same TPCC with value OPT. After the OPT time, from Lemma 6, there are at most bm/2c tasks running in S. From property 7 Pm/2 these tasks will increase the TPCC with at most (L i=1 d mi e) instructions. t u From Theorem 8 we show the following result which is asymptotically the same shown in [12]. Corollary 9 The WQRxx is an (1 + the problem Q; sit |Lj = L|TPCC.

3m 2n

+

m ln(m/2) )-approximation n

algorithm for the unpredictable case of

Proof. Notice that the optimal TPCC counts at least nL instructions executed. The approximation ratio r of the algorithm can be bounded by: WQRxx(I)/OP T 6 = 6 6 6 6

Pm/2 OP T +L i=1 d m e i OP T Pm/2 L i=1 d m e i 1+ OP T Pm/2 L i=1 ( m +1) i 1+ nL Pm/2 m/2+m i=1 1i 1+ n m/2+m(ln(m/2)+1) 1+ n 3m/2+m ln(m/2) 1+ n

t u We can prove that this result is tight for the WQRxx algorithm by showing classes of instances where b m 2c tasks remain executing after the OPT time. Consider the example given below (Figures 4 and 5). There are 9 machines and when the last task is allocated, there are exactly 9 slots available (Figure 4). The remaining interval times not marked as slots can be assumed to have processor speed 0, from the time the n-th task is allocated until the OPT time. The execution of the WQRxx algorithm to assign the last 9 tasks occurs as in Figure 5. Notice that after the OPT time, there are b m 2 c = 4 tasks remaining to be executed ({T4 , T6 , T8 , T9 }). From now on, assume that the task Tj , that has d nmj e replicas, finishes all its copies at the same time, where nj is the number of tasks running just before Tj finishes. In the example, the first such task is T4 . This way we will have more bm/2c l X mm L i i=1

3

RESULTS FOR THE PROBLEMS Q; SIT |LJ |CM AX AND Q; SIT |LJ |TPCC

11

... 0 instr./s M1 ... 0 instr./s M2 ... slot M3 ... 0 instr./s M4 ... 0 instr./s slot M5 . . . 0 instr./s M6 ... 0 instr./s slot M7 ... 0 instr./s slot M8 . . . slot slot slot slot slot M9 n-th task allocaded OPT Figure 4: When the last task is allocated there are m slots available. In this figure we are only presenting the slots, i.e, they do not represent tasks. The schedule generated by the WQRxx algorithm with the assignment of tasks in these slots is presented in Figure 5 ... M1 ... M2 ... T7 M3 ... M4 ... M5 T5 ... M6 ... M7 T3 ... M8 T2 . . . T1 T2 T3 M9 n-th task allocaded

T9 T8 T4 T6 T8 T4 T6 T5

T4 T7

T9

OPT

Figure 5: The schedule generated by WQRxx algorithm, and the last 9 tasks scheduled.

instructions executed after the OPT time. We can then conclude with the following result: Theorem 10 The ratio given in Theorem 8 is tight. Proof. The example in Figures 4 and 5 can be extended to any odd number  m  m of machines. There will always be bm/2c tasks executing   after time OPT, and each i-th last task has i replicas that are finished at the same time, consuming L mi instructions. This way the TPCC in the schedule computed by the algorithm WQRxx is bm/2c l X mm . OP T + L i i=1

t u

3

Results for the problems Q; sit |Lj |Cmax and Q; sit |Lj |TPCC

In this section we consider the problem of scheduling n tasks, each task Tj with length Lj , in m machines with processor speeds that vary over time. Our main results are: we show that the problem in the predictable case when minimizing the makespan is NP-Hard and there is no approximation algorithm for it unless P=NP. When

3

RESULTS FOR THE PROBLEMS Q; SIT |LJ |CM AX AND Q; SIT |LJ |TPCC

12

minimizing the TPCC we show that the WQRxx is an O(m)-approximation algorithm and that this result is tight, also closing the WQRxx analysis for the problem with tasks of different sizes. In the previous section we were able to show an algorithm to compute an optimal schedule for minimizing both the makespan and the TPCC when machines’ speeds are predictable. Now we will show that when the tasks have different lengths we cannot approximate an optimal makespan even if machines’ speeds are predictable. Theorem 11 Consider an instance of the problem Q; sit |Lj |Cmax , that has n tasks to be scheduled, each task Tj with size Lj . Suppose that the processors’ speeds are predictable. Then unless P=NP there is no approximation algorithm to the problem. Proof. Let I = (S, D) be an instance of the Partition Problem P (PP) (a well known NP-hard problem), where S = {s1 , . . . , sn } is a set of n natural numbers and D = ( nj=1 sj )/2 is a natural number. In the PP problem we have to determine if the set S can be partitioned in two sets, such that the sum of the numbers in each set is D. Suppose there is an α-approximation algorithm for the Q; sit |Lj |Cmax problem. Given the instance I of the PP problem, we construct an instance I 0 of the Grid-Scheduling problem Q; sit |Tj |Cmax with two processors, each one with processor speed of 1 instruction/s during the first D seconds, and 0 instructions/s from this time on until time (αD + 1), where from this time on, the processors’ speeds are set to any value. We also have n tasks, each task Tj with length Lj = sj , j = 1, . . . , n. If the instance I of the PP problem admits a partition, then the optimal solution of the Grid-Scheduling problem has a makespan equal to D. On the other hand if I doesn’t admit a partition, then the optimal solution of the Grid-Scheduling problem has a makespan of at least (αD + 1). This way, any α-approximation algorithm can be used to decide if the PP problem admits a partition, since an α-approximation algorithm return a solution with a value of at most αD if and only if instance I admits a partition. t u In light of the result of Theorem 11 we analyze the WQRxx algorithm only for the problem Q; sit |Lj |TPCC. We consider the unpredictable machines’ speeds version of the problem. Theorem 12 The WQRxx algorithm is an O(m)-approximation algorithm for the problem Q; sit |Lj |TPCC even when machines’ speeds are unpredictable. Proof. Since the optimal solution executes all tasks, we have the following lower bound for OPT: n X

Lj 6 OP T.

j=1

When the algorithm allocates the last task it starts the replication phase. The number of extra replicas of each task is at most m − 1 and at most m − 1 tasks are replicated. Let (Tj1 , . . . , Tjm−1 ) be the replicated tasks. The number of instructions executed by the extra replicas is bounded by (m − 1)

m−1 X

Lji 6 (m − 1)OP T.

i=1

The approximation ratio of the algorithm can be bounded by Pn mOP T j=1 Lj + (m − 1)OP T 6 6m OP T OP T

3

RESULTS FOR THE PROBLEMS Q; SIT |LJ |CM AX AND Q; SIT |LJ |TPCC

13 t u

Theorem 13 The approximation ratio O(m) of the WQRxx algorithm for the problem Q; sit |Lj |TPCC is tight. Proof. In order to show this result, we provide a class of instances where the ratio between the number of instructions executed by the algorithm and the OPT is Ω(m). Consider a list of n = m tasks (T1 , . . . , Tm ) where the first m − 1 tasks have length L and the last task has length (m − 1)L. We have m machines and the optimal schedule finishes at time t1 . From time 0 until t1 one machine executes (m − 1)L instructions and each one of the other machines execute L instructions. Clearly in the optimal schedule, the task Tm is scheduled in the fastest machine, and all the other tasks in the remaining machines. Since the processors’ speeds are unpredictable to the algorithm, it may happen that task Tm is not scheduled in the fastest machine. Consider the following example with six tasks (T1 , . . . , T6 ), where they are assigned in this order. Task T6 has length 5L and the other ones have length L. Machine M1 is the fastest, with execution capacity of 5L instructions until time t1 . Figure 6 shows the schedule generated by the algorithm considering this instance. An optimal solution would finish at time t1 having T6 scheduled in the first machine and the other tasks in the other machines. M6 M5 T5 M4 T4 M3 T3 M2 T2 M1 T1 T2 T3

T6

T4 T5

T6 T6 T6 T6 T6

T4

t1 Figure 6: An example of a worst case instance.

In the general case, consider m = 2+2k for k > 1; assume the algorithm schedules the tasks (T1 , . . . , Tm ) in this order; consider Tm the largest task and w.l.o.g that M1 is the fastest machine. For m = 2 + 2k , Tm is assigned to M1 only after the OPT time t1 (as in the example of Figure 6). To see this, notice that after T1 and T2 finish their execution on M1 , it is always allocated a pair of tasks, one in the available machine M1 and the other task in the machine executing the task that just finished (starting with the finishing of T2 ). There are 2k tasks [T3 , T4 , . . . , Tm ] and the odd indices tasks in this list are replicated with one copy in M1 . After Tm is replicated for the first time, and the next task is finished in M1 , there will be 2k−1 tasks to be replicated in the list, and again only the odd indices tasks in this list will have a copy in M1 . Since each one of the other machines can execute L instructions until t1 , then at most L instructions of Tm were executed in other machines. From time t1 on, the processors’ speeds are set such that all replicas of Tm finish together. This way, in this schedule, the TPCC will be 2(m − 1)L until time t1 , and after t1 it will execute (m − 1)L (the remaining length of Tm on M1 ), plus (m − 1)((m − 1)L − L) (the remaining length of Tm on the other machines) instructions. The ratio of this solution and the optimal solution satisfies r>

2(m − 1)L + (m − 1)L + (m − 1)2 L − (m − 1)L m+1 = . 2(m − 1)L 2

t u

4

ALLOWING REPLICATION

4

Allowing Replication

14

In this section we propose a simple interface that can be used to add replication to any scheduling algorithm. We then prove approximation ratios to any scheduling algorithm that makes use of this interface to problems where machines’ speeds are unpredictable. In the next section we will compare the performance of some well known algorithms with their versions that makes use of replication. Let A(T, M, C) be a scheduling algorithm that receives as parameters a set of tasks T to be scheduled in a set M of machines, with some initial configuration C (this configuration shows which machines are executing jobs at the moment). Every time the system has a free machine, it can make a call to the algorithm A(T, M, C) that then chooses a free machine and a task (using some rule), allocates the task in the machine, and updates T and C. To generate a complete schedule, it is just necessary to make successive calls to A until all tasks are allocated. To add replication capacity to such an algorithm, one just needs to create an interface that calls A until all tasks are allocated. After that, the interface keeps track of which tasks are still executing in a list R. Every time a machine becomes free the interface just needs to call A(R, M, C). When a replicated task is finished, the interface must kill replicas. This interface is executed until all tasks are finished. We now show performance guarantees for any scheduling algorithm that allows replication using this method. Of course the algorithms considered do not allow processors to become idle, i.e, every time a machine becomes available an unscheduled task is assigned to it. Theorem 14 Let A be a scheduling algorithm for the problem Q; sit |Lj = L|TPCC or Q; sit |Lj |TPCC. The 2 version of A with replication has approximation ratio at most min{(1 + (m−1) ), m} for the Q; sit |Lj = n L|TPCC problem and at most m for the Q; sit |Lj |TPCC problem. Proof. For the Q; sit |Tj = L|TPCC problem, notice that when the algorithm schedules the last task, it will waste nL instructions on these tasks. For the replication phase, each one of the (m − 1) remaining tasks will have at most m − 1 replicas. So the approximation ratio is bounded by nL + (m − 1)2 L (m − 1)2 =1+ . nL n 12.

5

To prove the approximation bound m for both problems, we can use a similar proof to the one in Theorem t u

Experimental Comparison of Algorithms

In this section we present an experimental comparison among some scheduling algorithms for the unpredictable version of the problem Q; sjk |Lj |Cmax . Our main interest is in investigating the algorithms and the quality of their solutions according to the makespan. We do not present results about the TPCC since it is proportional to the makespan (see Property 1). We extend some well known scheduling algorithms, so that they use replication with the proposed interface of Section 4. In previous works [5, 13] the scheduling algorithms WorkQueue [5], Sufferage [4], Min-min [16] and DFPLTF [17, 7] were compared with WQR and WQRxx. They showed that the WQRxx algorithm generally obtained the best results. In this work we implemented this set of algorithms, but now, for each algorithm we also considered its version that allows replication using the approach of Section 4. Are these algorithms, extended with replication, going to produce schedules of better quality than the ones produced by WQRxx? In this section we compare via simulation these algorithms and discuss the results.

5

EXPERIMENTAL COMPARISON OF ALGORITHMS

15

For each non-replication algorithm we add the term -Rxx at the end of the algorithm to denote its replication version in which an infinite number of copies may be scheduled. We assume that every time a machine becomes available, it calls the interface reporting its available processing capacity at that specific moment, so that dynamic algorithms such as DFPLTF (and its replication version DFPLTF-Rxx) can use this information in computing the next task to be scheduled. The first step taken by both algorithms Sufferage and Min-Min, is to compute the Minimum Completion Time (MCT) for each task to be scheduled. The current processor speed of each machine is used to predict the completion time of a job. Each job’s completion time is computed on each machine, and then the minimum time is used as the MCT. The Min-Min proceeds as follows: every time a set of machines is available, the algorithm computes the MCTs for the tasks remaining to be scheduled. The algorithm then assigns the task with minimum MCT to the corresponding machine that provides this MCT. The Sufferage works as follows: every time a set of machines is available, the algorithm computes, for each remaining task, its best MCT and its second best MCT. The difference between the second best and the best MCT is the sufferage of the task. The algorithm assigns the task with the largest sufferage to the corresponding machine that provides the best MCT. The DFPLTF first sorts the tasks by non-increasing length. Then every time a set of machines becomes available, the largest task remaining to be schedule is assigned to the fastest machine considering its current processor speed. In [13] they performed simulations using uniform distribution values U [a, b] for both tasks’ sizes as well as machines’ capacities. In our simulations we used the approach of [13] to create instances, and we also extended their approach. In our simulations each machine Mi is a vector where each position sit corresponds to a time interval [t, t + 1) representing the number of instructions that machine Mi can execute in the given time interval. Each task Tj has Lj instructions to be executed. We denote by U [a, b] the uniform distribution between the integers a < b, and denote by Zipf[a, b] the Zipf distribution where a corresponds to the element with the largest frequency, while b the element with lowest frequency in the Zipf distribution. For instance, in Zipf[5, 30] the element 5 has frequency 25, 94% while the element 30 has frequency 0, 997%, i.e, if we generate 100 elements using this distribution, about 25 elements will be the number 5 and about 1 element will be the number 30. We generated 15 different test environments, that combine 5 different ways of how to create tasks (Task Variability (TV) ) with 3 different ways of how to create machines (Machine Variability (MV) ). Given a number of tasks to be created, they are created in one of the following ways: (1) Zipf[5, 30]: uses the Zipf distribution where the majority of tasks will have a small size (beginning with 5), and few tasks will have large sizes. (2) Zipf[30, 5]: is the opposite of the previous one, where the majority of tasks will be large and a few will be small. (3) U [1, 30]: all tasks’ sizes are chosen from this uniform distribution. (4) 20%U [1, 15]: 20% of the tasks have their size chosen from U [1, 15] while 80% are chosen from U [15, 30]. (5) 80%U [1, 15]: 80% of the tasks have their size chosen from U [1, 15] while 20% are chosen from U [15, 30]. Given a number of machines to be created, they are created in one of the following ways: (1) U [1, 1] : All machines have the same processor power. For each time interval [t, t+1) the processing capacity sit = U [0, 5]. (2) U [1, 4]: In this case one machine can be up to four times faster than the slowest machine. For each machine Mi first select a speed multiplier Xi = U [1, 4] and then for each time interval compute sit = Xi × U [0, 5]. (3) U [1, 8]: Similar to the previous one but now one machine can be up to eight times faster than the slowest machine. For each machine Mi first select a speed multiplier Xi = U [1, 8] and then for each time interval compute sit = Xi × U [0, 5]. All environments were constructed in such a way that we can assess the impact of machines’ variability and tasks’ variability in the performance of the generated schedules. For each one of the possible environments we performed tests with 16, 64 and 256 machines. Given a number of machines, the tests were performed with

5

EXPERIMENTAL COMPARISON OF ALGORITHMS

16

the number of tasks varying from 1 to 6 times the number of machines, i.e, for 16 machines we performed tests with 16, 32, . . . , 96 tasks. For each given environment, number of machines and number of tasks, we performed 20 simulations totaling 43200 simulations. However, only the main results are discussed. When presenting the results we provide some graphs showing the performance profiles [9] of the algorithms. This method of comparing optimization softwares and algorithms, was proposed by Dolan and Mor´e in [9]. See for example the graph on the right in Figure 7. The intuitive idea is to compare algorithms among themselves, computing for each instance of the problem the ratio between the makespan of one algorithm and the best achieved makespan for the instance. We compute all such ratios for all instances. If, for example, some algorithm has ratio equal to 1 for some instance, this means that this algorithm obtained the best result for this instance among all algorithms. With the ratios computed for all instances, we can generate a graph such as the one in Figure 7 (right). In the x axes we have all possible values of ratios. For any given ratio, in the y axes we have the percentage of instances that an algorithm could solve, such that the ratio of its makespan to the best one is at most the given ratio. For more information about performance profiles see [9]. In Table 1 for each type of running environment, we show the algorithm that achieved the best makespan for the corresponding environment (in this case the results are the average of all results for that environment). The waste in this table is defined as the percentage of processing capacity used by replica tasks that are then killed when one version of the task finishes. The improv. field shows the percentage gain in the makespan that was obtained when using the replication algorithm compared to its no-replication version. We use MV and TV to denote Machine and Task Variability respectively. We can see that the best overall algorithm is the WQRxx. We can see that the replication version of this algorithm makes an improvement of up to 56% in the makespan, when compared with WQ. From Table 1 (also from Figure 7) we can also see that DFPLTF-Rxx was the best algorithm when machine variability is U [1, 1]. We notice that in the cases of high MV all algorithms with replication have a large gain compared to their no-replication version, but the only algorithm that comes close to WQRxx is DFPLTF-Rxx, as one can see in Figure 9 where machine variability is U [1, 8]. Table 1: The best algorithm for each specific running environment. Makespan improvement and cycle consumption waste compared to the no-replication version. Environment M V, T V U [1, 1], 20%U [1, 15] U [1, 1], 80%U [1, 15] U [1, 1], U [1, 30] U [1, 1], Zipf [30, 5] U [1, 1], Zipf [5, 30] U [1, 4], 20%U [1, 15] U [1, 4], 80%U [1, 15] U [1, 4], U [1, 30] U [1, 4], Zipf [30, 5] U [1, 4], Zipf [5, 30] U [1, 8], 20%U [1, 15] U [1, 8], 80%U [1, 15] U [1, 8], U [1, 30] U [1, 8], Zipf [30, 5] U [1, 8], Zipf [5, 30]

Best Algorithm DFPLTF-Rxx DFPLTF-Rxx DFPLTF-Rxx DFPLTF-Rxx DFPLTF-Rxx WQRxx WQRxx WQRxx WQRxx WQRxx WQRxx WQRxx WQRxx WQRxx WQRxx

Cmax 33.23 20.13 25.99 39.39 19.12 14.89 9.47 12.15 17.62 8.88 8.15 5.11 6.6 9.63 4.86

Cmax no-replic. 33.4 20.48 26.29 39.5 19.53 20.97 15.29 18.2 23.78 14.49 15.16 11.48 13.43 16.94 11.07

Improv. in % 0.51 1.72 1.16 0.27 2.1 29 38.08 33.24 25.91 38.69 46.21 55.47 50.81 43.17 56.08

Waste in % 17.9 24.26 19.38 16.27 26.17 26.34 35.09 30.54 24.41 35.32 25.7 32.29 29.09 23.8 32.68

In Figures 7, 8 and 9, we compare all algorithms with and without replication. In each figure, in the graph to the left, for each machine and task variability we compute the average makespan considering all instances with this configuration. In the graph to the right we present the performance profiles. We can see from these figures that the replication significantly reduces the makespan when compared to the non-replication algorithm. For example, in the performance profile of Figure 7 we can see that DFPLTF-Rxx found the best solution in about 90% of the instances while DFPLTF found about only 60% of the best solutions. In general, the largest gains

5

EXPERIMENTAL COMPARISON OF ALGORITHMS

17

were obtained by the WQRxx algorithm with an average gain in makespan of 28.86%, when compared to its non-replication version. The other algorithms had approximately the same gains, the Min-min-Rxx achieved 12.21% on average. DFPLTF-Rxx had an average gain of 12.19% while Sufferage-Rxx had 11.96%. The Coefficient of Variation (the ratio of the standard deviation to the mean) considering all scenarios was similar for all algorithms, 0.378 for the WQRxx, 0.352 for the Min-min-Rxx, 0.348 for the Sufferage-Rxx and 0.337 for the DFPLTF-Rxx.

Figure 7: Makespan of each algorithm with and without replication. Machine Variability U [1, 1]. The graph to the left contains the average makespan for each given TV. The graph to the right presents the performance profiles of the algorithms considering all TV together.

Figure 8: Makespan of each algorithm with and without replication. Machine Variability U [1, 4]. The graph to the left contains the average makespan for each given TV. The graph to the right presents the performance profiles of the algorithms considering all TV together. From the results we can see that MV and TV are factors that impact the makespan gain the most when using replication. Considering all algorithms the average gain in all environments with M V = U [1, 8] was 31.85%, whit M V = U [1, 4] was 15.85% and with M V = U [1, 1] was 1.26% (see figures 7, 8 and 9). The gains for the different task variabilities were: Zipf[5, 30] gain of 19.57%, 80%U [1, 15] gain of 19.09%, U [1, 30], gain of 16.14%, 20%U [1, 15] gain of 13.96% and Zipf[30, 5] gain of 12.77%.

5

EXPERIMENTAL COMPARISON OF ALGORITHMS

18

Figure 9: Makespan of each algorithm with and without replication. Machine Variability U [1, 8]. The graph to the left contains the average makespan for each given TV. The graph to the right presents the performance profiles of the algorithms considering all TV together. Considering all algorithms, a homogenous environment U [1, 1], brought a gain from 0.02% to 4.9%, while having a waste in processing capacity from 16.27% to 42.80%. This shows that replication may not be a good approach when scheduling in homogenous machine environments. From our simulation results we could verify that the best overall algorithm is WQRxx. DFPLTF-Rxx obtained good results due to replication although not as good as the results of WQRxx. On the other hand the Min-min-Rxx algorithm always had the worst results. As we can see, the extended versions of the algorithms Min-min and Sufferage, without a limit in the number of replicas, are not competitive with the WQRxx algorithm. The DFPLTF-Rxx is competitive with the WQRxx algorithm only in environments where processors’ speeds are similar, and it has low gains when compared to DFPLTF. However, as we are going to see in the next section, when limiting the number of replicas, different results are obtained.

5.1

Limiting the number of Copies

We did the same experiments but limiting the number of extra replicas of each task aiming to compare the gains with the -Rxx versions. The maximum number of replicas used was l ∈ {2, 4, 8}. In each algorithm −Rlx indicates the maximum number of replicas . For example Min-min-R2x uses at most two replicas. We compare all algorithms possible configurations: without replication, with the limited number of replicas, and with unbounded extra replicas. In Figures 10 to 11 we present results for each algorithm considering its bounded and unbounded number of replicas version. We can see that for the WQ algorithm, the best version is the one that considers an unlimited number of replicas. For other algorithms, the best approach is to use a bounded number of replicas, and best results are generally achieved with 8 replicas. Considering the best version of each algorithm in comparison to the no-replication version, the average gain in the maskepan was 44.75% for M V = U [1, 8], 26.24% for M V = U [1, 4] and 2.77% for M V = U [1, 1]. From these results we can see that the algorithms with limited number of replicas achieved the best results, except for the WQR algorithm. This happens because the algorithms DFPLTF-Rxx, Min-min-Rxx and Sufferage-Rxx do not have any control over the number of active replicas during the replication phase, and so they basically create a lot of replicas of the same task, which are executed in parallel. The algorithm WQRxx imposes the ring structure to organize the remaining tasks which controls the number of replicas of each task.

5

EXPERIMENTAL COMPARISON OF ALGORITHMS

Figure 10: Performance profiles of algorithms WQ and Min-Min with limited number of replicas.

Figure 11: Performance profiles of algorithms DFPLTF and Sufferage with limited number of replicas.

19

5

EXPERIMENTAL COMPARISON OF ALGORITHMS

20

In figure 12 we present the performance profiles of all algorithms with limited number of replicas equal to 8, except for WQ which we used its best version WQRxx. From this figure we can see that in general, the best algorithm is DFPLTF-R8x, solving approximately 80% of the instances with the minimum makespan among the algorithms. Table 2 is similar to Table 1, where we separate results for each Task Variability (TV) and Machine Variability (MV). We can see that the DFPLTF-R4x algorithm was the best in 60% of the scenarios and that the DFPLTF-R4x and DFPLTF-R8x were the best in 80% of the scenarios. It is important to see that in some cases the version of an algorithm with bounded number of replicas improves up to 12% the version with infinity number of replicas.

Figure 12: Performance profiles of algorithms with all algorithms having a limited number of replicas equal to 8 and WQRxx

Table 2: The best algorithm for each specific running environment considering versions with bounded number of replicas. For each best algorithm, the Makespan improvement and the cycle consumption waste compared to the no-replication and unbounded replicas version is presented. Environment M V, T V U [1, 1], 20%U [1, 15] U [1, 1], 80%U [1, 15] U [1, 1], U [1, 30] U [1, 1], Zipf [30, 5] U [1, 1], Zipf [5, 30] U [1, 4], 20%U [1, 15] U [1, 4], 80%U [1, 15] U [1, 4], U [1, 30] U [1, 4], Zipf [30, 5] U [1, 4], Zipf [5, 30] U [1, 8], 20%U [1, 15] U [1, 8], 80%U [1, 15] U [1, 8], U [1, 30] U [1, 8], Zipf [30, 5] U [1, 8], Zipf [5, 30]

Best Algorithm DFPLTF-R4x DFPLTF-R4x DFPLTF-R4x DFPLTF-R4x DFPLTF-R4x DFPLTF-R4x DFPLTF-R4x DFPLTF-R4x DFPLTF-R4x DFPLTF-R8x WQRxx DFPLTF-R8x WQRxx WQRxx DFPLTF-R8x

Cmax 32.66 19.45 25.27 39.39 18.37 14.42 8.42 11.49 17.4 7.93 8.15 4.93 6.6 9.63 4.63

Cmax no-replic. 33.4 20.48 26.29 39.5 19.53 17.56 10.62 14.89 21.03 9.99 15.16 8.47 13.43 16.94 8.22

Cmax Rxx 33.23 20.13 25.99 39.39 19.12 16.31 9.56 12.99 19.68 9.04 8.15 5.45 6.6 9.63 5.05

Improv. in % 2.23 5.05 3.90 1.31 5.91 18.21 20.69 18.21 17.28 20.58 46.21 41.76 50.81 43.17 43.64

Improv. Rxx in % 1.73 3.38 2.78 1.04 3.9 11.5 11.87 11.5 11.6 12.22 0 9.56 0 0 8.23

Waste in % 7.79 9.10 7.86 7.13 10.44 12.36 15.27 12.36 13 12.21 25.7 15.48 29.09 23.8 16.92

We can see that the best overall algorithms are the DFPLTF-R4x and DPFLTF-R8x even though the WQRxx had achieved good results. It is worth noting that in non-uniform scenarios WQRxx got better results than DFPLTF-Rxx, but limiting the number of replicas DFPLTF-R8x and 4x are even better than WQRxx (see

6

CONCLUSIONS

21

Tables 1 and 2). The DFPLTF is an algorithm that assigns the largest tasks to the fastest machines, so it has some knowledge of the processors speed, which is acquired when some task is finished. The system can inform the scheduler of the current speed of the processor when a task is finished, but of course this speed can vary over time. For instance, the machine can become very slow just after finishing a task. Since the WQRxx has no knowledge of machines’ speeds and also does not make any differentiation among jobs, the idea of scheduling the largest jobs to the fastest machines brings better results, but only if we limit the number of replicas, otherwise several processors will become busy running a same large task.

6

Conclusions

In this work we presented approximation algorithms for desktop grid scheduling problems. For the problem Q; sit |Lj = L|TPCC, when machines’ speeds are not predictable, we show that the WQRxx algorithm generates schedules where at most dm/2e tasks remain to be scheduled after the optim log(m/2) mal makespan time. With this result, we can show that the WQRxx algorithm is a (1 + 3m )2n + n approximation algorithm for the problem. We also show that this approximation ratio is tight. This is the first work to propose approximation algorithms for the problem Q; sit |Lj |TPCC, when tasks have different lengths and machines’ speeds vary over time. For this problem, we prove that the WQRxx algorithm is an O(m)-approximation algorithm, and that the approximation ratio is tight. Considering the problem Q; sit |Lj |TPCC, we also proposed an interface that can be used to extend nonreplication algorithms to generate schedules with replication. We proved that any non-replication algorithm 2 using this interface, generates schedules with cost at most min{1 + (m−1) , m} times the cost of an optimum n solution. We performed an extensive simulation to verify the possible benefits of adding replication to several well known non-replication algorithms. From previous studies, WQRxx performed better when compared to algorithms without replication. So we compared the WQRxx algorithm with algorithms extended to generate schedules with replication, to check if WQRxx was still the best algorithm. We concluded that the best overall algorithm is the DFPLTF-R8x, but WQRxx also performed very well in our simulations, and it is a good alternative when information about processors’ speeds is unknown, since DFPLTF-R8x requires information about machines’ speeds. For algorithms using replication, we could see that the largest gains occur when scheduling in non-homogeneous environments (high machine and task variability).

References [1] D. P. Anderson, J. Cobb, E. Korpela, M. Lebofsky, and D. Werthimer. Seti@home: an experiment in public-resource computing. Commun. ACM, 45(11):56–61, 2002. [2] A. Baratloo, M. Karaul, Z. M. Kedem, and P. Wijckoff. Charlotte: Metacomputing on the web. Future Generation Comp. Syst., 15(5-6):559–570, 1999. [3] M. Bougeret, P. Dutot, K. Jansen, C. Otte and D. Trystram, A Fast 5/2-Approximation Algorithm for Hierarchical Scheduling. Euro-Par (2010), pages 157–167, 2010. [4] H. Casanova, A. Legrand, D. Zagorodnov, and F. Berman. Heuristics for scheduling parameter sweep applications in grid environments. In Heterogeneous Computing Workshop, pages 349–363, 2000.

REFERENCES

22

[5] W. Cirne, F. V. Brasileiro, D. P. da Silva, L. F. W. G´oes, and W. Voorsluys. On the efficacy, efficiency and emergent behavior of task replication in large distributed systems. Parallel Computing, 33(3):213–234, 2007. [6] W. Cirne, D. P. da Silva, L. Costa, E. Santos-Neto, F. V. Brasileiro, J. P. Sauv´e, F. A. B. Silva, C. O. Barros, and C. Silveira. Running bag-of-tasks applications on computational grids: The mygrid approach. In 32nd International Conference on Parallel Processing (ICPP), pages 407–416, 2003. [7] D. P. da Silva, W. Cirne, and F. V. Brasileiro. Trading cycles for information: Using replication to schedule bag-of-tasks applications on computational grids. In Parallel Processing, 9th International Euro-Par Conference (Euro-Par), pages 169–180, 2003. [8] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In 6th Symposium on Operating System Design and Implementation (OSDI), pages 137–150, 2004. [9] E. D. Dolan and J. J. Mor´e. Benchmarking optimization software with performance profiles. Mathematical Programming, 91(2):201–213, 2002. [10] F.. Dong and S. G. Akl. Scheduling Algorithms for Grid Computing: State of the Art and Open Problems. Technical Report No. 2006-504, School of Computing, Queen’s University, 2006. [11] N. Fujimoto. On non-approximability of coarse-grained workflow grid scheduling. In 9th International Symposium on Parallel Architectures, Algorithms, and Networks (ISPAN), pages 127–132, 2008. [12] N. Fujimoto and K. Hagihara. Near-optimal dynamic task scheduling of independent coarse-grained tasks onto a computational grid. In 32nd International Conference on Parallel Processing (ICPP),, pages 391–398, 2003. [13] N. Fujimoto and K. Hagihara. A comparison among grid scheduling algorithms for independent coarsegrained tasks. In SAINT Workshops, pages 674–680, 2004. [14] N. Fujimoto and K. Hagihara. A 2-approximation algorithm for scheduling independent tasks onto a uniform parallel machine and its extension to a computational grid. In Proceedings of the 2006 IEEE International Conference on Cluster Computing (CLUSTER), 2006. [15] R. L. Graham, E. L. Lawler, J. K. Lenstra, and A. H. G. R. Kan. Optimization and approximation in deterministic sequencing and scheduling: A survey. Annals of Discrete Mathematics, 17:416–429, 1979. [16] K. Liu, J. Chen, H. Jin, and Y. Yang. A Min-Min average algorithm for scheduling transaction-intensive grid workflows. In Proceedings of the 7th Australasian Symposium on Grid Computing and e-Research (AusGrid 2009). Wellington, New Zealand: Australian Computer Society, Inc, pages 41–48, 2009. [17] D. A. Menasc´e, D. Saha, S. C. S. Porto, V. Almeida and S. K. Tripathi. Static and Dynamic Processor Scheduling Disciplines in Heterogeneous Parallel Architectures. J. Parallel Distrib. Comput., 28 (1), pages 1–18, 1995. [18] A. Quezada-Pina, A. Tchernykh, J. L. Gonz´alez-Garc´ıa, A. Hirales-Carbajal, J. M. Ram´ırez-Alcaraz, U. Schwiegelshohn, R. Yahyapour, V. Miranda-L´opez. Adaptive parallel job scheduling with resource admissible allocation on two-level hierarchical grids. Future Generation Comp. Syst., 28(7):965–976, 2012. [19] U. Schwiegelshohn, A. Tchernykh, and R. Yahyapour. Online scheduling in grids. In 22nd IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pages 1–10, 2008.

REFERENCES

23

[20] A. Tchernykh, U. Schwiegelshohn, R. Yahyapour and N. Kuzjurin. On-line hierarchical job scheduling on grids with admissible allocation. Journal of Scheduling, 13: 545–552, 2010. [21] E. C. Xavier, and R. R. S. Peixoto. On the Worst Case of Scheduling with Task Replication on Computational Grids . In 22nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 2010), pages 135–142, 2010.

Suggest Documents