Jun 18, 2014 - computing systems such as cluster, grid or cloud computing facilities. ... scale computing centers, typically composed by several dedicated clusters. ..... call Mix; i.e., 300 Heterogeneous-Parallel workflows, 300 Homogenous-.
A hierarchical approach for energy-efficient scheduling of large workloads in multicore distributed systems Bernab´e Dorronsoroa , Sergio Nesmachnowb , Javid Taheric , Albert Y. Zomayac , El-Ghazali Talbia , Pascal Bouvryd a
University of Lille1, France Universidad de la Rep´ ublica, Uruguay c The University of Sydney, Australia d University of Luxembourg, Luxembourg b
Abstract This article presents a two-level strategy for scheduling large workloads of parallel applications in multicore distributed systems, taking into account the minimization of both the total computation time and the energy consumption of solutions. Nowadays, energy efficiency is of major concern when using large computing systems such as cluster, grid or cloud computing facilities. In the approach proposed in this article, a combination of higher-level (i.e. between distributed systems) and lower-level (i.e. within each data-center) schedulers are studied for finding efficient mappings of workflows into the resources in order to maximize the quality of service, while reducing the energy required to compute them. The experimental evaluation demonstrates that accurate schedules are computed by using combined list scheduling heuristics (accounting for both problem objectives) in the higher level, and ad-hoc scheduling techniques to take advantage of multicore infrastructures in the lower level. Solutions are also evaluated with two user- and administratororiented metrics. Significant improvements are reported on the two problem objectives when compared with traditional round-robin and load-balancing techniques. Keywords: energy efficiency; workflows; multicore; scheduling heuristics 1. Introduction Nowadays, data-center facilities typically host a large number of computational resources. They usually include high performance clusters, large Preprint submitted to Sustainable Computing: Informatics and Systems
June 18, 2014
storage systems, and/or components of large grids or cloud systems. In any case, they are always composed of highly powerful computational resources (racks of computers, storage disks, routers, etc.) with increasing energetic demands. Therefore, energy consumption has become a major concern in data-centers [1]. There are different techniques aiming at reducing the energy consumption in data-centers [2, 3], from low-level hardware solutions to high-level software methods, which are different according to the kind of infrastructure considered. All the existing sustainable (i.e. energy-aware) techniques are in conflict with the performance of the system, that can be quantified in terms of traditional scheduling metrics such as the makespan (i.e., the finishing time of the last scheduled task) or other Quality of Service (QoS) metrics. This means that, for an optimal solution, increasing its performance leads to an increase of the energy consumption, while lowering the latter involves worsening the former. This is the reason why multi-objective formulations are needed for accurate capturing all the features of the data-centers planning problem. In this work, we focus on solving the problem of finding appropriate mappings of workflows into a set of available computing resources, in order to reduce the energy required to compute the tasks, while maximizing the quality of service too. The platform we target here is a distributed data-center, composed by a number of clusters that might be geographically distributed. Specifically, we target the reduction of the required execution time and energy consumption demanded to run very large sets of workflows in largescale computing centers, typically composed by several dedicated clusters. This is the architecture of modern high performance and distributed computing systems, including big supercomputers, high performance computing centers, and cloud infrastructures, among others. In such systems, the energy consumption of the processors is a major concern, and reducing it will also allow lowering the cooling system operational cost too [4]. The proposed strategy consists in using a hierarchical two-level approach for scheduling static batches with a large number of workloads composed by tasks with dependencies—characterized as DAGs—on an infrastructure that gathers large distributed data-centers, composed of a heterogeneous set of clusters of multi-core processors. The higher-level scheduler decides the mapping between jobs and data-centers, while the lower-level scheduling methods are applied to schedule each job within each data-center. We target the minimization of makespan, the global energy consumption, and the 2
penalizations due to overdue deadlines associated to the submitted jobs. In this line of work, the main contributions of this work are: (1) defining and tackling a novel multi-objective problem for energy-efficient scheduling of large sets of workflows in distributed data-centers; (2) designing a hierarchical two-level scheduler that allows dividing the problem into a number of simpler and smaller sub-problems; and (3) the evaluation and comparison of 16 different variants of the scheduler, by using ad-hoc heuristics based on combining both the makespan and energy consumption of solutions. The experimental evaluation demonstrates that accurate solutions are computed by the best performing schedulers, allowing the planner to achieve improvements of up to 46.8% in terms of makespan and 29.0% in terms of energy consumption, over a typical round-robin-based strategy. We propose a static solution to find the appropriate mapping, but it is suitable for real scenarios thanks to the extreme speed of our optimizers (they take around tenths of a second). Additionally, the proposed schedulers can be easily be extended to the dynamic case by implementing some dynamic heuristic in the higher level [5]. The paper is structured as follows. We present in the next section the main related works about energy-aware scheduling on data-centers. Section 3 presents the definition and the formulation of the proposed multi-objective scheduling problem. The software methods designed to tackle the problem are described in Section 4. The experimental analysis is reported in Section 5, including an exhaustive analysis of the numerical results for the proposed schedulers when solving a large benchmark set of problem instances composed of 1000 jobs with up to 132 tasks. Finally, Section 6 summarizes the conclusions of the research and proposes some lines for future work. 2. Related work There are many works in the literature dealing with multi-objective scheduling problems for computational grids, targeting different performance objectives as makespan and flowtime [6, 7], considering robustness of the solutions to optimize [8, 9], or targeting the minimization of the cost of the schedules (i.e., data transferring and processing services) [10, 11, 12, 13], to name a few. Energy efficiency is another important focus on multi-objective scheduling in recent approaches. Two main optimisation strategies are established
3
for energy-aware scheduling: independent and simultaneous. In independent approaches, a best-effort scheduling algorithm is either combined with or followed by a slack reclamation technique. In these techniques, because energy and performance constraints are assumed independent, new or existing scheduling algorithms—usually aimed to optimise performance—are adopted to become energy-efficient too [14, 15]. The Maximum-MinimumFrequency DVFS (MMF-DVFS) is among the most efficient algorithms for slack reclamation based execution of tasks. The authors proposed an approach to reclaim slack times of tasks by linear combination of the processor highest and lowest frequencies, and they mathematically calculated/proved what combination of frequencies should be used to minimize the energy consumption of executing a task through slack reclamation [16]. On the other hand, in simultaneous approaches both QoS metrics and energy saving considerations are simultaneously targeted. In this case, the problem is modelled as a multi-constrained, bi-objective optimisation problem where the goal is to find Pareto optimal schedules; i.e., no scheduling decision can strictly dominate the other ones with lower makespan and energy consumption at the same time. Khan and Ahmad [17] deployed the concept of Nash Bargaining Solution from cooperative game theory to schedule independent jobs on a grid to simultaneously minimise makespan and energy; they assumed machines were DVS-enabled. Lee and Zomaya [18] studied a set of DVS-based heuristics to minimise the weighted sum of makespan and energy; the heuristics were modified to optimise both objectives. Because each scheduling decision could be confined/trapped to local minima, their algorithm included a makespan conservative local search technique to slightly modify scheduling decisions only when they do not increase energy consumption for executing jobs. Later, Mezmaz et al. [19] improved the previous work by proposing a parallel biobjective hybrid GA for the same objectives where the running time for getting a resolution is significantly reduced. The parallel model was based on the cooperative approach of the island model for parallel EAs combined with a multi-start parallel model using the farmer-worker paradigm. Pecero et al. [20] proposed a bi-objective algorithm with two phases based on the Greedy Randomized Adaptive Search Procedure (GRASP). During the first phase, a greedy evaluation function builds a feasible solution. This solution is then processed in the second phase—by a local search DVS-aware bi-objective algorithm—to not only improve its quality, but also to generate a set of Pareto solutions. 4
Kim et al. [21] addressed the task scheduling problem considering priority and deadline constrained in ad-hoc grids. They assumed batteries to have limited capacity that are equipped with DVS-enabled power management systems. A resource manager was designed to exploit the heterogeneity of tasks while managing the energy. By studying several online and batch mode dynamic heuristics based on MinMin, Luo et al. [22] showed that batch mode dynamic scheduling outperforms online approaches though it requires significantly more computation time too. Li et al. [23] also introduced a MinMin-based online dynamic power management strategy with multiple power-saving states to reduce energy consumption of scheduling algorithms. Pinel et al. [24] also proposed a two-phase heuristic to schedule independent tasks on grids with energy considerations. First, a MinMin approach is applied to optimise the makespan, and then a local search is performed to minimise energy consumption. Through extensive simulation-based comparisons against a parallel asynchronous cellular GA, they showed that their proposed algorithm produces fairly comparable solutions to that of the GA, but within much lesser time. Lindberg et al. [25] studied the task scheduling problem with the aim of minimising makespan and energy subject to deadline constraint and tasks’ memory requirements; eight heuristics (six greedy algorithms based on list scheduling and two GAs) were introduced to solve the problem using dynamic voltage scaling technique. The two GAs were found to be too slow compared to the heuristics, and they reported worse quality solutions. In a recent work, Iturriaga et al. [26] proposed a parallel multi-objective local search algorithm, based on Pareto dominance, to minimize energy consumption and makespan in the independent tasks scheduling problem. It was shown to outperform a set of fast and accurate two-phases deterministic heuristics based on the traditional MinMin. In our previous work [27], we introduced an energy consumption model for multicore computing systems. Our approach did not apply DVS nor other specific techniques for power/energy management. Instead, we proposed an energy consumption model based on the energy required to execute tasks at full capacity, the energy when not all the available cores of the machine are used, and the energy that each machine on the system consumes in idle state. We proposed twenty fast list scheduling methods adapted to solve a bi-objective problem, by simultaneously optimizing both makespan and energy consumption when executing tasks on a single computing node. In that work we tackled the problem of scheduling independent Bag-of-Tasks (BoT) applications. In this article, we extend the previous approach to solve 5
a more complex multi-objective optimization problem, by considering large jobs, whose tasks have precedences, modeled by DAGs. In addition, here we propose a fully hierarchical scheduler that operates in two levels for efficiently planning large jobs in distributed data-centers. The approach in this article is closer to realistic situations arising in nowadays distributed computing infrastructures. 3. Problem Definition The problem we consider in this work is the static scheduling of batches with a large number of workloads (which are characterized as DAGs) on large distributed data-centres, composed of a heterogeneous set of data-centers of multi-core processors. We target the minimization of makespan, the global energy consumption, and the penalizations due to overdue deadlines associated to applications. We assume that there is a front-end server to which all workloads are submitted. It collects all of them arriving during a few seconds, and then proceeds to schedule them in the available servers. Therefore, the problem requires very fast solutions to allow the system being quickly responsive and minimize user waiting times. We deal with the schedule of large workloads of n independent heterogeneous jobs J = {j0 , j1 , . . . , jn } on a set of k heterogeneous computing nodes CN = {CN0 , CN1 , . . . , CNk }. Table 1 describes all the symbols used in this work. In the problem model we define: • Each computing node CNr is a collection of NPr multicore processors (NPr may be different for every CNr ). It is represented as a tur r ple (opsr , cr , EIDLE , EM AX , NPr ) defining the performance of its composing processors in terms of the floating-point operations per second (FLOPS) they can process, their number of cores and energy consumption at idle and peak usage, as well as the number of processors composing it, respectively. • Each job jq is a parallel application that is decomposed into a set of tasks Tq = {tq0 , tq1 , . . . tqm } with dependencies among them, typically having each task different computational requirements. • Every job jq has an associated deadline Dq before it must be accomplished. 6
• Each task tqα is a duple tqα = (oqα , ncqα ) containing its length oqα (in terms of its number of operations), and the number of processors required to execute it in parallel, ncqα .
Table 1: Summary of the symbols used Symbol J n jq Dq CN k CNr N Pr opsr cr r EIDLE r EM AX Tq tqα oqα ncqα CTr Fq
Description Set of jobs to schedule Number of jobs to schedule A given job Deadline of jq Set of available computing nodes Number of available computing nodes A given computing node Number of processors in CNr FLOPS that r can perform Number of cores of processors in CNr Energy consumption of processors in CNr in idle state Energy consumption of processors in CNr at peak usage Set of tasks composing jq A given task of jq The number of operations of task α in jq The number of cores required by task α in jq Completion time of CNr Finishing time of job jq in its assigned CN
We represent every job as a Directed Acyclic Graph (DAG). It is a precedence task graph jq = (V, E), where V is a set of m nodes, each one corresponding to the task tqα (0 ≤ α ≤ m) of the parallel program jq . E is the set of directed edges between the tasks that maintain a partial order among them. Let ≺ be the partial order of the tasks in G, the partial order tqα ≺ tqβ models the precedence constraints. That is, if there is an edge eαβ ∈ E then task tqβ cannot start its execution before task tqα completes. We consider negligible communication costs, as communications only happen between servers within the same CN. We model the aforementioned described scenario with the following multiobjective problem: • Minimize the makespan, defined as: fM (~x) = max CTr , 0≤r≤k
where ~x represents an allocation, k is the number of available computing nodes, and CTr is the completion time of computing node r (CNr ). 7
• Minimize the energy consumption: we apply the energy consumption model for multicore architectures developed in Nesmachnow et al. [27]. The total energy consumption for a set of jobs to be executed in certain computing nodes is defined by the formulation in Equation 1, where f1 is the upper level scheduling function, and f2 is the lower level scheduling function. X X X X EC(tqi , pj ) + ECIDLE (pj ) (1) fE (~x) = r∈CN
jq ∈J : tqi ∈Tq : f1 (jq )=CNr f2 (tqi )=pj
pj ∈CN
The total energy consumption accounts for both the energy required to execute the tasks assigned to each resource within a CN, and the energy that each processor on the CN consumes in idle state. The proposed model for the energy consumption is based on a linear function on the number of cores used in the machine [28] (Equation 2). Within each multicore computer, the energy consumption linearly varies from EIDLE when the processor is idle, to EM AX when the processor is fully used; uc is the number of used cores, tc is the total number of cores in the machine, and E(uc) is the energy consumed when using uc cores. A simple method is applied to compute the energy consumption for every single machine in each CN. By sorting the cores regarding their processing time, the energy consumption for a given machine in a CN is computed by the expression in Equation 3, where ti denotes the sum of ET for the tasks assigned to core i. E(k) = EIDLE + (EM AX − EIDLE ) ×
t0 × EM AX +
TP C−1
UC TC
E(T C − h) ∗ [th − th−1 ]
(2)
if U C = T C
(3)
h=1
tT C−U C × E(U C) +
UP C−1
E(U C − h)× [tT C−U C+h − tT C−U C+h−1 ]
if U C < T C
h=1
Figure 1 exemplifies the energy consumption calculation for a given schedule in a machine mj , considering the cases of a fully loaded machine (UC = T C, in (a)) and a partially loaded machine (UC < T C, in (b)), respectively. 8
(a) UC = T C
(b) UC < T C
Figure 1: Computing the energy consumption for machine mj
By using the previous method, a simple energy consumption calculation is included in the proposed scheduling heuristics, by extending the methodology proposed in [27]. A given schedule is evaluated by simply iterating on the CNs, computing the energy consumption for every task assigned on every machine, and taking into account the “holes” that appear when a computing resource is not fully used (i.e., because a task only requires a part of the available cores in the machine). In addition to the two defined objectives, we are also interested in this paper on studying two metrics to measure the quality of solutions from the point of view of both the customer/user and the infrastructure administrator. They are not considered in this paper as additional objectives to optimise, but we designed the ad-hoc lower-level scheduling heuristics trying to provide accurate values for these metrics, as we find they are good indicators of the quality of the solutions found: • Minimise the number of violated deadlines of the scheduled applications. Every application jq defines a penalty function that specifies the cost of not meeting its deadline. This penalty function is defined in terms of the amount of time exceeded after the deadline, P enaltyq (Fq ). Therefore, the cost is 0 when the application is finished before its deadline, and P enaltyq (Fq ) when Fq > Dq . The cost of a given schedule is defined as the sum of the costs of every application: X fC (~x) = Cost(~x, q) , (4) 0≤q≤n
9
Cluster 1
Customers Workflow Applications 1 2
Lower-level
Scheduler
Cluster 2
Grid
Lower-level
Scheduler
3
4
5
6
7
8
Front-end
Set 1 of jobs Set 2 of jobs
Higher-level Scheduler
Cluster i Lower-level
Scheduler
9
Set i of jobs 10
Figure 2: Scenario considered in this work
where Cost(~x, q) is the cost of job jq in schedule ~x, defined as: Cost(~x, q) =
0 if Fq < Dq P enaltyq (Fq ) otherwise
.
(5)
Three different penalty functions are considered in this work for jobs. This assignment is given by the problem instance, and it is done according to the jobs priority (more important penalizations are assigned to higher priority jobs). SQRT P enaltyq (Fq ) = LIN SQR
p = Dq − Fq = Dq − Fq = (Dq − Fq )2
.
(6)
• Resources utilisation. We measure the average of the percentage of time when the resources were not idle for all computing nodes. 4. Proposed scheduling methods This section presents the schedulers proposed to solve the considered problem. Because of the large scale nature of the problem, we design a two-level scheduler that decomposes the large problem into several smaller and easier to tackle ones. As it is shown in Figure 2, all workflow application requests are collected by the front-end server, which implements a higherlevel scheduler that dispatches all requests among the available clusters, that 10
we call CNs. Then, every computing node implements a local or lowerlevel scheduler that finds an accurate planing of the assigned jobs into the multicore servers composing it. We study in this work four different heuristics for each of the two scheduling levels. They are presented in the following subsections. 4.1. Higher level schedulers We propose four different strategies for the higher-level scheduler. All of them only assign jobs to those computing nodes where they can be executed, meaning that their servers have enough number of cores to execute any task in the job. Two of these strategies do not make use of any information related to the jobs (i.e., round robin, and load balance), while the other two ones make use of heuristic functions about the time or energy required to execute the jobs. They are MaxMin and MaxMIN, the two best heuristics for energyaware independent tasks allocation studied in [27]. They are described next: 1. Round Robin (RR): This heuristic iteratively assigns every job to the next computing node. If the job can not be executed in the selected computing node (because some task in it requires more cores than the number of cores of the servers in the computing node), then the heuristic continues the iteration to the next ones until a suitable computing node is found. 2. Load Balance (LB): It adjusts the number of jobs assigned to every computing node. Jobs are ordered according to the number of cores they require. The number of cores a job requires is defined as the maximum number of cores demanded by any of its tasks. Then, the jobs with higher cores requirements are assigned first. They are allocated to the computing node with the lowest number of jobs assigned, among those than can execute it. 3. MaxMin: It first assigns every job to the computing node that can finish it earlier (taking into account all previous assignments), then, among all these (job, computing node) pairs, the heuristic chooses the one with the overall maximum completion time, among those feasible assignments (i.e., the servers of the computing node have enough cores to execute the job). Therefore, larger tasks are allocated first in the most suitable computing nodes and shorter tasks are mapped afterwards, trying to balance the load of all computing nodes.
11
4. MaxMIN : This heuristic is similar to MaxMin, with the exception that in the first step the computing node associated to every job is the one that can complete it with less energy use (also taking into account the already assigned jobs). The second step is similar to the one of MaxMin, choosing the pair (job, computing node) with maximum completion time, subject to the constraint that the number of cores required by the job is not larger than the number of cores in the servers of the chosen computing node. In order to guide the search of MaxMin and MaxMIN algorithms, we need to define heuristic functions that will give an idea of the completion time and use of resources of jobs in the different CNs, as well as the energy required to execute them. We propose in this work to use rough estimations that will help guiding the search, at very low computational cost. We approximate the completion time of a job in the assigned CN as the sum of the expected time to compute all tasks in the job, if they were executed sequentially, divided by the total number of cores available in the CN, as defined in Equation 7. This heuristic function obviously finds unrealistic values for the completion time of jobs, but we consider it is a good indicator to guide our algorithms in the comparison of the resources required by the different jobs in the available CNs. X oqα · npqα /(kr · NPr ) . hCT (jq , CNr ) = (7) ops r t ∈T qα
q
With respect to the estimation of the energy required to compute job jq in CNr , we multiply hCT by the number of processors in CNr and the energy consumption of such processors at peak power, and add it to the time CNr remains idle after finishing its assigned jobs until the last CN executes all jobs (i.e., the makespan value): r hE (jq , CNr ) = hCT (jq ,CNr ) · NPr · EM AX + max ECTx − ECTr , 0≤x≤k
(8)
where ECTr is the estimated completion time of machine r, according to the values computed by heuristic hCT .
12
4.2. Lower level schedulers The proposed low-level scheduling heuristics are based on the Heterogeneous Earliest Finish Time (HEFT) strategy [29]. HEFT is a successful scheduler for DAG-modeled applications that works by assigning priorities to tasks, taking into account the upward rank metric, which evaluates the expected distance of each task to the last node in the DAG (the end of computation). The upward rank is recursively defined by URi = ti + maxj∈succ(i) cij + URj , where ti is the (average) execution time of task i in the computing resources, succ is the list of successors of task i, and cij is the communication cost between tasks i and j. After sorting all tasks in the job by taking into account the upward rank metric, HEFT assigns the task with the highest upward rank to the computing element that computes it in the earliest time. The proposed heuristics for low-level scheduling in data-centers follow the schema of HEFT, but using a backfilling technique and adapting the logic to work with multicore computing resources, by taking into account the “holes” that appear when a specific computing resources is not fully used by a single task (i.e., because the task only requires a part of the available cores in the machine). Four variants of the proposed scheduler were designed and implemented: 1. Best Fit Hole (BFH): sorts the tasks according to the upward rank values, then gives priority to assign the tasks to existing holes rather than using empty machines in the CN. When a given task fits on more than one hole, the heuristic selects the hole that “best fit” the task (i.e. the hole that minimizes the difference between the hole duration and the time to compute the task), disregarding the finishing time of the task. When no hole is available to execute the task, the heuristic selects the machine with the minimum finishing time for the task. The machines (and also machine holes) within the CN are processed sequentially and ordered (from machine #0 to machine #M). The rationale behind this strategy is to use available holes and left unoccupied large holes and empty machines for upcoming tasks. Ties between holes as well as between machines are decided lexicographically, as the method searches sequentially (in order) both holes and machines. 2. NOUR Best Fit Hole (NOUR): applies a “best fit hole” strategy, but without taking into account the task sorting using the upward rank metric. Instead, the heuristic simply sorts the list of tasks lexicograph13
ically (from task #0 to task #N), but it obviously takes into account the precedence graph. This heuristic is intended to produce compact schedules by not sticking to the importance given by the upward rank metric. 3. Earliest Finish Time Hole (EFTH): gives precedence to holes in machines, but instead of searching for the hole that best fits the task, the heuristic selects the hole that can complete the task in the earliest time, disregarding the hole length. As a consequence, this variant will (hopefully) produce fewer deadline violations and lower penalization values. When no hole is available to execute the task, the heuristic chooses to assign it to the machine with the minimum finish time for that task. 4. Optimistic Load Balancing (OLB): applies an optimistic strategy for spreading the tasks through machines, trying to reduce the makespan of the whole planning. Both holes and machine are searched, and the choice that produces the lowest local makespan increasing is selected. In order to exemplify how the proposed heuristics work for assigning DAG jobs to multicore CNs, we present next a simple case of study using a single CN composed by four quad-core machines and a workload containing six small jobs. Figure 3 presents the jobs (for each job, id is the task identifier, T is the length of the task, and C is the number of cores required). The real time to compute a task is defined by the task length divided by the processor speed. The deadlines for each job are D(j0 ) = 20.0, D(j1 ) = 15.0, D(j2 ) = 20.0, D(j3 ) = 20.0, D(j4 ) = 30.0, and D(j5 ) = 30.0. A linear penalization function is assumed for all jobs. The sample CN used in the study has the following features: speed = 1.6 GHz, #machines = 4, #cores in each machine = 4, EIDLE = 40.0 W, and EM AX = 100.0 W. The numerical results for the proposed case of study are summarized in Table 2. Besides the problem objectives (makespan and energy consumption), we also report the penalization cost and the effective utilization of the computing infrastructure. Figure 4 presents the schedules obtained when applying the proposed heuristics. Analyzing the results reported in Table 2 and the schedules in Figure 4, we can see that BFH is the heuristic that provides a better packaging, hole utilization, and makespan results for the considered problem. On the other 14
Figure 3: Sample instance for the small case of study (one CN, six jobs)
heuristic BFH EFTH NOUR OLB
makespan
metric energy (W/h) penalization
28.75 32.31 31.18 36.93
2.91 3.25 2.86 3.19
5.83 6.95 3.34 4.47
utilization 73.9% 65.8% 68.5% 56.3%
Table 2: Results for the small case of study (one CN, six jobs)
hand, NOUR provides solutions with the lowest energy consumption and penalization values, showing that not sticking to the upward rank values can be useful to properly spread the tasks among machines, thus reducing deadline violations. Finally, EFTH and OLB computing balanced trade-off schedules for the simple case of study tackled.
15
(a) BFH
(b) NOUR
(c) EFTH
(d) OLB
Figure 4: Sample solutions for the small case of study (one CN, six jobs)
5. Experimental Analysis This section describes the experimental setup for the comparison of the 16 combinations of the scheduling heuristics proposed in this article. After that, the main numerical results are presented and discussed. 5.1. Problem Instances To gauge the performance of our algorithm, we generate 125 different batch of workflows, each containing 1000 workflows. Our tailor made SchMng [30] was used to produce these workflows similar to those used in other approaches [31, 32, 33]. To cover all possible scenarios, we use four different workflow models to compose each batch, they are: (1) HeterogeneousParallel, (2) Homogenous-Parallel, (3) Series-Parallel, and (4) Single-Task. Fig. 5 shows overall shape of difference workflow types aimed to reflect real applications. Here, each block represents a computational task with specific characteristics; i.e., execution time (height of a task) and number of processors (width of a task). Heterogeneous-Parallel represent a generic job composed of non-identical computational blocks with arbitrary precedences; Homogenous-Parallel represents jobs composed of identical compu16
tational blocks; Series-Parallel represents jobs that can be split into concurrent threads running in parallel; Single-Task represents jobs of only one task. The number of tasks in workflows ranges from 3 to 132, in addition to the Single-Task ones, composed of only one task.
(a)
(b)
(c)
(d)
Figure 5: Workflow types: (a) Heterogeneous-Parallel, (b) Homogenous-Parallel, (c) Series-Parallel, and (d) Single-Task
From these 125 batch of workflows, • 25 were composed of 1000 Heterogeneous-Parallel workflows (25000 workflows altogether) • 25 were composed of 1000 Homogenous-Parallel workflows (25000 workflows altogether) • 25 were composed of 1000 Series-Parallel workflows (25000 workflows altogether) • 25 were composed of 1000 Single-Task workflows (25000 workflows altogether) • 25 were composed of a combination of above workflow types, that we call Mix; i.e., 300 Heterogeneous-Parallel workflows, 300 HomogenousParallel, 300 Series-Parallel, and 100 Single-Task (25000 workflows altogether) 17
Table processor Intel Celeron 430 Intel Pentium E5300 Intel Core i7 870 Intel Core i5 661 Intel Core i7 980 XE
3: Characteristics of the considered processors frequency cores GFLOPS EIDLE EMAX 1.80 GHz 1 7.20 75.0W 94.0W 2.60 GHz 2 20.80 68.0W 109.0W 2.93 GHz 4 46.88 76.0W 214.0W 3.33 GHz 2 26.64 74.0W 131.0W 3.33 GHz 6 107.60 102.0W 210.0W
GFLOPS/core 7.20 10.40 11.72 13.32 17.93
These workflows can be accessed/download by contacting the authors. Our computing scenario is composed of five different clusters, also called computing nodes, in which the workflows must be executed. Every cluster is made of an homogeneous set of processors. In this work, we considered the processors listed in Table 3, together with their characteristics. 5.2. Numerical results Table 4 reports the numerical results computed for each scheduler and each type of problem instances solved in the experimental analysis. We report the average performance difference (in %) of every algorithm with respect to the best result for every instance for the makespan (fM ) and energy consumption (fE ) objectives. We also study the average performance difference on the penalizations” cost for violated deadlines (fC ) and the average utilization of the resources (U) (in % of the total computing time). The results in Table 4 demonstrate that the list scheduling heuristics from the previous work [27] are also the best strategies for the two-level scheduling approach of DAG jobs applied in this work, clearly outperforming the schedulers based on RR and LB. MaxMin was consistently the best scheduler for Single-Task applications regarding both makespan and energy, and the best utilization values were obtained when applying the OLB and NOUR strategies in the low level. For Series-Parallel DAGs, MaxMin and MaxMIN computed the best results of makespan and energy, in both cases using OLB in the lower level, but the penalizations” cost is significantly larger than when using a simple RR strategy. MaxMin-EFTH was the best scheduler for Homogeneous-Parallel, and MaxMIN-EFTH for Heterogeneous-Parallel instances, outperforming the other combinations of higher-level/lower-level schedulers. Finally, in Mix workflow instances, which could represent the most generic case of multipurpose high performance computing applications, MaxMIN (using both
18
Table 4: Average performance difference (in %) for makespan (fM ) energy (fE ) and penalizations” cost (fC ) of every algorithm with respect to the best result for every instance, and average utilization of the resources (U, in %) Algorithms
19
RR-BFH RR-EFTH RR-NOUR RR-OLB LB-BFH LB-EFTH LB-NOUR LB-OLB MaxMin-BFH MaxMin-EFTH MaxMin-NOUR MaxMin-OLB MaxMIN-BFH MaxMIN-EFTH MaxMIN-NOUR MaxMIN-OLB
fM 30.3 30.4 30.3 30.5 32.9 32.9 32.9 32.9 0.1 0.1 0.1 0.1 9.1 9.1 9.1 8.5
Single-Task fE fC 18.2 1.5 18.4 0.3 18.2 1.5 18.5 1.0 22.7 24.3 22.8 23.6 22.7 24.3 22.8 23.9 0.1 37.2 0.1 37.2 0.1 37.2 0.1 37.2 5.6 47.1 5.7 47.1 5.6 47.1 5.2 46.9
U 84.2 83.9 84.2 83.7 86.4 85.9 86.4 85.8 93.5 93.4 93.5 93.0 93.2 93.2 93.2 92.9
fM 15.7 1.8 14.1 1.3 15.2 1.3 14.0 0.7 14.5 0.8 13.5 0.1 14.5 0.8 13.5 0.1
Series-Parallel fE fC 15.4 27.7 1.6 1.0 13.8 21.6 1.1 0.1 14.9 39.2 1.2 18.3 13.7 34.9 0.6 17.7 14.3 56.1 0.7 41.0 13.3 53.4 0.1 40.5 14.3 56.1 0.8 41.0 13.3 53.4 0.1 40.5
U 30.9 38.0 32.5 36.6 33.2 41.2 35.8 39.9 33.5 41.3 36.2 39.6 34.6 42.6 37.3 40.8
Homogeneous-Paralel fM fE fC U 41.8 23.7 4.4 85.2 41.3 23.4 2.1 87.3 42.1 24.2 4.0 85.5 41.2 23.3 1.2 86.8 42.7 23.3 38.6 80.3 41.6 25.6 34.5 87.7 42.3 23.9 37.0 81.7 41.0 25.0 34.1 87.2 34.0 16.1 37.4 81.3 0.1 0.5 16.8 91.2 17.9 6.8 28.0 84.8 0.1 0.6 17.2 91.0 51.0 30.2 73.7 81.0 50.2 32.9 72.0 89.9 50.6 31.1 73.1 83.1 50.0 32.8 71.7 89.5
Heterogeneous-Paralel fM fE fC U 12.4 11.9 20.5 33.9 2.2 1.9 0.6 38.6 10.2 9.7 15.1 35.2 2.2 1.9 0.7 37.2 11.5 11.1 27.4 34.4 0.6 0.6 11.0 39.6 9.8 9.5 24.5 35.3 0.7 0.7 11.3 38.1 10.8 10.6 48.3 35.3 0.2 0.2 35.3 40.1 9.0 8.9 45.5 36.9 0.3 0.3 35.7 38.6 10.8 10.6 48.3 35.5 0.2 0.2 35.3 40.3 9.0 8.9 45.5 37.1 0.3 0.3 35.7 38.8
fM 18.6 15.9 18.4 15.7 27.6 25.1 27.2 25.0 4.6 0.7 4.4 0.6 4.8 0.4 4.6 0.4
Mix fE fC 17.4 8.6 15.1 0.9 17.4 6.5 15.0 0.7 24.0 21.5 24.0 13.2 24.4 18.7 23.9 13.6 2.0 44.6 5.0 35.4 4.5 42.0 5.1 35.6 2.7 43.8 5.0 36.0 5.0 41.3 5.0 36.2
U 81.1 85.4 82.4 84.7 74.9 85.7 76.9 84.9 74.1 89.3 78.6 88.6 75.0 89.7 79.4 89.0
EFTH and OLB lower-level strategies) computed the best results regarding the makespan objective, and MaxMin-BFH was the heuristic that better optimized the energy consumption. Accurate resource utilization values (up to 90%) were achieved in the best case. In both Series-Parallel and Heterogeneous-Parallel instances, the average utilization of the computing infrastructure is significantly lower (between 40%-50%) than in the other classes of applications. This can be explained by observing that for large heterogeneous applications it is more difficult to take full advantage of the holes that appear when a given task does not use all the cores available in a machine. Thus, despite making the best effort to fill the existing holes, hole-oriented strategies such as the one applied in the lower-level of the hierarchical scheduler might not be enough to fully reduce the makespan and energy consumption. Anyway, the energy-aware schedulers compute solutions with about 5% to 10% better utilization values than those computed using simpler strategies such as RR and LB. We show in Fig. 6 some representative examples of the influence of the higher- and lower-level heuristics on the quality of results, in terms of makespan and energy consumption. It can be seen in Fig. 6a how EFTH and OLB lower-level schedulers provide better makespan and energy consumption values than BFH and NOUR, in general. This behavior was also observed in some cases for the Mix instances, however, the results shown in Fig. 6b happen more often. Here we show the results for MinMIN higher-level scheduler, but we get similar results for the others. We can see how OLB and EFTH help finding lower makespan solutions, but at the cost of higher consumption in this case. We now study the influence of the different higher-level heuristics, with the same lower-level one in figures 6c and 6d. We study in Fig. 6c the results found by the different heuristics for an heterogeneous-parallel instance, when fixing the lower-level scheduler to EFTH. It can be seen how MaxMin and MaxMIN heuristics clearly provide better results than RR and LB, for both objectives, and RR is the worst performing one. In the last plot, in Fig. 6d, we show the performance of the different higher-level schedulers when fixing the lower-level scheduler to OLB, for a Mix instance. We see in this case that, again, MaxMin and MaxMIN are the best performing heuristics, while LB is the worst one in this case. Table 5 summarizes the ranking of all algorithms according to the makespan and energy objectives. These rank values were obtained after applying the Friedman statistical tests on the average values obtained by the algorithms 20
Heterogeneous−Parallel (Instance1)
4
5.2
x 10
Mix (Instance6)
4
2.64
x 10
2.62
5.1
Energy used (KW/h)
Energy used (KW/h)
2.6 5 4.9 4.8 4.7 4.6 4.5 4.6
4.7
4.8
4.9
5
5.1
2.58 2.56 2.54 2.52 2.5
MaxMin−BFH MaxMin−EFTH MaxMin−NOUR MaxMin−OLB
2.48
5.2
2.46 2.55
5.3
5.4
MaxMIN−BFH MaxMIN−EFTH MaxMIN−NOUR MaxMIN−OLB 2.6
Makespan (hours)
2.65
(a) 4
4.64
x 10
Heterogeneous−Parallel (Instance3)
Mix (Instance11)
4
3.3
x 10
3.2
4.62
Energy used (KW/h)
Energy used (KW/h)
2.75
(b)
4.63
4.61 4.6 4.59 4.58
RR−EFTH LB−EFTH MaxMin−EFTH MaxMIN−EFTH
4.57 4.56 4.6
2.7
Makespan (hours)
4.65
3.1 3 2.9 2.8 RR−OLB LB−OLB MaxMin−OLB MaxMIN−OLB
2.7 2.6 2.6
4.7
Makespan (hours)
2.8
3
3.2
3.4
3.6
Makespan (hours)
(c)
(d)
Figure 6: Influence of the different heuristics on the solutions found.
in all instances of every problem class. For the two objectives, the Friedman test gave p-values lower than 0.095, therefore there are statistically significant differences between the algorithms with 95% confidence. The ranks in Table 5 confirm that MaxMin and MaxMIN are the best performing strategies for scheduling in the higher level regarding the considered objectives for the studied problems. Within each CN, the OLB and ETF schedulers allow to compute the best makespan and energy consumption objectives, respectively. The results also suggest that the greedy strategy used by BFH is not the best choice for simultaneously optimizing both objectives. While useful for same cases (see the example toy instance on Section 4) other non-greedy heuristics with respect to the holes/machine utilization such as OLB and EFTH are able to increase the improvements over RR for both 21
Table 5: Friedman ranking of the algorithms according to the average makespan and energy consumption found makespan algorithm MaxMin-OLB MaxMin-EFTH MaxMIN-OLB MaxMin-NOUR MaxMIN-EFTH MaxMin-BFH RR-OLB RR-EFTH LB-OLB MaxMIN-NOUR LB-EFTH RR-NOUR MaxMIN-BFH RR-BFH LB-NOUR LB-BFH
energy ranking
algorithm
1.6 2.0 3.8 4.7 5.0 6.9 7.0 7.4 7.7 8.3 8.9 9.9 10.5 11.3 11.5 13.5
MaxMin-EFTH MaxMin-OLB MaxMin-NOUR MaxMIN-OLB MaxMin-BFH MaxMIN-EFTH RR-OLB RR-EFTH MaxMIN-NOUR LB-OLB MaxMIN-BFH RR-NOUR LB-EFTH LB-NOUR RR-BFH LB-BFH
ranking 1.8 2.2 4.9 5 5.5 5.8 7.0 7.4 8.7 8.8 8.9 9.7 10.2 11.1 11.1 13.9
makespan and energy. 6. Conclusions and Future Work This article presented a two-level strategy for scheduling large workloads in multicore distributed systems, taking into account the total time and the energy consumption objectives. In the higher-level, two of the most accurate combined list scheduling heuristics from our previous work [27] were applied to schedule jobs between data-centers. They were adapted to work with DAG workflows by including a rough estimation of the computing time that will be used as a heuristic to guide the search, and a simple energy consumption estimation. Traditional round-robing and load-balancing techniques were studied too. In the lower-level, ad-hoc methods for scheduling DAG jobs into multicore systems using backfilling technique were designed and implemented. The rationale behind these methods is to improve the resource utilization and minimizing the penalization cost for deadline violations. Four strategies were studied, by greedily taking into account the hole utilization, the earliest finish 22
time for tasks, and the load balancing. An exhaustive experimental analysis was performed, taking into account 25 instances for each of the 5 studied problem classes, and every instance is composed by 1000 workflow applications. This makes a total of 125000 workflows studied by all proposed algorithms. The experimental results indicated that accurate schedules are computed by using the MaxMin and MaxMIN combined list scheduling heuristics— which account for both problem objectives in the higher level— along with the EFTH and OLB ad-hoc scheduling techniques for scheduling in multicore infrastructures in the lower level. The best performing schedulers on each problem scenario allow the planner to achieve improvements up to 46.8% in terms of makespan and 29.0% in terms of energy consumption, over a typical round-robin-based strategy. These results suggest that the proposed approach and the scheduling heuristics studied are promising techniques for scheduling large workloads in multicore distributed systems. As future work, we plan to use multi-objective evolutionary algorithms to learn more about the problem and identify other interesting areas of trade-off solutions, different from the ones found by our heuristics. This study will also allow us assessing the accuracy of the solutions of our two-level schedulers. Additionally, we plan to add the penalizations function as a third objective to optimize in the problem, together with makespan and energy use. Finally, another interesting line of future work is the design of schedulers that can efficiently cope with unexpected resource failures. Acknowledgements B. Dorronsoro acknowledges that the present project is supported by the National Research Fund, Luxembourg, and cofunded under the Marie Curie Actions of the European Commission (FP7-COFUND), under AFR contract no 4017742. The work of S. Nesmachnow has been partially supported by ANII and PEDECIBA, Uruguay. References [1] A. Y. Zomaya, Y. C. Lee, Energy Efficient Distributed Computing Systems, Wiley-IEEE Computer Society Press, 2012.
23
[2] I. Ahmad, S. Ranka, Handbook of Energy-Aware and Green Computing, Chapman & Hall/CRC, 2012. [3] G. Valentini, W. Lassonde, S. Khan, N. Min-Allah, S. Madani, J. Li, L. Zhang, L. Wang, N. Ghani, J. Kolodziej, H. Li, A. Zomaya, C.-Z. Xu, P. Balaji, A. Vishnu, F. Pinel, J. Pecero, D. Kliazovich, P. Bouvry, An overview of energy efficiency techniques in cluster computing systems, Cluster Computing 16 (1) (2013) 3–15. [4] A. Kumar, L. Shang, L.-S. Peh, N. K. Jha, Hybdtm: a coordinated hardware-software approach for dynamic thermal management, in: 43rd ACM/IEEE Design Automation Conference, 2006, pp. 548–553. [5] A. Tchernykh, L. Lozano, U. Schwiegelshohn, P. Bouvry, J. Pecero, S. Nesmachnow, Energy-aware online scheduling: Ensuring quality of service for IaaS clouds, in: The 2014 International Conference on High Performance Computing & Simulation, 2014, p. to appear. [6] A. Abraham, H. Liu, C. Grosan, F. Xhafa, Metaheuristics for Scheduling in Distributed Computing Environments, Vol. 146, Springer, 2008, Ch. Nature Inspired Meta-heuristics for Grid Scheduling: Single and Multiobjective Optimization Approaches, pp. 247–272. [7] S. Nesmachnow, Parallel multiobjective evolutionary algorithms for batch scheduling in heterogeneous computing and grid systems, Computational Optimization and Applications 55 (2013) 515–544. [8] B. Dorronsoro, P. Bouvry, J. Ca˜ nero, A. Maciejewski, H. Siegel, Multiobjective robust static mapping of independent tasks on grids, in: IEEE Congress on Evolutionary Computation (CEC), part of the World Congress on Computational Intelligence (WCCI), 2010, pp. 3389–3396. [9] I. De Falco, A. Della Cioppa, D. Maisto, U. Scafuri, E. Tarantino, Applications of Soft Computing, Vol. 58 of Advances in Soft Computing, Springer-Verlag, 2009, Ch. A Multiobjective Extremal Optimization Algorithm for Efficient Mapping in Grids, pp. 367–377. [10] K. Kurowski, A. Oleksiak, M. Witkowski, J. Nabrzyski, Distributed power management and control system for sustainable computing environments, in: International Green Computing Conference, IEEE, 2010, pp. 365–372. 24
[11] J. Yu, R. Buyya, A budget constrained scheduling of workflow applications on utility grids using genetic algorithms, in: Proc. of 15th IEEE Int Symposium on High Performance Distributed Computing, 2006, pp. 1–10. [12] J. Yu, M. Kirley, R. Buyya, Multi-objective planning for workflow execution on grids, in: IEEE/ACM Int Conf on Grid Computing, 2007, pp. 10–17. [13] G. Ye, R. Rao, M. Li, A multiobjective resources scheduling approach based on genetic algorithms in grid environment, in: Proc. of the 5th Int Conf on Grid and Cooperative Computing Workshops, IEEE Press, 2006, pp. 504–509. [14] S. Baskiyar, R. Abdel-Kader, Energy aware DAG scheduling on heterogeneous systems, Cluster Computing 13 (2010) 373–383. [15] N. Rizvandi, J. Taheri, A. Zomaya, Some observations on optimal frequency selection in dvfs-based energy consumption minimization, J. Parallel Distrib. Comput. 71 (2011) 1154–1164. [16] N. B. Rizvandi, J. Taheri, A. Zomaya, Some observations on optimal frequency selection in dvfs-based energy consumption minimization, Journal of Parallel and Distributed Computing 71 (8) (2011) 1154–1164. [17] S. Khan, I. Ahmad, A cooperative game theoretical technique for joint optimization of energy consumption and response time in computational grids, IEEE Trans. Parallel Distrib. Syst. 20 (2009) 346–360. [18] Y. Lee, A. Zomaya, Energy conscious scheduling for distributed computing systems under different operating conditions, IEEE Trans. Parallel Distrib. Syst. 22 (2011) 1374–1381. [19] M. Mezmaz, N. Melab, Y. Kessaci, Y. Lee, E. G. Talbi, A. Zomaya, D. Tuyttens, A parallel bi-objective hybrid metaheuristic for energyaware scheduling for cloud computing systems, J. Parallel Distrib. Comput. 71 (2011) 1497–1508. [20] J. Pecero, P. Bouvry, H. J. Fraire Huacuja, S. Khan, A multi-objective grasp algorithm for joint optimization of energy consumption and schedule length of precedence-constrained applications, in: Int Conf Cloud 25
and Green Computing, IEEE CS Press, Sydney, Australia, 2011, pp. 1–8. [21] J.-K. Kim, H. Siegel, A. Maciejewski, R. Eigenmann, Dynamic resource management in energy constrained heterogeneous computing systems using voltage scaling, IEEE Trans. Parallel Distrib. Syst. 19 (2008) 1445– 1457. [22] P. Luo, K. L¨ u, Z. Shi, A revisit of fast greedy heuristics for mapping a class of independent tasks onto heterogeneous computing systems, J. Parallel Distrib. Comput. 67 (6) (2007) 695–714. [23] Y. Li, Y. Liu, D. Qian, A heuristic energy-aware scheduling algorithm for heterogeneous clusters, in: Proc. of the 2009 15th International Conference on Parallel and Distributed Systems, ICPADS ’09, IEEE Computer Society, Washington DC, USA, 2009, pp. 407–413. [24] F. Pinel, B. Dorronsoro, J. Pecero, P. Bouvry, S. Khan, A two-phase heuristic for the energy-efficient scheduling of independent tasks on computational grids, Journal of Cluster Computing. Submitted. [25] P. Lindberg, J. Leingang, D. Lysaker, S. Khan, J. Li, Comparison and analysis of eight scheduling heuristics for the optimization of energy consumption and makespan in large-scale distributed systems, The Journal of Supercomputing 59 (1) (2012) 323–360. [26] S. Iturriaga, S. Nesmachnow, B. Dorronsoro, P. Bouvry, Energy efficient scheduling in heterogeneous systems with a parallel multiobjective local search, Computing and Informatics Journal 32 (2) (2013) 273–294. [27] S. Nesmachnow, B. Dorronsoro, J. E. Pecero, P. Bouvry, Energy-aware scheduling on multicore heterogeneous grid computing systems, Journal of Grid Computing 11 (4) (2013) 653–680. [28] L. Minas, B. Ellison, Energy Efficiency for Information Technology: How to Reduce Power Consumption in Servers and Data Centers, Intel Press, 2009. [29] H. Topcuouglu, S. Hariri, M.-y. Wu, Performance-effective and lowcomplexity task scheduling for heterogeneous computing, IEEE Transanctions on Parallel and Distributed Systems 13 (3) (2002) 260–274. 26
[30] J. Taheri, A. Zomaya, S. Khan, Grid simulation tools for job scheduling and datafile replication in Scalable Computing and Communications: Theory and Practice, John Wiley & Sons, Inc., Hoboken, New Jersey, 2013, Ch. 35, pp. 777–797. [31] J. Taheri, Y. Choon Lee, A. Y. Zomaya, H. J. Siegel, A bee colony based optimization approach for simultaneous job scheduling and data replication in grid environments, Comput. Oper. Res. 40 (6) (2013) 1564–1578. [32] J. Taheri, A. Y. Zomaya, P. Bouvry, S. Khan, Hopfield neural network for simultaneous job scheduling and data replication in grids, Future Gener. Comput. Syst. 29 (8). [33] J. Taheri, A. Y. Zomaya, S. Khan, Genetic algorithm in finding pareto frontier of optimizing data transfer versus job execution in grids, Concurrency and Computation: Practice and ExperienceOnline first, DOI: 10.1002/cpe.2960.
27