J Supercomput DOI 10.1007/s11227-017-2060-4
Real-time workflows oriented online scheduling in uncertain cloud environment Huangke Chen1 · Jianghan Zhu1 · Zhenshi Zhang1 · Manhao Ma1 · Xin Shen2
© Springer Science+Business Media New York 2017
Abstract Workflow scheduling has become one of the hottest topics in cloud environments, and efficient scheduling approaches show promising ways to maximize the profit of cloud providers via minimizing their cost, while guaranteeing the QoS for users’ applications. However, existing scheduling approaches are inadequate for dynamic workflows with uncertain task execution times running in cloud environments, because those approaches assume that cloud computing environments are deterministic and pre-computed schedule decisions will be statically followed during schedule execution. To cover the above issue, we introduce an uncertainty-aware scheduling architecture to mitigate the impact of uncertain factors on the workflow scheduling quality. Based on this architecture, we present a scheduling algorithm, incorporating both event-driven and periodic rolling strategies (EDPRS), for scheduling dynamic workflows. Lastly, we conduct extensive experiments to compare EDPRS
B
Huangke Chen
[email protected] Jianghan Zhu
[email protected] Zhenshi Zhang
[email protected] Manhao Ma
[email protected] Xin Shen
[email protected]
1
Science and Technology on Information Systems Engineering Laboratory, National University of Defense Technology, Changsha 410073, People’s Republic of China
2
State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan 430079, People’s Republic of China
123
H. Chen et al.
with two typical baseline algorithms using real-world workflow traces. The experimental results show that EDPRS performs better than those algorithms. Keywords Cloud computing · Event-driven · Periodic rolling · Uncertain scheduling · Online scheduling
1 Introduction Cloud computing has become a new paradigm in distributed computing. In this paradigm, cloud providers delivery on-demand services (e.g., application, platforms and computing resources) to customers in a “pay-as-you-go” model [1]. The cloud paradigm can be classified into three models: Software as a service (SaaS); Platform as a Service (PaaS); Infrastructure as a Service (IaaS) [2]. In cloud environments, service providers manage large-scale heterogeneous virtual machines (VMs) to service users’ requests. The ultimate goal of these service providers is increasing their revenues, while satisfying the users’ QoS that are defined in SLAs, such as deadline [3], makespan, data security, and reliability [4]. From the customers’ perspective, the cloud model is scalable and cost-effective because customers can access to resources on demand and just pay for their actual usage without upfront costs. Due to its various benefits, cloud computing has been increasingly adopted in many areas, such as banking, e-commerce, retail industry, and academia [5–7]. Notably, the applications in these fields usually comprise many interrelated computing and data transfer tasks [8]. As the precedence constraints between tasks in these applications, a large number of idle time slots between tasks will be left on VMs, which often leads to excessive resource usage for cloud applications [9]. Besides, low resource usage in cloud applications also wastes tremendous costs, and the 3% improvement in resource usage for large companies can translate to over a million dollars in cost savings [10]. One key component of cloud platform is the scheduler, which bridges users’ application and the computing resource. Moreover, the scheduling algorithm in the scheduler is responsible for scheduling workflow tasks to VMs and plays an essential role for satisfying applications’ requirements while efficiently utilizing system resources. Take a workflow application as an example, such as CyberShake [11], if all its tasks are scheduled to a cheapest VM, it costs the least, but seriously delays the makespan. On the other hand, if each of its tasks is scheduled to a VM with the highest configuration, its cost will be very high, but its makespan may not be minimized. Efficient scheduling approaches show promising ways to make trade-offs among multiple optimization objectives. Up to now, considerable work has been devoted to scheduling workflows in clouds. However, the majority of these existing scheduling approaches are based on the accuracy of the information about task execution times. Actually, task execution times usually cannot be reliably estimated, and the actual values are available only after tasks have been completed. This may be contributed to the following two reasons: (1) Workflow tasks usually contain conditional instructions under different inputs [12,13]; (2) the performance of virtual machines in clouds varies over the time [14–17].
123
Real-time workflows oriented online scheduling in…
Motivation Due to the dynamic and uncertain nature of cloud environments, numerous schedule disruptions (e.g., variation of task execution time, arrival of new workflows) may occur and the pre-computed baseline schedule may not be executed strictly or effective as expected in real execution. For instance, if the worst-case execution time of workflow tasks is considered in the schedule, many time slot will leave on VM, leading to waste of resources. On the other hand, if the reserved time for workflow tasks is too short, their real finish time will be greater than their expected finish time, which may recursively delay other tasks. Unfortunately, the majority of researches ignored these dynamic and uncertain factors, which may leave a large gap between the real execution behavior and the behavior initially expected. To address this issue, we study how to control the impact of uncertainties on scheduling results, and how to improve resource utilization for VMs and reduce cost for cloud providers via resource sharing among multiple workflows, while guaranteeing their timing requirements. Contributions The key contributions of this work are: • An uncertainty-aware architecture for scheduling dynamic workflows in cloud environments. This architecture can not only prohibit propagation of uncertainties, but also enable the overlapping of communications and computations. • A novel algorithm named EDPRS that incorporating both event-driven and periodic rolling strategies for scheduling dynamic workflows and computing resources when considering various uncertain factors. • The experimental verification of the proposed EDPRS algorithm based on realworld workflow traces. The outline of this paper is organized as follows. Section 2 briefly presents related work on workflow scheduling in distributed computing environments. Section 3 gives an overview of scheduling architecture and problem formulation, followed by detailing the scheduling algorithm for dynamic workflows in Sect. 4. In Sect. 5, we conduct massive experiments to evaluate the performance of our proposed algorithm. Section 6 concludes the paper.
2 Related work Since workflow scheduling is a well-known NP-complete problem, a great deal of heuristics has been proposed to obtain near-optimal solutions. Among them, there are three typical approaches: meta-heuristic-based, list-based, and PCP-based. For instance, Xu et al. incorporated a genetic algorithm (GA) approach to assign priorities to workflow tasks and utilized a heuristic strategy to map tasks to processors [18]. Jing et al. proposed an energy-efficient scheduling algorithm, based on ant colony optimization (ACO), for reconfigurable systems [19]. Also, there exist many listbased heuristics. For example, Durillo et al. proposed a list-based workflow scheduling heuristic to make trade-off between makespan and energy consumption [20]. Mei et al. presented a new energy-aware scheduling algorithm for real-time tasks [21]. Further, there is a large body of work in designing workflow scheduling approaches, based on Partial-Critial-Paths (PCP) methods [22]. Zhang et al. used a vectorized ordinal optimization approach to extend the ordinal optimization method from a single objective to multiple objectives [23]. However, the above existing approaches neglected the
123
H. Chen et al.
uncertainties of task execution times and the dynamic nature of workflow applications in cloud environments. There also exist some work investigating the workflow scheduling strategies under uncertain computing environments. For instance, Tang et al. developed a stochastic heterogeneous earliest finish time scheduling algorithm to minimize the makespan for workflow [12]. Li et al. proposed a heuristic energy-aware stochastic task scheduling algorithm to reduce energy consumption for heterogeneous computing systems [24]. However, these schemes were designed for a single application, not fit for dynamic cloud environments, where mass workflows will be submitted from time to time. Unlike the existing scheduling schemes, where the complete schedules for all the tasks are generated once a workflow arrives, the approach in this paper will continually generate new schedules for workflow tasks over their actual execution, to mitigate the impact of uncertainties on the scheduling quality.
3 Modeling and problem formulation In this section, we firstly give the model of virtual machines (VMs) and workflows and then propose an uncertainty-aware scheduling architecture for cloud platforms. Based on this architecture, we formulate the scheduling problem. 3.1 Virtual machine modeling We define a set S = {s1 , s2 , . . . , sm } to represent all VM types in cloud platform, where m is the number of VM types. The symbol su ∈ S denotes the u-th VM type. In addition, the parameter vm sku is used to denote the k-th VM with type su . Further, VMs are charged per integer amount of time periods, and partial utilization of a time period incurs charge for the whole period. For instance, if the time period is one hour, utilization of a VM per 61 minutes incurs in the payment of two hours. The price of VM vm sku is defined as the cost per time period, denoted as Price(vm sku ). In addition, VMs can be leased and released at any time. 3.2 Workflow modeling In a cloud platform, workflows will be continuously submitted by customers, and they can be characterized by an infinite set W = {w1 , w2 , . . .}. For a certain workflow wi ∈ W , it can be modeled as wi = {ai , di , G i }, where ai , di , and G i represent its arrival time, deadline, and structure, respectively. Further, the structure G i for workflow wi can be formally expressed as a directed acyclic graph (DAG), i.e., G i = (Ti , E i ), where Ti = {ti1 , ti2 , . . . , ti N } is a set of tasks, and the parameter N is the task count; the symbol ti j ∈ Ti represents the j-th task in workflow wi . Additionally, E i ⊆ Ti × Ti represents a set of directed arcs among tasks. An edge eipj ∈ E i of the form (ti p , ti j ) exists if there is a precedence constraint between task ti p and task ti j , where ti p is an immediate predecessor of task ti j and the task ti j is an immediate successor of task ti p . In addition, pr ed(ti j ) and succ(ti j ) denote the set of immediate predecessors and successors of task ti j , respectively.
123
Real-time workflows oriented online scheduling in…
It is worth noting that the main difference between uncertain scheduling and deterministic scheduling is that task execution time and communication time between tasks are random or deterministic. In this paper, the task execution time is assumed to follow normal distribution, which is reasonable to most real-world scenarios [12,25,26]. 3.3 Scheduling architecture In this paper, we design an uncertainty-aware scheduling architecture for a cloud platform, as shown in Fig. 1. The platform consists of three layers: user layer, scheduling layer, and resource layer. In cloud environments, users will dynamically submit their applications to cloud provider. The scheduling layer is responsible for generating taskto-VM mappings, according to certain objectives and expected resource performance. The resource layer consists of large-scale VMs with many types, and the number of each type of VMs can be scaled up or down dynamically. As a study of scheduling algorithm is our primary concern here, we focus on the scheduling layer, which consists of a task pool (TP), a schedulability analyzer, a resource controller, and a task controller. The TP accommodates most of the waiting tasks, and the schedulability analyzer is responsible for producing the mapping of waiting tasks in TP to VM and the plan of scaling up/down the computing resources. The plan includes when and what VMs to add or delete. Based on the plan, the resource controller will dynamically adjust the computing resources in the system. In addition, the task controller will dynamically allocate waiting tasks from TP to corresponding VMs according to the mappings of tasks to VMs. The novel features of this scheduling architecture are that most of waiting tasks are waiting in the TP instead of waiting on the VMs directly, and at most one task is allowed to wait on each VM. The benefits of this scheduling architecture are summarized as follows. • It can prohibit propagation of uncertainties. Since task execution time is uncertain before being completed, and we assume it to follow normal distribution. Then, the
Fig. 1 Uncertainty-aware scheduling architecture for a cloud platform
123
H. Chen et al.
variance of task execution time reflects its uncertainty, and the greater variance of task execution time means the greater uncertainty. In addition, the variance of task execution time is also reflected in its finish time. When calculating a task’s finish time, the variance of both all its waiting predecessor tasks and the tasks waiting before it will accumulate to its finish time, which is the embodiment of the uncertainty propagation. Since only the scheduled tasks are allowed to wait on VMs in this architecture, the uncertainty of the executing task can only transfer to the waiting tasks on the same VM. When the executing task is completed, its uncertainty does not exist, and the subsequent waiting tasks on that VM will not be affected by the task finished. Thus, this scheduling can prohibit propagation of uncertainties. • This design enables overlapping of communications and computations. When a VM is executing a task and its waiting task is empty, this VM can simultaneously receive a new task as a waiting task. By doing so, communications and computations are efficiently overlapped to save time. 3.4 Problem formulations The assignment variable xi j,k is utilized to reflect the mapping relationship between tasks and VMs. The xi j,k is 1 if task ti j is mapped to VM vm sku , otherwise, xi j,k equals 0, i.e., xi j,k =
1, if ti j is assigned to vm sku , 0, otherwise.
(1)
In the schedule of multiple workflows, the execution time of a task is assumed to be normal distribution, and we utilize its α quantile to approximate it. The symbol etiαj,k is utilized to denote the α quantile of the execution time of task ti j on VM vm sku . In addition, symbols psti j,k and p f ti j,k are utilized to denote the predicted start time and the predicted finish time of task ti j on VM vm sku , respectively. Since the execution of task ti j can begin if it gains all the data from its predecessors, the predicted start time psti j,k can be calculated as follows: psti j,k = max p f tlh,k ,
max
ti p ∈ pr ed(ti j )
{ p f ti p,r (ti p ) + tt ipj } .
(2)
where p f tlh,k represents the predicted finish time of task tlh which is the currently last task on VM vm sku ; r (ti p ) represents the index of the VM that task ti p is assigned to; and tt ipj represents the transfer time of eipj . Apparently, the predicted finish time, p f ti j,k , can be written as p f ti j,k = psti j,k + etiαj,k .
(3)
Since the execution time of all workflow tasks is uncertain, only after they are completed, their start time, execution time, and finish time can be determined. The
123
Real-time workflows oriented online scheduling in…
symbols r sti j,k and r f ti j,k are used to denote the real start time and the real finish time of task ti j on VM vm sku , respectively. Due to precedence constraints in a workflow, we have the following constraint: r f ti p,k + tt ipj ≤ r sti j,k
∀eipj ∈ E i .
(4)
Under the uncertain scheduling environments, it is the actual finish time f tirj,k that determines whether the workflow’s timing requirement has been guaranteed or not. So we have the following constraint. max {r f ti j,k } ≤ di
ti j ∈Ti
∀wi ∈ W.
(5)
where maxti j ∈Ti {r f ti j } represents the actual finish time of workflow wi . Subjecting to aforementioned constraints, as specified in formulas (4) and (5), the primary optimization objective is to minimize total cost for executing the workflow set W , which can be represented as follows.
Minimize
|V M|
Price(vm sku ) · t pk .
(6)
k=1
where |V M| denotes the total number of VMs utilized to execute the workflow set, and t pk is the working time periods of VM vm sku . Further, resource utilization is also an important metric to evaluate the performance of a cloud platform. So, in this paper, we also focus on maximizing the average resource utilization of VMs, which can be represented as follows.
Maximize
|V M| k=1
(wtk )/
|V M|
(ttk ),
(7)
k=1
where wtk and ttk represent the working time and the total active time (including working and idle time) of VM vm k with template si during an experiment. Another objective needed to be optimized under uncertain computing environments is to minimize the deviation function [27] as follows:
Minimize
m 1 wi | max { p f ti j,k } − max {r f ti j }| , ti j ∈Ti ti j ∈Ti m
(8)
i=1
where wi represents the marginal cost of time deviation between the predicted and actual finish time. The parameters maxti j ∈Ti { p f ti j } and maxti j ∈Ti {r f ti j } represent the predicted and real finish time of workflow wi , respectively.
123
H. Chen et al.
4 Algorithm design It is widely known that workflow scheduling is a NP-complete problem, and finding the optimal schedule is infeasible within acceptable time. In this paper, we propose a heuristic that incorporates both event-driven and periodic rolling strategies, to obtain a sub-optimal solutions. The events refer to the arrival of new workflows. 4.1 Ranking tasks An important issue in workflow scheduling is how to rank the tasks. In this paper, all the tasks in TP will be ranked by their predicted latest finish time pl f ti j . The pl f ti j for each task ti j is defined as the latest finish time, before which task completes its execution, such that the predicted finish time p f ti of workflow wi will be less than its deadline di . Definition 1 The pl f ti j for task ti j is recursively defined as follows. pl f ti j =
min
tis ∈succ(ti j )
di , if succ(ti j ) = ∅, { pl f tis − metisα − mtt ijs }, otherwise.
(9)
where metisα represents the minimum of α quantile of tis ’s execution time among all the VMs; mtt ijs denotes the minimal data transfer time between ti j and tis , which is calculated based on the maximum bandwidth among VMs in the cloud platform. Definition 2 The predicted earliest start time pesti j of task ti j is recursively defined as follows. ai , if pr ed(ti j ) = ∅, (10) pesti j = otherwise, max { pesti p + metiαp + mtt ipj }, ti p ∈ pr ed(ti j )
where pesti p represents the predicted earliest start time of task ti p . 4.2 Scheduling algorithm To facilitate the presentation of scheduling strategies, we give several rules. Rule 1 Each virtual machine only executes one task at any time instant. This rule is helpful for leveraging the virtual technology to isolate the operating environment of different tasks and avoid the resource grabbing between tasks. But, it may cause the waste of computing resources if a single workflow task cannot utilize the total capacity of a VM. Rule 2 The waiting task on a VM is allowed to start as soon as executing task on the same VM and all its processors have been finished. This rule can simplify the scheduling. The reason is that with this rule the schedule only needs to specify the mapping relationships between tasks and VMs and the sorting of tasks on the each VM.
123
Real-time workflows oriented online scheduling in…
Rule 3 The parameter T is assumed to be the scheduling period, i.e., every time period T , a part of workflow tasks in TP will be scheduled to VMs. This rule can reduce the frequency of triggering reactive strategies. Definition 3 Ready task: A task is ready if it has not any predecessors, i.e., pr ed(ti j ) = ∅, or all of its predecessors have been scheduled. For traditional scheduling approaches, once a new workflow arrives, all its tasks are mapped and dispatched immediately to resources’ local queues. Unlike them [4,12, 23,28], our approach generates new mappings for the waiting tasks in TP continually over the actual execution of the cloud platform. The EDPRS performs the following operations when a new workflow arrives, as shown in Algorithm 1. Algorithm 1 EDPRS—on the arrival of new workflows 1: task Pool ← NULL; 2: for each new workflow wi arrives do 3: Calculate pl f ti j and pesti j for each task in wi as formula (9) and (10). 4: task List ← NULL; 5: for each task ti j ∈ wi do 6: if pesti j − ct