An iterative expanding and shrinking process for processor allocation ...

6 downloads 4537 Views 3MB Size Report
Kuo-Chan HuangEmail author; Wei-Ya Wu; Feng-Jian Wang; Hsiao-Ching Liu ... Workflow scheduling Mixed parallelism Moldable task Processor allocation.
Huang et al. SpringerPlus (2016) 5:1138 DOI 10.1186/s40064-016-2808-y

Open Access

RESEARCH

An iterative expanding and shrinking process for processor allocation in mixed‑parallel workflow scheduling Kuo‑Chan Huang2*, Wei‑Ya Wu1, Feng‑Jian Wang1, Hsiao‑Ching Liu2 and Chun‑Hao Hung2 *Correspondence: [email protected] 2 Department of Computer Science, National Taichung University of Education, No. 140, Min‑Shen Road, Taichung, Taiwan Full list of author information is available at the end of the article

Abstract  Parallel computation has been widely applied in a variety of large-scale scientific and engineering applications. Many studies indicate that exploiting both task and data parallelisms, i.e. mixed-parallel workflows, to solve large computational problems can get better efficacy compared with either pure task parallelism or pure data parallelism. Scheduling traditional workflows of pure task parallelism on parallel systems has long been known to be an NP-complete problem. Mixed-parallel workflow scheduling has to deal with an additional challenging issue of processor allocation. In this paper, we explore the processor allocation issue in scheduling mixed-parallel workflows of mold‑ able tasks, called M-task, and propose an Iterative Allocation Expanding and Shrinking (IAES) approach. Compared to previous approaches, our IAES has two distinguishing features. The first is allocating more processors to the tasks on allocated critical paths for effectively reducing the makespan of workflow execution. The second is allowing the processor allocation of an M-task to shrink during the iterative procedure, resulting in a more flexible and effective process for finding better allocation. The proposed IAES approach has been evaluated with a series of simulation experiments and compared to several well-known previous methods, including CPR, CPA, MCPA, and MCPA2. The experimental results indicate that our IAES approach outperforms those previous methods significantly in most situations, especially when nodes of the same layer in a workflow might have unequal workloads. Keywords:  Workflow scheduling, Mixed parallelism, Moldable task, Processor allocation

Background Parallel processing (Konstantopoulos 2015) has been applied to many computation demanding applications, especially a variety of large-scale scientific and engineering applications (Feitelson et  al. 1997). In general, parallelism inherent in applications can be broadly divided into two types: data parallelism and task parallelism. For applications with data parallelism, usually a single program is executed on several processors simultaneously and each processor is responsible for processing a specific portion of data. Many tools and programming libraries have been developed to aid writing parallel programs with data parallelism, such as MPI (Quinn 2008), OpenMP (Chapman and Jost 2007), and OpenCL (Munshi et al. 2011). The computational structure of an application © 2016 The Author(s). This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Huang et al. SpringerPlus (2016) 5:1138

with task parallelism usually can be represented by a Directed-Acyclic-Graph (DAG) (Topcuoglu et  al. 2002; Ramaswamy et  al. 1997) based task dependency graph, commonly called a workflow, and looks like Fig. 1. Each node represents a task which usually executes a specific program. The number next to each node indicates the computation workload of the task. Based on the computation workload and processor speed, the required execution time of a task on a processor can be derived. The edges represent the dependence between tasks and the number next to an edge means the amount of data to transfer between two tasks. The required data transmission time depends on the amount of data and the communication bandwidth between the processors running the two tasks. A scheduler has to schedule and allocate each task according to the dependence specified in the workflow. Scheduling is an important and challenging research field (Severino et  al. 2014; Amirghasemi and Zamani 2014), and scheduling such kind of workflows on parallel systems has long been known to be a NP-complete problem (Pinedo 2008). Therefore, many heuristic methods have been proposed to produce efficient schedules within a reasonable time period (Topcuoglu et  al. 2002; Ramaswamy et al. 1997; Radulescu et al. 2001; Radulescu and van Gemund 2001; Bansal et al. 2006; N’Takṕe et al. 2007; Yu and Shi 2009). As applications become even more complex and computation demanding, recently many studies indicate that exploiting both task and data parallelism can be a promising approach to getting better efficacy compared with either pure task parallelism or pure data parallelism models (Hsu et al. 2011). The computational structure exploiting both task and data parallelism is sometimes called a mixed-parallel model (N’Takṕe et  al. 2007), which means that each node in Fig.  1 can itself be a parallel program exploiting data parallelism (Feitelson et al. 1997). Scheduling mixed-parallel workflows is more complicated than dealing with simple task-parallel workflows since each task might require more than one processor for execution, and therefore the resource fragmentation issue in scheduling data-parallel jobs also has to be considered for producing efficient schedules (Hsu et al. 2011). There is a particular class of mixed-parallel workflows where each data-parallel task in a workflow is moldable (Feitelson et al. 1997). A moldable job is a kind of data-parallel jobs which can be executed with an arbitrary number of processors depending on resource availability (Feitelson et  al. 1997). Such moldable jobs in the mixed-parallel

Fig. 1  Task parallelism represented by a workflow

Page 2 of 24

Huang et al. SpringerPlus (2016) 5:1138

workflows are called M-task in the literature (Radulescu et al. 2001; Radulescu and van Gemund 2001). Scheduling mixed-parallel workflows of M-tasks is even more challenging because it usually involves two different kinds of activities, allocation and mapping, where the allocation activities are not needed in scheduling other types of workflows. The allocation activities are for determining an appropriate amount of processors to be allocated for each M-task. The mapping activities regards mapping each M-task onto the processors in a parallel system to form a temporal and spatial schedule of the entire mixed-parallel workflow. This paper aims at developing an effective processor allocation approach for M-tasks in order to improve the overall execution performance of mixed-parallel workflows. In general, the goal of processor allocation for M-tasks is concerned about critical path reduction and allocation fragmentation avoidance. Most of previous approaches adjust the allocation of each M-task in a monotonically increasing manner until a predefined scheduling criterion is reached in the iterative process. In this paper, we propose an Iterative Allocation Expanding and Shrinking (IAES) approach to dealing with the above two concerns. IAES has two distinct features compared to existing methods. The first one is that IAES allows the allocation of an M-task to shrink during the iterative procedure, leading to a more flexible and effective processor allocation process. Secondly, IAES adopts a more accurate mechanism based on the temporarily scheduled Earliest-Start-Time (EST) and Earliest-Finish-Time (EFT) of each M-task to avoid possible processor allocation fragmentation. Based on these two features, IAES has potential to outperform existing methods. The proposed IAES approach has been evaluated with a series of simulation experiments using both workflow structures of real world applications and synthetic workflows generated by the widely used approach in (Topcuoglu et al. 2002). The performance results demonstrate that IAES outperforms existing methods in most situations in terms of average makespan and average SLR. The remainder of this paper is organized as follows. Section two discusses “Related work” on workflow scheduling, including task-parallel and mixed-parallel workflows. Section “Processor allocation for M-tasks in mixed-parallel workflows” presents our IAES approach and illustrates how it could outperform existing methods. Section “Performance evaluation and discussion” presents the experimental results and discussions. Section “Conclusions and future work” concludes the paper.

Related work Most previous research works on workflow scheduling deal with task-parallel workflows, where each task in a workflow is a serial job requiring only one processor for execution. The taxonomy proposed in (Yu et al. 2010) classifies such workflow scheduling algorithms into two groups: heuristics-based and meta-heuristics-based, and further, heuristics-based scheduling algorithms fall into several categories, including (1) immediate task scheduling, (2) list-based scheduling, (3) cluster-based scheduling, and (4) duplication-based scheduling. Immediate task scheduling is the simplest heuristic for workflow applications. It makes schedule decisions based on the availability of tasks only. One typical example is the Myopic algorithm (Sakellariou et  al. 2005), which has been implemented in some Grid systems such as Condor DAGMan (Tannenbaum et  al. 2002). A list-based

Page 3 of 24

Huang et al. SpringerPlus (2016) 5:1138

scheduling algorithm comprises two phases: the task prioritizing phase and the resource selection phase. The task prioritizing phase sets the priority of each task and generates a scheduling list by sorting the tasks according to their priorities. Then, the resource selection phase picks tasks from the list in order and maps each task to a most appropriate resource for it. List-based heuristics (Topcuoglu et al. 2002; Kwok and Ahmad 1996; Wu and Gajski 1990) received the most attention because of their simplicity and flexibility. For example, HEFT (Topcuoglu et al. 2002) is a well-known list-based workflow scheduling algorithm for heterogeneous environments. It first traverses a workflow from the exit node to the entry node in order to calculate an upward rank value for each task. The tasks are then sorted in non-ascending order of their ranks. According to the order, each task is assigned to the resource that minimizes its Earliest Finish Time (EFT). Many heuristics have been developed based on HEFT (Yu and Shi 2009; Bittencourt et al. 2010; Ghanem et al. 2010). Both cluster-based heuristics and duplication-based heuristics are designed to reduce the communication costs between interdependent tasks (Yang and Gerasoulis 1994; Darbha and Agrawal 1998; Park et  al. 1997; Bajaj and Agrawal 2004). In cluster-based heuristics, several tasks with data dependency are put into the same group (cluster) first, and then are assigned onto the same resource for communication cost reduction. On the other hand, duplicated-based heuristics try to reduce the communication cost for a task to transmit data to the resource of its succeeding task(s) through duplicating the task on the destination processors. Duplication-based heuristics were shown potential to achieve good performance when scheduling a single workflow (Park et al. 1997). However, they might not be appropriate when scheduling multiple concurrent workflows since task duplication in a workflow would consume extra computation resources and thus degrade the performance of other workflows. The meta-heuristics-based approaches provide both a general structure and strategy guidelines for developing a heuristic to fit a particular kind of problem. Meta-heuristics-based algorithms, generally applied to large and complicated problems, provide an efficient way of moving quickly toward a very good solution, although not optimal. There are in general three kinds of meta-heuristics-based approaches based on Greedy Randomized Adaptive Search Procedure (GRASP) (Resende and Ribeiro 2002), Genetic Algorithm (Singh and Youssef 1996), and Simulated Annealing (YarKhan and Dongarra 2002). There are comparisons (Tannenbaum et al. 2002; Blythe et al. 2005) between the heuristics-based approaches and meta-heuristics-based approaches. The comparison shows that meta-heuristics-based approaches usually perform better than heuristicsbased approaches, since a meta-heuristics-based method has more chance to approach the globally optimal solution than heuristics-based methods. However, the scheduling time in meta-heuristics-based algorithms is significantly higher than heuristicsbased algorithms, and the time complexity of the meta-heuristics based algorithms grows more rapidly than that of the heuristics-based algorithms if the size of workflows become larger. As workflow applications become more complex and computation-demanding, mixed-parallel workflow computing (N’Takṕe et  al. 2007) becomes a promising and important computing model where each task in a workflow might be a data-parallel program requiring multiple processors for execution. Many studies have shown that

Page 4 of 24

Huang et al. SpringerPlus (2016) 5:1138

mixed-parallel computation achieves better performance compared to either pure data parallelism or pure task parallelism (Ramaswamy et al. 1997; Radulescu et al. 2001; Radulescu and van Gemund 2001; Hunold 2010). According to (Feitelson et al. 1997) dataparallel jobs usually can be classified into four categories: rigid, moldable, malleable, and evolving. The work on mixed-parallel workflow scheduling in (Hsu et al. 2011) deals with the case that each data-parallel task within the workflow is rigid which means that each data-parallel task comes with a pre-specified number of processors to use and the scheduler has to allocate exactly that amount of processors to the task. On the other hand, in Radulescu et  al. (2001), Radulescu and van Gemund (2001), N’Takṕe et  al. (2007) and some other research works, the data-parallel tasks are assumed to be moldable and the focus is on how to determine a most appropriate number of processors to use for each moldable task, M-task, within a mixed-parallel workflow. This is also the research issue to be dealt with in this paper. For mixed-parallel workflows of M-tasks, according to how allocation and mapping activities are arranged during the scheduling process, existing scheduling approaches in the literature can be broadly divided into two categories: one step and two steps. Onestep approaches produce the schedule in an iterative manner. Each scheduling iteration consists of two steps where the first step adjusts the allocation of each M-task and the second step maps all M-tasks onto processors to check whether an improved schedule is achieved or not. The feedback of the second step will then guide the next iteration’s first step. A typical example of one-step approaches is CPR (Radulescu et al. 2001), which is a greedy iterative algorithm. At first, the algorithm assigns one processor for each M-task and computes the resultant makespan based on the list-scheduling approach. Then, an iterative procedure is applied to increase the number of assigned processors for each M-task until the entire workflow’s makespan cannot be improved further. To reduce scheduling overhead, many two-step approaches have been proposed in the literature, such as TSAS (Ramaswamy et  al. 1997), CPA (Radulescu and van Gemund 2001), MCPA (Bansal et al. 2006), and MCPA2 (Hunold 2010). In two-step approaches, the iterative process is only applied to the allocation step, which determines the most appropriate allocation of each M-task simply based on the static structural property of the workflow to be scheduled. Then, the mapping step decides the spatial and temporal assignment of each M-task onto the parallel computing platforms to produce the workflow execution schedule according to the allocation result in the first step. CPA (Radulescu and van Gemund 2001) is one of the most famous two-step algorithms. Many later two-step algorithms were developed based on its critical path strategy, such as MCPA (Bansal et al. 2006), MCPA2 (Hunold 2010). They differ in how to decide the allocation limit of each M-task.

Processor allocation for M‑tasks in mixed‑parallel workflows In this section, we explore the issues of processor allocation for M-tasks when scheduling mixed-parallel workflows, discuss the pros and cons of previous methods, and then propose a new Iterative Allocation Expanding and Shrinking (IAES) approach to the processor allocation problem. We use an example mixed-parallel workflow, shown in Fig. 2, to illustrate the characteristics of each method and demonstrate the superiority of our IAES approach. Each node in Fig. 2 represents an M-task with its ID shown in

Page 5 of 24

Huang et al. SpringerPlus (2016) 5:1138

Fig. 2  An example mixed-parallel workflow

the circle, and the number next to each node is the computation workload of the corresponding M-task. Workflow model

As in most of the literatures (Topcuoglu et  al. 2002; Ramaswamy et  al. 1997; Radulescu et  al. 2001; Radulescu and van Gemund 2001; Bansal et  al. 2006), we assume that a mixed-parallel workflow application of moldable jobs can be modeled as a Directed Acyclic Graph (DAG), e.g. Figure  2, to represent the constituent tasks and their execution order. The DAG is defined as a pair (V, E), where V and E are finite sets. V = {ti |i = 1, . . . , n} denotes the set of n nodes representing the constituent data-parallel tasks, each of which is a moldable job (Feitelson et al. 1997) and can be executed with an arbitrary number of processors depending on resource availability. E denotes the set of edges {ei,j |1 ≤ i, j ≤ n} where ei,j, is an arc from ti to tj, representing that tj can only be executed after ti finishes its computation due to the control or data dependency between them. ti is thus usually called the parent of tj. A task without ancestor is called an entry task and a task without any descendant is an exit task. It is assumed that there is only one entry task and one exit task in a workflow application. Each node in the task graph is called an M-task (Radulescu and van Gemund 2001) since it is moldable and can run with an arbitrary number of processors. Each node is annotated with the computation workload of the corresponding M-task. The required computation time of an M-task with a specific number of processors can be obtained either by user estimation or by applications’ speedup models (Ramaswamy et al. 1997; Rauber and Rünger 1998). In our study, the execution time of an M-task with different number of processors is calculated by Amdahl’s law (Kleinrock and Huang 1992), and the fraction of workload that must be executed serially within an M-task is assumed to be 0.2. A task can be executed only when it receives all the required data from its parents. The data transfer between two tasks incurs a communication cost that depends on network capabilities. In traditional research works on task-parallel workflows (Prasanna et  al. 1994; Kwok and Ahmad 1996; Wu and Gajski 1990), the communication cost between two tasks is assumed to be negligible if these two tasks are allocated on the same processor. Therefore, reducing inter-task communication costs becomes an

Page 6 of 24

Huang et al. SpringerPlus (2016) 5:1138

important part when scheduling task-parallel workflows. However, for mixed-parallel workflows, since each M-task might use a different number of processors for execution, there is always data communication or redistribution costs between two connected tasks. Therefore, in this paper we focus on the processor allocation issues of M-tasks and ignore the data communication costs. Common notations and terms used in M‑task allocation algorithms

Before elaborating on the M-task allocation methods, we first introduce several key notations and terms (Sinnen 2007) as follows, which will be used in describing the M-task allocation algorithms. ••  P The number of processors in a parallel computing system. ••  Schedule A schedule determines the spatial and temporal assignment of tasks in a DAG to processors. A schedule is usually generated by a specific scheduling algorithm on a specific number of processors. ••  np(t) The number of processors allocated to task t. ••  Tw (t, np(t)) The computation cost of a node t, representing the required computation time of the corresponding M-task with np(t) processors. ••  Path length The length of a path is the summation of the computation cost of each node on the path. Since we don’t consider data communication costs in the study as explained in the previous section, the path length defined here excludes the communication costs between nodes on the path. ••  Allocated path length Based on a schedule, the allocated path length is defined to be the finish time of the last node on the path subtracted by the start time of the first node on the path. ••  tl(n) The top level of a node n in a DAG, which is the length of the longest path ending in n, but excludes the computation cost of n. ••  bl(n) The bottom level of a node n in a DAG, which is the length of the longest path starting with n. ••  Schedule length The length of a schedule is the finish time of the exit task on it, assuming the entry task starts at time zero. ••  Critical path It is a longest path in a DAG. The critical path gains its importance for workflow scheduling from the fact that its length is a lower bound for the schedule length. ••  Allocated critical path The path of the longest allocated path length in a schedule. ••  Critical tasks The nodes on critical paths or allocated critical paths, which are of particular importance in the following M-task allocation methods. ••  MLS M-task list scheduling, which is a procedure applying simple list scheduling to produce the execution schedule of a workflow on a parallel system of a specific number of processors after the number of allocated processors for each M-task is known (Radulescu et al. 2001). This procedure can provide the estimated execution time, i.e. makespan, of a workflow. ••  Makespan The total execution time for a workflow application. It is used to measure the performance of a scheduling algorithm from the perspective of workflow applications.

Page 7 of 24

Huang et al. SpringerPlus (2016) 5:1138

Page 8 of 24

However, makespan usually varies widely among workflows with different sizes and other properties. ••  Schedule Length Ratio (SLR) The ratio of a workflow’s makespan over the length of its critical path. SLR tries to measure the performance of scheduling algorithms regardless of the variation in workflow’s size. In the experiments, the length of the critical path is calculated by assuming each M-task runs with only one processor. Previous methods

This section presents several most well-known processor allocation methods for M-tasks in mixed-parallel workflows and discusses their pros and cons. CPA

One of the most famous methods for scheduling mixed-parallel workflows of M-tasks is the Critical Path and Allocation (CPA) algorithm (Radulescu and van Gemund 2001). It continues to increase the number of processors, starting from one, for each task on the critical path while the condition, TCP > TA, holds, where

  TCP = maxt∈V bl(t) and

TA =

1 (Tw (t, np(t)) × np(t)). P t∈V

Both TCP and TA represent theoretical lower bounds for a workflow’s makespan, but characterize two different aspects. TCP is a measure of the dependence paths, that can be shortened by increasing the number of processors for the tasks on the critical paths. On the other hand, TA is a measure of processor utilization, which would become larger when allocating more processors to tasks. The detailed algorithm of CPA is shown in Algorithm 1

CPA is in general efficient. However, since CPA allocates processors to tasks at a per task basis, in many cases, it might lead to unnecessary resource fragmentation and wasting because the total allocated processors of concurrent tasks exceed the system’s capacity. Figure 3 is the schedule generated by CPA for the mixed-parallel workflow in Fig. 2 and shows an example for such situation. As shown in Fig. 2, t1, t2, and t3 are three concurrent tasks at the same level and can be run in parallel to exploit task parallelism. However, in the schedule generated by CPA, as shown in Fig. 3, t1, t2, and t3 do not run in parallel since the total number of processors allocated to these three tasks is more

Huang et al. SpringerPlus (2016) 5:1138

Fig. 3  Schedule generated by CPA

than the available number of processors in the system. Therefore, the potential task parallelism among tasks t1, t2, and t3 is deteriorated which leads to increased makespan of the workflow and reduced resource utilization rate of the system. MCPA

The Modified Critical Path and Area-based (MPCA) algorithm (Bansal et al. 2006) was developed based on improving the processor allocation phase of CPA, which aims to make better processor allocation for data-parallel tasks without sacrificing the essential task parallelism available in the workflow applications. MCPA divides the tasks of a workflow into different layers according to their dependency relationship. Thus, potential task parallelism within a workflow comes from the tasks at the same layer, which can run concurrently. MCPA bounds the number of processors that can be allocated to each layer’s tasks by the system’s capacity. The detailed algorithm of MCPA is shown in Algorithm 2.

Page 9 of 24

Huang et al. SpringerPlus (2016) 5:1138

Figure  4 shows the schedule generated by MCPA for the same mixed-parallel workflow. In contrast to the schedule in Fig. 3, tasks t1, t2, and t3 are now running in parallel, demonstrating MCPA’s advantage of retaining task parallelism among tasks at the same layer (Bansal et al. 2006). However, the makespan in Fig. 4 is worse than that in Fig. 3, indicating the drawback of MCPA (Hunold 2010) that it fails to deliver efficient schedules for irregular workflows where concurrent tasks differ significantly in the computation costs or there are more concurrent tasks than processors in the system. MCPA2

MCPA2 (Hunold 2010) was proposed to overcome the drawbacks of CPA and MCPA. The detailed approach of MPCA2 is shown in the following Algorithm  3. We first define a set of specific notations as follows, which are used in the algorithm description (Hunold 2010). ••  ••  ••  ••  ••  ••  •• 

pl(v): precedence level of node v. DFS_DEPTH(v): depth of node v determined by a depth-first search procedure. prec_alloc(l): number of processors allocated to tasks at precedence layer l. PL: set of precedence levels. prec_p(l): bound for number of processors at layer l. lp(v): nodes in the same precedence layer as v. wr: a scaling factor of P defined by users in order to loosen the restrictions when allocating processors; 0 

Suggest Documents