Scheduling DAGs with Parallel Tasks in Multi-Clusters Based on Parallel Efficiency Silvio Luiz Stanzani and Líria Matsumoto Sato
Abstract The Multi-cluster Grid environment is an effective infrastructure for the execution of DAGs composed by parallel tasks. The scheduling of DAGs in these environments is a challenging task. This paper details two scheduling strategies: one which maps parallel tasks to different clusters and another to a single cluster. Both approaches were evaluated with five workloads in a multi-cluster environment. Keywords DAG scheduling
Parallel task scheduling Parallel efficiency
1 Introduction Multi-cluster Grid environments have emerged as an effective infrastructure for supporting large scientific collaborations by the means of sharing resources, such as: computational resources, high performance networks, disk space and software components [1]. Such environments are capable of provide for massively parallel and computationally intensive applications an amount of resources that would rarely be available in a single cluster. Applications in multi-cluster environments are executed with the support of Grid middleware, which provides a low-level interface for the development of grid
S. L. Stanzani (&) L. M. Sato Avenida Prof. Luciano Gualberto, Escola Politécnica da Universidade de São Paulo (USP), travessa 3 n8 380 CEP, São Paulo-SP 05508-970 Brazil e-mail:
[email protected] L. M. Sato e-mail:
[email protected]
James J. (Jong Hyuk) Park et al. (eds.), Computer Science and Convergence, Lecture Notes in Electrical Engineering 114, DOI: 10.1007/978-94-007-2792-2_74, Ó Springer Science+Business Media B.V. 2012
753
754
S. L. Stanzani and L. M. Sato
applications. Complex applications consisting of a set of interdependent tasks are executed by Scientific Workflow Management Systems (ScWFMS) [2], which provide an abstraction layer above grid middleware for the composition of complex applications. The scheduling of scientific workflows involves the mapping of tasks to resources [3], considering two conditions. The first condition is the execution precedence constraint between two dependent tasks. The second condition is that only one task can be executed on a cluster node at one time. Finding a schedule to minimize the execution time of a workflow is an NP-complete problem [4]. The approach to scheduling scientific workflows presented in this paper is based on a two-phase algorithm, namely priorization and task scheduling. Parallel efficiency is used to estimate the execution time in the priorization phase. The remainder of this paper is organized as follows: Sect. 2 presents DAG scheduling and related work. Section 3 presents the computational model. Section 4 presents the simulation setup. Section 5 presents the simulation results. And, finally, Sect. 6 presents the conclusion and future works.
2 DAG Scheduling and Related Work Scientific workflows can be defined as a DAG (Directed Acyclic Graph), in which the vertices represent tasks and the directed edges represent the dependencies between them. DAGs are executed in Grid Environments by submitting each DAG task to one grid resource. In a multi-cluster grid, a resource is a cluster comprising of a set of nodes managed by a LRM (Local Resource Manager). A number of DAG scheduling strategies have been developed and implemented in ScWFMS which uses file dependency between tasks as a criterion for scheduling the tasks. Pegasus [5] executes a scheduling algorithm which clusters a number of tasks to the same resources in order to minimize file transfer between tasks. Gridbus [6] implements an algorithm for scheduling DAG tasks to resources which are close to the data source. The task parallelism is also used as a criterion for scheduling DAG tasks on the grid. In [7], a static list scheduling algorithm is presented in order to schedule parallel tasks in heterogeneous clusters. Mixed parallel tasks represents applications which presents parallelism in the task level, and in the data level, in [8] a number of algorithm for scheduling DAGs composed by mixed parallel tasks is evaluated. In [9] bi-criteria algorithm is proposed for scheduling DAGs composed by mixed parallel tasks, aiming at find a schedule which minimize execution time and minimize the quantity of resources allocated for each task. The scheduling strategy described in this paper aims at demonstrating the viability of executing parallel tasks in multi-cluster grids. In this sense, the strategy considers tasks which the file transfers between tasks has low costs, and also has low communication costs. The approach used is the immediate mode [10], which performs task scheduling during DAG execution for tasks without dependencies or tasks with dependent tasks that have already been completed. The objective is to
Scheduling DAGs with Parallel Tasks
755
minimize the execution time of parallel tasks being executed simultaneously, using parallel efficiency for an execution time estimation. The next section details the support of parallel task execution and scheduling in multi-cluster grids.
2.1 Executing and Scheduling Parallel Tasks in Multi-Clusters The MPI Paradigm is a de-facto standard for the development of parallel applications on parallel machines, such as clusters and supercomputers. Such parallel machines provide fast and reliable communication links among nodes optimized for MPI communication libraries, which enable efficient use of a number of resources by the same application. A multi-cluster grid can support the execution of MPI tasks enabling the use of resources from diverse clusters. Accordingly, the grid MPI frameworks such as PACX-MPI [11], MPICH-G2 [12], GridMPI [13] and the MPI Gateway [14], implement mechanisms for transparent inter-cluster communication, enabling the porting of a parallel application to the grid without the need for software code modifications [15]. The heterogeneity of clusters can lead to poor performance without an adequate scheduling strategy. In this sense, the scheduling of parallel tasks is essential. The scheduling of parallel task is the process of map a set of parallel tasks to a set of resources. A parallel task can be rigid, moldable or malleable [16]. Rigid tasks require a fixed quantity of resources; moldable tasks can be executed with any quantity of resources, which will be reserved to the task until its completion; and malleable tasks can also be executed with any quantity of resources, although the quantity of resources can vary throughout the execution of the task. In this sense, the parallel task scheduling algorithm has to assign an quantity of resources to tasks, and also map the task to resources. In the context of multi-clusters grid such resources can be deployed in a single cluster, or can be deployed in diverse clusters.
3 Computational Model The computational environment considered for this study was a multi-cluster grid environment, which is a set of clusters interconnected by each cluster head node using internet infrastructure. Each cluster consists of a number of P processors with a processing capacity and connected by a switched Ethernet. The connection speed among nodes in the same cluster is represented by the cluster switch bandwidth and latency, and the connection speed among nodes from different clusters is the sum of connections to the local cluster switch, the connection between local switches and the router and the router to other cluster switches.
756
S. L. Stanzani and L. M. Sato
The application is represented by its precedence constraints in a DAG format. The tasks are defined with the following characteristics: Computation size, communication size (the amount of data that will be transferred along with the execution), minimum number of cores (mincores), maximum number of cores (maxcores), and parallel efficiency, which is defined as a function of speedup [17]. The tasks can be sequential, rigid or malleable. The sequential tasks have mincores = 1 and the rigid tasks have mincores equal to maxcores. The malleable tasks have maxcores greater than mincores. The efficiency is represented by the amount of code that can be parallelized Teff : The task computation size is defined in flops/s and represents the sequential task size. Tcomp: The task communication size is defined as Tcomm: The task computation size is a function of efficiency and the quantity of resources used: Tcomp ¼ ð
Tcomp Teff Þ þ Tcomp ð1 Teff Þ cores
3.1 Scheduling Strategy The objective of the developed scheduling strategy is to minimize the execution time of DAGs, consisting of sequential and parallel tasks in a multi-cluster environment. The execution time of a parallel task is a function of the resources used according to parallel efficiency. In this context, the objective of the strategy is to find a schedule that minimizes execution time for all the tasks to be executed simultaneously, since the execution time of a set of ready tasks will be limited by the execution time of the slowest task. The algorithm has two phases, namely priorization and scheduling: In the priorization phase the resources of available resources in each cluster is shared among the tasks in order to minimize the execution time of tasks with greater computational sizes. In the scheduling phase the tasks are mapped to resources according to the quantity of resources allotted to each task, following the max– max heuristic in which the highest computational size tasks are mapped to the highest capacity cluster. For sequential and rigid tasks, the quantity of resources allotted will be fixed according to task requirements, and for the malleable tasks the quantity of resources will be defined in priorization phase. The scheduling phase maps the tasks to resources according to the quantity of resources defined in priorization phase. Such quantity of resources can be available on a single cluster, or can be available as a subset of resources from a number of clusters. In this sense, the scheduling phase was developed following two approaches. In the first approach parallel tasks can be executed in one cluster or in diverse clusters. In the second approach parallel tasks is executed always in a single cluster.
Scheduling DAGs with Parallel Tasks
757
The priorization phase of algorithms works in the following way: Priorization (task_list) 1) for each ready tasks 2) sort ready tasks by Tcomp 3) sort resources by power capacity 4) for each malleable task i 5) i.cores = i.mincores 6) for each available_resources 7) for each malleable task i 8) if execution_time(i) [ task_execution_time 9) slowest_task_ind = i 10) slowest_task_ind.cores ++ The first approach for the scheduling phase of algorithm schedule the parallel task to the first cluster which has the available resources required to execute the task, if none of cluster has enough resources, the task will be mapped to more than one cluster. Scheduling_multi_cluster (task_list) 1) for each ready tasks i 2) for each resource r 3) if available_resources(r) [ i.cores 4) taskmap[i][0] = r 5) if taskmap[i] == null 6) for each resource r 7) if available_resources(r) [ 0 8) i.cores = i.cores - available_resources(r) 9) taskmap[i][c] = r 10) c++ The following algorithm performs the scheduling phase following the second approach. This algorithm schedule the parallel task to the first cluster which has the available resources required to execute the task, if none of cluster has enough resources, the task will be mapped to the cluster with highest amount of available resources. Scheduling_one_cluster (task_list) 1) for each ready tasks i 2) for each resource r 3) if available_resources (r) [ i.cores 4) taskmap[i][0] = r 5) if taskmap[i] == null 6) for each resource r 7) if available_resources(r) [ max_avail_resources
758
S. L. Stanzani and L. M. Sato
Table 1 Resources Cluster 1
Cluster 2
Cluster 3
10 nodes Flops: 63840160000 Intel 2.00 GHz Bandwidth = 100 Mb/s Latency = 100 ms
4 nodes Flops: 25534816000 Intel 1.60 GHz Bandwidth = 100 Mb/s Latency = 100 ms
2 nodes Flops: 11200384000 Intel 2.80 GHz Bandwidth = 100 Mb/s Latency = 100 ms
Table 2 Workloads Tasks Workload 1
Workload 2
Workload 3
Workload 4
Workload 5
Malleable Rigid Sequential
2 0 0
4 0 2
3 1 2
3 1 2
8) 9)
2 0 0
max_avail_resources = r i.cores = available_resources(r)
The next section will present an evaluation comparing the algorithm using the scheduling strategy that maps the task to more than one cluster with the one mapping the task to a single cluster.
4 Simulation Setup The proposed strategy evaluation was carried out with the support of gridsim [18], using the MSG Framework. The multi-cluster environment was deployed with 3 homogeneous clusters. Each cluster node consisted of one core machine. Intra-cluster communication was modeled as a 100 MB/s Ethernet switch network, and inter-cluster communication was modeled as a 1 GB/s Ethernet switch network connected by a router. Table 1 presents the clusters configuration. The workloads were created with a varying number of sequential, rigid and malleable tasks, and different communication and computation costs. Table 2 presents the workload details.
5 Simulation Results The results shows the execution time of the scheduling strategy which maps parallel tasks to one cluster (scheduling_one_cluster), and the execution time of scheduling strategy which can maps tasks to diverse clusters (scheduling_multi_cluster). The
Scheduling DAGs with Parallel Tasks Table 3 Workload execution time
759
Workload Workload Workload Workload Workload Workload
1 2 3 4 5
Execution time (s) multi-cluster
Execution time (s) one-cluster
5.00588 25.0059 21.0533 52.6358 40
6.49161 31.2292 21.0533 52.6358 40
utilization of parallel efficiency as a criterion to estimate execution time in the priorization phase was also evaluated. The Table 3 shows the simulation results. The utilization of scheduling_multi_cluster performed better for workload 1 and workload 2 than the scheduling_one_cluster. For the workload 3, 4 and 5 both algorithms performed in the same time. Considering the workload 1 and workload 2 composed by two malleable tasks which are very efficient, the scheduling_multi_cluster performed better because the tasks could use resources from diverse clusters without significant network overhead. The workloads 3, 4 and 5 are composed by tasks which require more resources than is provided by the multi-cluster environment. In this case, the priorization phase associated an amount of resources for each task which is lower than the maxcores of task, the consequence is that the scheduling_multi_cluster and scheduling_one_cluster executed the same task mapping.
6 Conclusion and Future Works A two-phase scheduling algorithm was designed to schedule DAGs consisting of sequential and parallel tasks in a multi-cluster grid made up of homogeneous clusters. Such algorithms were developed by means of two approaches. The first approach was to schedule parallel tasks to more than one cluster (scheduling_ multi_cluster), while the second approach was to schedule parallel tasks to only one cluster (scheduling_one_cluster). Both approaches were tested using five types of workloads. The scheduling_multi_cluster performed better than scheduling_one_cluster approach for workloads with few tasks, and it presented the same execution time for workloads with more tasks than available resources. In future work, two further aspects could be investigated. The first aspect is an analysis of the network overhead, considering a scenario with heterogeneous bandwidth and latency. The second aspect is an evaluation within a production environment.
760
S. L. Stanzani and L. M. Sato
References 1. Foster IT (2001) The anatomy of the grid: enabling scalable virtual organizations. In: First IEEE international symposium on cluster computing and the grid (CCGrid’01), pp 1–4 2. Yu J, Buyya R (2005) A taxonomy of scientific workflow systems for grid computing. SIGMOD Rec 34(3):44–49 3. Yu J, Buyya R, Ramamohanarao K (2008) Workflow scheduling algorithms for grid computing. In: Metaheuristics for scheduling in distributed computing environments 4. Jansen K, Zhang H (2006) An approximation algorithm for scheduling malleable tasks under general precedence constraints. ACM Trans Algorithms 2(3):416–434 5. Deelman E et al (2005) Pegasus: a framework for mapping complex scientific workflows onto distributed systems. Sci Program 13(3):219–237 6. Venugopal S, Buyya R, Winton L (2006) A grid service broker for scheduling eScience applications on global data grids. Concurr Comput: Pract Experience 18(6):685–699 7. Barbosa J, Morais C, Nobrega R, Monteiro AP (2005) Static scheduling of dependent parallel tasks on heterogeneous clusters. In: 2005 IEEE international conference on cluster computing, pp 1–8 8. Casanova H, Desprez F, Suter F (2010) On cluster resource allocation for multiple parallel task graphs. J Parallel Distributed Comput 70(12):1193–1203 9. Desprez F, Suter F (2010) A bi-criteria algorithm for scheduling parallel task graphs on clusters. In: 2010 10th IEEE/ACM international conference on cluster, cloud and grid computing (CCGrid), pp 243–252 10. Couvares P, Kosar T, Roy A, Weber J, Wenger K (2007) Workflow management in condor. In: Workflows for e-Science. Taylor IJ, Deelman E, Gannon DB, Shields M (eds). Springer, London, pp 357–375 11. Graham RL, Woodall TS, Squyres JM (2005) Open MPI: a flexible high performance MPI. In: 6th annual international conference on parallel processing and applied mathematics 12. Karonis NT, Toonen B, Foster I (2003) MPICH-G2: a grid-enabled implementation of the message passing interface. J Parallel Distrib Comput 63(5):551–563 13. Takano R et al (2008) High performance relay mechanism for MPI communication libraries run on multiple private IP address clusters. In: 8th IEEE international symposium on cluster computing and the grid. CCGRID’08, pp 401–408 14. Massetto F et al (2011) A message forward tool for integration of clusters of clusters based on MPI architecture, In: Methods and tools of parallel programming multicomputers, vol 6083, Hsu C-H, Malyshkin V (eds). Springer, Berlin/Heidelberg, pp 105–114 15. Coti C, Herault T, Cappello F (2009) MPI Applications on grids: a topology aware approach. In: Euro-Par 2009 parallel processing, vol 5704. Sips H, Epema D, Lin H-X, (eds). Springer, Berlin/Heidelberg, pp 466–477 16. Feitelson DG, Rudolph L, Schwiegelshohn U, Sevcik KC, Wong P (1997) Theory and Practice in Parallel Job Scheduling. In: Proceedings of the job scheduling strategies for parallel processing, pp 1–34 17. Trystram D (2001) Scheduling parallel applications using malleable tasks on clusters. In: Proceedings of the 15th international parallel and distributed processing symposium, pp 2128–2135 18. Casanova H, Legrand A, Quinson M (2008) SimGrid: a generic framework for large-scale distributed experiments. In: 10th international conference on computer modeling and simulation, UKSIM 2008, pp 126–131