Scheduling of Dependent Grid Jobs in Absence of ...

2 downloads 0 Views 375KB Size Report
Maria Chtepen1, Filip H.A. Claeys2, Bart Dhoedt1, Filip De Turck1, Peter A. Vanrolleghem2, and Piet ..... Man Jobs and Its Evaluation. In: Proceedings of the ...
Scheduling of Dependent Grid Jobs in Absence of Exact Job Length Information Maria Chtepen1 , Filip H.A. Claeys2 , Bart Dhoedt1 , Filip De Turck1 , Peter A. Vanrolleghem2 , and Piet Demeester1 1

2

Department of Information Technology (INTEC), Ghent University, Sint-Pietersnieuwstraat 41, Ghent, Belgium {maria.chtepen, bart.dhoedt, filip.deturck}@intec.ugent.be Department of Applied Mathematics, Biometrics and Process Control (BIOMATH), Ghent University, Coupure Links 653, Ghent, Belgium {filip.claeys, peter.vanrolleghem}@biomath.ugent.be

Abstract. In highly complex virtualized grid environments, a significant performance gain can be achieved when appropriate scheduling techniques are applied that are adjusted to the system at hand and to the expected workload. Most of the existing scheduling algorithms for grids are based on the assumption that the exact application (job) execution time is known a priori. This assumption significantly simplifies the job-resource matchmaking, although it has proven to be inapplicable for a large group of real-world applications, for which only a rough estimation of the total job length can be provided in advance. In this paper, an intelligent scheduling algorithm for grid jobs with dependent sub-tasks and initially unknown execution times is proposed. The algorithm dynamically reassigns tasks based on the runtime estimation of their total execution time and system status information. The algorithm is validated in a simulation environment, using a realistic job execution model. The model employs workload and system parameters derived from logs that were collected from several large-scale parallel production systems. The results from simulation experiments allow to estimate the performance gain as a function of the number of dependent tasks and the pattern of their execution time estimate.

1

Introduction

Taking into account the complexity of grid distributed environments and constantly increasing dimensions of modern applications, there is clearly a need for elaborate scheduling strategies that provide for efficient utilization of the available resources. In spite of the fact that significant efforts have been spent on the design of scheduling algorithms for grids, the main disadvantage of the proposed methods is that they often rely on a priori knowledge on task lengths when making task-resource matchmaking decisions. However, determination of a task length is far from trivial for a large group of real-world application and often only a rough estimate can be provided before execution starts. Luckily, in many realworld applications there exist a correlation between the task execution progress

and the total task execution time. If the execution progress can be monitored at runtime, this data together with the length of the elapsed time interval, can serve as a good basis for the prediction of the remaining execution time. In most cases, the accuracy of the provided task length estimation improves as execution proceeds. Another important issue in the domain of distributed computing is scheduling of tasks with inter-dependencies. Most of the dependencies that occur in practice can be subdivided into the following categories: MPI dependencies: Tasks communicate during the execution process by means of an MPI-like (Message Passing Interface) protocol. Input dependencies: Tasks execute independently but require inputs generated by other tasks. In the first case, tasks periodically exchange information during execution, at so-called synchronization points. Since extensive message passing within globally distributed environments implies significant communication overhead, it is desirable to schedule MPI tasks on collocated resources (either within a single cluster or within a single machine). On the other hand, the overhead resulting from communication of tasks belonging to the second category is much more limited. Therefore, knowledge of the input dependencies forms a potential for execution optimization by means of intelligent scheduling. This paper considers scheduling of applications composed of tasks with input inter-dependencies, for which no exact information on total execution time is available. The dependencies are assumed to be statically determined and are provided to the distributed system in the form of DAGs (Directed Acyclic Graphs). A dynamic scheduling algorithm is proposed, which reacts at runtime on changes in task length prediction by possibly rescheduling parallel tasks in the DAG, herewith optimizing the total execution time of the application. As a use case for the validation of the proposed scheduling method a data set is used that was derived from realworld logs collected from different large scale parallel production systems [1, 2]. Simulation-based experiments, utilizing this data set within a discrete event grid simulation environment, have shown a significant performance improvement of the proposed dynamic strategy compared to a static scheduling approach. The remainder of this paper is organized as follows: Section 2 summarizes related work; Section 3 describes the assumed grid model; the proposed dynamic scheduling method is introduced in Section 4, while its performance is evaluated in Section 5; and, finally, Section 6 concludes the paper.

2

Related Work

As was mentioned earlier, numerous research efforts have already been devoted to the design of static and, to a lesser extent, also to the design of dynamic scheduling strategies for grids [3, 4]. Unfortunately, their usability is often lim-

ited by the fact that they rely on the assumption of a priori knowledge of job length or execution time. However, the latter is often not the case for many existing applications. Therefore, research was performed on the accurate prediction of a task runtimes. For instance, Nassif et al. [5] use historical information on previous similar task executions to recommend a solution for a new case. Each task is supposed to possess a set of descriptive attributes, such as application name, arguments, type, requirements, data, which are used to determine the similarities between the different tasks. The key idea is to verify whether a specific host has already run a similar task. The time spent executing a past task helps to predict the runtime of a new one. Caniou et al. [6] introduce a Historical Trace Manager (HTM) task length estimation module for GridRPC-based middleware. The HTM takes into account that servers run under the time-shared model, and is able to predict the duration of a given task on a particular resource as well as its impact on the other running tasks. The module is non-intrusive and provides good estimations when the resource load is not too high. The disadvantage of HTM is that it requires a significant amount of information on system performance and the executed application, such as server and network peak performance, size of input and output data, the number of operations in each task, etc. However, providing such information is often challenging, since the system performance can strongly vary over time and applications can exhibit complicated control flows, whereby it is difficult to actually determine the number of executed instructions. In contrast to both research efforts mentioned, our method does not require any historical information or knowledge of task properties, since its decisions are based solely on information on task execution progress and system status, collected at runtime. Yet another performance prediction-based mechanism is described by Wu et al. [7]. Wu introduces a Grid Harvest Service (GHS) that combines probabilistic modeling for task execution time estimation together with runtime resource monitoring to guarantee good grid performance in presence of abnormal resource behavior. The operation of GHS is based on the assumption that the local job processing follows a M/G/1 queuing system, with exponentially distributed interarrival times and arbitrary service times. The disadvantage of this model is that is sufficiently accurate only for execution-time prediction of long-running tasks. Considering scheduling of interdependent tasks within distributed environments, most of the related research solves this issue by representing dependencies as a DAG. Malewicz et al. [8] present a DAG scheduling tool that prioritizes tasks to maximize the number of eligible tasks at every step of the computations. An eligible task stands for a task whose parents have already been processed and thus which can be executed in the next scheduling round. Our research, on the other hand, concentrates on optimization of the number of processed jobs in a certain time interval. The algorithm can delay or accelerate the execution of an individual eligible task, if this reduces the execution time of any job as a whole. Aggarwal et al. [9] proposed a Genetic Algorithm based scheduler for tasks organized into a DAG-structure according to constraints imposed by the end-user.

Figure 1. Example grid architecture: U I (User Interface), GSched (Grid Scheduler), IS (Information Service), CS (Checkpoint Server), WAN (Wide Area Network), LAN (Local Area Network)

Contrary to our approach, the algorithm optimizes the available resource utilization while satisfying the conflicting goals of the users and the resource providers.

3

Grid and Workload Models

For development and validation of the proposed dynamic solution, a general grid model was assumed, consisting of geographically dispersed computational sites (S), aggregating a number of computational resources (CR) and general services (Figure 1). The latter include a user interface (U I), through which tasks are submitted into the system; a scheduler (GSched) responsible for task-resource matchmaking; an information service (IS), which collects task and resource status information required by the GSched; and a checkpoint server (CS) where checkpointing data is made persistent. The GSched invokes the matchmaking procedure within the predefined scheduling interval IGSched , while the IS collects changes in resource status with a delay IIS , to reflect the modification propagation time occurring in actual deployments. The grid sites reside within a Wide Area Network (WAN), while resources belonging to a single site are interconnected by Local Area Networks (LAN). Finally, it is assumed that all grid management services are protected against failure and only CRs can be unstable, with a resource failure affecting all CPUs within a given CR. In this paper, job dependencies are represented as a DAG with a single initial and a single final task, as shown in Figure 2. In the figure circles indicate

Figure 2. Example input dependencies organized into DAG structure

tasks belonging to the same job, while edges represent input dependencies. The considered structure is an often occurring dependency pattern, whereby an initial task generates an arbitrary number of dependents (for example, simulations with varying parameter values), the results of which can either be combined as an input for a new task or can further be split into an arbitrary number of subdependent tasks. At the final level of the graph structure the outputs of several parallel tasks are combined to calculate the final result. Taking into account the form of the above-described task dependencies, we utilize the generally accepted tree data structure terminology when referring to the DAG components. In what follows a root refers to a top-level task, a task with at least one outgoing edge is called a parent and a task with at least one incoming edge is a child. In the remainder of the paper children are also called dependent tasks. Since root tasks do not have any execution dependencies, their processing can be started as soon as there are idle resources in the grid, while the start time of children strongly depends on the execution time of their parents. Furthermore, the considered workload model assumes that exact task execution time is unknown a priori but that it can be estimated at runtime by monitoring the task progress and the system parameters. To find suitable models describing the evolution of the execution time estimates, the behavior of several existing applications was observed. Figure 3 gives examples of the variation of execution time estimates over time for a number of simulation experiments, respectively referred to as “Orbal”, “Lux”, “Bamberg” and “Galindo OL” [10]. As can be seen, the four curves follow a different course that, however, can be fit into several approximation models: the Orbal curve follows an overestimation model, whereby job execution time prediction gradually decreases as execution progresses, until the stable state (i.e., the actual execution time) is reached; the Lux experiment follows an underestimation model, which means that the estimate gradually increases during the execution process; the Bamberg execution time estimate displays a rather erratic pattern that can be represented by a fluctuating model; and, finally, the Galindo OL again follows an overestimation model, however in this case a damped oscillation is superposed on the decreasing trendline. For all of these cases, the estimate was computed as Eest = 100%/P × E,

Figure 3. Examples of evolution of job execution time estimates for ‘Orbal”, “Lux”, “Bamberg” and “Galindo OL” simulation experiments

where P is the percentage of the workload already completed and E is the execution time thus far. To simplify and generalize the representation of the above described experimental approximation models, the following mathematical models for execution time estimation were considered: Eest = Eref + r1 Ae−bt describes an exponentially decreasing overestimation model; Eest = r1 Ae−1/bt stands for an exponentially increasing underestimation model; and Eest = Eref +r1 Ae−bt sin(2πF +r2 ϕ) represents fluctuating evolution of execution time estimates. In the above equations Eref stands for the reference total task execution time; r1 and r2 are random weight factors that can take values between 0 and 1; A is the amplitude of the estimates oscillations; t is the current simulated time; b is a weight factor that determines the speed in the decrease / increase of A over time; F stands for the oscillation curve period; and, finally, ϕ represents the oscillation curve phase.

4

Dynamic Scheduling Method for Jobs with Input Inter-Dependencies and Unknown Execution Time

In this section a dynamic scheduling method for applications composed of tasks with input dependencies is introduced. The algorithm deals with the situation where the exact task execution time is unknown a priori but gradually improving execution time estimates can be collected at runtime. The objective of the proposed algorithm is to maximize the number of processed jobs within a particular time interval, which is achieved at the cost of slower execution of some individual tasks. The algorithm assigns tasks to resources in the order of their arrival into the grid and in such a way that parallel tasks finish more or less simultaneously. The idea behind this approach is that tasks having the same dependents should execute equally fast, for as much as possible, since the dependent tasks can only

Figure 4. Flow of the operation phases of the proposed dynamic method

be processed after all their parents have finished execution. For instance, if a parallel set contains one long and several short tasks, the long task should be scheduled to the fastest available resource, while the short tasks can be assigned to slower resources, while not increasing the maximum execution time of the set. The latter creates a possibility for other tasks requiring fast processing to be executed on faster machines. At first, tasks are assigned using the initial rough estimation of the task length, however later the groups of parallel tasks can be rescheduled based on dynamic updates of execution time prediction and the grid status information. In concreto, the operation of the proposed method can be subdivided into three phases, graphically represented in Figure 4: Dynamic collection of information: Runtime information on the current task status as well as on the availability of the grid resources is collected from the IS. In this phase, the execution location of all root tasks or information on their failure is updated. The latter is of particular use in heavily loaded grids, where CRs are scarce. In this case, root tasks, with no execution constraints due to input dependencies, will be started first, occupying to the largest extent the available resources. To avoid significant job execution delay, it should be possible to locate and reschedule / interrupt running root tasks when a dependent task is ready to execute. Here it is important to notice that the correctness of the information provided by the IS strongly depends on the size of the information collection interval IIS . Rescheduling of parent sets: The purpose of this part of the algorithm is to accelerate and optimize the execution of dependent tasks, by reducing the execution time of their parent(s). A set of parents of a particular task is further referred to as a parent set or P S. Examples of P Ss can be found in Figure 2, where sets {0}, {1,2}, {3} and {5,6,7} are parent sets of tasks 1-4, 5-6, 7 and 8 respectively. The total execution time of a P S is reduced based on estimations of runtime of the included tasks (Eest ) as well as on dynamic information on grid resource availability and load. The algorithm processes parent sets in the order of the increasing number of tasks in a set (shortest set first). Furthermore, only P Ss consisting of either running or finished tasks are considered for rescheduling. This procedure ensures that almost completed P Ss are scheduled first, to the fastest CRs, and thus that the execution time of the set is reduced. As candidate resources for rescheduling not only free CRs are considered, but also all resources running root tasks, since the latter can be canceled or replaced in favor of dependent tasks. The algorithm iterates over all tasks within the selected max P S trying to assign the task j with the longest execution time (Eest ) to the

fastest available resource r. The latter is selected based on dynamic information on resource speed (M IP Sr or Million Instructions Per Second) and the number of tasks it is currently executing (nr ). In case root tasks are executing on r, they are checkpointed and reallocated to the slowest free resources, if any exist and if the rescheduling does not effect the execution of other P Ss. When r is found, j is rescheduled again under the condition that it will not delay the execution of other P Ss running on r. Otherwise, the algorithm tries to reschedule j on the second, third, etc. fastest CR. Afterwards the algorithm proceeds by linking max tasks from the considered P S, with Eest < Eest , to CRs in such a way that the max difference | Eest − Eest | is minimized. This implies that the remaining tasks from the considered P S are assigned to the slowest possible resources, which max keeps the total job execution time below a predefined boundary (Eest ), preserving better CRs for tasks requiring fast execution. Taking into account the cost of checkpointing and migration for a particular application, it can be desirable to limit rescheduling to cases where the achieved performance improvement exceeds a certain boundary. The exact value for the latter parameter is, however, strongly dependent on the application properties. Furthermore, instead of using the current estimation of a task execution time avg Eest , average obtained value, Eest can be utilized. The latter can be particularly advantageous to filter out oscillations from fluctuating execution time estimate models, such as in the case of Galindo OL (see Figure 3), and thus to avoid overzealous checkpointing and migrations. Scheduling of new / failed tasks: In this part of the algorithm ready to be executed tasks, i.e. tasks with either no parents or all parent tasks executed, are (re)scheduled. Furthermore, to achieve even task assignment, the algorithm attempts to process tasks belonging to the same parent set in a single iteration. The next task or the next P S is selected in the order of its arrival timestamp (longest waiting task / P S first). An exception is the situation whereby the P S processed in the previous algorithm iteration could not be integrally assigned due to free CRs shortage. In this case the remaining tasks of the previous sets get higher priority. Furthermore, this part of the algorithm proceeds to a large extent in a way similar to the previous phase, except that tasks within P Ss are sorted according to their predicted length instead of the predicted execution time. For failed tasks, the predicted length is eventually reduced with the already performed and checkpointed computations. Here again tasks belonging to a parent set can be scheduled either on free resources or replace root tasks, while the root tasks can, logically, only be distributed to free CRs.

5

Performance Evaluation

The performance of the algorithm proposed in the previous section was evaluated in the DSiDE (Dynamic Scheduling in Distributed Environments) grid simulation environment [11]. DSiDE possesses an easy and flexible task dependency specification mechanism, and allows for easy modelling and simulation of dynamic events, such as varying job execution time estimates, varying resource

availability and load, etc. The grid environment modelled in DSiDE corresponds to the model represented in Section 3 and consists of 128 CRs, equally spread over 4 sites, as shown in Figure 1. Sites are connected by WAN links with a bandwidth of 1 GBit/s and a latency varying from 3 to 10 ms, while CRs inside the sites communicate through LAN networks arranged into a star topology, with link bandwidths of 1 Gbit/s and a latency of 1 ms. In compliance with the real-world grid deployments, CRs in this model possess different computational power that varies from 100 to 4000 MIPS. Furthermore, each CR is limited to process at most 2 tasks simultaneously. GSched in the assumed model is scheduled to run every 5 min, while the worst case information update delay for the IS is initialized to 10 min. The performance of the algorithm is observed in an entirely stable environment, where neither grid services, nor CRs are subject to failure. As input to the simulator, workload derived from production grid logs was utilized. More specifically, the submitted workload follows the Lublin job generation model [12]. In the considered simulation scenario the model parameters are initialized to represent a heavily loaded grid system with a job arrival pattern with a daily cycle. The latter implies that most of the jobs (almost 80%) arrive during day-time. The runtimes of the submitted jobs fluctuate from less than 1 hour to more than 10 hours, where more than 80% of all jobs have medium execution times (1 to 6 hours). To simplify the comparison between algorithms, it is assumed that all jobs use inputs and outputs of 10 GB; and a job checkpointing delay varies from 100 ms to 5 s, depending on the execution time. Furthermore, the exact task execution time is unknown before runtime and the parameters of the earlier introduced estimation models are initialized as follows: values for Eref are extracted from the Lublin model with the above described properties; A is uniformly distributed between Eref × 150% and Eref × 200%; b is uniformly distributed between 0 and 1; F is uniformly distributed between Eref × 5% and Eref × 10%; and, finally, ϕ is uniformly distributed between Eref × 1% and Eref × 2%. Additionally, the maximum value of Eest for the exponentially increasing underestimation model was set to A. Using the above-described grid and workload models, the performance of the proposed dynamic algorithm was compared against the performance of a simple static approach. The latter schedules ready to be executed tasks in the order of their arrival into the system, assigning each time the fastest and least loaded free resource to the following task. The static approach, as the name suggests, takes into account neither runtime variations in task execution time estimates nor changes in grid status. The performance of both algorithms was observed for three simulation scenarios: all tasks follow the overestimated execution time model; tasks follow the underestimated execution time model; tasks follow the fluctuating execution time model. For each scenario the number of final tasks or jobs processed and their average turn-around times (Y-axes) are observed as a function of the average number of dependent tasks generated by each parent node (see Figure 5). The variation of the latter parameter is determined as a uniform distribution between the boundaries depicted on the X-axes of Figure 5.

The behavior of the static and the dynamic approaches was observed during 7 days of simulated time. As Figure 5 shows, in case task execution time is overestimated and job dependency trees have few branches, the static approach performs the best. This is the consequence of the increased turn-around time of individual tasks as a result of additional checkpointing and migration. This additional overhead, required by the parent set execution time balancing mechanism, is obviously not profitable when there are few relatively short dependent tasks. The situation changes when the number of dependents increases. At that time the balancing mechanism of the dynamic approach becomes more efficient than the static approach, with respect to the number of final tasks processed and their turn-around times. The higher the branching factor of the dependency trees, the less final tasks get processed, but the better, relatively speaking, the performance of the dynamic method compared to the static method. However, as can be observed in Figure 6, with the increase in the number of dependents the average turn-around time of an individual task significantly increases, since its execution time has to be balanced against the execution time of a large number of parallel tasks. For underestimated execution time the advantage of the dynamic approach, compared to the static one, is getting even more essential. In the latter case, the dynamic algorithm performs about 30% better even in the case of a small number of dependent tasks. This is the result of the increasing difference between the turn-around times of final tasks achieved by the both algorithms. As can be seen in the figure, for a large number of dependent tasks the static approach does not manage to finish any final tasks within the observed time interval. Therefore, for the last three cases there is no average job turn-around data available for the static algorithm. The more or less random task assignment in the case of the static approach seems thus to be more penalizing for the turn-around times in case of long tasks than when tasks are relatively short. Finally, the fluctuating model shows the results that are somewhat in between the results achieved by overestimation and underestimation models for execution time prediction. Important to notice is that in the latter case average execution time estimation is used, instead of the last estimated execution time value, to avoid unnecessary migrations as the result of frequent oscillations.

6

Conclusions

In this work, a dynamic method is presented that schedules dependent tasks, represented in the form of DAGs, taking into account the monitored grid status and temporary estimations of the total task execution time. The proposed algorithm was evaluated in a grid simulator under increasing, decreasing and fluctuating task execution time estimation models. The simulation results have shown that the dynamic algorithm significantly outperforms the static approach for jobs consisting of relatively large tasks with initially underestimated or fluctuating execution times. In the case of overestimated and gradually decreasing execution time, the dynamic algorithm is only advantageous for scheduling jobs

Figure 5. Number of processed jobs and their average turn-around times for dynamic and static algorithms performing on the decreasing, increasing and combined task execution time estimation models.

with a large number of dependent tasks, executed in parallel.

References 1. Feitelson, D.: Parallel Workloads Archive. http://www.cs.huji.ac.il/labs/parallel/workload/ (2003) 2. Schroeder, B., Gibson, G.: A Large-Scale Study of Failures in High-PerformanceComputing Systems. In: Proceedings of the International Conference on Dependable Systems and Networks (DSN2006), Philadelphia, PA, USA (2006) 3. Chang, R.S., Chang, J., Lin, P.S.: Balanced Job Assignment Based on Ant Algorithm for Computing Grids. In: Proceedings of the 2nd IEEE Asia-Pacific Services Computing Conference, Tsukuba Science City (2007)

Figure 6. Average turn-around time of individual tasks for dynamic and static algorithms performing on the decreasing task execution time estimation model.

4. Rahman, M., Venugopal, S., Buyya, R.: A Dynamic Critical Path Algorithm for Scheduling Scientific Workflow Applications on Global Grids. In: Proceedings of the 3rd IEEE International Conference on e-Science and Grid Computing, Bangalore, India (2007) 35 – 42 5. Nassif, L., Nogueira, J., Ahmed, M., Karmouch, A., Impey, R., Vinicius de Andrade, F.: Job Completion Prediction in Grid Using Distributed Case-based Reasoning. In: Proceedings of the 14th IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprise, Link¨ oping, Sweden (2005) 249–254 6. Caniou, Y., Jeannot, E.: Experimental Study of Multi-Criteria Scheduling Heuristics for GridRPC Systems. In: Proceedings of ACM/IFIP/IEEE Euro-Par-2004: International Conference on Parallel Processing, Pisa, Italy (2004) 1048–1055 7. Wu, M., X.-H., S.: A General Self-adaptive Task Scheduling System for Nondedicated Heterogeneous Computing. In: Proc. of the IEEE International Conference on Cluster Computing, Hong Kong (2003) 8. Malewicz, G., Foster, I., Rosenberg, A., Wilde, M.: A Tool for Prioritizing DAGMan Jobs and Its Evaluation. In: Proceedings of the 15th IEEE International Symposium on High Performance Distributed Computing, Paris, France (2006) 156–168 9. Aggarwal, M., Kent, R., Ngom, A.: Genetic Algorithm Based Scheduler for Computational Grids. In: Proceedings of the 19th International Symposium on High Performance Computing Systems and Applications. (2005) 209–215 10. Claeys, F.: A Generic Software Framework for Modelling and Virtual Experimentation with Complex Environmental Systems. Phd thesis, Dept. of Applied Mathematics and Process Control, Ghent University, Belgium (2008) 11. Chtepen, M., Claeys, F., Dhoedt, B., De Turck, F., Vanrolleghem, P., Demeester, P.: Dynamic Scheduling of Computationally Intensive Applications on Unreliable Infrastructures. In: Proceedings of the 2nd European Modeling and Simulation Symposium (EMSS ’06), Barcelona, Spain (2006) 12. Lublin, U., Feitelson, D.: The Workload on Parallel Supercomputers: Modeling the Characteristics of Rigid Jobs. Parallel & Distributed Computing 63(11) (1992) 1105 – 1122 2003.

Suggest Documents