Available online at www.sciencedirect.com
ScienceDirect IERI Procedia 10 (2014) 216 – 223
2014 International Conference on Future Information Engineering
Towards Better Workflow Execution Time Estimation Artem M. Chirkina,b, Sergey V. Kovalchuka,* a
ITMO University, Birzhevaya line, 4, 199034, Saint-Petersburg, Russian Federation b University of Amsterdam, Spui 21, 1012 WX,bAmsterdam, Netherlands
Abstract Authors highlight the importance of estimating workflow execution time in the scheduling problem and in pricing models in Clouds. Claim that better approaches need to be developed to obtain precise estimates of execution time. Describe straightforward regression for stochastic execution time estimation. Propose a combination of Chebyshev-like distribution-free inequalities and distribution-based approaches to calculate better confidence bounds on runtime based on results of statistical tests. Finally, authors explain the need of developing better algorithm to compute workflow execution time based on stochastic estimates of tasks’ execution time. Explain the need of developing better workflow runtime estimation algorithm. Finally, authors present an approach to estimating workflow execution time based on stochastic estimates of tasks’ execution time. © 2014 ThePublished Authors. Published by Elsevier © 2014. by Elsevier B.V.B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/). Selection and peer review under responsibility of Information Engineering Research Institute Selection and peer review under responsibility of Information Engineering Research Institute
Keywords: runtime, execution time, workflow, scheduling
1. Introduction Nowadays research and production computational problems require more computing resources. Therefore heterogeneous systems (HSs) based on Grid and Cloud technologies become more popular because they
* Corresponding author. Tel.: +7-962-687-0460; fax: +7-812-232-23-07. E-mail address:
[email protected].
2212-6678 © 2014 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/). Selection and peer review under responsibility of Information Engineering Research Institute doi:10.1016/j.ieri.2014.09.080
Artem M. Chirkin and Sergey V. Kovalchuk / IERI Procedia 10 (2014) 216 – 223
provide high flexibility in scaling resources, usage of large data amounts, possibility to run single service on different computers as a part of a complex engine [1]. This work addresses the problem of estimating execution time of complex tasks in HS. Complex tasks executed in HS usually consist of several atomic tasks (stages) and form a workflow. We refer to the workflow as a collection of tasks and informational dependencies between them; these dependencies impose some restrictions on execution order of the tasks. Number of papers are written to describe different implementations of workflow management systems (e.g. [2, 3]). The main question researchers address when developing management systems is how to schedule workflow in the most efficient way: minimizing execution time, cost, meet some deadline or other constraints [4, 5]. We say that HS contains number of packages (applications) which provide services; each task of the workflow is an execution of a single service. Additionally, HS contains number of nodes (computing units). Scheduling problem consists of mapping tasks to nodes optimally. Often used approach to represent workflow is a directed acyclic graph (DAG) [4, 6, 7]. It is well-known that complexity of a general scheduling problem in DAG is NP-Complete [8], thus many heuristics exist that provide near-optimal solution to the scheduling problem. Significant part of the scheduling problem is estimation of workflow execution time. We suppose the runtime estimation problem is not so well-developed as the scheduling problem because several straightforward techniques exist that give acceptable for many scheduling cases performance. However, we claim that these techniques may fail in some cases, especially in deadline constrained setting. Moreover, when estimating workflow execution time, one should take into account dependence between executed tasks, heterogeneity of the tasks and computing resources, variety of workflow forms. The most important feature of HS is that some execution time components are stochastic in their nature (data transfer time, overheads, etc.). These facets are not considered in modern researches together, therefore the problem needs a further research. Besides scheduling, runtime estimation is used by the clients of HS. Cloud computing providers create pricing models to give their clients billing information [9]; they have to calculate precisely execution time estimates to present expected price of workflow execution for a user. Given the estimated time bounds, one can use a model described in [10] to provide user with pre-billing information. The aim of this work is to highlight common problems of estimating workflow execution time and propose a solution that takes into account complexity and stochastic nature of workflow processes. 2. Execution time in workflow scheduling Workflow scheduling is well-known to be a complex problem; it requires to consider its different facets (e.g. resource model, workflow model, scheduling criteria, etc.) [4]. Therefore many authors describing infrastructure of their scheduling systems do not focus on the runtime estimation. We separate these two questions so that results obtained in a research dedicated to general workflow execution time problem can be used in complex scheduling systems which already exist or are being developed. In such setting an estimating system provide values of target function (execution time) to the scheduling system similar as if the scheduling system have calculated it internally. Additionally, we can use the estimating system to get remaining execution time at any stage of workflow execution to monitor the process or re-schedule remaining tasks. Some examples of scheduling systems that naturally can use any implementations of runtime estimation module are CLAVIRE (ITMO University, St.Petersburg, Russia) [2, 11], framework proposed by Ling Yuan at al.[3], and CLOWDRB by Somasundaram and Govindarajan [12]. Today a large variety of scheduling heuristics have been studied. Some of them do not require to estimate workflow runtime [13, 14], but this decreases flexibility and maximum possible effectiveness of a scheduler [15]. Most schedulers can be classified by the type of assumption regarding time. The first class of assumptions requires only information whether one task needs longer time to execute than other (i.e. ordered
217
218
Artem M. Chirkin and Sergey V. Kovalchuk / IERI Procedia 10 (2014) 216 – 223
list of tasks), and is used in task-level scheduling heuristics (which do not take into account execution time of the workflow). A number of scheduling heuristics of this class is described in [16]. The second class of assumptions considers task runtime as a constant value. This is simple approach to model workflow time, but it does not allow to measure possible estimation error. Some examples can be found in [6, 7, 15]. The last type of assumptions exploits random nature of execution time and has a potential to give the most accurate runtime predictions. Knowledge of expected value and variance allows to measure uncertainty of workflow runtime, make rough bounds on time using Chebyshev’s inequality [17]; or, provided by quintiles or distribution approximation, one can make rather good confidence intervals on execution time that may be used in deadline-constrained setting [18]. 3. Stochastic approach to modeling runtime Besides calculation time of the programs, workflow execution time includes additional costs for data transfer, resource allocation, and others. Therefore single task processing can be considered as computational service calling procedure, and its execution time can be represented in a following way:
T
TR TQ TD TE TO
(1)
Here, TR is resource preparation time, TQ - queuing time, TD - data transfer time, TE - program execution time, TO - system overhead time (analyzing the task structure, selecting the resource, etc.). For detailed description of a component model (1) see [11]. If we consider components to be stochastic variables, we should analyze their distributions and dependencies. In the case if all components are independent, we can easily find mean and variance of T :
E[T ] Var (T )
E[TR ] E[TQ ] E[TD ] E[TE ] E[TO ] Var (TR ) Var (TQ ) Var (TD ) Var (TE ) Var (TO )
(2)
However, in the case of dependent variables the variance equality is wrong - we should add covariance terms to compute Var (T ) . Especially this is related to dependence between program execution time TE and data transfer time TD : usually execution time does depend on data size significantly. Here and later for simplicity we assume all components included into overall task execution time T . When modeling execution time of a workflow two major problems should be addressed: statistical sufficiency of the data and assumptions on it, and aggregating tasks’ execution time into workflow estimation. 3.1. Task execution time The first problem is related to the way one represents execution time when fits it to observed data. The most general way to represent execution time T as a regression model: d
T
f (X, E )
(3)
Here X - task execution arguments, represented by k (t 0) -dimensional vector in arbitrary space; X is controlled by an application executor, but can be treated as a random variable from the regression analysis point of view. T R - dependent variable (i.e. execution time); f ( X , E ) - regression (random) function of
219
Artem M. Chirkin and Sergey V. Kovalchuk / IERI Procedia 10 (2014) 216 – 223
specified form, so that T | X is random variable; E - unknown parameters of a function f , their form depends on representation of the regression function. This formula allows to represent any kind of runtime dependence on the task arguments, but difficult to work with. It can be rewritten in the following form: d
T P ( X , E ) V ( X , E )[ ( X , E ) Here
P( X , E )
E[T | X ]
E[ f ( X , E ) | X ]
-
conditional
(4) expectation;
V (X ,E )
Var (T | X ) Var ( f ( X , E ) | X ) - standard deviation; [ ( X , E ) - random variable that depends on X and T . The regression problem consists of estimating parameters E , thus estimating mean, variance, and conditional distribution of the runtime. This model can be used in three ways: calculate only mean P ( X ) and assume V ( X )[ ( X ) to be insignificant error (e.g. [15, 19]); calculate mean and variance, then use Chebyshev-like inequalities (e.g. [17]); estimate
[ ( X ) and then calculate quintiles [18].
Let assume now that we have estimated [ ( X ) and V ( X ) perfectly (or we know them theoretically). Then we can easily prove that E[[
| X]
E[[ ] 0 , Var ([ | X ) Var ([ ) 1 , Cov(X, [ ) 0 . It is usual approach to assume that error [ does not depend on input X . However, in general case [ depends on X this means that distribution of T not only shifts or scales with X , but also has changing shape. Consequently, for some values of X , time T may have significantly larger right “tail”. Given small size of training sample (observed executions) this results in unexpectedly high probability of missing a deadline.
Fig. 1. Q-Q plots - Weibull distribution; sample size N = 60; k = 3 for independency case; k = 1:1::3 for dependency case
Figure 1 presents described phenomena on the example of a synthetic test. We generated two small samples from Weibull distribution to simulate running program with different parameters X : one “good” sample ( [ does not depend on X ) generated using constant value of the shape parameter k , and one “bad” sample ( [ does depend on X ) for which k 3 X / 55 . Scale parameter Ȝ was always adjusted so that Var ([ ) 1 . In this setting for high values of X “bad” distribution of [ has large right tail. This effect
220
Artem M. Chirkin and Sergey V. Kovalchuk / IERI Procedia 10 (2014) 216 – 223
is not noticeable if we observe whole execution sample (Fig. 1.a), because only few runs were proceeded with certain parameters X ! 80 . At the same time, if we get narrowed region of X , we see that “bad” distribution does not fit theoretical quintiles completely (Fig. 1.b). In order to check if the sample (i.e. execution logs of a program) satisfies independence condition, one should use a statistical hypothesis test. In case of continuous data, smoothed bootstrap tests shown to perform best [20]. By categorizing data, such tests as Fisher’s exact [21] or asymptotic Pearson’s chi-squared (only if there is enough data) tests may be used. Above we explained why the straightforward approach of fitting mean and variance to estimate distribution of a program execution time should be used with caution and might give wrong results in some cases. If one cannot statistically accept that distribution of [ does not depend on X , quintiles cannot be calculated directly; in this case distribution-independent inequalities (e.g. Chebyshev’s inequality) may be used to estimate runtime bounds. In our research we propose to combine distribution-dependent and distribution-free quintiles using p-values from independence tests. More formally, we suppose to estimate confidence level of the runtime bounds (4) in the following form:
P(T Tu )
P(T Tu | H 0 , { X n .[ n }) f n ( p) P(T Tu | H 0 , { X n .[ n })(1 f n ( p ))
(5)
Here p stands for p-value of hypothesis test; H 0 - hypothesis of independence (null hypothesis); f n ( p ) - some non-decreasing function of p , which depends also on sample and represents probability of the sample being taken from independent distributions X and [ . In other words, we compute probability P (T Tu ) of meeting deadline Tu using the law of total probability. 3.2. Workflow execution time The second problem consists of estimating bounds on workflow execution time. From the stochastic runtime perspective, workflow simply is a combination of sums and maximums of random variables. If the only information about execution time available is mean and variance, then the root of the problem is that max( E[T1 ], E[T2 ]) d E[max(T1 , T2 )] - execution time of parallel branches cannot be calculated directly. One workaround is to compute confidence intervals separately for each task. However it cannot be used because it leads to applying Chebyshev’s inequality many times and boosting size of confidence bounds. Another approach is to calculate mean and variance of execution time for all paths in workflow graph (as in [17]). We believe, in case of large workflows this approach may be too computationally expensive.
Fig. 2. (a-c) three primitive workflow types; (d) increase of estimation precision during execution of workflow; (e) standard deviation of an estimation of workflow execution time at different stages during workflow execution
Artem M. Chirkin and Sergey V. Kovalchuk / IERI Procedia 10 (2014) 216 – 223
We propose an algorithm that takes a place between two described methods. It consists of five stages: 1) Select all fork-join parts in workflow for branches of which mean and variance are calculated; for example, on Figure 2 workflows (a) and (b) consist of one part, workflow (c) consists of three parts (T1; T2 and T3; T4 ); this procedure might be performed recursively by replacing any task with another sub-workflow. 2) Evaluate mean and standard deviation for all ( k t 1 ) parallel threads of fork-join structure. 3) Estimate upper and lower bounds of fork-join part execution time moments (mean and variance). 4) Use estimations of inner fork-joins in outer fork-joins (and finally to estimate moments of the workflow execution time). 5) Use mean and variance to estimate bounds on workflow runtime. We are working now on creating better bounds on moments of maximum of several variables, because precision of these bounds has a great effect on total estimations. Similar approach can be used to calculate quintiles directly instead of using moments. One important feature of the workflow execution time estimating system is the ability to improve estimates during workflow execution when some stages are executed and other are not. We developed an estimating system which uses different methods to calculate regression model (4), always using the one that performs best on given input. To implement the ability to use real-time execution data, we included dummy method into the system that gives the observed execution time if a package is already executed. Figure 3 represents estimations of the system during the workflow execution. We simulated simple workflow that consists of four sequentially connected tasks, defined stage k as a time when k 1 of all packages are already executed. All tasks had roughly the same execution time round 100s (mean and variance were varying). We performed remaining execution time estimations during workflow execution. Figure 3.a shows single workflow execution; green vertical lines separate stages (i.e. time when different packages are executing). It is easy to see that expectation comes closer to reality when time passes, but does not necessarily always improves, however confidence intervals always improve. Note, that there is a chance that upper (lower) confidence bound goes upper (lower) instead of decreasing (increasing) if previous task takes much more (less) time than expected. Figure 3.b represents relative deviation of estimation error at different stages. Average is taken for different trials, so that we proceeded with 1000 experiments of 40 sequential workflow estimations. Usually estimation system is learning during these 40 runs, however in plotted case task models are too simple (i.e. estimation is simple average, estimator learns almost immediately) to observe learning process. It is clear that deviation decreases with increasing stage number, because uncertainty (number of remaining packages) decreases. Note, the last stage variance always equals to zero (all packages are executed). 4. Conclusion We showed that execution time estimation takes important place in the scheduling problem and in pricing models of HSs. Researchers are different in the way they represent execution time; many scheduling heuristics do not require to estimate workflow runtime, but this estimation play key role in scheduling quality [15]. Additionally, custom runtime estimator can be used in many scheduling algorithms without modifying them. We claimed that better approaches need to be developed to obtain precise estimates of execution time. We proposed a combination of Chebyshev-like distribution-free inequalities and distribution-based approaches to calculate better confidence bounds on runtime based on results of statistical tests. We high- lighted the need of developing better workflow runtime estimation algorithm. Finally, we presented an approach to estimating workflow execution time based on stochastic estimates of tasks’ execution time.
221
222
Artem M. Chirkin and Sergey V. Kovalchuk / IERI Procedia 10 (2014) 216 – 223
Acknowledgements. This work was financially supported by Government of Russian Federation, Grant 074-U01.
References [1] Juve G., Deelman E., Scientific Workflows in the Cloud. Grids, Clouds and Virtualization, 2011: 71-91. [2] Knyazkov K.V., et al. CLAVIRE: e-Science infrastructure for data-driven computing. Journal of Computational Science, 2012; 3 (6): 504-510. [3] Yuan L., He K., Fan P. A framwork for grid workflow scheduling with resource constraints. International Conference on Consumer Electronics, Communications and Networks (CECNet), 2011: 962-965. [4] Wieczorek M., Hoheisel A., Prodan R. Towards a general model of the multi-criteria workflow scheduling on the grid. Future Generation Computer Systems, 2009; 25 (3): 237-256. [5] Kyriazis D., et al. An innovative workflow mapping mechanism for Grids in the frame of Quality of Service. Future Generation Computer Systems, 2008; 24 (6): 498-511. [6] Chen W.-N., Zhang J.Z.J. An Ant Colony Optimization Approach to a Grid Workflow Scheduling Problem With Various QoS Requirements. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 2009; 39 (1): 29-43. [7] Amalarethinam D.I.G., Selvi F.K.M. A Minimum Makespan Grid Workflow Scheduling algorithm. IEEE International Conference on Computer Communication and Informatics, 2012: 1-6. [8] El-Rewini H., Lewis T.G., Ali H.H. Task scheduling in parallel and distributed systems. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1994. [9] Weinhardt C., et al. Cloud Computing A Classification, Busi- ness Models, and Research Directions, Business & Information Systems Engineering, 2009; 1 (5): 391-399. [10] Sharma B., et al. Pricing Cloud Compute Commodities: A Novel Financial Economic Model, 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 2012: 451-457. [11] Kovalchuk S.V., et al. Deadline-driven Resource Management within Urgent Computing Cyberinfrastructure. Procedia Computer Science, 2013; 18: 2203-2212. [12] Somasundaram T.S., Govindarajan K. CLOUDRB: A framework for scheduling and managing HighPerformance Computing (HPC) applications in science cloud. Future Generation Computer Systems, 2014; 34: 47-65. [13] Cushing R., et al. Prediction-based auto-scaling of scientific workflows. Proceedings of the 9th International Workshop on Middleware for Grids, Clouds and e-Science (MGC’11), 2011; Article No. 1. [14] Oprescu A.-M. Stochastic Approaches to Self-Adaptive Application Execution on Clouds, Ph.D. thesis, Vrije Universiteit, 2013. [15] Korkhov V. Hierarchical Resource Management in Grid Computing, Ph.D. thesis, Universiteit van Amsterdam, 2009. [16] Gutierrez-Garcia J.O., Sim K.M. A Family of Heuristics for Agent-Based Cloud Bag-of-Tasks Scheduling. IEEE International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery, 2011: 416-423. [17] Afzal A., Darlington J., McGough A. Stochastic Workflow Scheduling with QoS Guarantees in Grid Computing Environ- ments. IEEE Fifth International Conference on Grid and Cooperative Computing (GCC’06), 2006: 185-194. [18] Trebon N., Foster I. Enabling urgent computing within the existing distributed computing infrastructure, Ph.D. thesis, 2011. [19] Ichikawa S., Takahashi S., Kawai Y. Optimizing process allocation of parallel programs for heterogeneous clusters. Concurrency and Computation: Practice and Experience, 2009; 21 (4): 475-507. [20] Wu E.H., Yu P.L., Li W. A smoothed bootstrap test for independence based on mutual information.
Artem M. Chirkin and Sergey V. Kovalchuk / IERI Procedia 10 (2014) 216 – 223
Computational Statistics & Data Analysis, 2009; 53 (7): 2524-2536. [21] Mehta C.R., Patel N.R. A Network Algorithm for Performing Fisher’s Exact Test in r c Contingency Tables. Journal of the American Statistical Association, 1983; 78 (382): 427-434.
223