Real World Workflow Applications in the Askalon Grid ...

4 downloads 149 Views 359KB Size Report
harnesses the computational power of multiple Grid sites to optimise the overall execution .... A heuristic-based search engine attempts to maximise a plug-and-.
Real World Workflow Applications in the Askalon Grid Environment? Rubing Duan, Thomas Fahringer, Radu Prodan, Jun Qin, Alex Villaz´ on, Marek Wieczorek Institute for Computer Science, University of Innsbruck, Technikerstraße 21a, A-6020 Innsbruck, Austria {rubing,tf,radu,jerry,alex,marek}@dps.uibk.ac.at

Abstract. The workflow paradigm is widely considered as an important class of truly distributed Grid applications which poses many challenges for a Grid computing environment. Still to this time, rather few real-world applications have been successfully ported as Grid-enabled workflows. We present the Askalon programming and computing environment for the Grid which comprises a high-level abstract workflow language and a sophisticated service-oriented runtime environment including meta-scheduling, performance monitoring and analysis, and resource brokerage services. We demonstrate the development of a real-world river modelling distributed workflow system in the Askalon environment that harnesses the computational power of multiple Grid sites to optimise the overall execution time.

1

Introduction

Grid computing simplifies the sharing and aggregation of distributed heterogeneous hardware and software resources through seamless, dependable, and pervasive access. It is well known that highly dynamic Grid infrastructures severely hamper the composition and execution of distributed applications that form complex workflows. We have developed the Askalon programming and computing environment [5] whose goal is to simplify the development of Grid applications. Askalon currently supports the performance-oriented development of single-site, parameter study, and workflow Grid applications. The focus of this paper is on distributed workflow applications. Workflow applications are first specified using a novel high-level abstract workflow language that shields the user from any Grid middleware implementation or technology details. The abstract representation is then mapped to a concrete workflow that can be scheduled, deployed, and executed on multiple Grid sites. The Askalon computing environment is based on a service-oriented ?

This research is partially supported by the Austrian Grid project funded by the Austrian Federal Ministry for Education, Science and Culture under the contract GZ 4003/2-VI/4c/2004.

architecture comprising a variety of services including information service and resource brokerage, monitoring, performance prediction and analysis, reliable execution, and meta-scheduling. Although a variety of Grid programming systems exist, few concentrate on workflow applications, and even fewer are capable to support the development and execution of real-world workflow applications. In this paper we describe the development of a workflow application in the Askalon Grid environment which extends an existing river modelling system that was previously developed to run on a sequential computer only. This work resulted in a real-world distributed workflow river modelling system that harnesses the computational power of multiple national Austrian Grid sites. The next section describes the Askalon programming and computing environmentfor the Grid. Section 3 proposes several overheads that we use in the workflow performance analysis process. Section 4 shows the representation of a river modelling workflow application in the Askalon environment. We present in Section 5 a performance analysis study accompanied by a small overhead analysis that illustrates the benefit of executing the application in a distributed Grid environment. Section 6 concludes the paper.

2

Askalon Programming and Computing Environment

This section describes the Askalon programming and computing environment for the Grid, currently under development at the University of Innsbruck [5]. 2.1

Workflow Specification Languages: AGWL and CGWL

In contrast to other approaches [1, 3, 6, 8, 10, 12–14], Askalon enables the description of workflow applications at a high level of abstraction that shields the user from the middleware complexity and the dynamic nature of the Grid. Although workflow applications have been extensively studied in areas like business process modelling and web services, it is relatively new in the Grid computing area. Existing work on Grid workflow programming commonly suffers by one or several of the following drawbacks: control flow limitations (e.g., no loops), unscalable mechanisms for expressing large parallelism (e.g., no parallel sections or loops), restricted data flow mechanisms (e.g., limited to files), implementation specific (e.g., focus on Web services, Java classes, software components), and low level constructs (e.g., start/stop tasks, transfer data, queue task for execution) that should be part of the workflow execution engine. Using the XML-based Abstract Grid Workflow Language (AGWL) [4], the user constructs a workflow application through the composition of atomic units of work called activities interconnected through control-flow and data-flow dependencies. In contrast to much existing work, AGWL is not bound to any implementation technology such as Web services. The control-flow dependencies include sequences, Directed Acyclic Graphs, for, foreach, and while loops, if-then-else and switch constructs, as also more advanced constructs such as

parallel activities (or master-slave patterns), parallel loops, and collection iterators. In order to modularise and reuse workflows, so called sub-workflows (or activity types) can be defined and invoked. Basic data-flow is specified by connecting input and output ports between activities. AGWL is free of low-level constructs as mentioned above. Optionally, the user can link constraints and properties to activities and data flow dependencies that provide additional functional and non-functional information to the runtime system for optimisation and steering of the workflow execution on the Grid. Properties define additional information about activities or data links, such as computational and communication complexity, or semantic description of workflow activities. Constraints define additional requirements or contracts to be fulfilled by the runtime system that executes the workflow application, like the minimum memory necessary for an activity execution, or the minimum bandwidth required on a data flow link. A transformation system parses and transforms the AGWL representation into a concrete workflow specified by the Concrete Grid Workflow Language (CGWL). In contrast to AGWL designed for the end-user, CGWL is oriented towards the runtime system by enriching the workflow representation with additional information useful to support effective scheduling and execution of the workflow. At this level, the activities are mapped to concrete implementation technologies such as Web services or legacy parallel applications. Moreover, a CGWL representation commonly assumes that every activity can be executed on a different Grid site. Thus, additional activities are inserted to pre-process and transfer I/O data and to invoke remote job submissions. Data transfer protocols are included as well. CGWL is also enriched with additional constraints and properties that provide execution requirements and hints, e.g., on which platform a specific activity implementation may run, an estimated number of floating point operations, or the approximate execution time of a given activity implementation. During the compilation from AGWL to CGWL, several correctness checks are performed, like the uniqueness of names, the syntax of conditionals, or the existence of links. The data-flow loosely defined in AGWL is verified and completed, the data types are added to all the ports, and the compatibility of the data-links is validated. If possible, automatic data type conversions are added. The CGWL representation of an AGWL specification serves as input to the Askalon middleware services, in particular to the Workflow Executor and the Meta-scheduler (see Figure 1). 2.2

Askalon Grid Services

Askalon supports the performance-oriented execution of workflows specified in CGWL through the provision of a broad set of services briefly outlined in this section. All the services are developed based on a low-level Grid infrastructure implemented by the Globus toolkit, which provides a uniform platform for secure job submission, file transfer, discovery, and resource monitoring.

Fig. 1. The Askalon service-oriented architecture.

Resource broker service targets negotiation and reservation of resources required to execute a Grid application [11]. Resource monitoring integrates and extends our present effort on developing the SCALEA-G performance analysis tool for the Grid [5]. Information service is a general purpose service for scalable discovery, organisation, and maintenance of resource and application-specific online and postmortem data. Workflow executor service targets dynamic deployment, coordinated activation, and fault tolerant completion of activities onto the remote Grid sites. Performance prediction is a service through which we are currently investigating new techniques for accurate estimation of execution time of atomic activities and data transfers, as well as of Grid resource availability. Performance analysis is a service that targets automatic instrumentation and bottleneck detection (e.g., excessive synchronisation and communication, load imbalance, inefficiency) within Grid workflows, based on the online data provided by the Monitoring service, or the offline data organised and managed by the Information service. (Meta)-scheduler performs appropriate mapping of single or multiple workflow applications onto the Grid. We have taken a hybrid approach to scheduling single workflow applications based on the following two algorithms [9]: 1. Static scheduling algorithm approaches the workflow scheduling as an NPcomplete optimisation problem. We have designed the algorithm as an instantiation of a generic optimisation framework developed within the ZENTURIO experiment management tool. The framework is completely generic

and customisable in two aspects: the definition of the objective function and the heuristic-based search engine. ZENTURIO gives first the user the opportunity to specify arbitrary parameter spaces through a generic directivebased language. In the particular case of the scheduling problem, the application parameters are the Grid machines where the workflows are to be scheduled. A heuristic-based search engine attempts to maximise a plug-andplay objective function defined over the set of generic annotated application parameters. For the scheduling problem, we have chosen the converse of the workflow makespan as the objective function to be maximised. For the implementation of the search engine we target problem-independent heuristics like gradient descent or evolutionary algorithms that can be instantiate the framework for other optimisation problems too (e.g., parameter optimisation, performance tuning). Our first search engine implementation is based on genetic algorithms that encodes the (arbitrary) application parameters (e.g., Grid machines for the scheduling problem) as genes and the parameter space as the complete set of chromosomes. We have conducted several experiments on real world-applications where a correctly tuned algorithm delivered in average 700% generational improvement and 25% precision by visiting a fraction of 105 search space points in 7 minutes on a 3GHz Pentium 4 processor. 2. Dynamic scheduling algorithm is based on the repeated invocation of the static scheduling algorithm at well-defined scheduling events whose frequency depends on the Grid resource load variation. The repeated static scheduling invocation attempts to adapt the highly-optimised workflow schedule to the dynamically changing Grid resources. Workflow activities have associated well-defined performance contracts that determine whether an activity should be migrated and rescheduled upon underlying resource perturbation or failure. We have conducted several experimental results in which our dynamic scheduling algorithm produced in average 30% faster execution times than the Condor DAGMan matchmaking mechanism [13].

3

Grid Workflow Overhead Analysis

In Askalon, a Grid workflow application is executed by the Workflow Executor service, based on the CGWL workflow representation. The workflow activities are mapped onto the processors available through the Grid using the (Meta)Scheduler [9]. For each workflow activity A we currently compute three metrics: 1. computation time tA of the corresponding remote Unix process, which we measure by submitting the job on the Grid as an argument to the POSIXcompliant “/bin/time --portability” program; the timing results are retrieved from the job’s standard error stream; 2. Grid execution time TA , measured between the events STAGEIN (i.e., when the input data is transferred to the execution site) and COMPLETED generated by the Globus Resource Allocation Manager used to submit the job on the Grid;

3. Grid middleware overhead associated with the activity A, which we define as: OA = TA − tA . With each workflow execution we associate a Directed Acyclic Trace Graph by unrolling the loops and cloning each activity upon every loop iteration. Since the number of parallel activities in our rather large workflow applications commonly exceed the available Grid machines, we introduce additional edges at the execution time which we call run-time schedule dependencies, that prohibit two parallel activities to execute on the same machine simultaneously because of the lack of additional Grid sites. For instance, if the parallel activities A1 and A2 are scheduled on the same Grid machine, an artificial run-time schedule dependency (A1 , A2 ) is added to the set of workflow edges. Let (A1 , . . . , An ) P represent a path in the trace graph of a workflow W which maximises the sum ni=1 TAi , also called the critical Pn workflow path. We define the Grid middleware overhead of W as: OW = i=1 OAi . Additionally, we measure the communication overhead generated by the (GridFTP-based) file transfers required by activities which are executed on different Grid sites having different NFS file systems.

4

River Modelling: Invmod

Invmod is a hydrological application for river modelling which has been designed for inverse modelling calibration of the WaSiM-ETH program [7]. It uses the Levenberg-Marquardt algorithm to minimise the least squares of the differences between the measured and the simulated runoff for a determined time period. Invmod has two levels of parallelism which are reflected in the Grid-enabled workflow version of the application depicted Figure 2: 1. the calibration of parameters is calculated separately for each Fig. 2. The Invmod workflow. starting value using multiple, so called, parallel random runs; 2. for each optimisation step represented by an inner loop iteration, all the parameters are changed in parallel and the goal function is separately calculated. The number of inner loop iterations is variable and depends on the actual convergence of the optimisation process, however, it is usually equal to the input maximum iteration number. Figure 3 represents an AGWL excerpt of the Invmod workflow, which contains the declarations of the internal while loop and parallel-for structures.

... true ... repeatLoop=’true’ ... ... ... Fig. 3. Excerpt from the Invmod AGWL representation.

5

Experiments

The Askalon service-oriented architecture is currently being developed and deployed within the Austrian Grid infrastructure that aggregates several national Grid sites [2]. A subset of the computational resources which have been used for the experiments presented in this paper are summarised in Table 1. The abstract AGWL representation of the Invmod workflow depicted in Figure 4 is translated into a concrete CGWL representation, which in which the

Site Number of CPUs CPU Type Clock [GHz] RAM [MBytes] Hydra 16 AMD 2000 1.6 1000 ZID392 16 Pentium 4 1.8 512 ZID421 16 Pentium 4 1.8 512 ZID108 6 Pentium 3 1 256 ZID119 6 Pentium 3 1 256 ZID139 6 Pentium 3 1 256 ZID145 6 Pentium 3 1 256 Table 1. The Austrian Grid testbed.

Location Linz Innsbruck Innsbruck Innsbruck Innsbruck Innsbruck Innsbruck

activities are instantiated by legacy Fortran applications executed using the resource and data management support offered by the Globus toolkit. We performed three series of experiments for the Invmod river modelling application corresponding to three different problem sizes identified by 100, 150, respectively 200 parallel random runs. We first executed each problem size on the Hydra reference Grid site, since it is the fastest cluster in our Grid testbed for this application (faster than the Pentium 4). Then, we incrementally added new sites to the execution testbed to investigate whether we can improve the performance of the application by increasing the available computational Grid resources. For each individual execution, we measured the execution time as well as the overheads described in Section 3. Figure 4 shows that the Invmod execution time improves by increasing the number of Grid sites. The best speedup is obtained when the two fastest clusters (i.e., Hydra and ZID392) are first joined to the resource pool. Less powerful clusters, however, also improve the overall execution but with a less steep increase in the speedup (see Figure 4(d)). As expected, the Grid middleware overhead increases by adding new slower clusters to the Grid testbed. This is mostly visible for the smallest problem size executed (i.e., 100 random runs), for which the large overhead/computation ratio produces the rather low increases in speedup (see Figure 4(a)). We obtained similar speedup curves for all the three problem sizes due to the limited number of Grid machines available. As the problem size gets larger, the ratio of the overhead to the overall execution time gets smaller and the speedup obtained are higher since the Grid machines perform more computation (see Figure 4(c)). Since the workflow schedules are computed by the meta-scheduler such that the data dependent activities are scheduled on the same sites (sharing the same NFS file system), the time spent on communication is negligible in all the experiments, even though the size of the generated files is of the order of Gigabytes. The most important result is that by increasing the number of Grid sites, the overall performance of the distributed Invmod application improves compared to the fastest parallel computer available in the Grid infrastructure.

(a) 100 random runs.

(b) 150 random runs.

(c) 200 random runs.

(d) Speedup.

Fig. 4. The Invmod experimental results.

6

Conclusions

In this paper we have shown the approach taken by the Askalon project for defining and executing Grid workflow applications. Askalon proposes a service oriented-architecture that comprises a variety of services for performanceoriented development of Grid applications, including resource brokerage, resource monitoring, information service, workflow execution, (meta-)scheduling, performance prediction, and performance analysis. Workflows are specified in a high-level abstract language that shields the application developer from the underlying Grid and its technologies. A transformation system instantiates the workflow into a concrete representation appropriate for Grid execution. We have demonstrated the effective use of Askalon for modelling, scheduling, executing, and analysing the performance of a real-world distributed river modelling application in the Austrian Grid environment. Our experiments based on workflow overhead analysis show that substantial performance improvement can be gained by increasing the number of sites available in the Grid environment (up to a reasonable number), compared to the fastest parallel computer available. Currently we are applying Askalon to other real world applications from areas such as astrophysics and material science. Moreover, we are incrementally

improving the Askalon middleware to better service the effective performanceoriented development of Grid applications.

References 1. Tony Andrews, Francisco Curbera, Hitesh Dholakia, Yaron Goland, Johannes Klein, Frank Leymann, Kevin Liu, Dieter Roller, Doug Smith, Siebel Systems, Satish Thatte, Ivana Trickovic, and Sanjiva Weerawarana. Business process execution language for web services (bpel4ws). Specification version 1.1, Microsoft, BEA, and IBM, May 2003. 2. The Austrian Grid Consortium. http://www.austriangrid.at. 3. Dietmar W. Erwin and David F. Snelling. UNICORE: A Grid computing environment. Lecture Notes in Computer Science, 2150, 2001. 4. T. Fahringer, S. Pllana, and A. Villazon. A-GWL: Abstract Grid Workflow Language. In International Conference on Computational Science. Programming Paradigms for Grids and Metacomputing Systems., Krakow, Poland, June 2004. Springer-Verlag. 5. Thomas Fahringer, Alexandru Jugravu, Sabri Pllana, Radu Prodan, Clovis Seragiotto Junior, and Hong-Linh Truong. ASKALON: A Tool Set for Cluster and Grid Computing. Concurrency and Computation: Practice and Experience, 17(24), 2005. http://dps.uibk.ac.at/askalon/. 6. IT Innovation. Workflow enactment engine, October 2002. http://www.itinnovation.soton.ac.uk/mygrid/workflow/. 7. K. Jasper. Hydrological Modelling of Alpine River Catchments Using Output Variables from Atmospheric Models. PhD thesis, ETH Zurich, 2001. Diss. ETH No. 14385. 8. Sriram Krishnan, Patrick Wagstrom, and Gregor von Laszewski. GSFL : A Workflow Framework for Grid Services. Technical Report, Argonne National Laboratory, 9700 S. Cass Avenue, Argonne, IL 60439, U.S.A., July 2002. 9. Radu Prodan and Thomas Fahringer. Dynamic Scheduling of Scientific Workflow Applications on the Grid using a Modular Optimisation Tool: A Case Study. In 20th Symposion of Applied Computing (SAC 2005), Santa Fe, New Mexico, USA, March 2005. ACM Press. 10. Ed Seidel, Gabrielle Allen, André Merzky, and Jarek Nabrzyski. Gridlab: a grid application toolkit and testbed. Future Generation of Computer Systems, 18(8):1143–1153, 2002. 11. Mumtaz Siddiqui and Thomas Fahringer. GridARM: Askalon’s Grid Resource Management System. In European Grid Conference (EGC 2005), Lecture Notes in Computer Science. Springer Verlag, February 2005. 12. Ian Taylor, Matthew Shields, Ian Wang, and Rana Rana. Triana applications within Grid computing and peer to peer environments. Journal of Grid Computing, 1(2):199–217, 2003. 13. The Condor Team. Dagman (directed acyclic graph manager). http://www.cs.wisc.edu/condor/dagman/. 14. Gregor von Laszewski, Beulah Alunkal, Kaizar Amin, Shawn Hampton, and Sandeep Nijsure. GridAnt-Client-side Workflow Management with Ant. Whitepaper, Argonne National Laboratory, 9700 S. Cass Avenue, Argonne, IL 60439, U.S.A., July 2002.

Suggest Documents