DEE: A Distributed Fault Tolerant Workflow ... - Semantic Scholar

DEE: A Distributed Fault Tolerant Workflow Enactment Engine for Grid Computing? Rubing Duan, Radu Prodan, Thomas Fahringer Institute for Computer Science, University of Innsbruck, Technikerstrasse 21a, A-6020 Innsbruck, Austria {rubing,radu,tf}@dps.uibk.ac.at

Abstract. It is a large and complex task to design and implement a workflow management system that supports scalable executions of largescale scientific workflows in distributed and unstable Grid environments. In this paper we describe the Distributed workflow Enactment Engine (DEE) of the ASKALON application development environment for Grid computing. DEE proposes a de-centralized architecture that simplifies and reduces the overhead for managing large workflows through partitioning, improved data locality, and reduced workflow-level checkpointing overhead. We propose a systematic approach to understanding the nature of the fault tolerance and enactment engine overheads that impact the distributed Grid execution of workflow applications. We report experimental results for a real-world material science workflow application. Key words:checkpointing, dependence analysis, distributed enactment engine, fault tolerance, Grid computing, overhead analysis

1

Introduction

Scientific workflows slowly emerge as one of the most popular paradigm for programming Grid applications. A scientific workflow application can be seen as a (usually large) collection of activities processed in a well-defined order to achieve a specific goal. These activities may be executed a broad and dynamic set of geographically distributed heterogeneous resources with no central ownership or control authority, that is called a computational Grid. In this paper we propose a new Distributed Enactment Engine (DEE) to reliably drive the execution of large-scale scientific workflows in dynamic and unstable distributed Grid environments. A distributed infrastructure provides enhanced fault tolerance on own crases and decreases the overhead for controlling and managing the faults of various workflow parts. We propose two checkpointing mechanisms that enable the Grid workflow executions recover and resume from serious faults. In addition, we integrate existing fault tolerance techniques ?

This research is partially supported by the Austrian Science Fund as part of the Aurora project under contract SFBF1104 and the Austrian Federal Ministry for Education, Science and Culture as part of the Austrian Grid project under contract GZ 4003/2-VI/4c/2004.

2

like task replication, retry, migration, redundancy, or exception handing. We identify and classify a broad set of overheads to understand the impact of the distributed enactment engine architecture and the underlying fault tolerance techniques to the distributed execution of Grid applications. We describe several scenarios how our design and development is driven by real-word scientific workflow applications from the hydrological and theoretical chemistry fields. Due to space limitations we limit the scope of this paper to a set of key distinguishing features of ASKALON. Descriptions of other parts can be found at [9] that is the umbrella project of our work. The paper is organized as follows. Section 3 describes the architecture of the DEE in detail. In Section 4 we present the two approaches taken by DEE to checkpoint and recover the execution of workflows in unstable Grid environments. In Section 6 we report experimental results that show a scalable execution control and recovery through checkpointing of a real world material science workflow application, executed in the Austrian Grid environment [5]. We overview the related work in Section 7 and conclude the paper in Section 8.

2

ASKALON Run-time System Architecture

In the ASKALON application development and execution environment for Grid computing [6], the user specifies workflow applications at a high-level of abstraction using the Abstract Grid Workflow Language (AGWL) [15] enhanced with a graphical composition service. AGWL is a high-level XML-based language that, with the support of the enactment engine, entirely shields the user from the technology details that implement the underlying Grid infrastructure (e.g., Web services, Globus toolkit). AGWL has been carefully designed to include an essential set of constructs for specifying large-scale scientific workflow applications (hundred to thousands of activities) in a compact and intuitive way (e.g., through parallel loops). In addition, the user can plug-in properties and constraints that help the ASKALON middleware services, including the enactment engine, optimize the workflow execution. After receiving the full workflow description, the enactment engine takes the responsibility of executing the application on the Grid through the following tasks (see Figure 1): 1. Receive the AGWL description from the composition service and parse it to an internal workflow object representation (see step 1 in Figure 2); 2. Enhance the internal workflow representation with additional concrete information required for the execution. For example, the legacy (Fortran) applications that drive our research [3, 12] need a dedicated execution directory on every Grid site and require the mapping of activity data dependencies to physical files or to executable command line arguments; 3. Interact with the scheduling and the resource brokerage services for obtaining appropriate (”optimal”) workflow mappings onto the available resources (see step 2 − 5 in Figure 2).

3

Fig. 1. The Askalon run-time system architecture.

4. Submit and control the workflow execution (or restart from the checkpoint – see steps 9, 15 in Figure 2); 5. Communicate with the Fault Manager and choose the fault tolerance strategy, for instance according to the prediction data provided by prediction service (see Section 4). In case of Grid site failures, it stops the execution and asks for rescheduling (see step 14 in Figure 2); 6. Send online execution metrics to the monitoring service.

3

Distributed Enactment Engine Architecture

DEE proposes a distributed master-slave enactment engine architecture depicted in Figure 3. The master enactment engine parses first the AGWL description to an internal (Java) workflow representation and sends it to scheduler for appropriate mapping onto the Grid. After the workflow has been mapped onto the Grid, the master enactment engine partitions the workflow into several subworkflows according to the sites where the activities have been scheduled and the type of compound activities. We will present the patition mechanism in a future paper. Usually, the large scale parallel activities (i.e., hundreds of sequential activities) are sent to slave enactment engines. The master enactment engine monitors the execution of the entire workflow as well as the state of slave enactment engines. The slave enactment engines monitor the execution of the sub-workflows and report to the master whenever activities change their state or when the sub-workflows produce some intermediate output data relevant to other sub-workflows. If one of the slave enactment engines crashes and cannot recover, the entire sub-workflow is marked as failed and the master enactment engine asks for rescheduling, re-partitions the workflow, and migrates it to other Grid sites. After the initial scheduling, the master enactment engine also chooses one slaves a a

4

Fig. 2. ASKALON run-time sequence diagram.

backup engine (see the enactment engine on site B in Figure 3). If the master crashes, the backup becomes the master and chooses another backup slave immediately. Usually we choose the master and the backup enactment engines on machines with high CPU frequency, large memory, and good wide-area network interconnections. Every enactment engine consists of four main components, as follows: Control Flow Controller, Data Flow Controller, Fault Manager, Job Submission Controller. Due to the page limitation, we remove the details about them. 3.1

Distributed Data Management

An important task of the enactment engine is to automatically track data dependencies between activities. A data dependency is specified in AGWL by connecting input and output ports of different activities. At run-time, data ports

5 Global Data Flow Controller

Dataflow DB

Global Control Flow Controller

Workflow State Checkpoint Intermediate DB Data

Scheduled workflow

Global Fault Manager

Scheduler

Job Submission Controller

Master Enactment Engine (Site A)

Data Flow Controller

Workflow Object and Data Management Pool

Control Flow Controller Checkpoint DB Fault Manager Job Submission Controller Slave & Backup Enactment Engine ( Site B)

Resource Broker State of Grid Resource Data Flow Controller Control Flow Controller

Intermediate Data

Checkpoint DB Fault Manager Job Submission Controller Slave Enactment Engine ( Site C)

Fig. 3. The Distributed Enactment Engine (DEE) architecture.

may map to either data files identified using (GASS-enabled) URLs, or to objects corresponding to abstract AGWL data types. Such data ports, representing input and output data of activities, are distributed in the Grid and are statically unknown. DEE tracks the data dependencies at runtime by dynamically locating the intermediate data, as presented in the remainder of this section. Run-time Intermediate Data Location The enactment engine gets from the application developer the AGWL specification of the workflow. For example, one workflow contains one parallel for loop, each loop iteration consisting of two serial activities A and B. After scheduling the workflow, the parallel for loop will be translated into a composite parallel activity that unrolls the parallel loop iterations . We can notice that the activities A and B, the data input ports, and data output ports (i.e., file file) inside the composite parallel activity that unrolls the parallel for loop are the same. To solve this run-time name clashing problem, the enactment engine associates a run-time identification number (RID) with each activity, generated as follows: RID = RIDparent + ”.” + new RID. For example, the parallel (RID = 1) activity in Figure 4 is the parent of the seq (RID = 1.1) activity, which is the parent of the activity A (RID = 1.1.1). We therefore use the RID as an attribute of each activity, including the data

6

input and data output ports, to retrieve the data from the correct predecessor, rather than the brother’s predecessor. For instance, in the case of activity B (RID = 1.1.2) that needs the input file file, we compare the RIDs of the two existing possibilities 1.1.2 and 1.2.2 and choose the first one since it has the correct parent identifier 1.

Fig. 4. Run-time intermediate data location.

Transfer of Data Collections AGWL introduces the notion of collection, which is a set of output data produced by one activity or (most likely) a parallel activity. A collection contains zero or more elements, where each element has an arbitrary well-defined data type, an index, and an optional name. The name and the index can be used to directly access specific elements. A collection can be used like a regular data type which is produced by almost all real-world workflow applications that we studied WIEN2k and Invmod. During our research on modelling and executing real-world applications, we identified six types of data collection transfers which we did not see in other existing workflow management systems [7, 11, 14, 16, 17]: the collection produced by one activity and is consumed by another activity (see Figure 5(a)); the collection is produced by one activity and is consumed by every activity from a parallel activity (see Figure 5(b)); the collection is produced by one parallel activity and is entirely consumed by another activity (see Figure 5(c)); the collection is produced by one activity and each ith collection element is consumed by every ith activity from a parallel activity (see Figure 5(d)); the collection produced by one parallel activity and the entire collection is consumed by every individual activity from another parallel activity (see Figure 5(e)); the collection is produced by one parallel activity and every ith collection element is consumed by the ith activity from another parallel activity (see Figure 5(f)). The management of the data collection transfer is the hardest problem of the DEE data dependence analysis. The problem is further complicated if several types of collection transfer are mixed within the same workflow application, for

7

(a)

(b)

(d)

(c)

(e)

(f)

Fig. 5. Collection transfer cases.

example, 5(a), 5(c), 5(e), 5(f) happen in the real-world material science application presented in Section 6. DEE has a run-time data transfer merge mechanism to avoid that the same data is transferred twice, for instance in the case when a large collection is required by multiple activities scheduled on the same Grid site (see Figure 5(d)). In the future, we will also implement a run-time activity merge mechanism to decrease the Grid job submission overhead for multiple activities scheduled on the same site (usually of the order of 20 seconds in our GRAM configuration).

4

Checkpointing

Checkpointing and recovery are fundamental techniques for saving the application state during the normal execution and restoring the saved state after a failure to reduce the amount of lost work. There are two traditional approaches to checkpointing: System-level checkpointing saves to the disk the image of an entire process, including registers, stack, code and data segments. This is obviously not applicable for large-scale Grid applications;

8

Application-level checkpointing is usually implemented within the application source code by programmers, or is automatically added to the application using compiler-based tools. A typical checkpoint file contains the data and the stack segments of the process, as well as information about open files, pending signals, and CPU state. We concentrate our approach on application-level checkpointing. Since it is not always possible to checkpoint everything that can affect the program behavior, it is essential to identify what is included in a checkpoint to guarantee a successful recovery. For the Grid workflow applications, a checkpoint consists of: – the state of the workflow activities; – the state of the data dependencies. DEE checkpoints a workflow application upon precise checkpointing events defined, for instance, through AGWL property and constraint constructs. Typical checkpointing events occur when an activity fails, after the completion of an important number of activities (e.g., workflow phases, parallel or sequential loops), or after a user defined deadline (e.g., percentage of the overall expected or predicted execution time). Other checkpointing events happen upon rescheduling certain workflow parts due to the dynamic availability of Grid resources or due to variable or statically unknown number of activities in workflow parallel regions. Upon a checkpointing event, the Control Flow Controller invokes the Fault Manager that stops the execution of the workflow and saves the status and the intermediate data into a checkpoint database. We classify the checkpointing mechanisms in DEE as follows: Activity-level checkpointing saves the register, stack, and memory for every individual activity running on a certain processor. The advantage of the activity level checkpoint is that an individual activity can recover. At the moment we do not support activity-level checkpointing but we plan to integrate traditional system-level checkpointers like Condor [14] or MOSIX [1]; Light-weight workflow checkpointing saves the workflow state and URLs to intermediate data (together with additional semantical information that characterizes the physical GASS URLs). The control-flow checkpoint is very fast because it does not back-up the intermediate data. The disadvantage is that the intermediate data remains stored on possible unsecured and volatile file systems; Workflow checkpointing saves the workflow state and the intermediate data at the point when the checkpoint is taken. The advantage of the workflow checkpointing is that it completely backups the intermediate data such that the execution can be restored anytime from any Grid location. The disadvantage is that the checkpointing overhead grows significantly for large intermediate data. From the checkpointing perspective, the execution of one activity traverses three distinct stages:data preparation, job execution, and data register to the

9

Data Flow Controller. Therefore, when the enactment engine calls for a checkpoint there are five job state possibilities.The job execution state counts for all the states returned by GRAM, which is our job submission interface to the Grid (SUBMITTED, QUEUED, ACTIVE, COMPLETED, and FAILED). Definition 1. Let W = (AS, CF, DF D) denote a workflow application, where AS is the set of the activities, CF D = {(Af rom , Ato ) | Af rom , Ato ∈ AS} the set of control flow dependencies and DF D = {(Af rom , Ato , Data)| Af rom , Ato ∈ AS} the data flow dependencies, where Data denotes the workflow intermediate data, usually instantiated by a set of files and parameters. Let State : AS → {Executed, Unexecuted} denote the execution state function of an activity A ∈ AS. The workflow checkpoint is defined by the following set of tuples: CKP TW = {(A, State(A), Data) | ∀ A, Af rom ∈ AS ∧ State(A) = Unexecuted ∧ State(Af rom ) = Executed ∧ (Af rom , A, Data) ∈ DF D}. As we can notice, there are two possible options for the checkpointed state of an executing activity (i.e., job execution stage). We propose three solutions to this problem: 1. We let the job run and regard the activity as Unexecuted ; 2. We wait for the activity to terminate and set the state to Executed, if the execution was successful. Otherwise, we set the state to Unexecuted. Both solutions are not obviously perfect and therefore, we propose a third option that uses the predicted execution time of the job, as follows: 3. Delay the checkpoint for a significantly shorter amount of time, based on the following parameters: Predicted execution time (PET) is the time that the activity is expected to execute. DEE gets the predicted execution time from the prediction service using regression functions based on historical execution data; Checkpoint deadline (CD) is a pre-defined maximum time the checkpoint can be delayed; Job elapsed time (JET) is the job execution time from the beginning up to now. We compute the state of an activity A using the following formula: ½ Unexecuted, P ET − CD ≥ JET ; State(A) = Executed, P ET − CD < JET. This solution saves the checkpoint overhead, and let the checkpoint complete within a shorter time frame. Another factor that affects the overhead of the workflow checkpoint is the size of the intermediate data to be checkpointed. We propose two solutions to this problem:

10

Output data checkpointing stores the all the output files of the executed activities that were not previously checkpointed; Input data checkpointing stores all the input files of the unexecuted activities that will be used later in the execution. For a centralized enactment engine, the input checkpoint is obviously the better choice because it ignores all the intermediate data that will not be used, which saves the file transfer (backup) overhead. In the case of DEE, the slave enactment engines do not know which intermediate data will be used later and, therefore, must use the output checkpointing mechanism. However, the solution is still efficient since the checkpoint is done locally by each slave enactment engine which saves important network file transfer overhead.

5

Overhead Analysis

The ultimate goal of the ASKALON Grid computing environment is to support reliable high-performance execution of scientific applications on the Grid. Fault tolerance techniques and distributed executions have important advantages that insure fast and proper completion of the application, however, they are the source of a set of additional overheads. The nature of this overheads and their contribution to the overall execution time is the scope of this section. Figure 6 presents a hierarchical classification of a set of overheads from the enactment engine perspective. We divide the enactment engine overheads in six main categories, as follows: Middleware overhead is due communication with the middleware services, as follows: Schedule overhead represents the time taken by the remote scheduler to appropriately map the workflow activities onto the Grid. Reschedule overhead represents the time to re-map the failed workflow activities to other Grid resources; Resource brokerage overhead accounts for the time needed by the resource broker to provide the resources requested by the enactment engine; Database overhead represents the time to access all the remote databases (e.g., checkpoint database); Slave enactment engine overhead represents the time for communicating with other remote enactment engines; Execution control overhead consists of the following sub-overheads required to control the execution of the workflow: Data dependence overhead represents the time taken by the enactment engine to dynamically analyze and optimize (i.e., archive, compress) the data dependency and decrease the file transfer size and number; Control flow overhead represents the time taken to process the control flow dependencies, like fork a set of activities at the beginning of a parallel region, or to join them (synchronize) at the end;

11

Fig. 6. Enactment engine overhead classification.

Job queue overhead represents the time taken to control the maximum number of parallel jobs submitted to one Grid site. This avoids overloading the GRAM-PBS gatekeepers on slower front-end machines that may crash otherwise (in our Globus installation). Fault tolerance overhead comprises: Checkpoint overhead represents the time taken to stop the execution of the workflow and send the state to the checkpoint database; Restore overhead represents the time taken to restore the workflow from the checkpoint; Retry overhead represents the time taken to retry the failed activity on the same or a different Grid site; Migration overhead represents the time taken to checkpoint, reschedule, and restart or resume the activity from the checkpoint;

12

Workflow preparation comprises: Environment set-up overhead is the time needed to prepare the execution environment like create the necessary directory structure required by legacy applications; Partitioning overhead is the time required to partition the workflow into smaller parts to be executed by the slave enactment engines; Optimization overhead represents the time required to optimize the workflow at run-time based on the schedule computed. For example, the enactment engine can merge several activities to be executed on the same site (to reduce the Grid job submission middleware overhead of GRAM) or group together multiple data transfers between the same Grid sites (for subsequent archiving and compression); Data transfer overhead is due to any kind of data transfer due to data dependencies. This includes: Input from user overhead (interactive); Input from scheduler overhead for run-time intermediate data location (see Section 3.1); Third party data transfer overhead between two remote Grid sites; Collection data compression overhead for archiving and compressing a collection of data before initiating a third party data transfer; Data stage-in overhead from the local user machine to the remote Grid site; Data stage-out overhead of the remote workflow output to the local user machine. Job Management comprises the following sub-overheads: Preparation overhead corresponds, for instance, to the time required to uncompress data archives or create directory structures; Submission overhead represents the time required by GRAM to submit the job; Polling overhead is the time required to poll for job termination (usually ten seconds in our GRAM configurations); Queue overhead is related to jobs blocked in the queuing system (e.g., PBS) of the parallel machines available as Grid sites.

6

Experiments

In this section we show experimental results that evaluate our approach on a realworld material science workflow application. WIEN2k [3] is a program package for performing electronic structure calculations of solids using density functional theory, based on the full-potential (linearized) augmented plane-wave ((L)APW) and local orbital (lo) method. We have ported the WIEN2k application onto the Grid by splitting the monolithic code into several course-grain activities coordinated in a workflow.The lapw1 and lapw2 tasks can be solved in parallel by a fixed number of so-called k-points. A final activity converged applied on several output files tests whether the problem convergence criterion is fulfilled. The number of recursive loops is statically unknown.

13

Table 1 summarizes the Austrian Grid testbed that we used for this experiments. The Grid sites are ranked according to their individual speed in executing the WIEN2k application. Rank Site Architecture # CPU, GHz Job 1 altix1.jku NUMA, SGI Altix 3000 10 Itanium 2, 1.6 4 gescher COW, Gigabit Ethernet 10 Pentium 4, 3 2 altix1.uibk NUMA, SGI Altix 350 10 Itanium 2, 1.6 3 schafberg NUMA, SGI Altix 350 10 Itanium 2, 1.6 5 agrid1 NOW, Ethernet 10 Pentium 4, 1.8 6 arch19 NOW, Ethernet 10 Pentium 4, 1.8 Table 1. The Austrian Grid testbed.

Manager Fork PBS Fork Fork PBS PBS

Location Linz Vienna Innsbruck Salzburg Innsbruck Innsbruck

We use a WIEN2k problem size of 100 parallel k-points, which means a total of over 200 workflow activities. We started by executing the workflow on the fastest Grid site available (in Linz) and then we incrementally added new sites to the execution environment, as presented in Table 1. Fig. 7(a) shows that WIEN2k considerably benefits from a distributed Grid execution until three sites. The improvement comes from the parallel execution of WIEN2k on multiple Grid sites that significantly decreases the computation of the lapw1 and lapw2 parallel sections. Beyond four Grid sites we did not obtain further improvements due to a temporary slow interconnection network of 1 Mbit per second to the Grid site in Salzburg. As expected, the overheads increase with the number of aggregated Grid sites, as shown in Fig. 7(c) (5.669%) and 7(d) (25.933%). We can rank the importance of the measured overheads as follows: Data transfer overhead, Load imbalance overhead, Workflow preparation, Middleware overhead,Job preparation overhead. We configured DEE to perform a checkpoint after each main phase of the WIEN2k execution: lapw0, lapw1, and lapw2. Moreover, we configured the master enactment engine to perform input data checkpointing and slave engines to do output data checkpointing. Fig. 8(a) compares the overheads of the light-weight workflow checkpointing and the workflow-level checkpointing for a centralized and a distributed enactment engine. The overhead of the light-weight workflow checkpointing is very low and relatively constant, since it only stores the workflow state and URLs to the intermediate data. The overhead of workflowlevel checkpointing for a centralized enactment engine increases with the number of Grid sites because there are more intermediate data to be transferred to the checkpoint database. For a distributed enactment engine, the workflow-level checkpointing overhead is much lower since every slave enactment engine uses a local checkpoint database to store the intermediate data files, which eliminates the wide area network file transfers. Fig. 8(b) presents the gains obtained in the single site workflow execution because of checkpointing. We define the gain as the difference between the timestamp when the last checkpoint is performed Tckpt minus the timestamp of the

14

(a) Scalability.

(b) Distribution of parallel activities.

(c) All overheads on one Grid site.

(d) All overheads on two Grid sites.

Fig. 7. Enactment Engine overheads. prev prev previous checkpoint Tckpt : Gain = Tckpt − Tckpt . The biggest gains are obtained after checkpointing the parallel regions lapw1 and lapw2. The gain for the workflow level checkpointing is lower, since it subtracts the time required to copy the intermediate data to the checkpoint database. On two sites using two enactment engines (one master and one slave) the gain obtained will be only half, if we assume that only one of the two enactment engines crashes. Therefore, a distributed enactment engine provides lower losses upon failures, which are isolated in separate workflow partitions. Fig. 8(c) shows that the size of the data checkpointed at the workflow level is bigger than the overall size of intermediate data transferred for a small number of Grid sites (up to three, when scalability is achieved). Beyond four sites, the size of the intermediate data exceeds the workflow-level checkpointing data size. The data size of the light-weight workflow checkpointing is, of course, negligible. The number of files transferred preserves, more or less, this behavior (see Fig. 8(d)).

7

Related Work

Checkpointing is one of the most important tasks required by fault tolerance. DAGMan [14] and the Pegasus [8] workflow management systems support activitylevel checkpointing and restart techniques. However, they do not support checkpointing at the workflow level. Other Grid workflow management projects [4, 13,

15

(a) Checkpointing overhead comparison

(b) Checkpoint gains.

(c) Size of data transferred.

(d) Number of files transferred.

Fig. 8. Enactment Engine checkpointing results.

17] also do not consider workflow-level checkpointing, because their intermediate data management is independent from the control flow management system. GrADS [2] only supports rescheduling at the activity level, but not at the workflow level. Rescheduling at the workflow level, like is done in ASKALON with support from DEE, has the potential of producing better schedules since it reconsiders the entire remaining sub-workflow for optimized mapping onto the Grid resources. None of the existing works [4, 10, 11, 17] are based on a distributed architecture, like the one proposed in this paper. A distributed architecture improves the overall fault tolerance, decreases the overhead of the enactment engine and of the checkpointing, reduces the losses upon enactment engine failures, improves the data locality, and reduces the complexity of the typically large-scale scientific workflows through partitioning. Finally, we introduced a systematic approach to understanding the overheads produced by the enactment engine and the fault tolerance techniques to the distributed Grid execution. We did not see a similar systematic approach in any of the related works.

16

8

Conclusion

This paper we presented the motivation, features, design, and implementation of a distributed workflow enactment engine, which is part of the ASKALON application development environment for Grid computing. A distributed architecture has several advantages over a centralized approach. It improves the overall fault tolerance of the enactment engine itself, it improves the data locality by appropriate workflow partitioning, reduces the enactment engine overhead due to simplified control flow and data flow structures, and reduces the losses upon enactment engine crashes. We have defined and implemented two approaches to workflow-level checkpointing that improves the robustness of workflow executions in unstable Grid environments. We have demonstrated our techniques for two real-world workflow applications from the theoretical chemistry and the hydrological fields. Future work aims at improving the enactment engine in various aspects, including scalability, workflow optimization, and workflow partitioning. We will also study new real-world applications from the meteorological and astrophysics domains.

References 1. Amnon Barak and Oren La’adan. The MOSIX multicomputer operating system for high performance cluster computing. Future Generation Computer Systems, 13(4–5):361–372, 1998. 2. Francine Berman, Andrew Chien, Keith Cooper, Jack Dongarra, Ian Foster, Dennis Gannon, Lennart Johnsson, Ken Kennedy, Carl Kesselman, John Mellor-Crummey, Dan Reed, Linda Torczon, and Rich Wolski. The GrADS Project: Software support for high-level Grid application development. The International Journal of High Performance Computing Applications, 15(4):327–344, 2001. 3. P. Blaha, K. Schwarz, G. Madsen, D. Kvasnicka, and J. Luitz. WIEN2k: An Augmented Plane Wave plus Local Orbitals Program for Calculating Crystal Properties. Institute of Physical and Theoretical Chemistry, Vienna University of Technology, 2001. 4. Junwei Cao, Stephen A. Jarvis, Subhash Saini!, and Graham R. Nudd. GridFlow: Workflow Management for Grid Computing. In Proceedings of the 3rd International Symposium on Cluster Computing and the Grid (CCGrid 2003)), Tokyo, Japan, May 2003. IEEE Computer Sociery Press. 5. The Austrian Grid Consortium. http://www.austriangrid.at. 6. Rubing Duan, Thomas Fahringer, Radu Prodan, Jun Qin, Alex Villazon, and Marek Wieczorek. Real World Workflow Applications in the Askalon Grid Environment. In European Grid Conference (EGC 2005), Lecture Notes in Computer Science. Springer Verlag, February 2005. 7. Dietmar W. Erwin and David F. Snelling. UNICORE: A Grid computing environment. Lecture Notes in Computer Science, 2150, 2001. 8. Yolanda Gil Carl Kesselman Gaurang Mehta Karan Vahi Albert Lazzarini Adam Arbree Richard Cavanaugh Scott Koranda Ewa Deelman, Jim Blythe. Mapping abstract complex workflows onto grid environments. Journal of Grid Computing, 1(1):9–23, 2003.

17 9. T. Fahringer. ASKALON - A Programming Environment and Tool Set for Cluster and Grid Computing. http://dps.uibk.ac.at/askalon, Institute for Computer Science, University of Innsbruck. 10. Soonwook Hwang and Carl Kesselman. Grid workflow: A flexible failure handling framework for the grid. In Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing (HPDC-12), pages 126–137, Seattle, WA, USA, June 2003. IEEE Computer Society Press. 11. IT Innovation. Workflow enactment engine, October 2002. http://www.itinnovation.soton.ac.uk/mygrid/workflow/. 12. K. Jasper and J. Schulla. The hydrological model wasim-eth. 1999. http://www.nccr-climate.unibe.ch/download/wp3/p32/p32 wasim.html. 13. T. Oinn, M. Addis, J. Ferris, D. Marvin, M. Senger, M. Greenwood, T. Carver adn K. Glover, M.R. Pocock, A. Wipat, and P. Li. Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics, 20(17):3045–3054, 2004. 14. The Condor Team. Dagman (directed acyclic graph manager). http://www.cs.wisc.edu/condor/dagman/. 15. Jun Qin Thomas Fahringer and Stefan Hainzer. Specification of Grid Workflow Applications with AGWL: An Abstract Grid Workflow Language. In Proceedings of IEEE International Symposium on Cluster Computing and the Grid 2005 (CCGrid 2005), Cardiff, UK, May 2005. IEEE Computer Society Press. 16. Gregor von Laszewski, Beulah Alunkal, Kaizar Amin, Jarek Gawor, Mihael Hategan, and Sandeep Nijsure. The Java CoG Kit User Manual. MCS Technical Memorandum ANL/MCS-TM-259, Argonne National Laboratory, March 14 2003. 17. Gregor von Laszewski, Beulah Alunkal, Kaizar Amin, Shawn Hampton, and Sandeep Nijsure. GridAnt-Client-side Workflow Management with Ant. Whitepaper, Argonne National Laboratory, 9700 S. Cass Avenue, Argonne, IL 60439, U.S.A., July 2002.

DEE: A Distributed Fault Tolerant Workflow ... - Semantic Scholar

DEE: A Distributed Fault Tolerant Workflow ... - Semantic Scholar

Suggest Documents

A DISTRIBUTED FAULT TOLERANT ... - Semantic Scholar

Resilient Distributed Datasets: A Fault-Tolerant ... - Semantic Scholar

A Fault-Tolerant Distributed Formation Protocol for ... - Semantic Scholar

A Fault-Tolerant Distributed Legacy-based System ... - Semantic Scholar

Fault-Tolerant BPEL Workflow Execution via Cloud ... - Semantic Scholar

Fault Tolerant Scheduling in Distributed Networks - Semantic Scholar

Fault-Tolerant Distributed Mass Storage for LHC ... - Semantic Scholar

Building fault tolerant distributed systems using IP ... - Semantic Scholar

CONFIGURABLE FAULT-TOLERANT DISTRIBUTED ... - CiteSeerX

CONFIGURABLE FAULT-TOLERANT DISTRIBUTED ... - CiteSeerX

Fault-tolerant Distributed Continuous Double

Adaptive Distributed and Fault-Tolerant Systems 1 ... - Semantic Scholar

Fault Tolerant Localization for Teams of Distributed ... - Semantic Scholar

Fault-Tolerant Distributed Mass Storage for LHC ... - Semantic Scholar

Fault-tolerant mobile agents in distributed objects ... - Semantic Scholar

Distributed Bayesian Algorithms for Fault-Tolerant ... - Semantic Scholar

Distributed Fault-Tolerant Topology Control in ... - Semantic Scholar

Fault Tolerant Scheduling in Distributed Networks - Semantic Scholar

A Byzantine Fault Tolerant Distributed Commit Protocol

A Fault-Tolerant Approach to Distributed Applications

STAR: a Fault-Tolerant System for Distributed

A DISTRIBUTED FAULT TOLERANT ARCHITECTURE ... - CiteSeerX

Automatic Synthesis and Fault-Tolerant ... - Semantic Scholar

Experiments on Fault-Tolerant Self ... - Semantic Scholar