Workflow Automation for Processing Plasma Fusion ... - CiteSeerX

2 downloads 18184 Views 372KB Size Report
Jun 25, 2007 - The Center for Plasma Edge Simulation Project (CPES) is creating an integrated ..... and Kepler actors only call a method for each operation.
Workflow Automation for Processing Plasma Fusion Simulation Data ∗

¨ Norbert Podhorszki, Bertram Ludascher

Scott A. Klasky

Department of Computer Science University of California, Davis, USA

Oak Ridge National Laboratory Oak Ridge TN 37831-6008, USA

{pnorbert,ludaesch}@cs.ucdavis.edu

[email protected]

ABSTRACT The Center for Plasma Edge Simulation project aims to automate simulation monitoring, data processing, and result archival using scientific workflows. We describe these tasks and requirements, and the newly developed Kepler workflow components that provide the required functionality for our automated workflow solution. Besides functionality, the focus is on the robust execution of the workflow. To this end, a user-level checkpointing model has been developed that allows a workflow to restart and re-execute operations that failed during a previous run. Categories and Subject Descriptors: D.0 [Software]: General General Terms: Design, Reliability Keywords: Kepler, SSH, checkpointing, plasma fusion, scientific workflow

1.

INTRODUCTION

Two top priorities in the U.S. Department of Energy (DOE) are the International Thermonuclear Experimental Reactor (ITER), and high performance computing. There have been a series of recent SciDAC Fusion Simulation Projects which have been trying to address the many challenging issues of modeling a device such as ITER on the largest supercomputers built for scientific discovery and maintained at National Leadership Computing Facilities such as Oak Ridge (ORNL) and Argonne National Laboratories (ANL). The concept of a fusion reactor is based on the toroidal plasma confinement provided by external coil arrays and an internal electrical current in the plasma. If the hot plasma (the hottest substance on earth) is allowed to touch the wall of the reactor, a very cool surface, the plasma can sputter the wall material into the plasma to extinguish the fusion burn and shorten the wall lifetime to a unacceptable level. In order to alleviate this problem, ITER is being designed to divert this escaping edge plasma to a specific location. ∗ Work supported by DOE grants DE-FC02-01ER25486 (SciDAC/SDM) and DE-AC02-05CH11231 (CPES)

Copyright 2006 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the U.S. Government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. WORKS’07, June 25, 2007, Monterey, California, USA. Copyright 2007 ACM 978-1-59593-715-5/07/0006 ...$5.00.

The Center for Plasma Edge Simulation Project (CPES) is creating an integrated predictive plasma edge simulation code package. Its goal is to understand edge localized mode (ELM) instabilities, which dramatically reduce the confinement properties of the plasma and shorten the walls lifetime. The project created two different particle simulation codes, XGC-0 and XGC-1 [13]. Both XGC simulations run on massive amounts of processors, due to the nature of Particle In Cell (PIC) simulations [4]. These codes have been shown to scale efficiently to over 10,000 processors at ORNL. With the ever increasing processing power, fusion simulation scientists are starting to encounter problems in largescale data management. One of the main challenges is maximizing the efficiency of their codes on new supercomputers. Typically, XGC and GTC [17] simulations can last for over 20 hours on over 8,000 processors. These simulations are running on batch systems. In order to determine if the simulation is running properly, the scientist generally logs into the machine where the program is running with Ssh, and then performs various job management commands, in order to see if the simulation started, or is currently running. Once the scientist sees that the simulation is running, s/he typically runs gnuplot on an ASCII data file produced by the simulation (containing volume averaged quantities) to determine the fidelity of the simulation. Although this is the common technique scientists use, they realize that this is a very compact representation of their data, and they typically know that they need to move the “raw” data over to another computer for data analysis. Simulations typically produce about 40 GB of data to be analyzed every hour. The typical GTC/XGC scientist uses bbcp [24], a point-topoint network file transfer tool capable of transferring files at approaching line speeds in the WAN, in order to transfer files from the DOE leadership computing facilities to their home institutions, e.g. Princeton Plasma Physics Laboratory (PPPL) or New York University. Transferring 40GB of data using bbcp over an OC3 link (the link from ORNL to PPPL) takes about 1 hour assuming there are no failures in the data transfer. Typically, scientists see many failures when transferring data over the WAN and must regularly check this operation in order to guarantee continuous success. Once the data actually moves over to the scientist’s desktop, the data must be processed through an analysis program, and then visualized. At this stage a scientist often has a better idea about the fidelity of the simulation. Since this checking process requires a large amount of manual work, scientists generally do this only a few times per day. Often they notice instabilities, problems with input

files, etc. by doing more detailed analysis long before those problems show up in the aggregated ASCII files. Once the simulation is over, the data must be backed up on the tape system, e.g. on the High-Performance Storage System (HPSS) at ORNL. This process is usually not automated, and the scientist must remember to backup their raw data, since large data files can be removed from the system seven days after the simulation. Since simulations can comprise many runs and go on for several weeks, data must be backed up in the meantime. The use of workflow automation relieves scientists from this data management issue, allowing them to instead focus on the science aspects of the simulation.

1.1 CPES Automation Tasks The simulation codes used in the CPES project (XGC-0, XGC-1, GTC) are running on thousands of processors. They are timestep based programs, writing their state to disk regularly; usually not at every timesteps to save disk space but rather every five or ten timesteps. One crucial issue is the data format of the output. It was shown in [2] that the GTC simulation can spend over 20% of its simulation time writing out data in HDF5 format; experience with XGC on a Cray XT3 machine yielded similar results. As these simulations use costly computing resources, precious allocation time is spent improperly. In order to speed up I/O, XGC writes data in a special binary format and the file conversion is done separately on a secondary, less costly system. Thus the first step for automation is to watch the simulation and whenever a timestep is written completely, data should be transferred to the secondary resource, and should be converted there to HDF5 with the provided tool. After the transfer and conversion, all data files should be archived. When the target applications reach their full maturity, they will produce tens of terabytes of data with each run. We do not expect to have space to store all data on disk for a complete simulation run. Therefore, files should be transferred to a mass storage system on the fly and then removed from the disk to make space for more data coming from the simulation. Moreover, there is a requirement from the mass storage administrators that the archive files should be neither “too small” nor “too large”. Practically this means that neither individual files nor the complete simulation output as a whole should be sent to the archiving system. Thus, the automated solution must put files into larger chunks (akin to SRB “containers”), taking care that all data for a timestep goes into the same archival chunk. Besides archiving simulation data, the users also want to get regular status updates about the simulation. This is particularly important for large, long-running simulations. Such simulations have the tendency to “go wrong”, e.g. by not converging where they should, or by magnification of artifacts due to errors. With a full day run on 10,000 processors, a “bad” simulation can waste over 200,000 CPU hours from a project’s limited allocation. It is therefore important to be able to get early insight into an ongoing simulation, in order to terminate a bad simulation as early as possible. The difficulty behind automation is that the users usually do not know in advance what exact information and in what presentation they want to see. The automated solution thus should include further ad-hoc processing capabilities on the files and some ways of presentation (e.g., a simple copy of image files into a specific directory).

In our setting, data processing operations (including the transfers between the hosts) take as much or even more time than computing a timestep by the simulation. Hence a completely sequential solution would not be able to keep up with the simulation itself. Instead it is important to create a concurrent solution that processes data on the fly with as much parallel processing as possible: e.g., when files are transferred to the processing host, the first file can be converted while the second file is being transferred (i.e., in a pipelineparallel fashion), or images can be created in parallel for several data files on different nodes of the cluster. To summarize, the following tasks should be automated: • Watch a running simulation for its data output. • Transfer output data to another machine. • Convert each data file to HDF5 format. • Archive the HDF5 files to a mass storage system in large chunks. • Execute analysis/visualisation tools on the data files and continuously present current results to the user. The following are requirements for any automated solution: 1. Security: ORNL applies SecureID one-time-passwords to access their resources. The automated tool has to deal with this fact when attempting to regularly check something on one resource, transfer data between resources and execute programs on the other resource. 2. Configurability: the resources, programs, and their settings are not fixed. The solution should be able to work with a simulation running for instance at NERSC1 and still processing the results at ORNL. 3. Robustness: if something goes wrong with the processing of a timestep, the software should be able to continue working on future timesteps. 4. Restartable: if something goes wrong, we just want to restart this software and it should continue where it previously failed. Of course, we should not start over from the very beginning, copying all files again! 5. On-the-fly runs: the approach should work during the simulation run. Files should be moved as soon as possible (but not earlier): a concurrent situation, when the simulation is still writing out data into a file and the copy has already been started, should be avoided. 6. Parallelism: Process data files in (pipeline- or other) parallel fashion to keep up with the speed of data generation.

1.2 Computational Resources for CPES The largest supercomputer of ORNL, a Cray XT4 machine called Jaguar consist currently of 6,296 dual-core processors and is scheduled to be gradually upgraded to a 250 TFlop machine by the end of 2007, and a PetaFlop machine by the end of 2008. The XGC-based simulations run on these machines using a batch environment. The system is also 1 National Energy Research Scientific Computing Center, Lawrence Berkeley National Laboratory (LBL)

connected to a Lustre parallel file system, so jobs can efficiently write data in parallel. For data processing purposes of the CPES project, there is a secondary AMD Opteron cluster called Ewok installed close to Jaguar. It currently consists of 80 dual-processor machines, connected with InfiniBand network and there is a Lustre parallel file system for a shared disk space of 20TB. The users on the cluster can connect to the HPSS system of ORNL with the hsi (i.e., HPSS interface) tools without further authentication to the user’s own account. ORNL had introduced one time passwords in the past to increase the security of the systems. The only way to log into a resource from the outside is by typing the new passcode generated by the RSA SecurID “token” given to the user. As an exception, the Ssh authentication in one direction from Jaguar to Ewok is host-based, that is, users can e.g. copy files from Jaguar to Ewok without typing a new passcode for each transfer command. The other supercomputer used for CPES simulations is located at NERSC: Seaborg is an IBM SP RS/6000 system with 6,080 processors. The project has a large amount of allocation on this system as well.

1.3 Kepler Workflow Automation Tool Kepler [18] is an open-source scientific workflow system developed in collaboration between several projects from different scientific and technology areas. The SciDAC/SDM (Scientific Data Management) Center project [28] targets application domains such as the CPES plasma fusion simulation. Kepler is based on Ptolemy II [6], a modeling tool for heterogeneous, concurrent systems. One advantage of Ptolemy II lies in a modeling and design paradigm called actor-oriented modeling [16] in which data-centric computational components (actors) and control-flow-centric scheduling and activation mechanisms (frameworks) are clearly separated. This has proven to be essential to deal with the complex design issues of scientific workflows [3]. In Kepler, a workflow is viewed as a composition of independent actor components that have input/output ports for communication, and parameters to customize actor behavior. The execution semantics of such (often nested) workflow graphs is specified by a separate director component, which defines the overall model of computation and how actors communicate with one another. For example, the Process Network (PN) model [12, 15] allows us to create processing pipelines on data streams with pipeline-parallel execution. Kepler extensions to Ptolemy II include numerous (currently over 350) actors and capabilities for scientific workflows, e.g., web service actors and harvester [1], Grid actors using Globus, Java CoG Kit and Griddles, SRB and database actors, file operations and command-line execution actors etc. Additional components are constantly added, for instance, for statistics packages (such as R), GIS functionality (e.g., Grass and ArcIMS couplings), and other scientific data analysis and visualization capabilities like Matlab and SCIRun. Kepler also provides extension frameworks for handling semantic types, recording provenance and for accessing web-based actor repositories. A Kepler workflow is a graph of actors, where an actor itself can be a (nested) workflow, possibly under the control of a different director than the top level. The main CPES actors, e.g., internally are dynamic dataflow workflows executing sequentially in a single thread, while the outer CPES

workflow is a process network in which actors execute concurrently, i.e., task- and pipeline-parallel. During execution, data is encapsulated via tokens that flow between actors according to a schedule determined by the director.

1.4 Related Work Most of the current scientific workflow systems focus on executing jobs on remote resources using some kind of Grid middleware, that is mapping an abstract workflow into individual jobs and submitting them into remote job queues. Askalon [7], Pegasus workflow mapping tool [5], Swift [23], Triana [22], P-GRADE Portal [11], Frauenhofer Grid Job Builder [10], Condor-G [8] are tools based on Globus or on other middleware based on Globus. The authentication and authorization in the above Grid oriented workflow systems are based on GSI, the certificatebased security infrastructure of Globus. This allows single sign-on procedure for users into a whole set of resources and submission of jobs into different queues. At ORNL, GSI was not considered secure enough. The only authentication method is via RSA SecurID, typing a new passcode each time a connection is made between two resources. This strict policy renders most Grid-oriented workflow systems unusable for CPES, including Kepler’s Grid actors and other Globus-based composition frameworks such as XCAT [9]. Therefore, web services and Grid services cannot be used as components in CPES workflows either, since they cannot be reached from any other resource than the one they are running on. Similarly, workflow systems that are based on service orchestration, e.g., Taverna [20], JOpera [21], Triana, and Kepler’s web service actors cannot be used. There are two systems besides Kepler/CPES that allow authentication with the Ssh-2 protocol. The Java CoG Kit [14], when used through the Karajan workflow engine and specification language, is able to authenticate to remote machines with Ssh-2 instead of Grid certificates. It allows password and public key authentication methods. The commercial IP*Works! .NET Workflow Edition framework from /nsoftware [26] extended with similar components such as IP*Works! SSL and IP*Works! S/Shell allows opening persistent SSL connections to execute remote commands within Windows Workflow Foundation applications.

2. KEPLER COMPONENTS FOR CPES Before the CPES project, Kepler collaborators were mainly focusing on the support and use of Grid technology and web services as those have been the underlying technologies of their projects. They did not address the challenges involving the scenario presented here: • Ability to perform several remote actions through one established Ssh connection, • Remote file operations through Ssh, • Specific actors to perform tasks for file based workflow: regularly listing a directory looking for new files, generic execution actor for a (remote) file, packing (archiving) a stream of files in chunks, • Some kind of checkpointing capability to make a workflow restartable, • Generic logging for workflow components.

Permanent Ssh Connections. Although there was an actor to execute a command with Ssh and copy a file between a remote machine and the local machine (where Kepler is running), that Ssh actor could not execute commands repeatedly through keyboard-interactive logins, as each time the password/passcode has been asked. Moreover, two such actors performing operations on the same host, did not share the connection, therefore asking for password independently. The very first capability we had to add to Kepler was a package of actors related to remote actions that can share one established Ssh connection. There are ongoing developments for generic Ssh support for Java applications, like JSch [27] and Ganymed Ssh-2 [25]. Kepler’s original Ssh execution actor had used the JSch package to establish a connection and execute a command. JSch provides support for all kinds of authentication (password, public key, keyboard-interactive) except for host-based authentication and it provides an interface to open several channels within one established Ssh session to execute different operations (both in parallel and consecutively). We have first created a wrapper package org.kepler.ssh around JSch to be able to share a connection among different actors in a workflow. This Java package provides the following remote operations currently: • open or close a connection, • execute a command, • copy files from a remote host to the local host, • copy files to the remote host from the local host, • create a directory, • remove files. All remote operations are implemented within this package and Kepler actors only call a method for each operation. The individual actors does not have internal fields to store a session or channel because then they could not be shared with other actors. The identification of an Ssh connection is the user@host string, which is delivered either through links in the workflow graph from one actor to another (e.g. from opening a session to an execution actor) or through parameters given to all actors. This kind of support may seem to be misleading first for streaming workflows as it gives a chance to make a hidden way of communication between two actors without being connected. However, here we rather achieved some simplification as we do not need to use additional control links between actors. The links in the workflow graph thus correspond more to the data flow in the workflow. The SshSession actor opens an Ssh session. This actor is useful to let the workflow ask the user for password at the beginning of the workflow run instead of at the very first remote operation at a given host. An identity file can be provided for public-key authentication. However, the underlying org.kepler.ssh package does not require the use of this actor to explicitely open a session. Before any remote operation is started, it is checked whether the connection to the requested host is opened. If not, a new connection is established (and a password is asked if needed) before performing the action. Additionally, the above package implements all operations on the local host with Java Runtime as well and all Ssh related actors can perform operations locally, too. That is,

if a host is named “local ”, no Ssh session is established to the local host but local operations are performed. We use this feature all the time because the workflows are developed on a laptop, and test executions connect to ORNL as a remote site, while the production workflows are executed locally on the Ewok cluster. The deployment therefore consist of simply replacing the host name of Ewok to local in the configuration file of the workflow. Remote File-Oriented Operations. There are specific actors for file operations on a local or remote host. The SshFileCopier actor provides the functionality of the scp command. It can copy files and directories recursively to/from a host from/to the local disk, using wildcards. The SshFileRemover is the Kepler variant of the powerful rm command (although, it works on a local Windows machine as well) except that the acceptance of wildcards must be explicitely requested and dangerous attempts (/, *, ./* etc) are simply rejected. SshDirectoryCreator enables us to create remote directories, a regular step in the preparation/initialization parts of workflows. Finally, the SshDirectoryListing actor provides the content of a directory (allowing wildcards for the files). Moreover, it can be used to regularly list the same directory and get only the new files that were not previously present, or get the list of modified files. Moreover, this actor produces not simply the list of file names but a record of name, size and date for each file. The size information is needed for the archival parts of the CPES workflows. The SshDirectoryListing actor uses the ’ls -l’ command for remote directories and the Java File class for local directories. Watching a Simulation. Scientific workflows process data which often comes from external sources like databases, web services, instruments or data from files. Kepler workflows that process the output of external programs need to get the stream of files and fill the pipeline with tokens referencing those files. The stream should not just be started but also stopped. We have defined an easy way to detect the termination of the simulation: it should create a specific (empty) file as the very last step (e.g. included in the job script of the simulation as the last command). Another way of termination-detection for job-based programs could be polling the status of the job, however for this the workflow should know about the job handler. Another important thing is to avoid concurrent access to a file, that is, it should not give a file to the first element in the processing pipeline as long as the simulation is writing it. The criteria for ’completely finished files’ for the simulations used in the CPES project is that whenever a simulation starts writing files for a given timestep, it never touches files of previous timesteps. Therefore, it is enough to ensure being one step behind the simulation. When listing available files, the ones found to be new in the last listing are considered ’dangerous’ (possibly under writing), but the previously found list of files can be safely processed. We have realized the FileWatcher actor as a Kepler workflow based on the SshDirectoryListing. FileWatcher is a data-dependent cyclic workflow, i.e. the number of iterations depends on unknown circumstances, see Fig. 1. It stops executing the SshDirectoryListing actor whenever the specific termination file is found among the list. Still within this actor in later parts, the actual list of new files is withheld for one step. It is emitted (to the next actor) only when there is a newer list of files.

Figure 1: The data-dependent loop in Kepler to regularly check a remote directory for new files (with SshDirectoryListing) until a specific file is found (by SshDirectoryListing2). Remote command execution. Another important capability to be supported is the execution of commands or programs on a remote site. Kepler had support for Grid jobs and Web Services as well as it had a basic actor for remote command execution by Ssh. The latter actor has been modified to use the org.kepler.ssh package so that it shares an established connection and enables logins with keyboardinteractive authentication. The SshExec actor handles a command as an atomic step in the workflow execution. It executes a command specified as a string and emits the stdout, stderr and exit code after the command has terminated. Although there is an experimental version to stream the commands output line by line, this makes the workflow design more complex and should be used only when it is really necessary. For the CPES project, we have created a generic ProcessFile actor (implemented again as a workflow) with the following properties. Most operations take a single file as input, execute a program on it and produce one output file. At first, developers find basic actors easy to use to construct workflows to realize a processing pipeline but face difficulties of complexity when they start to think about handling failures. ProcessFile encapsulates checkpointing for successfully executed commands according to the user-level checkpointing technique described in Section 3.1 and discarding of tokens related to failed operations to realize the workflow behaviour described in Section 3. Logging is also included, so that reports on errors and successful actions can be collected (currently into files). Moreover, this actor is constructed in a way to pass the specified stop file without processing it (see Fig. 2), so that the information about the termination of the data source (simulation) can reach further actors in the pipeline. For instance, the archiving actor needs to know the end of the stream. This actor is used the most in our workflows, as it fits the requirements of file transfer, file conversion, image creation operations and is also used as an atomic step in the archival workflow. It hides the control complexity coming from the extra work of dealing with failures and the capability of workflow restart, allowing users to create simple workflow pipelines at the top level corresponding to the steps of the scientist requirement list, see the final workflow in Fig. 3. Data Transfer. The SshFileCopier actor allows the workflow to copy files from a remote host to the workflow execu-

tion (local) machine, or vice-versa. However, this is needed only in the display of remote images for the user. In all other steps, data is transferred among remote hosts. One of the difficult topics Grid/Cyber-infrastructures focus on is how to enable data transfer in general. In the environment of the CPES project, there are only a few large resources for computing and data processing but without any established Grid infrastructure. Therefore we have to have an adhoc solution here. The data transfer operation is actually one instance of the ProcessFile actor in CPES workflows: the actual transfer is always performed by an external tool executed as a command on one of the participating hosts, usually the bbcp or scp. This requires that the transfer tool can log in from the executing host into the other host with an automated authentication. Automated authentication forms include public-key (used between ORNL and NERSC) and host-based (inside ORNL in the direction from Jaguar to Ewok and among Ewok nodes) authentication as well as using a grid proxy certificate but which way is not supported for the resources used by the CPES project. Archival on the fly. The archival of a stream of files in chunks of a given size is a specific task which has been implemented as a workflow that uses the ProcessFile actor as the last component to execute the actual archiving command for one chunk on the remote host. The split-up of the stream is performed by the specific ArchiveCount actor. This java actor runs on the local host and therefore creates the list of files to be archived at once on the local host (this is why the SshDirectoryListing emits size information for the files besides their names). Therefore, the Archiver takes care of copying the list to the remote machine and the execution of the archive command on the list remotely. It also does checkpointing, so that at a workflow restart, already archived files will not be counted again. Logging The Logger actor takes a log string, and has two parameters: a valid file name for the logging and a header string for identifying the particular Logger. This actor is special in Kepler in the sense that it shares its state with all instances of this actor (class) within the workflow, which is the log file with the defined name (although, there can be several different log files handled by such actors). This way, the workflow can use Logger actors at arbitrary number of

Figure 2: The top-level of ProcessFile. The ExecuteCmd sub-workflow handles logging and checkpointing (see Fig. 4). MakeOutput handles failed operations and gets output file information. places to log into one file within the workflow without the need of drawing a lot of annoying connections. The very first call of all Logger actors with the same log file results in creating a new text file with the given name. For all the other calls, the input string is written into the log file, prefixed with date and the particular Logger’s header string. The writes are synchronized, thus the strings of separate loggers are not mixed.

2.1 CPES Kepler Workflow for Processing Simulation Data and Archival The above new components in Kepler abstract and encapsulate the complex operations, dealing with checkpointing, logging and robust execution. They enable workflow developers now to focus on the steps they want to perform. The workflow that corresponds the tasks enumerated in Section 1.1 can be seen on Fig. 3. It is the PN Director computational model that allows the pipelined-parallel execution of the actors on the streams of files. The actors themselves inside take care to not to re-execute accomplished tasks in case of a workflow restart. They also discard tokens from the stream that belong to failed tasks, thus allowing the whole workflow to continue on consecutive files coming from the simulation. The Transfer, Convert and CreateImage actors in the workflow are instances of ProcessFile. FileWatcher and Archiver are also used. The very first actor emits a constant that brings the whole workflow in motion; it is then the FileWatcher that produces the stream of files that have to be processed in the pipeline. The wide white boxes are value displays to show the last processed files at each step in case the scientist executes this workflow within the GUI; otherwise the log files can be examined in case of commandline execution of the workflow. In the future, the project is planning to create a dashboard for easy deployment of such workflows and central control of workflow execution, monitoring and steering.

3.

ROBUSTNESS OF SCIENTIFIC WORKFLOWS

Robustness of the executing workflows is one of the key requirements for the long-term success of workflow systems. A new workflow system brings yet another component into the already large set of tools used by scientists. It must

bring some new value to wake up the scientists’ interest but it also must avoid making the whole solution less reliable otherwise, scientists simply ignore it after the first try. In the CPES project we face workflow executions that last for a day or more, with frequent executions of external programs to process data. One side of the robustness is whether the workflow system is robust enough not to fail. Fortunately, Ptolemy II is a thoroughly designed modeling system developed for more than ten years. It is unlikely that the workflow execution fails by itself not counting bugs in the newly developed actors. However, machine failures can bring the workflow down, too. Another side of the robustness is whether the workflow is designed to avoid going into panic mode because some external operation fails. This occurs far more frequently in the CPES scenarios. The automation tasks of the CPES workflows consist of many repeated operations on individual data items (files), where the simulations produce data in timesteps. If some operation during a timestep fails (e.g. transfer to another host fails, mass storage is down at archival, or a statistics cannot be created), this should not influence the workflow in executing the whole pipeline for the next timestep. Moreover, if a timestep consists of several files, most operations are still independently processing each of them (e.g. transfer, conversion, variable extraction). However, this is rather a bless for the workflow designer than a curse since most actors in Kepler are stateless. The streaming nature of the Kepler computational models combined with stateless actors naturally fits the above requirement; the past actions of an actor do not influence its future actions. However, the next actor in the pipeline is really affected by the result of the previous actor, so the actors should be prepared for the failures. There are two options on the level of actor design. Discarding failed items. One possibility is to discard a token from the stream by an actor that encounters a failure corresponding to that token. Then, the next actors simply do not receive a token corresponding to the unsuccessfully processed item. Thus, all actors can assume that if a token, representing a given data item, arrives, then the external status of that item is still valid to be processed. In the design of the CPES workflows, originally we have chosen this option. Data items here are always files on disks, and the processing pipeline produces a set (chain) of files located

Figure 3: CPES workflow: transfer-convert-archive simulation data; generate, display images to the user on different machines and directories. The processing finishes for a given simulation output file at the first failure and the workflow simply forgets the case (except for retries later, see below). The elimination of failed items happens in theProcessFile actor, a complex workflow by itself, consisting of two parts: the execution part and the output file listing part (see Fig 2). If the execution fails, the token inside the ProcessFile subworkflow is discarded (driven to a dead branch in a conditional), otherwise, the output file is listed in a new token and passed on the pipeline. However, there are two disadvantages. First, there may be other scenarios, where a failure does not mean total failure and the impossibility of further processing on the pipeline. We do not have such a scenario, but can be imagined in the monitoring/visualisation part of the pipeline. Second, the workflow design can sometimes become more difficult because of the discard of a token from the stream. It was not obvious at first that keeping always the same amount of tokens in the pipeline leads to simpler workflow networks. Synchronous dataflow (SDF), the simplest model of computation in Kepler, by its static scheduling of actors, assumes a given rate of token production for the actors. The actor behavior of producing “zero or one token” is not expressible within this model. Once, we have used the ProcessFile actor already encapsulated in a container workflow, which was then used in a straight pipeline in a higher-order actor that required its own director. Forgetting about the nature of ProcessFile underneath, we used SDF. Supposedly other workflow designers would fall into this trap as well. The bug was hidden until the first failure happened in that particular external processing, leading to the elimination of the token, to a raised exception from the SDF director and thus, the abortion of the workflow. The bug fix is to use a more complex director at that place, which, from a design point of view, it looks like a strange choice at that given level of the workflow hierarchy, unless the reason becomes well documented. Introducing failure tokens. The other option is to keep those tokens that are referring to a data item with a failed operation thus, keeping the number of tokens in the network (workflow) and making workflow design simpler. The disadvantage of this choice is that all actors should be prepared for receiving a token, which must be handled differently from the others. This is only feasible for a given project, where the workflow developers create the limited set of actors for their own purposes but it is a nightmare for the Kepler project with many developers from different scientific areas,

and several hundreds of actors. In Kepler, the ports of the actors are typed, therefore, the failure token should have the same type as the “normal” tokens, otherwise, the execution of the workflow would fail. Each actor has to clearly define, what it does with a failure token (most of the time, just pass it on). When the output has a different type from the input (e.g. job id actor outputs a string id of the input job object token), it has to define what it emits for a non-existing job represented by a failure token. Therefore, the difficulties in the workflow design are not avoided or simplified with this option but put into other fields: the actor design and data model design for a given application. One example of failure tokens in Ptolemy II/Kepler is the NIL token for the basic token types, including arrays. Originally, failed operations like division by zero, caused either an exception or emitted an empty (string, array of zero length, etc) token. However, the empty array is just an array of zero length, and not a failure; sometimes it is meaningful, sometimes not. This has led to confusion among actor developers, so the NIL token was created and all actors (e.g. all array operations) were extended to handle NIL tokens. The NIL token is basically the same for workflows as the null pointer for textual languages, and handling of it correctly everywhere in the actor codes is at least as error-prone as for textual codes. The handling of failure tokens can be solved on a higherlevel, though. If the workflow system’s computational model incorporates failure tokens and avoids executing actors on them, then the hassle of coding can be saved. There has been no such director designed or implemented for Kepler yet but the COMAD [19] framework provides such a highlevel solution. In collection-oriented workflow design, actors work on collections (of data items or tokens). A collection can represent and contain the whole processing history of an input data item, growing with intermediate results as it passes through more actors. An actor picks up relevant data from a collection to work on; if there is no such data, the whole collection just passes through the actor without any processing. Failures thus can be different things from real data items, incorporated into the collections.

3.1 A User-Level Checkpoint/Recovery Model for File-Based Scientific Workflows Above, we have avoided bringing the workflow to its knee on external processing failures. However, if the workflow terminates abnormally for any reason, it must be able to continue the work later from the point where it was. It is not acceptable to start the whole processing again. Moreover, there

Figure 4: The checkpointing composite workflow with ExecCmd as a type A actor. Processing of the output and logging is not shown here. is a need to do something with the failed operations later as well. For instance, if the workflow’s task is to archive all timesteps of the simulation and some of them fail, those should be archived later as well. Note that these are two separate challenges to be solved. The keyword behind the continuation is checkpointing, a well known technique for any kind of software, combined with the appropriate recovery mechanism for the application. If there is a system-level checkpointing service providing a way to store the whole state of an application, then developers are in an easy situation. If there is no such, a user-level checkpoint should be developed for the particular application in order to save all necessary information that is needed to restore the current state of the application. Moreover, a recovery routine should be developed that reads the checkpoint data, builds the internal state of the application and lets it start running from that state. Kepler has no system-level checkpointing capability and is not planned to have in the near future. Therefore we have to incorporate some user- (workflow-) level checkpointing in the CPES workflows. Instead of trying to save the actual state of the workflow and continue the workflow execution from that point, we have relaxed the notion of state and restart, allowing some repetition of internal work (but not on files). The basic idea is that the checkpoint is rather a process history containing all successful operations, and at restart, CPES workflows start from the very beginning but skip all operations that succeeded earlier. That is, instead of reading the checkpoint data at the beginning of the restart, rebuilding the state of the workflow and then continuing from that point, the actors in the CPES workflows continuously check their actual task against the checkpoint and skip the task if it is found to be already done. This solves the other problem as a side effect: failed operations are not stored in the checkpoint and they will be repeated when restarting the workflow. Checkpoint-recovery model. First, we have to define, what is the state in case of the CPES workflows that basically just automate external processing over data stored in files. The workflow independently processes many simulation output files (let’s say f1,1 , . . . , f1,n , . . . , fm,1 , . . . , fmn if there are m timesteps and n files for each timestep). We assume that the files have unique names, that is fig 6= fjg for timesteps i and j (usually timesteps are included in the file names). There are actors A1 , . . . , Ak that work on single files, that is, perform m ∗ n independent opera-

tions and their outputs are new files. Let denote the input Al Al l to Al as inpA ij and the output as outij = Al (inpij ) if it processes a file derived from the jth file of timestep i by preceding operations. There can be also actors B1 , . . . , Bp that process an undefined number of files before producing one output (e.g. the Archiver actor that creates one large archive from several timesteps together). Let denote this operation performed in many smaller steps as one step: Bl Bl l outB ij = Bl (inpvw , . . . , inpij ), where v ≤ i, w ≤ j. Finally, there are other actors that work internally on the tokens themselves. They are considered negligible in resource usage compared to the file processing operations and therefore we do not deal with them in the checkpoints. They are usually stateless but if some of them have an internal state based on the processing history of the tokens this state will be restored by re-processing all those tokens in a restarted workflow. We cannot model what is going on in the external processing of a file but we can assume that the output is another file and a successful operation means that the output file is fully created, otherwise, the output and the partial state of the operation can be ignored by the workflow. If this is not the case, then we require this to be hidden from the workflow. One example is the transfer of large files over the network. Tools, like bbcp, bbftp, rsync, gridftp can continue a partially failed transfer. To manage this, they use their own temporary files. The workflow does not need to know about this since it can deal only with successfully transferred files, so that the next actor can process them. If the operation l of a failed transfer (Al (inpA ij )) becomes repeated, the workflow does not need to know about that the underlying tool uses its own records to continue a previously failed operation. That is, we model the Ai actors as atomic operations that either succeed or fail. For each actor Al , we define the state as the set of successful operations performed by this actor. For an operation, its signature is recorded upon successful execution: Al l σ(Al , inpA ij , outij ). That is, the state of Al is a set of signaA A tures {σ(Al , inpijl , outijl )|i ∈ I ⊆ {1..n}, j ∈ J ⊆ {1..m}}. Note that we need to have to know the signature of the output before the execution of the actor, otherwise the signature could not be checked in advance. For such actors, where this is not possible, the signature should consist only A of the signature of the actor and the input: σ(Al , inpijl ). For an actor Bl , similarly, we store the signature of the Bl Bl l global operation outB ij = Bl (inpvw , . . . , inpij ): Bl Bl Bl σ(Bl , inpvw , . . . , inpij , outij ). It depends on the actor Bl ,

whether this signature should be recorded at once after completing the global operation or the individual inputs should be recorded for each (successful) intermediate step. There are two checkpoint/recovery related actions: recording a signature and checking a signature against the checkpoint. Recording is straightforward; the signature should be added to the checkpoint. Checking depends on the type of A A the actor. For an actor Al , the signature σ(Al , inpijl , outijl ) is looked upon the checkpoint and the result is true if it is found. For an actor Bl , the checking is different: for each individual input file, it has to be decided whether Bl should l be called, that is whether for an input inpB ij , there is a sigB Bl Bl Bl nature σ(Bl , inpvw , . . . , inpij , outij ), which contains inpijl . If the checking action returns true, instead of calling Al (or Bl l Bl ), the output token outA ij (outij ) should be passed on. The last question is from where to take the output token of a successful operation. In a formal model we can easily say, that besides the signature the data stored in the output token should be saved in the checkpoint as well. However, it depends on the underlying data model of a particular workflow, what this extra operation means. In the CPES workflows, tokens contain information (name, size and date) about the files. We do not need to store this information, rather we can get/produce this information again from the simulation output files that are processed in the workflow. Implementation of the checkpointing in the CPES workflows. In the implementation, first we have to decide what the signature of an operation is, from which it can be unambiguously decided whether an operation should be executed or not. This can be the unique name of the actor in the workflow plus the input file name, but in the CPES workflows the signature for the ProcessFile (type A) actor is simply the command string, which is externally executed. The command string usually contains the input (and output) file names, so it is unique for each input. If this is not the case for a particular instance of the ProcessFile actor, extra care should be taken to include some unprocessed string denoting the unique ids into the command string (e.g. as a comment). For the Archiver actor (type B), it is enough to store the whole signature when an archive file is successfully stored because, when it is called with one input file, only its size is added to the sum and the actual archiving operation is started for all files at once when the total size reaches the predefined limit. The signature of the Archiver actor for one successful operation is several lines of strings actually. One is the archival command, that refers to a file containing the list of data files to be archived together. The rest consists of each input file name counted for the given operation. When the Archiver actor is running, it checks the checkpoint for each incoming input file; those already present in the checkpoint, are not counted any more since they are recorded in the checkpoint if and only if they have already been successfully archived. The checkpoint capability in Kepler is realized as a new actor, similar to the Logger actor. The MappedLogger actor takes a signature string, and has two parameters: the checkpoint file name and (a selection of) one of two possible actions: recording or checking. The very first call to one of the MappedLogger actors with the same checkpoint file reads the text file into memory (into a java.util.HashSet) for the given checkpoint or creates a new file if such does not exist. For a recording action, the signature string is appended

both to the text file and to the hash set. For a checking action, the signature string is looked upon the hash set and the actor emits a true token if the string is found, otherwise emits a false token. The checkpointing for an actor is thus realized by a specific way of embedding that actor into a checkpointing composite workflow. Fig. 4 shows the checkpointing workflow graph for a type A actor, where the command string is already known and thus can be used as the signature for the given actor invocation. The graph represents a nested if-then-else execution: If the checkpoint contains the command string, the workflow emits true (top branch), otherwise the command is executed. If the execution is successful (middle branch), the command string is added to the checkpoint and the workflow emits true; otherwise the workflow emits false, indicating an unsuccessful operation. Summary on user-level checkpointing. The following assumptions about scientific workflows were taken for the user-level checkpointing capability presented here: • The overwhelming weight of resource usage is on external operations, therefore internal actors need not be saved but can be repeated from the beginning to build their state (if they have any), • The checkpointed actors take care of the checking by themselves or they are wrapped around the appropriate composite workflow that does this for them. In the latter case, the individual operations of the actors can be distinguished by a string, which can be created/determined within the composite workflow before calling the actual actor. • File names for different timesteps are different. This is used in providing the unique signature for the invocations of the actors. However, if this is not the case, still there is probably a way to define unique signatures, e.g. counting the invocations to a given actor. • Modification/removal of files in the external environment is not advisable between two runs of the workflow. For example, if an intermediate output is removed for a successful operation, by avoiding the repetition of that operation, the actor is not reproducing that output again, by default. This of course, can be checked and handled within the actor or the checkpointing composite. Modified files are also not processed again, except if the checkpointing composite is further developed to check modification times of input/output pairs.

4. CONCLUSION AND FUTURE WORK This paper has described the tasks and requirements for the CPES plasma edge simulation project to automate monitoring of simulation and archival of data, the components developed in Kepler to provide the required functionality and the automated workflow solution. A user-level checkpoint model has been described that allows the workflow to restart after failures and also re-try tasks that failed earlier. The main future goal of the CPES project is to couple several simulations running on different resources in order to solve the multi-scale problem of controlled nuclear fusion. This requires workflow automation that works not just in

a pipeline but in a loop, moving data back and forth between hosts and manages jobs running on different systems besides providing the monitoring and archival functionality altogether. Another goal is to raise the abstraction level and to hide the underlying workflows from the scientists. They will need to use only a dashboard to pick tasks they want to perform, perform experiments involving several simulation executions and a central place for observing the status of all components involved in an experiment.

Acknowledgements We would like to thank Julian Cummings at Caltech for testing the CPES workflows and identifying bugs; Stephane Ethier at PPPL for being the first, brave user of the workflow to monitor a GTC simulation of 32 hours at NERSC and moving over 800GB of data to ORNL for the conversion/archival tasks; Ilkay Altintas at the San Diego Supercomputer Center, Mladen Vouk, Jeff Ligon, Pierre Mouallem at North Carolina State University and Ayla Khan at the University of Utah for organizing a Kepler tutorial together with the authors at Supercomputing’06 where tutorial attendants built and executed the same workflow as presented here (on a smaller scale of data).

5.

REFERENCES

[1] I. Altintas, E. Jaeger, K. Lin, B. Lud¨ ascher, and A. Memon. A web service composition and deployment framework for scientific workflows. In 2nd Intl. Conference on Web Services (ICWS), San Diego, California, July 2004. [2] V. Bhat, S. Atchley, S. Klasky, M. Parashar, and M. Beck. High performance threaded data streaming for large scale simulations. In Proceedings of the 5th International Grid Computing Workshop (Grid 2004), pages 243–250, Pittsburgh, PA, Nov. 2004. IEEE Computer Society Press. [3] S. Bowers and B. Lud¨ ascher. Actor-oriented design of scientific workflows. 24st Intl. Conf. on Conceptual Modeling (ER), Klagenfurt, Austria, LNCS, Springer 2005. [4] J.M. Dawson (1983). Particle simulation of plasmas. Reviews of Modern Physics 55: 403. [5] E. Deelman, G. Singh, M-H. Su, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, K. Vahi, G.B. Berriman, J. Good, A. Laity, J.C. Jacob and D.S. Katz. Pegasus: a Framework for Mapping Complex Scientific Workflows onto Distributed Systems. Scientific Programming Journal, Vol 13(3), 2005, pp. 219–237 [6] J. Eker, J. W. Janneck, E. A. Lee, J. Liu, X. Liu, J. Ludvig, S. Neuendorffer, S. Sachs, and Y. Xiong. Taming Heterogeneity – The Ptolemy Approach. In Proceedings of the IEEE, volume 91(1), January 2003. [7] T. Fahringer, A. Jugravu, S. Pllana, R. Prodan, C. Seragiotto and H-L. Truong. ASKALON: a tool set for cluster and Grid computing: Research Articles. Journal of Concurrency and Computation : Pract. Exper., Vol. 17, No. 2-4, 2005, pp. 143–169. [8] J. Frey, T. Tannenbaum, I. Foster, M. Livny and S. Tuecke. Condor-G: A Computation Management Agent for Multi-Institutional Grids. Journal of Cluster Computing, Vol. 5, pp. 237–246, 2002. [9] D. Gannon, S. Krishnan, L. Fang, G. Kandaswamy, Y. Simmhan, and A. Slominski. On Building Parallel and Grid Applications: Component Technology and Distributed Services. Proceedings of CLADE 2004, Challenges of Large Applications in Distributed Environments. June 2004.

[10] A. Hoheisel. User tools and languages for graph-based Grid workflows. Journal of Concurrency and Computation: Practice and Experience, Special Issue: Workflow in Grid Systems. Eds. G.C. Fox, D. Gannon, Vol. 18, Issue 10, pp 1101–1113, 2005 [11] P. Kacsuk, Z. Farkas, G. Sipos, A. Toth and G. Hermann. Workflow-level Parameter Study Management in multi-Grid environments by the P-GRADE Grid portal. Grid Computing Environment Workshop, Supercomputing’06, Tampa, FL. To appear in 2007 [12] G. Kahn. The Semantics of a Simple Language for Parallel Programming. In J. L. Rosenfeld, editor, Proc. of the IFIP Congress 74, pages 471–475. North-Holland, 1974. [13] S. Ku, C. Chang, M. Adams, and et al. Gyrokinetic particle simulation of neoclassical transport in the pedestal/scrape-off region of a tokamak plasma. In Institute of physics Publishing Journal of Physics: Conference Series, 46, pages 87–91, 2006. [14] G. von Laszewski, I. Foster, J. Gawor and P. Lane. A Java Commodity Grid Toolkit. Concurrency and Computation: Practice & Experience, vol. 13, 2001. [15] E.A. Lee and T.M. Parks. Dataflow process networks. Proc. of the IEEE, 83(5), 1995. [16] E.A. Lee and S. Neuendorffer. Actor-oriented models for codesign: Balancing re-use and performance. In Formal methods and models for system design: a system level perspective, ISBN:1-4020-8051-4, pp. 33–56, 2004. [17] Z. Lin, S. Ethier, T. S. Hahm, and W. M. Tang. Size scaling of turbulent transport in magnetically confined plasmas. Phys. Rev. Lett., 88(19):195004, Apr. 2002. [18] B. Lud¨ ascher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger, M. Jones, E. A. Lee, J. Tao and Y. Zhao. Scientific Workflow Management and the Kepler System. Concurrency and Computation: Practice & Experience, 18(10):1039–1065, August 2006. [19] T. McPhillips and S. Bowers. An approach for pipelining nested collections in scientific workflows. SIGMOD Record, 34(3):12–17, 2005. [20] T. Oinn, M. Greenwood, M. Addis, M.N. Alpdemir, J. Ferris, K. Glover C. Goble, A. Goderis, D. Hull, D. Marvin, P. Li, P. Lord, M.R. Pocock, M. Senger, R. Stevens, A. Wipat and C. Wroe. Taverna: Lessons in Creating a Workflow Environment for the Life Sciences. Journal of Concurrency and Computation: Practice and Experience, Vol.18 Issue 10. pp. 1067-1100, 2005 [21] C. Pautasso and G. Alonso. The JOpera visual composition language. Journal of Visual Languages & Computing, Vol. 16, Issues 1-2 , Feb-Apr 2005, pp. 119–152. [22] I. Taylor, M. Shields, I. Wang, and A. Harrison. The Triana Workflow Environment: Architecture and Applications. In I. Taylor, E. Deelman, D. Gannon, and M. Shields, editors, Workflows for e-Science, pages 320-339. Springer, New York, Secaucus, NJ, USA, 2007. [23] Y. Zhao, M. Wilde and I. Foster. Virtual Data Language: A Typed Workflow Notation for Diversely Structured Scientific Data. Taylor, I.J., Deelman, E., Gannon, D.B. and Shields, M. eds. Workflows for eScience, Springer, 2007, 258-278. [24] BBCP peer-to-peer parallel file transfer software. http://www.slac.stanford.edu/∼ abh/bbcp/ [25] Ganymed SSH-2 for Java. http://www.ganymed.ethz.ch/ssh2 [26] IP*Works! .NET Workflow Edition for Windows Workflow Foundation. http://www.nsoftware.com/workflow [27] JSch – Java Secure Channel. http://www.jcraft.com/jsch [28] SDM – Scientific Data Management Center http://sdm.lbl.gov/sdmcenter