Characterizing Workflow-based Activity on a Production e-Infrastructure Using Provenance Data Souley Madougoua , Shayan Shahanda , Mark Santcroosa , Barbera van Schaika , Ammar Benabdelkadera , Antoine van Kampena,b , S´ılvia Olabarriagaa,∗ a Academic b Swammerdam
Medical Center, University of Amsterdam, The Netherlands Institute for Life Science, University of Amsterdam, The Netherlands
Abstract Grid computing and workflow management systems emerged as solutions to the challenges arising from the processing and storage of shear volumes of data generated by modern simulations and data acquisition devices. Workflow management systems usually document the process of the workflow execution either as structured provenance information or as log files. Provenance is recognized as an important feature in workflow management systems, however there are still few reports on its usage in practical cases. In this paper we present the provenance system implemented in our platform, and then use the information captured by this system during 8 months of platform operation to analyze the platform usage and to perform multilevel error pattern analysis. We make use of the large amount of structured data using the explanatory potential of statistical approaches to find properties of workflows, jobs and resources that are related to workflow failure. Such an analysis enables us to characterize workflow executions on the infrastructure and understand workflow failures. The approach is generic and applicable to other e-infrastructures to gain insight into operational incidents. Keywords: Grid Computing, Scientific Workflow Management Systems, Provenance, Workflow Failure Analysis
1. Introduction The advances in simulation and acquisition devices, while opening new opportunities, have also brought new challenges. The overwhelming amount of generated data that needs to be processed require computing infrastructure beyond traditional resources like desktop and individual clusters. Additionally, the challenging nature of the problems research teams are addressing requires them to collaborate with homologous teams in other cities, countries or even continents, involving sharing not only of knowledge but also of other resources such as acquisition, computing, and storage facilities [1]. New technologies have emerged to enable this new way of research, coined “e-science”. Among them are grid computing [2] and workflow management systems (WfMS) [3, 4]. Both technologies have been extensively used in practice in many scientific domains. Because WfMSs abstract complex computations and data ∗ Corresponding
author Email address:
[email protected] (S´ılvia Olabarriaga) Preprint submitted to Future Generation Computer Systems
dependencies on complex infrastructures, they need to document their execution in order to, for instance, facilitate troubleshooting and analysis. Especially in a scientific environment, the ability of the WfMS to document the process of the workflow execution with enough granularity to allow reproducibility of the scientific results and their auditing is of paramount importance [5]. Most of today’s WfMS either internally record workflow provenance (e.g., VisTrails [6], Kepler/pPOD [7], provenance trails [8] in Pegasus [9], and Taverna [10]) or provide structured data that can be used to build it such as in MOTEUR [11] and WS-VLAM [12]. Provenance can also include semantic information, in which case it can be used to answer high-level queries involving domain concepts like in Janus [13]. However, generally WfMSs implement system provenance by collecting workflow execution traces, which is very important in troubleshooting problems, especially given the high frequency of failures in distributed computing infrastructures (DCI) [14]. Despite the flexibility introduced by WfMS, running production workflows on DCIs still remains a difficult task because execution incidents are common place. June 4, 2013
Optimization of the usage of grid resources is needed to increase the adoption of grid computing by a broader community. In practice, when running large dataintensive applications on production grid infrastructures, users experience various types of incidents such as bugs in the application program for certain combinations or types of inputs, data transfer timeouts, insufficient resources (e.g. memory) to run a job, hardware and software incompatibilities, etc. Usual approaches include manual examination of job log files or re-running the failing jobs. However, such approaches become very complex when a WfMS is used because these are exactly the details that it is meant to hide in the first place. Although WfMSs usually implement some automatic management of errors, in practice we see that detecting the cause of an error and acting upon it requires (much) human intervention. Failure analysis on DCIs has received attention in various work. Townend et al. [15] describe a novel approach to multiversion design (MVD) fault-tolerant system in serviceoriented applications. The approach has the particularity of using provenance of the services [16] to weight channels within the MVD. Using testbed applications, authors show how their provenance-aware MVD-based fault-tolerant system outperforms a traditional MVD system in terms of system dependability and avoidance of the pathological common-mode failure. Data mining techniques have been used to help administrators of large production computing grids deal with failure and troubleshooting as described in [14]. The authors apply classification techniques to historical data about resources, users, and jobs on DCIs to infer what properties of machines, jobs, and the execution environment are correlated with success or failure. Samak et al. [17] describe methods for guiding remediation to stochastic errors through predictions of the impact on application performance along with techniques for isolating systematic sources of failures. The approach identifies workflows with high failure probability, then looks at detailed failure patterns to identify problematic resources that should be avoided in subsequent runs. After assessing the classifier performance, its predictions are used on-line to preemptively restart problematic workflow runs. A self-healing process to deal with the frequent operational incidents on DCIs [18] has been added to the MOTEUR workflow engine [19]. Workflow activity incidents leading to failure are modeled as states of a fuzzy finite state machine. The degrees of severity of the incidents are computed from metrics assuming that incidents have outlier performance. Incident degrees are
later discretized into levels that are associated to actions like task or input replication, site blacklisting and activity stopping. Previous approaches to workflow execution failure analysis were either simple error analysis or monitoringlike or fault-tolerant features directly implemented in the system. Our approach differs from these in several ways. We not only analyze the errors, but also provide a complete provenance of the workflow executions useful for other purposes such as establishing a data product lineage. Moreover, because we use a third-party WfMS, we prefer non-invasive approaches that do not require instrumentation of the WfMS and that can be adapted to other systems as well. For the sake of using a common vocabulary, a grid faults taxonomy is proposed in [20]. This multilevel taxonomy classifies an incident according to eight viewpoints which are further subdivided into subtypes. These viewpoints are the following: the location is the component where the fault manifests itself; the originator the component that causes the fault; the introduction time corresponds to the software life cycle phase where the fault was introduced; the duration is the time span of the fault existence; the severity of the fault; the consequences of the fault are the effects on the system and the caused behavior describes how the system behaves in the faulty state. For instance, a fault in the location viewpoint can be at resource level (e.g., a process, a node or the network), at middleware level (e.g., resource broker, scheduler), at application service layer (e.g., host or container) or at workflow layer (e.g., workflow specification or workflow engine). We use this taxonomy throughout this paper to classify our platform specific errors. For a complete description of the viewpoints and their subcategories, please refer to [20]. We have developed a system that collects provenance of workflow execution externally to the WfMS [11, 21]. This system facilitates the analysis of workflow activity and, in particular, the understanding of workflow execution problems without going into the applicationspecific details. In this paper we present the provenance system implemented in our platform, and then use the data collected by this system during 8 months of platform operation to analyze its usage and error patterns. We analyze the large amount of structured data using statistical approaches to find properties of workflows, jobs and resources that are related to workflow failures. Such an analysis enables us to characterize workflow execution on the infrastructure and understand workflow failures. The paper is organized as follows. We first present the platform and the provenance system (Section 2), 2
On the e-BioInfra platform all grid computing activity takes place by executing workflows from the various user interfaces. The MOTEUR workflow management system [19] is used to enact workflows described using the GWENDIA language [24]. A MOTEUR workflow is a directed graph of processors representing computations and dependencies constraining their order of invocation. A processor description encompasses both the processing (binary code and command line parameters) and the input/output data. MOTEUR workflows accept input URIs and scalar values on their initial input ports and produce URIs for the generated output data. Throughout this paper, we also use the term components to refer to processors. Applications that perform data analysis are wrapped into MOTEUR workflow components using the generic application service wrapper (GASW) [25]), which is also responsible for staging data in and out of the worker node. The DIANE pilot job framework [26] is used by MOTEUR for resource mapping and task scheduling. DIANE adopts a master-worker model where the grid jobs (workers) contact the master to receive tasks that correspond to the actual processing units of the workflow. If the job fails, MOTEUR retries it up to a configurable number of times. Both DIANE and MOTEUR write execution information into custom databases and log files. Our approach relies on these information sources to build, structure, link and store workflow execution traces into a provenance store.
and then we characterize the activity on this platform during a period of 8 months, in particular concerning failure (Section 3). Next we adopt a statistical approach to study failure in more detail for two applications in DNA sequence analysis and computational neuroscience (Section 4). Finally we discuss the results and present concluding remarks and future prospects. 2. Provenance system of the e-BioInfra platform In this section we present a brief overview of the eBioInfra platform and describe the implemented provenance system. 2.1. Overview of the e-BioInfra platform The context of our study is the e-BioInfra platform [22], which enables scientists of the Academic Medical Center (AMC) of the University of Amsterdam to run largescale data analysis on a grid infrastructure. Figure 1 shows its layered architecture. The bottom layer, grid fabric, consists of resources of the Dutch e-science grid provided by the BiG Grid project1 . The layer above it contains grid middleware services also provided by BiG Grid. On top of these layers, generic services such as workflow management, monitoring and provenance services are provided. Finally, the e-BioInfra can be used from various user interfaces, among them the webbased e-BioInfra gateway[23].
2.2. Provenance System The provenance system in use today has evolved from our previous work [21, 11]. The main motivation for its initial development was to capture information about the provenance of grid workflow executions in a structured manner to facilitate troubleshooting, among other uses. In particular, the goal was to link information available from various log files and software components (MOTEUR, DIANE) and provide context for errors observed when running workflows on the grid. We chose the Open Provenance Model (OPM) [27] to represent workflow provenance because it clearly defines what information to capture and how to represent it. Furthermore, OPM is a standard that facilitates interoperability with other provenance and workflow management systems, as well as compatibility with the the new provenance specification of the W3C, PROV2 , which
Figure 1: Overview of e-BioInfra architecture with focus on the provenance system (gray components at the right).
1 http://www.biggrid.nl
2 http://www.w3.org/TR/prov-primer
3
materialized by a causal dependency between their corresponding OPM processes. An OPM agent is a contextual entity causing and managing a process execution; therefore, the scientist running the workflow and all eBioInfra components (hardware and software) involved in the workflow execution are mapped to agents. OPM requires every graph to have at least one account, so we map a workflow run to an account of its corresponding graph. We followed OPM as closely as possible, however the original model does not specify how to link retried grid jobs to their original grid job. When a given job is a retry of a another one that failed previously, an annotation is added to it (RetryOf) and a link is created between the two jobs. Therefore, the generated provenance graph is, in fact, an extended OPM graph.
was released after we started development of our system. As illustrated in Figure 1, the provenance system has three main components: Provenator, Provenance store and Retrieval interface. The Provenator collects provenance information from various sources, including the database maintained by MOTEUR, log files and grid jobs standard output and standard error files. The Provenator is automatically executed after completion of a workflow execution. Collected information include both workflow-specific data (e.g., application and user names) and job-specific data (e.g., component name and execution site) with the proper links between workflow runs and jobs. The collected information is mapped to OPM using the strategy described in Section 2.3. The provenance store is implemented as a relational database described in Section 2.4. The Retrieval interface implements selected queries based on use cases defined with users of the e-BioInfra platform [21], such as lineage of a data product or comparison of executions of the same workflow specification. The interface also manages how the retrieved information is presented to the user. These functions have been incorporated to the e-BioInfra gateway3 . SQL can also be used for more flexible queries, which was the approach adopted to perform the analysis presented in this paper (see Section 3).
ReferenceTarGz FastqGz
BWA
2.3. Workflow to OPM Concept Mapping Bai
The developed system is based on version 1.1 of the OPM specification, a standard originated from a community effort, the provenance challenge series [28]. OPM defines the provenance of an object as a causality graph enriched with annotations capturing further information pertaining to the execution. The graph has three types of nodes which are agent, artifact and process. An edge materializes a causal dependency between the source and the destination of the edge. The workflow concepts are mapped onto OPM as explained in [11] and summarized below. Mapping a MOTEUR workflow to an OPM graph is straightforward, as both are defined as a graph. The inputs and outputs are defined as OPM artifacts. GASW processors (where the computation takes place) are defined as OPM processes. A processor is either a GASW task (grid job) or a beanshell (embedded Java code executed on the MOTEUR server). A data dependency between two processors is
Bam
Figure 2: Simplified abstract representation of a BWA workflow in MOTEUR with two inputs: a reference file (ReferenceTarGz) and a file with reads to be aligned in the FASTQ format (FastqGz). Two outputs are generated in the BAI and BAM formats (respectively Bai and Bam). Triangles are inputs, boxes are processes, diamonds are outputs.
As an illustration, consider a genomics application that executes the BWA4 program [29] for sequence alignment of next generation sequencing data. For the sake of readability, let us take the simplified workflow in Figure 2. It contains one component (grid job) that requires a reference file and (a list of) files containing sequence data as inputs, and generates results of the alignment in two output files. Figure 3 represents the extended OPM graph for the execution of this workflow using one reference file (Reference.fasta.tgz) and two files with
3 See documentation about the e-BioInfra gateway on http://www.bioinformaticslaboratory.nl/twiki/bin/ view/EBioScience/EBioInfraGatewaySCIBUSMOTEUR.
4 Burrows-Wheeler
4
Alignment tool.
)
Ta
rG
z)
Gz
stq
fe
stq
re
Gz
)
Fa
U(
U( Fa
Re U(
nc re fe U( Re
Read2.fastq.gz
e nc
) Gz ar eT
z) tqG as F U(
U(ReferenceTarGz)
Reference.fasta.tgz
Read1.fastq.gz
RetryOf
Result1.bai
RetryOf: BWA2
BWA3
Result1.bam
Result2.bai
am (B GB W
WG B(B ai)
Ba
)
m)
BWA2
B( WG
W
GB
(B
ai)
BWA1
correspond to OPM concepts (prefix OPM). Since the OPM specification does not define any recommendations for provenance storage, we chose to adopt a relational database due to the maturity, popularity and efficiency of this technology. The Provenator builds OPM entities as Java objects and Hibernate6 is used as an object-to-relational mapping (ORM) tool to make the Java objects persistent into the relational schema. Tables AgentAccount, ArtifactAccount, DependencyAccount and ProcessAccount are association tables imposed by the mapping. The OPMDependency table contains all objects of type causal dependency (Used, WasGeneratedBy, WasControlledBy,WasTriggeredBy, WasDerivedFrom). Most ORM tools map objects of the same class to the same table. By harnessing polymorphism, some ORM tools are able to map objects of different classes to the same table, as long as they are of the same type. The objects can be unambiguously extracted by using the actual class name as discriminator. Our provenance system leverages this feature to optimize the schema. Furthermore, we took into account two performance-to-compatibility trade-offs in the OPM implementation. The first trade-off is related to Role, which we implement as simple string instead of an object to avoid additional query overhead. The second trade-off concerns time information about events, which OPM defines as an interval (start and stop times) for generality, but in our case they are implemented as timestamps.
Result2.bam
Figure 3: Extended OPM graph of the execution of the BWA workflow with one failed job. Ellipses represent OPM artifacts used as input or produced as output, boxes are processes, arcs are dependencies, and labels represent files, program names and parameter values. U and WGB are abbreviations of OPM causal dependencies Used and WasGeneratedBy respectively. The edge labeled with “RetryOf” from BWA3 to BWA2 is an extension to express that BWA3 is a retry of the failed job BWA2.
sequence reads (Read1.fastq.tgz, Read2.fastq.tgz). For each read file, a BWA grid job is created to align the sequences against the reference. Successful jobs generate two output files (Result1.bai, Result2.bai and Result1.bam, Result2.bam). The execution graph in Figure 3 shows that, actually, three BWA jobs were executed, however one of them has failed (BWA2) and was retried (BWA3). The representation of the OPM graph in Figure 3 reads similarly to the workflow representations: inputs are on top, outputs on the bottom5 . The graphical representations for the inputs, outputs and processes are labeled with the values they are given during the execution. For files, those values correspond to URLs, and scalar values otherwise. Edges linking inputs/outputs to processes are labeled by their OPM type (causal dependency), with the role played by the input/output in the process execution in parenthesis. When a job is a retry of a failed one, an annotation is added to it (foldedcorner rectangle in the figure).
3. Understanding workflow activity on the e-BioInfra The e-BioInfra platform is used by different users, for different applications and in different settings. Collecting usage information and understanding this varied activity is challenging. In particular, it is difficult to identify problems that could be affecting the overall activity and to distinguish these from specific cases of applications or users. In this section we analyze the activity at the e-BioInfra based on workflow provenance information. Our goal is to characterize the platform usage in terms of applications, users, resources and failures. We first describe the dataset used in the experiments, and then we present an analysis of activity at the workflow level and at the task level (jobs).
2.4. Provenance Schema The schema of the provenance database is shown in Figure 4. It is composed of twelve tables, most of which
3.1. Provenance dataset The dataset used in this study comprises information collected during actual e-BioInfra operation (Decem-
5 Alternatively the outputs could be shown on top, as it is more common for displaying dependency graphs, however we found the current drawing more intuitive for the users.
6 www.hibernate.org
5
Figure 4: The relational schema of the provenance store.
ber 2011-July 2012). It includes workflow runs performed with various goals: testing of the infrastructure, development of new workflows, and data analysis experiments, mostly in DNA sequencing and computational neuro-science applications. These workflows were started manually by different users from different user interfaces, including the e-BioInfra gateway [23]. From this set, we removed incomplete runs. For example, when the workflow could not start, or when some serious error caused the workflow engine to abort its execution, because in such cases the reconstruction of provenance is not reliable due to missing data. The subset of the provenance store considered in this study contains nearly 394 MB of provenance information about 1848 workflow runs, 70 different applications, 77 components and 36033 jobs. A large amount of details are available for these workflow runs, however in this study we concentrate on the following subset. At the workflow level, the following information was used:
4. end time of the workflow run, which is the end time of the last job; 5. number of retried jobs: MOTEUR retries jobs up to N times, where N is configurable by the user for each workflow run; 6. final status (successful or failed): a workflow fails when some tasks cannot be successfully completed during its execution, even after the task has been retried a given number of times by MOTEUR. For each workflow task executed as a grid job, the following information was used7 : 1. application name to which the job belongs to; 2. component name (string identifying the program executed by the task, which can be reused in different workflows); 3. final status of the job: successful or failed with the error codes explained below; 4. computing resource (coined “site”) and hostname; 5. start time which is when the job starts to execute; 6. end time of the job which is when the job completes or gets aborted or killed; 7. error type derived from the job exit code.
1. application name (string identifying the abstract workflow, unique); 2. name of the user who executed the workflow; 3. start time of the workflow run, which is the workflow submission time;
7 No
6
information is available about the tasks executed locally.
succeeded failed
80
120
●
●
10
20 9
a9
7
a5
a4
1
0
a2
a1
a6
a1
4 17
8
0
a1
19
1 24
0
19
6
a7
22
Figure 5 shows the successful and failed workflow runs of the applications that have been executed at least 15 times (11 applications in total after excluding test workflows). Success rate ranges from 12% for a1 up to 96% for a5. Note that some of the most failing applications are also the most executed ones. As currently a workflow cannot be resumed, in practice, the users need to resubmit the complete workflow for the same inputs or a subset of them. Note also that applications a2, a6, a10 and a11 correspond to evolving generations of workflows implementing DNA sequence alignment using BWA. The success rate of a6, a10 and of a11 is clearly higher than a2, confirming the success of the strategy used to improve the workflows along time described in [31]. We observed that one of the most executed components of a1 often failed during the data staging phase. a3 mostly fails because of wrong input files and data staging issues. Besides data staging incidents, a5 usually runs smoothly on most sites. a7 involves both data- and compute-intensive stages. a8 and a9 are simple workflows including no memory nor compute-intensive steps, so they have high success rate. In particular because of the compute-intensive nature of some workflow components, we observed lots of sudden termination of components execution due to memory faults, DIANE agent timeouts or for unknown reasons. Figure 6 shows the number of failed and successful runs started by users who executed at least 10 workflows in the analyzed period (12 users in total). User proxy is linked to a robot proxy certificate, a credential used to run workflows from the e-BioInfra gateway. The workflows executed with robot credentials normally corre-
Table 1: Description of some users.
u11
40
25
Figure 5: Successful and failed workflow runs for 11 applications, sorted by number of runs. Bars indicate the number of workflows (left axis). The line indicates the success rate (%, right axis).
From the diverse applications and users of the platform, we focus on the most used applications and the most active users in the remainder of the paper. For a sake of readability of the figures, we have used abbreviations for the names of those applications and users. The abbreviations are described in Table 1 for the users and Table 2 for the applications. We do not describe all active users but only those with outlier behavior regarding workflow execution.
u4
●
a3
20
42 18
unknown error: any error that cannot be attributed to the above types, corresponding to a default return value. This value is returned, for example, when a job is aborted, or GASW crashes, or DIANE agent is killed.
u3
34
●
result upload error: the middleware cannot move a file from a worker node to some external storage, corresponding to a FileTransferOut fault.
u2,u9
●
52
●
40
31
a8
60
●
30
●
input download error: the middleware cannot move a file from some external storage to a worker node. This corresponds to a FileTransferIn fault.
success rate
80
●
88
40
number of runs
●
124
60
●
100
application error: this error code is returned whenever the component execution is not successful: for instance, exception in Java or C++ code, non zero return value from C code or a script, etc. This corresponds to resource level fault with the resource being a process.
User proxy u1
100
140
The errors detected at the job level fall into four categories derived from differents viewpoints of the faults taxonomy [20]:
Description Corresponds to all e-BioInfra gateway users develops, tests and runs production workflows execute simple workflows to test the infrastructure expert who develops and tests workflows user who runs workflows developed by others scientist testing the platform
3.2. Analysis at the Workflow Level In this section we analyze the provenance data to understand success and failure of workflows in the e-BioInfra regarding different applications and different users. 7
Table 2: Description and abbreviation of the used applications.
Label a1 a2 a6
bwasplit bwasplit2 bwasplit2ss diffgenomes dtiatlas dtiprep freesurfer pileupvarscanannovar samtoolsflagstat
Description Alignment of DNA sequences using BLAST8 [30] Alignment of DNA sequences using BWA with one component Alignment of DNA sequences using BWA with data splitted and processed in parallel and partial results merged Same as a6 but with additional input sequence files Same as a10 but with site selection for the split and merge steps Counts and clusters homologous genes to find related strains Wraps Diffusion Tensor Imaging software Pre-processing step to a4 Wraps freesurfer9 for reconstruction of the brain’s cortical surface from MRI Determines variants from alignment files and annotates them Gives basic statistics about alignment files
a10 a11 a3 a4 a5 a7 a8 a9
●
88
60 success rate
● ●
●
83
3.3. Analysis at Task Level In this section we analyze the performance of workflow tasks executed as grid jobs, as well as the types of errors returned by them.
40
●
● ●
242
58 ●
succeeded failed
34
u3
u4
u5
22
u6
16 2
u7
11 4
3 11
u8
u9
9 4
0
u1
13 0
●
1
u1
●
13
●
●
●
●
● ●
number of jobs
spond to stable applications that lead to more successful executions, however they can still fail due to infrastructure problems or malformed inputs. Because users u2 and u9 execute simple workflows, their runs are usually successful. In contrary, the large number and potentially complex workflow runs lead u3 to a large failure rate. So is the case of u1 who, in addition, runs production workflows. User u11 is a frustrated researcher who abandoned the platform after too many unsuccessful attempts. Finally, naive user u4 experiences a high failure rate due to incorrect workflow inputs.
●
1000
● ●
●
●
●
0 96 9 76 50
●
●
9 8 9 54 52 40 451
34
0
●
●
●
2000
Figure 6: Successful and failed workflow runs for 12 users, sorted by number of runs. Bars indicate the number of workflows (see left axis). The line indicates the success rate (%, see right axis).
3000
●
●
80
xy pro
6 48 303
0
2 24
60
u2
95
success rate
0
11
u1
0
33
4000
62
40
●
100
20 103
113
36 19
3 9 6 45 42 40 36 247 331 5 3 6 728 0 36 56 52 24 14 14 1 141 0 191 1 11 1 5 7 4 8 74 39 74 78 79 713 615 47 1 5 430 2 5 44 35 225 216
20
100
●
58
168
0
●
200
number of runs
224
flow runs considering other information available at the provenance store, such as workflow duration (end - start time), number of retried jobs, and day of the week and month of the workflow submission (data not shown here). Given the diversity of the applications, user profiles and conditions of platform usage, however, we could not identify any trend or relevant findings.
80
300
succeeded failed
100
Application blast bwaone
s1 s2 s3 s4 s5 s6 s7 s8 s9s10s11s12s13s14s15s16s17s18s19s20
Figure 7: Job performance per grid site, sorted by number of jobs.
Figure 7 presents the job performance (success/failure) with respect to computing clusters located on distributed sites. Although there could be more clusters per site,
We also analyzed the success and failure of work8
we treat the clusters individually because their operation and administration policies can differ within the same site. Most sites (or clusters) have a fairly high success rate (above 70%), with the exception of two sites, s11 and s12, where roughly every other job fails. These two sites had bad connectivity during the analyzed period as will be shown in Figure 8.
a particular error type occurs more often in some applications. Such information would be useful to give hints about how to solve a problem, for example to find out that the problem is transient and that rerunning the workflow the next day is the best approach. However, typically a thorough troubleshooting requires manual inspection of the log files of the individual jobs, which is a daunting task, or even impractical when too many jobs are run and retried. Moreover, it can only be performed by a developer with sufficient information about all details of the workflow implementation, the WfMS and the infrastructure. This is not the case of the e-BioInfra operators, who are the first to detect and support the investigation of errors occurred during workflow execution. Therefore it would be useful to use the provenance information as a first analysis step in troubleshooting to identify the major error causes. In this section we explore a statistical approach to identify the major error and success patterns based on the information in the provenance store. The motivation for such an approach is the variety of the data, which makes the analysis very complex when performed in a case-bycase fashion.
Figure 8: Distribution of error types for all runs (pie chart) and per grid site (bars).
4.1. Classification method Pattern classification [32] consists of learning a target function, also called classification model, that maps each attribute set, or record, to one of a set of predefined category or class labels. A training dataset is used to build the model, which then can be used as an explanatory tool to distinguish between objects of different classes, as well as a prediction tool for unlabeled objects. In our case we are interested in the explanatory potential of classification models to understand the error patterns of grid jobs executed via the e-BioInfra platform. There are several approaches to classification. Our first attempts were based on clustering [32], however the results were not enlightening. First of all, we found it difficult to prepare the attribute values properly to compensate for their heterogeneity. Secondly, the interpretation of the clusters in such a large feature space was not straightforward. We then investigated tree-based classification methods, which are popular in data mining for tasks of similar complexity. These methods use a form of recursive partitioning in which the training dataset is split into subsets using the most discriminating attribute. The splitting continues until a stopping criterion is met. Generated trees are particularly interesting to explain the classification model generated from the training data in an intuitive way. Conditional inference tree is a relatively new tree-based method that estimates
Next, we analyzed the errors, their types and how they were distributed across sites. In the pie chart in Figure 8 we see that data staging errors (input download and output upload) dominate the landscape (around 70%), followed by application errors. The bars in Figure 8 show the error type distribution per site. Whereas all error types occur in most sites in similar proportions to the pie chart, they show different patterns in some sites. Sites s11, s12 and s16, for instance, display almost exclusively data staging errors. We discovered a large number of connection timeouts on s11, which suggested that the site had bad connectivity. As of s12 and s16, it has been confirmed from the logs that there was a configuration problem on those sites during the analyzed period which made data upload fail. 4. Statistical approach to error analysis From the analysis performed in Section 3, we conclude that workflow executions fail a lot and that we can roughly categorize the error occurrences by type. However, we still don’t have more details about the causes of errors, for example, whether a particular workflow component fails more often on a particular site, or whether 9
a regression relationship by binary recursive partitioning in a conditional inference framework [33], selecting variables in an unbiased way. A statistically motivated stopping criterion is used, namely permutation tests based on p-values, such that the partitions induced by this algorithm are not affected by overfitting. In this study we used the R party package [34], in particular the ctree function, which implements a conditional inference tree derivation. As this model avoids overfitting, there is no need for explicit pruning or crossvalidation of our trees. The essential arguments to ctree are (1) the definition of the target and input attributes, (2) a training dataset, which provides the values for those attributes, and (3) controls, such as the stopping criterion. The target attribute, also called the response variable, defines the class for a given record. The resulting tree is plotted using a routine of the same R package – see example in Figure 10. The most discriminating attributes appear at the top levels of the tree. Improvements were implemented to the plotting function to enhance readability of the trees, in particular the position of edge labels.
BWA that was implemented by three different workflows (a2, a10 and a11). These workflows were executed under controlled conditions during the period of one month, and a thorough analysis of errors has been manually performed in this dataset by an advanced user (bioinformatician) – details of this study are presented in [31]. The second set is composed of production runs of a7 started by one end-user (neuroscientist) from the e-BioInfra gateway. These runs have also frequently failed and were briefly analyzed manually by the workflow developer and the e-BioInfra operators. A summary of the datasets is shown in Table 3. The manual findings about these two sets of workflow runs helped us to test and evaluate the value of the results obtained with the classification approach.
4.2. Experiments
4.3. Results We carried out a large number of analyses involving different combinations of the above datasets and job attributes. It turned out that some attributes are not useful because they are correlated (failed jobs and retried jobs, resource and hostname), and others are a consequence of the error, and not its cause (e.g. job duration). After much experimentation, we found that the most useful attribute set is: application name, final status of the job, error type, component name and execution site. This generates small trees that are compact and summarize well the intuition built from extensive troubleshooting during the e-BioInfra operation. We also noticed that trees generated for sets of runs with too distinct workflows in them are not useful. In such cases site and application prevail as discriminating attributes, occluding other attributes that are specific for each of the workflows (e.g. component name). For example, the tree generated for all the runs in the first set is large and not informative, whereas the individual trees corresponding to the workflows a2, a10 and a11 are valuable as illustrated below. A selection of the most relevant classification trees and findings are included as supplemental material10 . Some examples of classification trees are presented below to illustrate the results and potential of this approach.
Table 3: Summary of the datasets used in the classification.
Dataset First Set Second Set
The classification method was applied to different sets of workflow runs using two response variables: the job final status (completed successfully or error) and the error type (application, input, output and unknown). All jobs are included in the training dataset in the first case, whereas only the failed jobs are included in the second case. The stopping criterion for node split was set to 0.999, such that only splits with high confidence (pvalue < 0.001) are performed. This value was set after experimentation: smaller values generate larger trees with too many fine grained nodes, hampering its interpretation. We initially considered all available job attributes and all the workflows from Figure 5, but the generated trees were too big and too hard to interpret, therefore of limited usefulness in practice. We also experimented with subsets of the attributes and workflow runs to bring the trees to an interpretable size. We studied two types of workflow runs with high incidence on the infrastructure operation: successful runs with only a few job retries and failed runs with high number of retried jobs. Our goal was to understand why the workflow execution recovers in the first case and not in the second. Next, we considered workflow runs per application and per user, and finally we used sets of well-known workflow runs of two applications. The first set corresponds to an application for DNA sequence alignment with
10 see
10
Runs 178 24
attached documents.
Jobs 2372 1574
Failed Jobs 968 472
4.3.1. Successful vs Failed Runs For jobs of successful workflow runs with very few (0 or 1) job retries, we found the application to be the most influential attribute. One interesting finding depicted in Figure 9 was that runs of the most failing applications such as a1 and a4 a2 (see also Figure 5) do not usually retry jobs when they successfully complete, which suggests that the job execution should be stopped and/or restarted after the very first retries. Conversely, a5, the most successful application, does retry some jobs, for instance, after they failed on misconfigured sites.
1 site p < 0.001
{s1, s10, s14, s15, s16, s17, s2, s3, s5, s6, s8} {s4, s7} 1 0.8 0.6
Node 3 (n = 143) COMPLETED
COMPLETED
Node 2 (n = 16)
0.8 0.6
1 app p < 0.001
0
0.4 ERROR
ERROR
0.4 0.2
1
0.2 0
Figure 10: Classification tree for 159 jobs from a2 runs (output attribute=job status). Labels are centered around the edge they decorate.
{a1, a2, a3, a4, a6} a5 1 0.8 0.6
Node 3 (n = 437) COMPLETED
COMPLETED
Node 2 (n = 106)
0.2 0
0.8
1 site p < 0.001
0.6 0.4
ERROR
ERROR
0.4
1
{s1, s10, s16, s2} {s14, s15, s17, s3, s5, s6, s7, s8}
0.2
3 site p < 0.001
Node 2 (n = 95)
0
{s1, s10, s2} s16 Node 4 (n = 23)
Figure 9: Classification tree for 543 jobs from successful runs of a1, a2, a3, a4, a5 and a6 applications with few job retries (output attribute=job status). Node 3 does not include any single failed job. This means that the runs of the applications on the adjacent edge did not have to retry job at all.
A
I R U
A
For jobs of failed workflow runs with a large number of job retries, the component name and the execution site appeared to be the deciding factors for a job success or failure. This tree is rather big, so we include it as additional material.
Node 5 (n = 32)
I R U
A
I R U
Figure 11: Classification tree for 150 failed jobs executed by application a2 (output attribute=error type). Error codes: A = application, I = input download, R = result upload, and U = unknown.
by result upload errors, which were caused by a middleware misconfiguration of site s16 (based on manual analysis of logs). Application errors are the most frequent in three sites (s1, s10, s2) in node 5. These sites have the largest clusters of the grid infrastructure and implement strict memory usage policy by killing any non-complying jobs (e.g., using above 2GB of memory). A manual analysis of the logs showed that most of these failing jobs display “out of memory exceptions” because the BWA component in the a2 workflow uses a too aggressive memory approach. One last observation is that s4 does not appear in Figure 11, which means none of the jobs running on it has failed and the failure rate in the left hand branch (node 2) of Figure 10 is due to s7.
4.3.2. First Set Figure 10 shows the classification tree of all jobs executed by the BWA application implemented via a single component workflow (a2). It shows that the most influential attribute in the success of a job is the execution site. Indeed, all jobs executed on the sites on the right branch have failed. In only two sites, on the left branch, the jobs have some chance to successfully complete. Figure 11 presents more details about the same set of runs as in Figure 10 considering only failed jobs. It clearly shows that different groups of sites display different error patterns. On most sites, jobs fail for unknown reason, as shown in tree node 2. The error pattern displayed by jobs in node 4 is dominated 11
type distribution in Figure 13, node 2 on the left reveals that a fairly large number of failures during input download occur on sites s12 and s13, which were identified as mis-configured by a manual analysis. The central node (4) shows that application and download errors in components ImportInput and steps1to14 shape the error pattern. The manual analysis has revealed that application errors are linked to ImportInput and download errors to steps1to14, however due to the strict stopping criterion for tree splitting adopted, no further node splitting could take place.
1 site p < 0.001 {s11, s12} {s1, s10, s13, s14, s15, s2, s3, s4, s5, s6, s7, s8, s9} 2 component p < 0.001
COMPLETED
Node 7 (n = 65)
{steps15to36, steps1to14} ImportInput
{s1, s10, s14, s2, s3, s4, s5, s6, s7, s8, s9} s13
0.8 0.6
1 0.8
0.8
0.2 0
0.6
0.2 0
0.4 ERROR
ERROR
0
0.4 1
0.6
0.4 0.2
0.6
0.4
Node 5 (n = 517)
ERROR
1
COMPLETED
COMPLETED
Node 4 (n = 23)
COMPLETED
Node 6 (n = 231)
0.8
ERROR
3 site p < 0.001
1
5. Discussion
0.2 0
In this paper, we presented our approach to collect detailed provenance of grid workflow executions based on logging information generated by a particular workflow management system. In this case, MOTEUR and DIANE were used because they currently form the backend of the e-BioInfra. However, the approach could be adapted to other systems as well. The architecture of the system is designed to confine changes to the Provenator, as long as the provenance specification is kept. When the WfMS or its log messages change, the Provenator needs to be adapted, and the mapping into OPM graph needs to be revised to comply with possible differences in the available information and concepts in the new system. Changing the provenance standard (e.g. to PROV), however, would require modifying the whole system, including the PLIER API, the Provenator, and the relational schema. Nevertheless, we expect such a modification to be straightforward because usually standards, including OPM, only change incrementally. Although PROV for instance introduces new concepts and some changes in data models, these actually rename or map to OPM concepts and can be easily incorporated into the system. Finally, although we chose a relational schema, we realize the advantages of introducing a semantic layer on top of the current system provenance. Semantic Web standards will enable to publish our results as Linked Open Data after proper data anonymization. The collected workflow provenance information was used to analyze failure and success of workflows executed on the e-BioInfra during a relatively large period of time. The availability of structured information enabled a thorough analysis based on various combinations of attributes and subsets of the data using straightforward queries to the provenance store. Although the results presented in this study have been performed via programmatic interface (SQL), it would be straightforward to implement a GUI for selected pre-defined
Figure 12: Classification tree for 836 jobs from a7 runs started from the gateway by one user (output attribute=job status).
1 site p < 0.001 {s1, s10, s11, s14, s15, s2, s3, s4, s5, s6, s7, s8, s9} {s12, s13} 3 Node 2 (n = 66) component p < 0.001 steps15to36 {ImportInput, steps1to14} Node 4 (n = 55)
A
Node 5 (n = 41)
I R U
A
I R U
A
I R U
Figure 13: Classification tree for 162 failed jobs executed by application a7 (output attribute=error type). See legend in Figure 11.
4.3.3. Second Set Results for runs of the neuroimaging application are shown in Figure 12 and Figure 13. In both cases, site is the most influential attribute, followed by component type. Figure 12 shows that ImportInput, a data transfer workflow component, is 100% successful on most sites (node 5) except for s13 (node 4). Moreover, steps1to14 and steps15to36, which are compute-intensive components, are less successful on the same sites (node 6). In the rightmost branch we observe that, irrespective of component, more than 95% of the jobs fail on two sites, s11 and s12, which also have the worse overall job performance as seen in Figure 7. Zooming into the error 12
of workflow runs, we concluded that the most valuable classification trees for troubleshooting are generated from a homogeneous set of workflow runs. This is in fact not limiting, because it reflects the typical scenario in which such an analysis would take place. For example, a workflow developer would like to optimize a certain workflow implementation, or a platform operator would like to investigate why a given application is failing. Although we confirmed our results for a limited number of cases as reported here, the value of this approach needs to be further validated by others. Moreover, the performance of the classifier has not been objectively evaluated, which would require extensive confirmation of results via manual or custom inspection of the job logs. Considering that the trees are meant to roughly guide a more thorough and detailed analysis of some workflow activity, we consider the current results satisfactory. Finally, the visualization of the trees also calls for improvement in both interactivity and parameterization, which would be very important for its exploitation by the actual users. Another aspect that could be enhanced is the error type classification, which is currently limited to four types only. A richer and hierarchical error taxonomy as described in [20] would greatly improve both analysis and troubleshooting. However this would require instrumentation of the workflow system and of the application components, violating our approach to consider these as black boxes from the provenance system perspective. Finally the analysis presented in this paper, which is based on structured provenance data, could not have been easily done based only on log files. The provenance data enables linking jobs to the applications, their components and the input files. It also enables the identification of retried jobs. For example, we coult not have had component names in Figure 12 and Figure 13 without the proper linking provided by causal dependencies.
queries, along the lines of the already existing provenance interface available on the e-BioInfra gateway11 . A user-friendly interface for freely browsing the provenance database is desirable, for instance using the approach described in [35], however such advanced interfaces are still a subject of research. We took a top-down approach in the analysis of e-BioInfra activity, first looking at the patterns at the workflow level, and then at the job level. This first approach was to filter workflow runs, and their corresponding grid jobs, using various attributes, e.g., per application, per user, per site. General statistics about these groups of runs or jobs allowed us to better understand the general activity taking place on the infrastructure. In general we observed low efficiency due to a large number of errors and retries. The reasons for this low efficiency are found in an unfortunate combination of factors such as (the absence of) site selection to run the experiments and to retry jobs, malformed or unavailable inputs to run the workflow, and resource overload. Further actions are needed to address these factors both at the technical and training level for workflow development and execution. With this first analysis, however, we could not identify by eye any significant “pattern” of activity, in particular of failure and success, because of the heterogeneity of the data and the complexity of the infrastructure. We therefore applied a pattern recognition algorithm, namely classification tree. In general, we observed that classification trees are useful for providing explanations of job success and failure patterns by making explicit the job attributes that most contribute to the job final status. As such, they could become an invaluable tool for first step workflow incident troubleshooting. However, care must be taken to grow the right-sized tree because big trees are not only hard to interpret, they are also hard to read. For instance, in order to make readable the reasonably small trees shown in this paper, we had to implement some R callback routines to improve the original plotting function. Classification trees present a dilemma: small, they are easy to interpret but poor in information, large they are hard to interpret but potentially rich in information. Additionally, they depend on the data and the feature space and their interpretation is close to an art. In our case, we had to reduce the input data by selecting the workflow runs and/or the most relevant or intuitive features. After extensive analysis of a large number of trees generated with various combinations of attributes and sets
6. Conclusion In this paper we showed how provenance data can be collected from an existing system and then used to characterize workflow-based activities on an einfrastructure. We analyzed usage and failure patterns, firstly at workflow run level for the most executed applications and the most active users of the platform, and then at the task level, showing the global performance per site in our virtual organization. Finally, we further investigated the causes of failure by analyzing job properties concerning failure and error type distribution. In all cases, the execution site shows as the most influential attribute for the success or failure of a job, indicating
11 http://www.bioinformaticslaboratory.nl/twiki/ bin/view/EBioScience/EBioInfraGatewaySCIBUS
13
that better site selection, or alternatively site blacklisting, could improve the efficiency of the infrastructure significantly. We discussed how this work can be extended and how classification trees can be used as a valuable tool for troubleshooting workflow execution incidents. Our provenance system also annotates whether a job is a retry of another job during workflow execution trace collection, and this information can be used to study the job retry mechanism in an operational context. We observed that the mechanism is non-clairvoyant, as the workflow engine often re-submits a job to a site where it has consistently failed. Making the mechanism clairvoyant by using provenance would increase the efficiency of workflow executions on the platform. Finally, during the analysis, one recurrent question was to know the maintenance status of a given site in a given period of time. As currently this information is not available in the provenance, we had to get it from external sources when available. To address this issue, we plan to extend the system with tools designed to provide such information.
[6] L. Bavoil, S. Callahan, P. Crossno, J. Freire, C. Scheidegger, C. Silva, H. Vo, VisTrails: enabling interactive multiple-view visualizations, in: proceedings of IEEE Visualization, 135 – 142, 2005. [7] S. Bowers, T. M. McPhillips, S. Riddle, M. K. Anand, B. Lud¨ascher, Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life, in: procs of IPAW, 70–77, 2008. [8] J. Kim, E. Deelman, Y. Gil, G. Mehta, V. Ratnakar, Provenance trails in the Wings-Pegasus system, Concurr. Comput. : Pract. Exper. 20 (2008) 587–597, ISSN 1532-0626. [9] E. Deelman, G. Singh, M. hui Su, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, K. Vahi, G. B. Berriman, J. Good, A. Laity, J. C. Jacob, D. S. Katz, Pegasus: a framework for mapping complex scientific workflows onto distributed systems, Scientific Programming Journal 13 (2005) 219–237. [10] D. Hull, K. Wolstencroft, R. Stevens, C. Goble, M. R. Pocock, P. Li, T. Oinn, Taverna: A tool for building and running workflows of services, Nucleic Acids Research (2006) 729–732. [11] A. Benabdelkader, M. Santcroos, S. Madougou, A. H. C. van Kampen, S. D. Olabarriaga, A Provenance Approach to Trace Scientific Experiments on a Grid Infrastructure, in: Proceedings of the 2011 IEEE Seventh International Conference on eScience, e-SCIENCE ’11, IEEE Computer Society, ISBN 978-0-76954597-4, 134–141, 2011. [12] M. Gerhards, V. Sander, T. Matzerath, A. Belloum, D. Vasunin, A. Benabdelkader, Provenance opportunities for WS-VLAM: an exploration of an e-science and an e-business approach, in: Proceedings of the 6th workshop on Workflows in support of largescale science, WORKS ’11, ACM, ISBN 978-1-4503-1100-7, 57–66, 2011. [13] P. Missier, S. Sahoo, J. Zhao, C. Goble, A. Sheth, Janus: from workflows to semantic provenance and linked open data, Lecture Notes in Computer Science 6378/2010 (2010) 129–141. [14] D. A. Cieslak, N. V. Chawla, D. L. Thain, Troubleshooting thousands of jobs on production grids using data mining techniques, in: Proceedings of the 2008 9th IEEE/ACM International Conference on Grid Computing, GRID ’08, IEEE Computer Society, ISBN 978-1-4244-2578-5, 217–224, 2008. [15] P. Townend, P. Groth, J. Xu, A provenance-aware weighted fault tolerance scheme for service-based applications, in: ObjectOriented Real-Time Distributed Computing, 2005. ISORC 2005. Eighth IEEE International Symposium on, 258–266, 2005. [16] P. Groth, M. Luck, L. Moreau, A protocol for recording provenance in service-oriented grids, in: In Proceedings of the 8th International Conference on Principles of Distributed Systems (OPODIS04, 124–139, 2004. [17] T. Samak, D. Gunter, M. Goode, E. Deelman, G. Mehta, F. Silva, K. Vahi, Failure prediction and localization in large scientific workflows, in: Proceedings of the 6th workshop on Workflows in support of large-scale science, WORKS ’11, ACM, ISBN 978-1-4503-1100-7, 107–116, 2011. [18] R. Ferreira da Silva, T. Glatard, F. Desprez, Self-Healing of Operational Workflow Incidents on Distributed Computing Infrastructures, in: Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012), CCGRID ’12, IEEE Computer Society, ISBN 9780-7695-4691-9, 318–325, 2012. [19] T. Glatard, J. Montagnat, D. Lingrand, X. Pennec, Flexible and Efficient Workflow Deployment of Data-Intensive Applications On Grids With MOTEUR, J. High Performance Computing Applications 22 (3) (2008) 347–360. [20] J. Hofer, T. Fahringer, A multi-perspective taxonomy for systematic classification of grid faults, 16th Euromicro Conf on
Acknowledgements This work is financially supported by the AMC-ADICT innovation fund, the BiG Grid programme funded by the Netherlands Organisation for Scientific Research (NWO) and the COMMIT project “e-Biobanking with imaging for healthcare”. We thank Perry Moerland for helping in the interpretation of the datasets used in the analysis, Mahdi Jaghoori and Mostapha Al Mourabit for their suggestions in improving the text.
References [1] K. T. Tony Hey, Stewart Tansley, The Fourth Paradigm: Data-Intensive Scientific Discovery, Microsoft Research, ISBN 0982544200, 2009. [2] C. Kesselman, I. Foster, The Grid: Blueprint for a New Computing Infrastructure, Morgan Kaufmann Publishers, 1998. [3] J. Yu, R. Buyya, A taxonomy of scientific workflow systems for grid computing, SIGMOD Rec. 34 (3) (2005) 44–49, ISSN 0163-5808. [4] E. Deelman, D. Gannon, M. Shields, I. Taylor, Workflows and e-Science: An overview of workflow system features and capabilities, Future Generation Computer Systems 25 (5) (2009) 528–540. [5] S. B. Davidson, J. Freire, Provenance and scientific workflows: challenges and opportunities, in: proceedings of the 2008 ACM SIGMOD international conference on Management of data, SIGMOD ’08, ACM, ISBN 978-1-60558-102-6, 1345–1350, 2008.
14
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34] [35]
Parallel, Distributed and Network-Based Processing (2008) 126–130. S. Madougou, M. Santcroos, A. Benabdelkader, B. D. van Schaik, S. Shahand, V. Korkhov, A. H. van Kampen, S. D. Olabarriaga, Provenance for distributed biomedical workflow execution, in: HealthGrid Applications and Technologies Meet Science Gateways for Life Sciences, vol. 175, 91–100, 2012. S. D. Olabarriaga, T. Glatard, P. T. de Boer, A Virtual Laboratory for Medical Image Analysis, IEEE Transactions on Information Technology in Biomedicine 14 (4) (2010) 979–985. S. Shahand, M. Santcroos, A. van Kampen, S. Olabarriaga, A Grid-Enabled Gateway for Biomedical Data Analysis, Journal of Grid Computing (2012) 1–18ISSN 1570-7873. J. Montagnat, B. Isnard, T. Glatard, K. Maheshwari, M. B. Fornarino, A data-driven workflow language for grids based on array programming principles, in: Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science, WORKS ’09, ACM, ISBN 978-1-60558-717-2, 10, 2009. T. Glatard, J. Montagnat, D. Emsellem, D. Lingrand, A ServiceOriented Architecture enabling dynamic service grouping for optimizing distributed workflow execution, Future Gener. Comput. Syst. 24 (7) (2008) 720–730, ISSN 0167-739X. J. T. Moscicki, M. Lamanna, M. Bubak, P. M. A. Sloot, Processing moldable tasks on the grid: Late job binding with lightweight user-level overlay, Future Generation Comp. Sys. 27 (6) (2011) 725–736. L. Moreau et al., The Open Provenance Model core specification (v1.1), Future Generation Computer Systems 27 (6) (2011) 743 – 756. P. G. Yogesh Simmhan, L. Moreau, pecial Section: The third provenance challenge on using the open provenance model for interoperability, Future Generation Computer Systems 27 (6) (2011) 737–742, ISSN 0167-739X. H. Li, R. Durbin, Fast and accurate short read alignment with Burrows-Wheeler transform, Bioinformatics 25 (14) (2009) 1754–1760, ISSN 1367-4803. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, D. J. Lipman, Basic local alignment search tool, Journal of Molecular Biology 215 (3) (1990) 403 – 410, ISSN 0022-2836. B. D. C. van Schaik, M. Santcroos, V. Korkhov, A. Jongejan, M. Willemsen, A. H. C. van Kampen, S. D. Olabarriaga, Challenges in DNA sequence analysis on a production grid, in: PoS(EGICF12-EMITC2), 1–19, submitted, 2012. P.-N. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining, (First Edition), Addison-Wesley Longman Publishing Co., Inc., ISBN 0321321367, 2005. T. Hothorn, K. Hornik, A. Zeileis, Unbiased Recursive Partitioning: A Conditional Inference Framework, Journal of Computational and Graphical Statistics 15 (3) (2006) 651–674. T. Hothorn, K. Hornik, A. Zeileis, party: A Laboratory for Recursive Partytioning, 2006. M. K. Anand, S. Bowers, B. Lud¨ascher, Database support for exploring scientific workflow provenance graphs, in: Proceedings of the 24th international conference on Scientific and Statistical Database Management, SSDBM’12, Springer-Verlag, ISBN 978-3-642-31234-2, 343–360, 2012.
15