Evaluating workflow management systems for bioinformatics

4 downloads 276 Views 222KB Size Report
Nov 6, 2005 - Context of workflow systems development and definitions. 2.1. ...... about MIR is available at http://www.mygrid.org.uk/index.php?module=.
November 6, workflows˙evaluation

2005

13:11

WSPC/INSTRUCTION

FILE

Evaluating workflow management systems for bioinformatics

Zo´ e Lacroix and Herv´ e M´ enager Mechanical and Aerospace Engineering, Arizona State University, P.O. Box 876106 Tempe, Arizona 85287-6106, USA {zoe.lacroix, herve.menager}@asu.edu

1. Introduction A scientific protocol describes an experiment, and its outcome is compared to the expected results in order to draw a conclusion 11 . A scientific protocol is composed of a reproducible succession of tasks, organized with respect to the order of execution. Until recently, scientific protocols in life sciences as in many other fields, were mainly composed of successions of tasks executed by humans. The progress in technologies related to computer science and genomics sciences resulted in the production of a constantly increasing amount of digital data. As a result, a significant part of the experiments in modern sciences such as bioinformatics rely extensively on digital datasets that have to be analyzed and correlated, and do not require human intervention. Be data-intensive, these protocols generate new needs. Users need tools letting them manage both their authoring (i.e. express, store, revise, retrieve protocol definitions) and their execution. These tools also need to allow the retrieval and the analysis of the collected results for the protocol instances. Workflow systems, initially developed to automate business activities, offer a solution suitable to design and execute scientific protocols, for workflows can be as well defined as a succession of ordered tasks. There is currently a great number of software tools that can be qualified as scientific workflow management systems, with a great diversity in their . There is a great diversity in the projects and software in this category. These systems have different characteristics: more or less specialized to handle some types of workflows, more or less scalable, academic or commercial, etc. Hence, when it comes to selecting a software, a scientist needs to know which is the one that best meets his criteria. To evaluate to which extent a software meets the user’s needs, the characteristics of similar software must be identified and the benefits offered by each compared. Comparing these systems is not an easy task, because of the diversity and complexity of the available solutions, and because each user has its own specific needs, thus according specific importance to each of the possible criteria. The aim of this paper is to establish a list of criteria to help evaluating different 1

November 6, workflows˙evaluation

2

2005

13:11

WSPC/INSTRUCTION

FILE

Zo´ e Lacroix and Herv´ e M´ enager

bioinformatics workflow management systems. These criteria should be both easy to assess, and relevant to the requirements for this software category. The first section presents the context for the development of the workflow systems and some definitions. The next section gives a description of some evaluation criteria, explains their importance in the context of bioinformatics, and presents an evaluation grid that could be used to analyze and compare different solutions. The third section of this paper, after briefly introducing some of the existing solutions, assesses them according to our evaluation grid. Finally, we discuss the similarities and differences that appear in the results between these different systems, in order to characterize them better. 2. Context of workflow systems development and definitions 2.1. History and traditional context Workflows were historically developed in a business-oriented environment, where they were aimed at defining and automating office work procedures, to help managing and reducing the volume of paper-based information. The re-engineering trends of the nineties helped this technology develop, and an urge for standards led to the creation of organizations such as the Workflow Management Coalition a . This group defines a workflow as “the automation of a business process, in whole or part, during which documents, information or tasks are passed from one participant to another for action, according to a set of procedural rules” 6 . 2.2. Definitions We define a workflow as the description of a reproducible process composed of a set of coordinated tasks. A workflow is authored and executed using software systems called workflow management systems. Each distinct execution of a workflow is a workflow instance. Workflows and tasks can require some inputs to be executed, and returned. The outputs of all the tasks of a workflow instance that are not part of the workflow results are the intermediate results. In addition to these notions of inputs and outputs, the successful execution of a task may require the verification of conditions that are not specified in the provided information flow and are related to the state of “outside world”, i.e., its inputs: these are the pre-conditions of a task. An execution may also have consequences that are not reflected in its outputs, or effects. A workflow can be defined inductively: the basic component of a workflow is a task, and a basic workflow is composed of a single task. Workflows can be connected into more complex ones, using connectors such as “successor” or “merge”. We can define a task with the following (non-exhaustive) list of properties: T ask ::= (< Input >, < Output >, < Condition >, < Result >, N ame, Description) a More

information about this organization is available at: http://www.wfmc.org

November 6, workflows˙evaluation

2005

13:11

WSPC/INSTRUCTION

FILE

Evaluating workflow management systems for bioinformatics

3

The coordination of the execution of the different components of a workflow can be expressed through different operators that define control constructs: W orkf low ::= (T ask, (W orkf low + W orkf low|W orkf low − W orkf low|W orkf low ∗ |W orkf low/W orkf low) The above partial definition states that a workflow is described as a single task, and is composed of a set of workflows structured by control constructs. These control constructs specify that the components can be executed concurrently (operator +) or sequentially (operator −), that they execute iteratively a single component (operator ∗), or execute alternatively different tasks (operator /). The composition of a workflow using these constructs defines a control flow. Another type of coordination is essential to the definition of a workflow, based on the availability of the inputs of its different components. We can express the flow of information between tasks as a list of links, or bindings, between two parameters, either the output and the input of two successive tasks, the input of a workflow and the input of one of its components, or the output of a workflow and the output of one of its components: Binding ::= (T askOutput, T askInput)|(W orkf lowInput, T askInput)|(T askOutput, W orkf lowOutput) To summarize, the coordination of a workflow may be based on two types of flows: • Control flows specify this coordination based on the execution status of each task. The execution of a task is usually subordinated to the completion of the preceding tasks. • Data flows represent the “path” of data between tasks. In such a perspective, the execution of a task is only conditioned by the availability of its inputs data. These two views offer different, yet complementary, perspectives on the coordination of tasks. Business-oriented workflows typically emphasize more on the control flow, because they describe complex tasks that involve information that may not be completely represented in their inputs and outputs, and because they represent human tasks and capture their inputs. These two types of flows specify constraints on each other, as they define the coordination of tasks with regard to independent rules. For instance, the execution of a sequence of tasks requires that the input of each task is computed previously to its execution. Therefore, you cannot specify a control construct that would execute a task T1 followed by a task T2, and feed the input of the task T1 with one of the outputs of task T2. Workflow tasks may have different purposes, such as: • control the workflow, its results determining choices in the coordination of the tasks. • collect data from various data sources.

November 6, workflows˙evaluation

4

2005

13:11

WSPC/INSTRUCTION

FILE

Zo´ e Lacroix and Herv´ e M´ enager

• analyze data, processing them to compute new results. • transform data, for instance to increase the readability of the workflow outputs, or to enable the interoperability of different tasks. 2.3. Workflows and bioinformatics An example of bioinformatics workflow is the “gene clustering” workflow, used when analyzing the evolution of the expression rate of different genes during the successive states of a disease. The result of this experiment is a list of gene accession numbers, that is used to gather related information, such as the binding partners of their products, the signal transduction pathways they belong to, or their function. This information can subsequently be used to cluster the different genes. For instance, if the level of expression of many genes that belong to a same pathway is modified, then this pathway is important in the disease state. Such results can lead to further investigations to determine the exact role this pathway plays in the disease. Such an example can be expressed as a workflow that expresses the successive information collection and analysis steps, displayed in Figure 1.

Figure 1. Gene clustering workflow example

As in this example, scientific workflows are mainly defined by the data flow that represents it, because all the significant information must be explicitly defined. From this point of view, a scientific workflow is similar to a database query, and the coordination of its tasks is more flexible, allowing optimization mechanisms such as pipelining, or parallelism. For instance, in the previously mentioned example, the collection of the signal transduction pathway, protein function, and binding partners can be executed concurrently. An in-depth description of the different types of tasks involved in a bioinformatics workflow can be found in 21 . 2.4. Workflow management systems requirements Classic workflow management systems have to meet a number of requirements: (1) express the definition of the workflow.

November 6, workflows˙evaluation

2005

13:11

WSPC/INSTRUCTION

FILE

Evaluating workflow management systems for bioinformatics

5

(2) execute the defined tasks with respect to the coordination properties defined in the workflow. (3) ensure the availability of the results and effects. (4) provide a secured multi-user environment, required in any enterprise-level software system. However, the recent introduction of workflows in the field of scientific data management extends the list of requirements: (5) provide data collection capabilities that let scientists exploit the wealth of available resources. (6) offer powerful computing capabilities, corresponding to the data-intensive tasks that are executed. (7) ensure the traceability of data, and therefore the reproducibility of the experiences, in a dramatically distributed and dynamic environment. (8) facilitate the use of the system by ensuring a high degree of transparency between the design of a workflow and its implementation, automatically solving translation and interoperability issues between the different databases and tools accessed. These different requirements can be translated into a set of system specifications that will be expected when evaluating a given scientific workflow management system: • The system interface must be “scientist-friendly” enough to facilitate the design of workflows by users that do not necessarily have programming skills (see requirement 1). Ideally, this interface allows to separate clearly the design of a workflow (its semantics) from the implementation by guiding the user from the first step to the second one (see requirement 8). • The execution of workflows (see requirement 2) has to be handled by an execution engine which has to be very scalable in order to run these workflows on very large datasets (see requirements 5 and 6) and for large-scale organizations (see requirement 4). • The data collection and computing capabilities (see requirement 5 and 6) demand the access to a maximum of resources, including public databases and analysis tools. • Finally, the availability of the results and the traceability of the workflow execution (see requirements 7 and 3) necessitates the existence of a workflow repository (such as a database), to record, in addition to the workflow definitions, all the data resulting from their execution (meta data, intermediate and final results).

November 6, workflows˙evaluation

6

2005

13:11

WSPC/INSTRUCTION

FILE

Zo´ e Lacroix and Herv´ e M´ enager

Category

Description

Extensibility

System customization capabilities: add new data types, tools, or queries. Reuse previous queries (modularity capabilities) or tools.

Functionality Usability

• Supply an appropriate User Interface. • Support the queries that need to be executed. • Return the results in an appropriate format. Understandability Scalability Efficiency

Explain the meaning of the scientific queries and results. Support the queries that need to be executed. • Handle the amount of data and intended number of users. • Perform a query in a satisfactory time frame.

Table 1. Software evaluation criteria

3. Evaluation criteria The criteria used to evaluate a software can be classified into six characteristics, and each can be considered from two perspectives 10 : the implementation perspective, and the user perspective. While the implementation perspective is more concerned with the technical details of the implementation, the user perspective tries to characterize a software from an end-user point-of-view. Because the details of the implementation of evaluated software are not necessarily available, we will focus on the user perspective. We introduce evaluation criteria, sorted according to six characteristics, in Table 1). Considering the nature and the stage of development of these systems, the most important characteristics to assess are extensibility and functionality, as well as usability and understandability. Because many of the systems are academic, they often do not have scalability features developed, as it is not a common preoccupation for academic software development. As these systems are in early development stages, efficiency is not a priority in evaluation, because the development of an optimization strategy usually comes with maturity.

3.1. Extensibility-Functionalities A scientific workflow is composed of steps built from components that can either access databases, or call tools and applications to analyze the data (see requirements 5 and 6). These scientific resources can be available as components of the system, and thus be internal functionalities, or be accessed as external functionalities, and

November 6, workflows˙evaluation

2005

13:11

WSPC/INSTRUCTION

FILE

Evaluating workflow management systems for bioinformatics

7

be categorized as extensibility features. The components, when integrated to the tools, could be seen as functionalities of the software, whereas the solutions provided to integrate external tools can be described as part of the software extensibility characteristics. Still, this category of software does not permit making such a distinction. The components, even if packaged with the software, can often be wrapped external tools such as BioPerl b or BioJava c . Therefore, they are not really specific to the workflow management system. As we will see later, we cannot either separate them on the basis of the location of their execution (local or remote), because these platforms are increasingly relying on distributed architectures, that allow to seamlessly use resources independently of their location. The more appropriate distinction would be to consider on one side the available resources, presented by default in the software, without considering how specific they are. On the other side, we would enumerate the types of resources that can be integrated in any way to be used as components, as the integration capabilities. The connection between different components raises issues about the data transformation: the output of a workflow step is not necessarily in the good format to be valid input for the following one. The ability to transform data from one format to the other can also be included in this category, since they directly affect the extensibility of a system. 3.2. Usability Usability can be defined as “the ease with which a user can learn to operate, prepare inputs for, and interpret outputs from a system or a component” 7 . We can evaluate this criteria using four factors: the characteristics of the software interface, the existence and quality of user documentation and software support, its portability, and the level of technical knowledge required to operate it. 3.2.1. Software interface Interactions with a software, whether it is from a user or another system (any “actor” in the UML use cases diagram sense), is possible through interfaces. There are different types of interfaces, each one having its own advantages. Graphical user interfaces (GUI) use the graphical possibilities of windowing operating systems to offer users an intuitive interface that is usable without understanding complex text-mode syntaxes. The existence of such an interface is critical in the area of bioinformatics workflow management systems, because one of their requirements is to allow scientists to design complex workflows, with as less programming skills as possible (as mentioned previously in the requirements in 1). b For c For

more information, see BioPerl website: http://bio.perl.org/ more information, see BioJava website: http://www.biojava.org/

November 6, workflows˙evaluation

8

2005

13:11

WSPC/INSTRUCTION

FILE

Zo´ e Lacroix and Herv´ e M´ enager

Command line interfaces (CLI) rely only on textual inputs and outputs for the interaction with the users. Though less intuitive, they allow users to automate easily their interaction with the software, using scripts processed by command line interpreters. In the field of bioinformatics, a single workflow can have to be run in a repetitive way, for example to test the results with different inputs, or to check resources for updates. Therefore, allowing easy automation, such an interface is an advantage. Application programming interfaces (API) allow the interaction with other programs, by providing programming languages with functions or objects to communicate with them. These interfaces are also useful to automate processes, but also for any integrating task with another software.

3.2.2. User documentation The user documentation provides information about: • How to use the software. For instance, how to design and execute a workflow, and how to use the results from this execution. • How to maintain and calibrate it. For instance, if the system stores some information, the documentation should provide a description of the backup procedure. • How to fix the system, in case a misuse or a system incident occurs. In the late case, a detailed description of the recovery procedures is a key element to the reliability of a system, as it lowers the MTTR d . • How to extend the system, describing procedures to add new functionalities to the system. This list is not intended at being detailed and exhaustive, as the needs for some documentation heavily depend on the system, its architecture, its use, and its functionalities. We would like to underline the fact that this feature is essential, even to the systems that achieve the most user-friendly interfaces and usage. This documentation can be printed and packaged with the software, or available available electronically, i.e., on the internet. In the case of our software category, such a documentation should be as precise as possible, to allow users to learn quickly how to use it.

3.2.3. Software support Software support includes every communication mean that are at the disposition of a user to help him solve the problems he might have using the system. These communication means are diversified, including e-mail, forums, chats, mailing lists, d MTTR:

Mean Time To Repair.

November 6, workflows˙evaluation

2005

13:11

WSPC/INSTRUCTION

FILE

Evaluating workflow management systems for bioinformatics

9

telephone, etc. Through these tools, the user can communicate with the authors, a dedicated support team, or with a users community. 3.2.4. Portability Portability can be defined as “the ease with which a system or component can be transferred from one hardware or software environment to another” 7 . Unlike in closed environments like classical corporate businesses, where the computing environment is controllable, bioinformatics workflow management systems are used in a highly distributed environment. It means that these environments can be extremely heterogeneous, and this emphasizes greatly the need for portability. 3.2.5. Level of technical knowledge required The usability of a software is the result of the influence of many different factors. With the use of a factor such as “Level of technical knowledge required”, we want to underline the fact that our particular kind of software is to be used by scientists, who should be able to use them with minimal skills in computing and programming. 3.3. Understandability Understandability is “the degree to which the purpose of a system or component is clear to the evaluator” 7 . To assess it, we use three factors: the availability of data provenance and process execution information, and the mechanisms by which the software handles the faults in the execution of a workflow. 3.3.1. Data provenance information Data provenance information includes all the intermediate results of workflow instances, combined with collection information including the name, version and location of the resource that produce the data, the date and time of collection, and its mapping to the final result. Data sources in life science have particular properties 4 , one of them being that the data organizations (schemas) as well as the contents of the data sources are extremely dynamic. Because of these characteristics, and the intrinsic instability of data (and programs) on the web, scientific resources evolve at a fast pace, hence affecting the reproducibility of the results. If the results of an experience cannot be reproduced, then a scientist might at least want to be able to capture the reasons why. Gathering all the available information about workflow executions can help achieving this goal, explaining for instance that the final results of the execution of a workflow are different since its previous execution because the data collected from a remote database have changed, situation that may occur when this data source was updated or curated between the two executions of the workflow.

November 6, workflows˙evaluation

10

2005

13:11

WSPC/INSTRUCTION

FILE

Zo´ e Lacroix and Herv´ e M´ enager

In summary, to compensate data instability in life science, data provenance information improves their traceability. The integration of many different resources also raises the issue of intellectual property. Using increasingly seamless data integration tools cannot be done if users cannot access the data provenance information. 3.3.2. Process execution information Process execution information is the ability for a system to give users information about the workflow instances currently being executed. It helps them monitoring the execution process, and altering it if necessary. For instance, if a workflow instance is executing too slowly, cancel it in order to optimize it further. We define process execution information as the capacity of a system to inform users about the execution status of a workflow. 3.3.3. Faults handling Fault handling expresses the way the system reacts to the fact that a workflow instance behaves in an unexpected way. In this case, the system should both inform the user (which is part of the previously defined process execution information), and recover by automatically taking a decision to cancel the workflow instance or to alter it. This alteration could be for example retrieving data from an alternate database, in case the planned resource is not available. This automated recovery behavior is specified by the users during the design of a workflow. 3.4. Scalability Scalability is “the ease with which a system or a component can be modified to fit the problem area” 7 . It is generally used to designate the ability of a system to handle an increase in the information it computes, through a larger group of users or larger datasets for instance. To assess it, we estimate the level of the support for multiple users and workflow decomposition. 3.4.1. Support for multiple users As a discovery process often involves large teams of people working together, the need for software that supports the sharing of information and tasks is important. Hence, sharing workflow definitions or results between different users can be essential in this software. 3.4.2. Support for workflow decomposition We mentioned in the introduction that a scientific workflow can be, just as an elementary scientific task, described by its inputs and outputs. Some pieces of some workflows are regular patterns that are used in several of them. The ability to design

November 6, workflows˙evaluation

2005

13:11

WSPC/INSTRUCTION

FILE

Evaluating workflow management systems for bioinformatics

11

once such patterns and use them as tasks in every scientific workflow that uses them is a feature that enhances reliability, and the use of a system to design increasingly complex protocols.

4. Results 4.1. Compared systems A lot of scientific workflow systems are currently available. They can differ on many points. We chose to compare five of them, which we are going to present now. General information about these systems is also displayed in Table 2.

4.1.1. Taverna Taverna 18 is a system that allows designing and executing workflows using web services components. It is an open-source software that lets users integrate remote and local resources through an extensive collection of wrappers that access for instance web services or Java components. Although it can be used with any kind of resource, this software is particularly aimed at bioinformaticians, offering a large number of predefined biological resources. Taverna is a component of the EPSRC (Engineering and Physical Sciences Research Council) funded by the myGrid project.

4.1.2. JOpera JOpera is developed by the Information and Communication Systems Research Group from the Swiss Federal Institute of Technology in Z¨ urich. This tool allows the design and execution of workflows from components that can be web services, but also other types of software. Although JOpera is not openly aimed at bioinformatics, it is the successor to the project BioOpera project, whose goal was to improve and automate large scale analysis of genetic data sets.

4.1.3. Kepler Kepler 12 is a project issued from the collaboration of various institutes, which include SEEK, SDM Center/SPA, and more. This system aims at developing tools for scientific workflows, allowing their design and execution. It is based on Ptolemy II, a set of Java packages supporting heterogeneous, concurrent modeling and design.

4.1.4. Triana Triana 5 is a project from the Cardiff University. It allows users to build scientific workflows using a great variety of predefined tools or integrating new ones, and to run them.

November 6, workflows˙evaluation

12

2005

13:11

WSPC/INSTRUCTION

FILE

Zo´ e Lacroix and Herv´ e M´ enager

4.1.5. Pipeline Pilot Pipeline Pilot is a commercial software edited by Accelrys Inc. It is aimed specifically at Drug Discovery activities, and provides for this purpose a lot of analysis tools for cheminformatics and bioinformatics and wrappers to integrate different external tools.

XML sublanguage:OML (Opera Modeling Language)

XML sublanguage:XSCUFL

Scientific

Use

Scientific

XML sublanguage: MoML (Ptolemy II language)

V1.0.0 alpha4.exe (12/09/2004) Java

http:// kepler-project. org/

Yes Ptolemy II

BSD-style

Collaboration between: SEEK SDM Center/SPA Ptolemy II GEON ROADNet EOL Academic

Kepler

XML sublanguage. can also import from others such as BPEL4WS Scientific

v3.1.1 (06/07/2005) Java

http://www. trianacode.org/

Apache software license Yes GridLab

Academic

Cardiff University

Triana

Bioinformatics, Chemoinformatics

?

http: //www.scitegic. com/products services/ pipeline pilot. htm v4.0 (03/31/2004) ?

No None

Commercial

Commercial

SciTegic

Pipeline-Pilot

Evaluating workflow management systems for bioinformatics

Table 2. General information about the evaluated systems

General

V1.71 (12/10/2004) Java

V1.2 (06/25/2005) Java

http://taverna. sourceforge.net

http: //www.iks.inf. ethz.ch/jopera

JOpera License, c

C. Pautasso No None

LGPL

Yes MyGrid

Academic

IKS, ETH Z¨ urich

Collaboration between: EBI

Academic

Jopera

Taverna

13:11 WSPC/INSTRUCTION FILE

Current version and Release date Programming language Workflow language

Sources available Related to projects Main URL

Academic / Commercial License type

General information Author

November 6, 2005 workflows˙evaluation

13

Software interface Command Line Interface Application Programming Interface Documentation existence and quality No

Yes

Online documentation

No

Yes

Online User Guide Online and included documentation + tutorials

Usability Desktop interface

Computer science skills needed

Use of split/merge operators

Java Snippets. Call of Java programs

Desktop interface

Computer science skills needed

Java based (Beanshell scripting, API consumer, Local Java) XSLT or XPath components

Online documentation

No

No

Desktop interface

Python, Matlab, Command Line, Grid, HTML, ROADNet sensors Data transformation actors (XSLT, XQuery, Perl, etc.) Computer science skills needed

User Guide Available in pdf and included in the software

No

Yes

Desktop interface

Computer science skills needed

WS, Grid Services, Text Files, Image Files, Audio Files Java components

Triana

Workflows can be published as web services ?

Desktop interface + web interface No

Computer science skills needed

Use of “Utilities” components

WS, ODBC, Excel, SD files, Molecular Compliant Databases Perl or Java components

Pipeline-Pilot

13:11 WSPC/INSTRUCTION FILE

User documentation

Software interface

Level of technical knowledge required

Data transformation capabilities

lan-

Available protocols

Resources integration characteristics

Jopera Kepler Extensibility-Functionalities based on SOAP, WS, JDBC, Post- WS, JDBC, Grid REST, or JDBC GreSQL

Taverna

14

Available guages

Criteria

Characteristic

November 6, 2005 workflows˙evaluation

Zo´ e Lacroix and Herv´ e M´ enager

Intermediate results visible Intermediate results saved Does the user get information about the execution status (which step is/are being executed) Information about exceptions in the workflow: step, root cause - Recoverability

ata D provenance

Yes

Yes

Yes

Yes

Scalability

No

Yes

Yes

Yes (EMAIL)

Yes

Understandability Yes Yes

Windows Mac OS X Linux

Kepler mailing lists, IRC channel No HTML

Yes

Windows 2000 or XP

No Text

No Any (Visualization plug-ins can be added) Windows, Linux, Mac OS X

Jopera e-mail

Taverna mailing lists

Windows Mac OS X Unix

Triana e-mail, mailing lists No Text Graph Editor

Yes

Yes

Yes

Yes

Yes Any (Uses visualizers for specific types, such as sequences) Windows, Linux

Pipeline-Pilot e-mail

13:11 WSPC/INSTRUCTION FILE

Faults handling

Process execution information

Platforms supported by the client application

Criteria Electronic support Phone line Data types supported

Portability

Data visualization

Characteristic Software support

November 6, 2005 workflows˙evaluation

Evaluating workflow management systems for bioinformatics 15

Workflow ”sub- Yes units” definition possibilities Table 3: Evaluation criteria values collected management systems

Standalone app. for design and runtime, and a server for execution engine

Stand-alone app.

Support for workflow decomposition

Jopera None

Taverna None. But MIR allows these capabilities

Criteria Workgroup possibilities offered by the software Product Architecture

Characteristic upport for S multiple users

for the different workflow

Triana None

Client for design and runtimeServer for execution engineWeb interface for runtimeWebServices interface for runtime Yes

Pipeline-Pilot Yes

16

Yes - Workflows can be abstracted as steps

Kepler None

November 6, 2005 13:11 WSPC/INSTRUCTION FILE

Zo´ e Lacroix and Herv´ e M´ enager

workflows˙evaluation

November 6, workflows˙evaluation

2005

13:11

WSPC/INSTRUCTION

FILE

Evaluating workflow management systems for bioinformatics

17

5. Discussion 5.1. Extensibility / Characteristics All the systems that we described include mechanisms that aim at allowing to integrate new resources as easily as possible. This common characteristic can be interpreted as the acknowledgment of the dynamic nature of data in this domain: a solution to compensate the frequent changes in the resources is in fact to facilitate the integration step of these resources. If we examine closer the type of resource that can potentially be integrated, although many types of interfaces can be used, the most widely offered is the web services technology. It is not the most efficient, because it adds a significant communication overhead by using a verbose XML-based protocol. However, its numerous advantages, such as its ease of integration, its wide use and firewall-friendly protocol outbalance this inconvenient, especially since when using resources such as publicly available databases over the internet the limiting factor might more probably be the resource’s performance than the communication. The data format transformation or extraction seems to remain a weak point in these systems, as they often rely on scripting mechanisms, and only occasionally on XSLT or XQuery capabilities. The optimal transformation mechanism between different data formats remains an opened problem.

5.2. Usability The software interfaces provided by the evaluated systems generally emphasize the human interaction, providing a graphical user interface that lets users design and run workflows easily. However, it is important that such software provides an interface that allows automating the execution of workflows, whether these interfaces are Command Line Interfaces (like Triana or Taverna) or Application Programming Interfaces (like Taverna and Pipeline-Pilot). The Pipeline-pilot approach is interesting, allowing to publish workflows as web services, and therefore define workflows using reusable building blocks. This approach is close to what is proposed in the Web Services Business Process Execution Language (WSBPEL) 3 or the Web Ontology Language for Web Services (OWL-S) 15 , which also extend reusability by defining composite processes as web services. The data visualization functionality is handled by distinct software in many cases, in a plug-in approach such as the Taverna or the Pipeline-Pilot approach. This allows to add to the extensibility in this category. The portability of these systems is generally good, allowing to run them on most of the existing platforms. This feature is especially important in the scientific community, where all the This is facilitated by the fact that most of them are developed in Java.

November 6, workflows˙evaluation

18

2005

13:11

WSPC/INSTRUCTION

FILE

Zo´ e Lacroix and Herv´ e M´ enager

5.3. Understandability Data provenance information is provided for at least three of these systems: Taverna, JOpera, and Pipeline Pilot. In the rest of them, collecting this information can be achieved by specifying additional tasks that save these results (e.g., using “reporting actors” in Kepler 2 ). However, we believe this collection should not be explicitly part of a workflow, and results should be automatically recorded, given the importance of these data for results accountability. Process execution information and faults handling are also implemented, even giving sometimes to the user the ability to define fault tolerance behaviors on some tasks. For instance, Taverna lets users define alternate implementations to potentially failing tasks. This is a valuable feature in an environment where the execution of a workflow can be long and relies on resources that are externally controlled. 5.4. Scalability Support for multiple users is most of the time not implemented, except for Pipeline Pilot, the only commercial system included in this evaluation. However, the myGrid architecture, which Taverna is part of, offers similar capabilities, for instance with the myGrid Information Repository (MIR) e , that provides users some workgroup capabilities, allowing them to share experiments and data. The product architecture is sometimes client-server based, allowing to separate the execution engine from the user interface, like JOpera and Pipeline Pilot. Triana is based on a much more ambitious architecture, that allows to distribute the execution of workflows between different servers. Both Triana and Kepler are also able to use Grid-oriented protocols to execute some tasks. Support for workflow decomposition exists in many of these systems, allowing to reuse previously defined workflows and workflow components. Not only this characteristic increases the productivity, but it also allows to define optimally factored workflows and eases its evolutivity, in the same way well designed source code eases software maintenance. 6. Conclusion - Future works This paper, by setting up a list of evaluation criteria for bioinformatics workflow management systems, emphasizes the need for methodologies to assess the characteristics of bioinformatics workflow management systems. There is also a need to define a benchmarking method for this category of software: the vast quantities of data that need to be manipulated in bioinformatics stresses out the need for efficient systems, and their performance characteristics should therefore be a valuable factor e more

information about MIR is available at http://www.mygrid.org.uk/index.php?module= pagemaster&PAGE user op=view page&PAGE id=47&MMN position=55:51:52

November 6, workflows˙evaluation

2005

13:11

WSPC/INSTRUCTION

FILE

Evaluating workflow management systems for bioinformatics

19

to help choosing between them. However, the use of such systems in bioinformatics being recent, their characteristics and use should evolve, and so should the criteria used to assess them. Furthermore, the results of this evaluation emphasize some common characteristics between scientific workflow management systems: • scientific workflows are mostly data-oriented, the coordination of the tasks depending mostly on the availability of their inputs. The evaluated systems reflect this property, as the design phase of workflows is based on the connection of the inputs and outputs of the tasks. • some protocols are widely implemented to overcome the issues raised in accessing remote resources. This is the case for web services, that can be used in all the systems. However, some other characteristics allow to distinguish different approaches: • the level of the approach to workflow design divides the different systems in two categories. Whereas Taverna allows to track data provenance in workflows without specifying it at the workflow design stage, Kepler requires to add an actor that handles this function: this approach can be qualified as a “lowerlevel” approach, giving more flexibility to the system, but increasing the effort required to build workflows. • the level of complexity of the different systems can vary a lot, depending mostly on the type of workflows they handle. For instance, Taverna is directed more specifically to bioinformatics data, and manages mainly text-based data, whereas Triana and Kepler can manage more complex digital data to handle signal-processing workflows. Consequently, these latest systems are more complex. The different qualities of these systems can result of difficult tradeoff between, for example, usability and functionnalities. Scientists should therefore choose with care the system to adopt, depending on their specific needs, as the right choice will lead to improved productivity and faster discoveries. References 1. Special section on scientific workflows. 2. I. Altintas, A. Birnbaum, K. Baldridge, W. Sudholt, M. Miller, C. Amoreira, Y. Potier, and B. Lud¨ ascher. A Framework for the Design and Reuse of Grid Workflows, 2005. To be published. 3. A. Arkin, S. Askary, B. Bloch, F. Curbera, Y. Goland, N. Kartha, C. K. Liu, S. Thatte, P. Yendluri, and A. Yiu. Web Services Business Process Execution Language - Working Draft, Feb. 2005. http://www.oasisopen.org/committees/download.php/11601/wsbpel-specification-draft-022705.htm. 4. S. Y. Chung and J. C. Wooley. Challenges Faced in the Integration of Biological Information, chapter 2, pages 11–34. Volume 1 of Lacroix and Critchlow 9 , 2003.

November 6, workflows˙evaluation

20

2005

13:11

WSPC/INSTRUCTION

FILE

Zo´ e Lacroix and Herv´ e M´ enager

5. D. Churches, G. Gombas, A. Harrison, J. Maassen, C. Robinson, M. Shields, I. Taylor, and I. Wang. Programming Scientific and Distributed Workflow with Triana Services. Grid Workflow 2004 Special Issue of Concurrency and Computation: Practice and Experience, 2005. To be published. 6. W. M. Coalition. Workflow Management Coalition Terminology and Glossary, Feb. 1999. http://www.wfmc.org/standards/docs/TC-1011 term glossary v3.pdf. 7. A. Geraci, F. Katki, L. McMonegal, B. Meyer, J. Lane, P. Wilson, J. Radatz, M. Yee, H. Porteous, and F. Springsteel. IEEE Standard Computer Dictionary: Compilation of IEEE Standard Computer Glossaries. Institute of Electrical and Electronics Engineers Inc., The, 1991. 8. S. Hastings, M. Ribeiro, S. Langella, S. Oster, U. Catalyurek, T. Pan, K. Huang, R. Ferreira, J. Saltz, and T. Kurc. Xml database support for distributed execution of data-intensive scientific workflows. In SIGMOD Rec. 1 , pages 50–55. 9. Z. Lacroix and T. Critchlow, editors. Bioinformatics: Managing Scientific Data, volume 1. Morgan Kaufmann Publishing, 2003. 10. Z. Lacroix and T. Critchlow. Compared Evaluation of Scientific Data Management Systems, chapter 13, pages 371–391. Volume 1 of 9 , 2003. 11. A. E. Lawson. The nature and development of hypothetico-predictive argumentation with implications for science teaching. International Journal of Science Education, pages 1387–1408, Nov 2003. vol. 25. 12. B. Lud¨ ascher, I. Altintas, C. Berkley, D. H. E. Jaeger-Frank, M. Jones, E. Lee, J. Tao, and Y. Zhao. Scientific workflow management and the KEPLER system. Concurrency and Computation: Practice and Experience, Special Issue on Scientific Workflows, 2005. 13. B. Lud¨ ascher and C. Goble. Guest editors’ introduction to the special section on scientific workflows. In SIGMOD Rec. 1 , pages 3–4. 14. P. Maechling, H. Chalupsky, M. Dougherty, E. Deelman, Y. Gil, S. Gullapalli, V. Gupta, C. Kesselman, J. Kim, G. Mehta, B. Mendenhall, T. Russ, G. Singh, M. Spraragen, G. Staples, and K. Vahi. Simplifying construction of complex workflows for non-expert users of the southern california earthquake center community modeling environment. In SIGMOD Rec. 1 , pages 24–30. 15. D. Martin, M. Burstein, J. Hobbs, O. Lassila, D. McDermott, S. McIlraith, S. Narayanan, M. Paolucci, B. Parsia, T. Payne, E. Sirin, N. Srinivasan, and K. Sycara. OWL-S: Semantic Markup for Web Services. W3C Working Draft, Dec. 2004. http://www.daml.org/services/owl-s/1.1/overview/. 16. T. M. McPhillips and S. Bowers. An approach for pipelining nested collections in scientific workflows. In SIGMOD Rec. 1 , pages 12–17. 17. C. B. Medeiros, J. Perez-Alcazar, L. Digiampietri, J. G. Z. Pastorello, A. Santanche, R. S. Torres, E. Madeira, and E. Bacarin. Woodss and the web: annotating and reusing scientific workflows. In SIGMOD Rec. 1 , pages 18–23. 18. T. M. Oinn, M. Addis, J. Ferris, D. Marvin, M. Senger, R. M. Greenwood, T. Carver, K. Glover, M. R. Pocock, A. Wipat, and P. Li. Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics, 20(17):3045–3054, 2004. 19. S. Shankar, A. Kini, D. J. DeWitt, and J. Naughton. Integrating databases and workflow systems. In SIGMOD Rec. 1 , pages 5–11. 20. Y. L. Simmhan, B. Plale, and D. Gannon. A survey of data provenance in e-science. In SIGMOD Rec. 1 , pages 31–36. 21. R. Stevens, C. Goble, P. Baker, and A. Brass. A Classification of Tasks in Bioinformatics. Bioinformatics, 17(2):180–188, 2001. 22. M. Wieczorek, R. Prodan, and T. Fahringer. Scheduling of scientific workflows in the

November 6, workflows˙evaluation

2005

13:11

WSPC/INSTRUCTION

FILE

Evaluating workflow management systems for bioinformatics

21

askalon grid environment. In SIGMOD Rec. 1 , pages 56–62. 23. J. Yu and R. Buyya. A taxonomy of scientific workflow systems for grid computing. In SIGMOD Rec. 1 , pages 44–49. 24. Y. Zhao, J. Dobson, I. Foster, L. Moreau, and M. Wilde. A notation and system for expressing and executing cleanly typed workflows on messy scientific data. In SIGMOD Rec. 1 , pages 37–43.