Nov 6, 2005 - Context of workflow systems development and definitions. 2.1. ...... about MIR is available at http://www.mygrid.org.uk/index.php?module=.
November 6, workflows˙evaluation
2005
13:11
WSPC/INSTRUCTION
FILE
Evaluating workflow management systems for bioinformatics
Zo´ e Lacroix and Herv´ e M´ enager Mechanical and Aerospace Engineering, Arizona State University, P.O. Box 876106 Tempe, Arizona 85287-6106, USA {zoe.lacroix, herve.menager}@asu.edu
1. Introduction A scientific protocol describes an experiment, and its outcome is compared to the expected results in order to draw a conclusion 11 . A scientific protocol is composed of a reproducible succession of tasks, organized with respect to the order of execution. Until recently, scientific protocols in life sciences as in many other fields, were mainly composed of successions of tasks executed by humans. The progress in technologies related to computer science and genomics sciences resulted in the production of a constantly increasing amount of digital data. As a result, a significant part of the experiments in modern sciences such as bioinformatics rely extensively on digital datasets that have to be analyzed and correlated, and do not require human intervention. Be data-intensive, these protocols generate new needs. Users need tools letting them manage both their authoring (i.e. express, store, revise, retrieve protocol definitions) and their execution. These tools also need to allow the retrieval and the analysis of the collected results for the protocol instances. Workflow systems, initially developed to automate business activities, offer a solution suitable to design and execute scientific protocols, for workflows can be as well defined as a succession of ordered tasks. There is currently a great number of software tools that can be qualified as scientific workflow management systems, with a great diversity in their . There is a great diversity in the projects and software in this category. These systems have different characteristics: more or less specialized to handle some types of workflows, more or less scalable, academic or commercial, etc. Hence, when it comes to selecting a software, a scientist needs to know which is the one that best meets his criteria. To evaluate to which extent a software meets the user’s needs, the characteristics of similar software must be identified and the benefits offered by each compared. Comparing these systems is not an easy task, because of the diversity and complexity of the available solutions, and because each user has its own specific needs, thus according specific importance to each of the possible criteria. The aim of this paper is to establish a list of criteria to help evaluating different 1
November 6, workflows˙evaluation
2
2005
13:11
WSPC/INSTRUCTION
FILE
Zo´ e Lacroix and Herv´ e M´ enager
bioinformatics workflow management systems. These criteria should be both easy to assess, and relevant to the requirements for this software category. The first section presents the context for the development of the workflow systems and some definitions. The next section gives a description of some evaluation criteria, explains their importance in the context of bioinformatics, and presents an evaluation grid that could be used to analyze and compare different solutions. The third section of this paper, after briefly introducing some of the existing solutions, assesses them according to our evaluation grid. Finally, we discuss the similarities and differences that appear in the results between these different systems, in order to characterize them better. 2. Context of workflow systems development and definitions 2.1. History and traditional context Workflows were historically developed in a business-oriented environment, where they were aimed at defining and automating office work procedures, to help managing and reducing the volume of paper-based information. The re-engineering trends of the nineties helped this technology develop, and an urge for standards led to the creation of organizations such as the Workflow Management Coalition a . This group defines a workflow as “the automation of a business process, in whole or part, during which documents, information or tasks are passed from one participant to another for action, according to a set of procedural rules” 6 . 2.2. Definitions We define a workflow as the description of a reproducible process composed of a set of coordinated tasks. A workflow is authored and executed using software systems called workflow management systems. Each distinct execution of a workflow is a workflow instance. Workflows and tasks can require some inputs to be executed, and returned. The outputs of all the tasks of a workflow instance that are not part of the workflow results are the intermediate results. In addition to these notions of inputs and outputs, the successful execution of a task may require the verification of conditions that are not specified in the provided information flow and are related to the state of “outside world”, i.e., its inputs: these are the pre-conditions of a task. An execution may also have consequences that are not reflected in its outputs, or effects. A workflow can be defined inductively: the basic component of a workflow is a task, and a basic workflow is composed of a single task. Workflows can be connected into more complex ones, using connectors such as “successor” or “merge”. We can define a task with the following (non-exhaustive) list of properties: T ask ::= (< Input >, < Output >, < Condition >, < Result >, N ame, Description) a More
information about this organization is available at: http://www.wfmc.org
November 6, workflows˙evaluation
2005
13:11
WSPC/INSTRUCTION
FILE
Evaluating workflow management systems for bioinformatics
3
The coordination of the execution of the different components of a workflow can be expressed through different operators that define control constructs: W orkf low ::= (T ask, (W orkf low + W orkf low|W orkf low − W orkf low|W orkf low ∗ |W orkf low/W orkf low) The above partial definition states that a workflow is described as a single task, and is composed of a set of workflows structured by control constructs. These control constructs specify that the components can be executed concurrently (operator +) or sequentially (operator −), that they execute iteratively a single component (operator ∗), or execute alternatively different tasks (operator /). The composition of a workflow using these constructs defines a control flow. Another type of coordination is essential to the definition of a workflow, based on the availability of the inputs of its different components. We can express the flow of information between tasks as a list of links, or bindings, between two parameters, either the output and the input of two successive tasks, the input of a workflow and the input of one of its components, or the output of a workflow and the output of one of its components: Binding ::= (T askOutput, T askInput)|(W orkf lowInput, T askInput)|(T askOutput, W orkf lowOutput) To summarize, the coordination of a workflow may be based on two types of flows: • Control flows specify this coordination based on the execution status of each task. The execution of a task is usually subordinated to the completion of the preceding tasks. • Data flows represent the “path” of data between tasks. In such a perspective, the execution of a task is only conditioned by the availability of its inputs data. These two views offer different, yet complementary, perspectives on the coordination of tasks. Business-oriented workflows typically emphasize more on the control flow, because they describe complex tasks that involve information that may not be completely represented in their inputs and outputs, and because they represent human tasks and capture their inputs. These two types of flows specify constraints on each other, as they define the coordination of tasks with regard to independent rules. For instance, the execution of a sequence of tasks requires that the input of each task is computed previously to its execution. Therefore, you cannot specify a control construct that would execute a task T1 followed by a task T2, and feed the input of the task T1 with one of the outputs of task T2. Workflow tasks may have different purposes, such as: • control the workflow, its results determining choices in the coordination of the tasks. • collect data from various data sources.
November 6, workflows˙evaluation
4
2005
13:11
WSPC/INSTRUCTION
FILE
Zo´ e Lacroix and Herv´ e M´ enager
• analyze data, processing them to compute new results. • transform data, for instance to increase the readability of the workflow outputs, or to enable the interoperability of different tasks. 2.3. Workflows and bioinformatics An example of bioinformatics workflow is the “gene clustering” workflow, used when analyzing the evolution of the expression rate of different genes during the successive states of a disease. The result of this experiment is a list of gene accession numbers, that is used to gather related information, such as the binding partners of their products, the signal transduction pathways they belong to, or their function. This information can subsequently be used to cluster the different genes. For instance, if the level of expression of many genes that belong to a same pathway is modified, then this pathway is important in the disease state. Such results can lead to further investigations to determine the exact role this pathway plays in the disease. Such an example can be expressed as a workflow that expresses the successive information collection and analysis steps, displayed in Figure 1.
Figure 1. Gene clustering workflow example
As in this example, scientific workflows are mainly defined by the data flow that represents it, because all the significant information must be explicitly defined. From this point of view, a scientific workflow is similar to a database query, and the coordination of its tasks is more flexible, allowing optimization mechanisms such as pipelining, or parallelism. For instance, in the previously mentioned example, the collection of the signal transduction pathway, protein function, and binding partners can be executed concurrently. An in-depth description of the different types of tasks involved in a bioinformatics workflow can be found in 21 . 2.4. Workflow management systems requirements Classic workflow management systems have to meet a number of requirements: (1) express the definition of the workflow.
November 6, workflows˙evaluation
2005
13:11
WSPC/INSTRUCTION
FILE
Evaluating workflow management systems for bioinformatics
5
(2) execute the defined tasks with respect to the coordination properties defined in the workflow. (3) ensure the availability of the results and effects. (4) provide a secured multi-user environment, required in any enterprise-level software system. However, the recent introduction of workflows in the field of scientific data management extends the list of requirements: (5) provide data collection capabilities that let scientists exploit the wealth of available resources. (6) offer powerful computing capabilities, corresponding to the data-intensive tasks that are executed. (7) ensure the traceability of data, and therefore the reproducibility of the experiences, in a dramatically distributed and dynamic environment. (8) facilitate the use of the system by ensuring a high degree of transparency between the design of a workflow and its implementation, automatically solving translation and interoperability issues between the different databases and tools accessed. These different requirements can be translated into a set of system specifications that will be expected when evaluating a given scientific workflow management system: • The system interface must be “scientist-friendly” enough to facilitate the design of workflows by users that do not necessarily have programming skills (see requirement 1). Ideally, this interface allows to separate clearly the design of a workflow (its semantics) from the implementation by guiding the user from the first step to the second one (see requirement 8). • The execution of workflows (see requirement 2) has to be handled by an execution engine which has to be very scalable in order to run these workflows on very large datasets (see requirements 5 and 6) and for large-scale organizations (see requirement 4). • The data collection and computing capabilities (see requirement 5 and 6) demand the access to a maximum of resources, including public databases and analysis tools. • Finally, the availability of the results and the traceability of the workflow execution (see requirements 7 and 3) necessitates the existence of a workflow repository (such as a database), to record, in addition to the workflow definitions, all the data resulting from their execution (meta data, intermediate and final results).
November 6, workflows˙evaluation
6
2005
13:11
WSPC/INSTRUCTION
FILE
Zo´ e Lacroix and Herv´ e M´ enager
Category
Description
Extensibility
System customization capabilities: add new data types, tools, or queries. Reuse previous queries (modularity capabilities) or tools.
Functionality Usability
• Supply an appropriate User Interface. • Support the queries that need to be executed. • Return the results in an appropriate format. Understandability Scalability Efficiency
Explain the meaning of the scientific queries and results. Support the queries that need to be executed. • Handle the amount of data and intended number of users. • Perform a query in a satisfactory time frame.
Table 1. Software evaluation criteria
3. Evaluation criteria The criteria used to evaluate a software can be classified into six characteristics, and each can be considered from two perspectives 10 : the implementation perspective, and the user perspective. While the implementation perspective is more concerned with the technical details of the implementation, the user perspective tries to characterize a software from an end-user point-of-view. Because the details of the implementation of evaluated software are not necessarily available, we will focus on the user perspective. We introduce evaluation criteria, sorted according to six characteristics, in Table 1). Considering the nature and the stage of development of these systems, the most important characteristics to assess are extensibility and functionality, as well as usability and understandability. Because many of the systems are academic, they often do not have scalability features developed, as it is not a common preoccupation for academic software development. As these systems are in early development stages, efficiency is not a priority in evaluation, because the development of an optimization strategy usually comes with maturity.
3.1. Extensibility-Functionalities A scientific workflow is composed of steps built from components that can either access databases, or call tools and applications to analyze the data (see requirements 5 and 6). These scientific resources can be available as components of the system, and thus be internal functionalities, or be accessed as external functionalities, and
November 6, workflows˙evaluation
2005
13:11
WSPC/INSTRUCTION
FILE
Evaluating workflow management systems for bioinformatics
7
be categorized as extensibility features. The components, when integrated to the tools, could be seen as functionalities of the software, whereas the solutions provided to integrate external tools can be described as part of the software extensibility characteristics. Still, this category of software does not permit making such a distinction. The components, even if packaged with the software, can often be wrapped external tools such as BioPerl b or BioJava c . Therefore, they are not really specific to the workflow management system. As we will see later, we cannot either separate them on the basis of the location of their execution (local or remote), because these platforms are increasingly relying on distributed architectures, that allow to seamlessly use resources independently of their location. The more appropriate distinction would be to consider on one side the available resources, presented by default in the software, without considering how specific they are. On the other side, we would enumerate the types of resources that can be integrated in any way to be used as components, as the integration capabilities. The connection between different components raises issues about the data transformation: the output of a workflow step is not necessarily in the good format to be valid input for the following one. The ability to transform data from one format to the other can also be included in this category, since they directly affect the extensibility of a system. 3.2. Usability Usability can be defined as “the ease with which a user can learn to operate, prepare inputs for, and interpret outputs from a system or a component” 7 . We can evaluate this criteria using four factors: the characteristics of the software interface, the existence and quality of user documentation and software support, its portability, and the level of technical knowledge required to operate it. 3.2.1. Software interface Interactions with a software, whether it is from a user or another system (any “actor” in the UML use cases diagram sense), is possible through interfaces. There are different types of interfaces, each one having its own advantages. Graphical user interfaces (GUI) use the graphical possibilities of windowing operating systems to offer users an intuitive interface that is usable without understanding complex text-mode syntaxes. The existence of such an interface is critical in the area of bioinformatics workflow management systems, because one of their requirements is to allow scientists to design complex workflows, with as less programming skills as possible (as mentioned previously in the requirements in 1). b For c For
more information, see BioPerl website: http://bio.perl.org/ more information, see BioJava website: http://www.biojava.org/
November 6, workflows˙evaluation
8
2005
13:11
WSPC/INSTRUCTION
FILE
Zo´ e Lacroix and Herv´ e M´ enager
Command line interfaces (CLI) rely only on textual inputs and outputs for the interaction with the users. Though less intuitive, they allow users to automate easily their interaction with the software, using scripts processed by command line interpreters. In the field of bioinformatics, a single workflow can have to be run in a repetitive way, for example to test the results with different inputs, or to check resources for updates. Therefore, allowing easy automation, such an interface is an advantage. Application programming interfaces (API) allow the interaction with other programs, by providing programming languages with functions or objects to communicate with them. These interfaces are also useful to automate processes, but also for any integrating task with another software.
3.2.2. User documentation The user documentation provides information about: • How to use the software. For instance, how to design and execute a workflow, and how to use the results from this execution. • How to maintain and calibrate it. For instance, if the system stores some information, the documentation should provide a description of the backup procedure. • How to fix the system, in case a misuse or a system incident occurs. In the late case, a detailed description of the recovery procedures is a key element to the reliability of a system, as it lowers the MTTR d . • How to extend the system, describing procedures to add new functionalities to the system. This list is not intended at being detailed and exhaustive, as the needs for some documentation heavily depend on the system, its architecture, its use, and its functionalities. We would like to underline the fact that this feature is essential, even to the systems that achieve the most user-friendly interfaces and usage. This documentation can be printed and packaged with the software, or available available electronically, i.e., on the internet. In the case of our software category, such a documentation should be as precise as possible, to allow users to learn quickly how to use it.
3.2.3. Software support Software support includes every communication mean that are at the disposition of a user to help him solve the problems he might have using the system. These communication means are diversified, including e-mail, forums, chats, mailing lists, d MTTR:
Mean Time To Repair.
November 6, workflows˙evaluation
2005
13:11
WSPC/INSTRUCTION
FILE
Evaluating workflow management systems for bioinformatics
9
telephone, etc. Through these tools, the user can communicate with the authors, a dedicated support team, or with a users community. 3.2.4. Portability Portability can be defined as “the ease with which a system or component can be transferred from one hardware or software environment to another” 7 . Unlike in closed environments like classical corporate businesses, where the computing environment is controllable, bioinformatics workflow management systems are used in a highly distributed environment. It means that these environments can be extremely heterogeneous, and this emphasizes greatly the need for portability. 3.2.5. Level of technical knowledge required The usability of a software is the result of the influence of many different factors. With the use of a factor such as “Level of technical knowledge required”, we want to underline the fact that our particular kind of software is to be used by scientists, who should be able to use them with minimal skills in computing and programming. 3.3. Understandability Understandability is “the degree to which the purpose of a system or component is clear to the evaluator” 7 . To assess it, we use three factors: the availability of data provenance and process execution information, and the mechanisms by which the software handles the faults in the execution of a workflow. 3.3.1. Data provenance information Data provenance information includes all the intermediate results of workflow instances, combined with collection information including the name, version and location of the resource that produce the data, the date and time of collection, and its mapping to the final result. Data sources in life science have particular properties 4 , one of them being that the data organizations (schemas) as well as the contents of the data sources are extremely dynamic. Because of these characteristics, and the intrinsic instability of data (and programs) on the web, scientific resources evolve at a fast pace, hence affecting the reproducibility of the results. If the results of an experience cannot be reproduced, then a scientist might at least want to be able to capture the reasons why. Gathering all the available information about workflow executions can help achieving this goal, explaining for instance that the final results of the execution of a workflow are different since its previous execution because the data collected from a remote database have changed, situation that may occur when this data source was updated or curated between the two executions of the workflow.
November 6, workflows˙evaluation
10
2005
13:11
WSPC/INSTRUCTION
FILE
Zo´ e Lacroix and Herv´ e M´ enager
In summary, to compensate data instability in life science, data provenance information improves their traceability. The integration of many different resources also raises the issue of intellectual property. Using increasingly seamless data integration tools cannot be done if users cannot access the data provenance information. 3.3.2. Process execution information Process execution information is the ability for a system to give users information about the workflow instances currently being executed. It helps them monitoring the execution process, and altering it if necessary. For instance, if a workflow instance is executing too slowly, cancel it in order to optimize it further. We define process execution information as the capacity of a system to inform users about the execution status of a workflow. 3.3.3. Faults handling Fault handling expresses the way the system reacts to the fact that a workflow instance behaves in an unexpected way. In this case, the system should both inform the user (which is part of the previously defined process execution information), and recover by automatically taking a decision to cancel the workflow instance or to alter it. This alteration could be for example retrieving data from an alternate database, in case the planned resource is not available. This automated recovery behavior is specified by the users during the design of a workflow. 3.4. Scalability Scalability is “the ease with which a system or a component can be modified to fit the problem area” 7 . It is generally used to designate the ability of a system to handle an increase in the information it computes, through a larger group of users or larger datasets for instance. To assess it, we estimate the level of the support for multiple users and workflow decomposition. 3.4.1. Support for multiple users As a discovery process often involves large teams of people working together, the need for software that supports the sharing of information and tasks is important. Hence, sharing workflow definitions or results between different users can be essential in this software. 3.4.2. Support for workflow decomposition We mentioned in the introduction that a scientific workflow can be, just as an elementary scientific task, described by its inputs and outputs. Some pieces of some workflows are regular patterns that are used in several of them. The ability to design
November 6, workflows˙evaluation
2005
13:11
WSPC/INSTRUCTION
FILE
Evaluating workflow management systems for bioinformatics
11
once such patterns and use them as tasks in every scientific workflow that uses them is a feature that enhances reliability, and the use of a system to design increasingly complex protocols.
4. Results 4.1. Compared systems A lot of scientific workflow systems are currently available. They can differ on many points. We chose to compare five of them, which we are going to present now. General information about these systems is also displayed in Table 2.
4.1.1. Taverna Taverna 18 is a system that allows designing and executing workflows using web services components. It is an open-source software that lets users integrate remote and local resources through an extensive collection of wrappers that access for instance web services or Java components. Although it can be used with any kind of resource, this software is particularly aimed at bioinformaticians, offering a large number of predefined biological resources. Taverna is a component of the EPSRC (Engineering and Physical Sciences Research Council) funded by the myGrid project.
4.1.2. JOpera JOpera is developed by the Information and Communication Systems Research Group from the Swiss Federal Institute of Technology in Z¨ urich. This tool allows the design and execution of workflows from components that can be web services, but also other types of software. Although JOpera is not openly aimed at bioinformatics, it is the successor to the project BioOpera project, whose goal was to improve and automate large scale analysis of genetic data sets.
4.1.3. Kepler Kepler 12 is a project issued from the collaboration of various institutes, which include SEEK, SDM Center/SPA, and more. This system aims at developing tools for scientific workflows, allowing their design and execution. It is based on Ptolemy II, a set of Java packages supporting heterogeneous, concurrent modeling and design.
4.1.4. Triana Triana 5 is a project from the Cardiff University. It allows users to build scientific workflows using a great variety of predefined tools or integrating new ones, and to run them.
November 6, workflows˙evaluation
12
2005
13:11
WSPC/INSTRUCTION
FILE
Zo´ e Lacroix and Herv´ e M´ enager
4.1.5. Pipeline Pilot Pipeline Pilot is a commercial software edited by Accelrys Inc. It is aimed specifically at Drug Discovery activities, and provides for this purpose a lot of analysis tools for cheminformatics and bioinformatics and wrappers to integrate different external tools.
XML sublanguage:OML (Opera Modeling Language)
XML sublanguage:XSCUFL
Scientific
Use
Scientific
XML sublanguage: MoML (Ptolemy II language)
V1.0.0 alpha4.exe (12/09/2004) Java
http:// kepler-project. org/
Yes Ptolemy II
BSD-style
Collaboration between: SEEK SDM Center/SPA Ptolemy II GEON ROADNet EOL Academic
Kepler
XML sublanguage. can also import from others such as BPEL4WS Scientific
v3.1.1 (06/07/2005) Java
http://www. trianacode.org/
Apache software license Yes GridLab
Academic
Cardiff University
Triana
Bioinformatics, Chemoinformatics
?
http: //www.scitegic. com/products services/ pipeline pilot. htm v4.0 (03/31/2004) ?
No None
Commercial
Commercial
SciTegic
Pipeline-Pilot
Evaluating workflow management systems for bioinformatics
Table 2. General information about the evaluated systems
General
V1.71 (12/10/2004) Java
V1.2 (06/25/2005) Java
http://taverna. sourceforge.net
http: //www.iks.inf. ethz.ch/jopera
JOpera License, c
C. Pautasso No None
LGPL
Yes MyGrid
Academic
IKS, ETH Z¨ urich
Collaboration between: EBI
Academic
Jopera
Taverna
13:11 WSPC/INSTRUCTION FILE
Current version and Release date Programming language Workflow language
Sources available Related to projects Main URL
Academic / Commercial License type
General information Author
November 6, 2005 workflows˙evaluation
13
Software interface Command Line Interface Application Programming Interface Documentation existence and quality No
Yes
Online documentation
No
Yes
Online User Guide Online and included documentation + tutorials
Usability Desktop interface
Computer science skills needed
Use of split/merge operators
Java Snippets. Call of Java programs
Desktop interface
Computer science skills needed
Java based (Beanshell scripting, API consumer, Local Java) XSLT or XPath components
Online documentation
No
No
Desktop interface
Python, Matlab, Command Line, Grid, HTML, ROADNet sensors Data transformation actors (XSLT, XQuery, Perl, etc.) Computer science skills needed
User Guide Available in pdf and included in the software
No
Yes
Desktop interface
Computer science skills needed
WS, Grid Services, Text Files, Image Files, Audio Files Java components
Triana
Workflows can be published as web services ?
Desktop interface + web interface No
Computer science skills needed
Use of “Utilities” components
WS, ODBC, Excel, SD files, Molecular Compliant Databases Perl or Java components
Pipeline-Pilot
13:11 WSPC/INSTRUCTION FILE
User documentation
Software interface
Level of technical knowledge required
Data transformation capabilities
lan-
Available protocols
Resources integration characteristics
Jopera Kepler Extensibility-Functionalities based on SOAP, WS, JDBC, Post- WS, JDBC, Grid REST, or JDBC GreSQL
Taverna
14
Available guages
Criteria
Characteristic
November 6, 2005 workflows˙evaluation
Zo´ e Lacroix and Herv´ e M´ enager
Intermediate results visible Intermediate results saved Does the user get information about the execution status (which step is/are being executed) Information about exceptions in the workflow: step, root cause - Recoverability
ata D provenance
Yes
Yes
Yes
Yes
Scalability
No
Yes
Yes
Yes (EMAIL)
Yes
Understandability Yes Yes
Windows Mac OS X Linux
Kepler mailing lists, IRC channel No HTML
Yes
Windows 2000 or XP
No Text
No Any (Visualization plug-ins can be added) Windows, Linux, Mac OS X
Jopera e-mail
Taverna mailing lists
Windows Mac OS X Unix
Triana e-mail, mailing lists No Text Graph Editor
Yes
Yes
Yes
Yes
Yes Any (Uses visualizers for specific types, such as sequences) Windows, Linux
Pipeline-Pilot e-mail
13:11 WSPC/INSTRUCTION FILE
Faults handling
Process execution information
Platforms supported by the client application
Criteria Electronic support Phone line Data types supported
Portability
Data visualization
Characteristic Software support
November 6, 2005 workflows˙evaluation
Evaluating workflow management systems for bioinformatics 15
Workflow ”sub- Yes units” definition possibilities Table 3: Evaluation criteria values collected management systems
Standalone app. for design and runtime, and a server for execution engine
Stand-alone app.
Support for workflow decomposition
Jopera None
Taverna None. But MIR allows these capabilities
Criteria Workgroup possibilities offered by the software Product Architecture
Characteristic upport for S multiple users
for the different workflow
Triana None
Client for design and runtimeServer for execution engineWeb interface for runtimeWebServices interface for runtime Yes
Pipeline-Pilot Yes
16
Yes - Workflows can be abstracted as steps
Kepler None
November 6, 2005 13:11 WSPC/INSTRUCTION FILE
Zo´ e Lacroix and Herv´ e M´ enager
workflows˙evaluation
November 6, workflows˙evaluation
2005
13:11
WSPC/INSTRUCTION
FILE
Evaluating workflow management systems for bioinformatics
17
5. Discussion 5.1. Extensibility / Characteristics All the systems that we described include mechanisms that aim at allowing to integrate new resources as easily as possible. This common characteristic can be interpreted as the acknowledgment of the dynamic nature of data in this domain: a solution to compensate the frequent changes in the resources is in fact to facilitate the integration step of these resources. If we examine closer the type of resource that can potentially be integrated, although many types of interfaces can be used, the most widely offered is the web services technology. It is not the most efficient, because it adds a significant communication overhead by using a verbose XML-based protocol. However, its numerous advantages, such as its ease of integration, its wide use and firewall-friendly protocol outbalance this inconvenient, especially since when using resources such as publicly available databases over the internet the limiting factor might more probably be the resource’s performance than the communication. The data format transformation or extraction seems to remain a weak point in these systems, as they often rely on scripting mechanisms, and only occasionally on XSLT or XQuery capabilities. The optimal transformation mechanism between different data formats remains an opened problem.
5.2. Usability The software interfaces provided by the evaluated systems generally emphasize the human interaction, providing a graphical user interface that lets users design and run workflows easily. However, it is important that such software provides an interface that allows automating the execution of workflows, whether these interfaces are Command Line Interfaces (like Triana or Taverna) or Application Programming Interfaces (like Taverna and Pipeline-Pilot). The Pipeline-pilot approach is interesting, allowing to publish workflows as web services, and therefore define workflows using reusable building blocks. This approach is close to what is proposed in the Web Services Business Process Execution Language (WSBPEL) 3 or the Web Ontology Language for Web Services (OWL-S) 15 , which also extend reusability by defining composite processes as web services. The data visualization functionality is handled by distinct software in many cases, in a plug-in approach such as the Taverna or the Pipeline-Pilot approach. This allows to add to the extensibility in this category. The portability of these systems is generally good, allowing to run them on most of the existing platforms. This feature is especially important in the scientific community, where all the This is facilitated by the fact that most of them are developed in Java.
November 6, workflows˙evaluation
18
2005
13:11
WSPC/INSTRUCTION
FILE
Zo´ e Lacroix and Herv´ e M´ enager
5.3. Understandability Data provenance information is provided for at least three of these systems: Taverna, JOpera, and Pipeline Pilot. In the rest of them, collecting this information can be achieved by specifying additional tasks that save these results (e.g., using “reporting actors” in Kepler 2 ). However, we believe this collection should not be explicitly part of a workflow, and results should be automatically recorded, given the importance of these data for results accountability. Process execution information and faults handling are also implemented, even giving sometimes to the user the ability to define fault tolerance behaviors on some tasks. For instance, Taverna lets users define alternate implementations to potentially failing tasks. This is a valuable feature in an environment where the execution of a workflow can be long and relies on resources that are externally controlled. 5.4. Scalability Support for multiple users is most of the time not implemented, except for Pipeline Pilot, the only commercial system included in this evaluation. However, the myGrid architecture, which Taverna is part of, offers similar capabilities, for instance with the myGrid Information Repository (MIR) e , that provides users some workgroup capabilities, allowing them to share experiments and data. The product architecture is sometimes client-server based, allowing to separate the execution engine from the user interface, like JOpera and Pipeline Pilot. Triana is based on a much more ambitious architecture, that allows to distribute the execution of workflows between different servers. Both Triana and Kepler are also able to use Grid-oriented protocols to execute some tasks. Support for workflow decomposition exists in many of these systems, allowing to reuse previously defined workflows and workflow components. Not only this characteristic increases the productivity, but it also allows to define optimally factored workflows and eases its evolutivity, in the same way well designed source code eases software maintenance. 6. Conclusion - Future works This paper, by setting up a list of evaluation criteria for bioinformatics workflow management systems, emphasizes the need for methodologies to assess the characteristics of bioinformatics workflow management systems. There is also a need to define a benchmarking method for this category of software: the vast quantities of data that need to be manipulated in bioinformatics stresses out the need for efficient systems, and their performance characteristics should therefore be a valuable factor e more
information about MIR is available at http://www.mygrid.org.uk/index.php?module= pagemaster&PAGE user op=view page&PAGE id=47&MMN position=55:51:52
November 6, workflows˙evaluation
2005
13:11
WSPC/INSTRUCTION
FILE
Evaluating workflow management systems for bioinformatics
19
to help choosing between them. However, the use of such systems in bioinformatics being recent, their characteristics and use should evolve, and so should the criteria used to assess them. Furthermore, the results of this evaluation emphasize some common characteristics between scientific workflow management systems: • scientific workflows are mostly data-oriented, the coordination of the tasks depending mostly on the availability of their inputs. The evaluated systems reflect this property, as the design phase of workflows is based on the connection of the inputs and outputs of the tasks. • some protocols are widely implemented to overcome the issues raised in accessing remote resources. This is the case for web services, that can be used in all the systems. However, some other characteristics allow to distinguish different approaches: • the level of the approach to workflow design divides the different systems in two categories. Whereas Taverna allows to track data provenance in workflows without specifying it at the workflow design stage, Kepler requires to add an actor that handles this function: this approach can be qualified as a “lowerlevel” approach, giving more flexibility to the system, but increasing the effort required to build workflows. • the level of complexity of the different systems can vary a lot, depending mostly on the type of workflows they handle. For instance, Taverna is directed more specifically to bioinformatics data, and manages mainly text-based data, whereas Triana and Kepler can manage more complex digital data to handle signal-processing workflows. Consequently, these latest systems are more complex. The different qualities of these systems can result of difficult tradeoff between, for example, usability and functionnalities. Scientists should therefore choose with care the system to adopt, depending on their specific needs, as the right choice will lead to improved productivity and faster discoveries. References 1. Special section on scientific workflows. 2. I. Altintas, A. Birnbaum, K. Baldridge, W. Sudholt, M. Miller, C. Amoreira, Y. Potier, and B. Lud¨ ascher. A Framework for the Design and Reuse of Grid Workflows, 2005. To be published. 3. A. Arkin, S. Askary, B. Bloch, F. Curbera, Y. Goland, N. Kartha, C. K. Liu, S. Thatte, P. Yendluri, and A. Yiu. Web Services Business Process Execution Language - Working Draft, Feb. 2005. http://www.oasisopen.org/committees/download.php/11601/wsbpel-specification-draft-022705.htm. 4. S. Y. Chung and J. C. Wooley. Challenges Faced in the Integration of Biological Information, chapter 2, pages 11–34. Volume 1 of Lacroix and Critchlow 9 , 2003.
November 6, workflows˙evaluation
20
2005
13:11
WSPC/INSTRUCTION
FILE
Zo´ e Lacroix and Herv´ e M´ enager
5. D. Churches, G. Gombas, A. Harrison, J. Maassen, C. Robinson, M. Shields, I. Taylor, and I. Wang. Programming Scientific and Distributed Workflow with Triana Services. Grid Workflow 2004 Special Issue of Concurrency and Computation: Practice and Experience, 2005. To be published. 6. W. M. Coalition. Workflow Management Coalition Terminology and Glossary, Feb. 1999. http://www.wfmc.org/standards/docs/TC-1011 term glossary v3.pdf. 7. A. Geraci, F. Katki, L. McMonegal, B. Meyer, J. Lane, P. Wilson, J. Radatz, M. Yee, H. Porteous, and F. Springsteel. IEEE Standard Computer Dictionary: Compilation of IEEE Standard Computer Glossaries. Institute of Electrical and Electronics Engineers Inc., The, 1991. 8. S. Hastings, M. Ribeiro, S. Langella, S. Oster, U. Catalyurek, T. Pan, K. Huang, R. Ferreira, J. Saltz, and T. Kurc. Xml database support for distributed execution of data-intensive scientific workflows. In SIGMOD Rec. 1 , pages 50–55. 9. Z. Lacroix and T. Critchlow, editors. Bioinformatics: Managing Scientific Data, volume 1. Morgan Kaufmann Publishing, 2003. 10. Z. Lacroix and T. Critchlow. Compared Evaluation of Scientific Data Management Systems, chapter 13, pages 371–391. Volume 1 of 9 , 2003. 11. A. E. Lawson. The nature and development of hypothetico-predictive argumentation with implications for science teaching. International Journal of Science Education, pages 1387–1408, Nov 2003. vol. 25. 12. B. Lud¨ ascher, I. Altintas, C. Berkley, D. H. E. Jaeger-Frank, M. Jones, E. Lee, J. Tao, and Y. Zhao. Scientific workflow management and the KEPLER system. Concurrency and Computation: Practice and Experience, Special Issue on Scientific Workflows, 2005. 13. B. Lud¨ ascher and C. Goble. Guest editors’ introduction to the special section on scientific workflows. In SIGMOD Rec. 1 , pages 3–4. 14. P. Maechling, H. Chalupsky, M. Dougherty, E. Deelman, Y. Gil, S. Gullapalli, V. Gupta, C. Kesselman, J. Kim, G. Mehta, B. Mendenhall, T. Russ, G. Singh, M. Spraragen, G. Staples, and K. Vahi. Simplifying construction of complex workflows for non-expert users of the southern california earthquake center community modeling environment. In SIGMOD Rec. 1 , pages 24–30. 15. D. Martin, M. Burstein, J. Hobbs, O. Lassila, D. McDermott, S. McIlraith, S. Narayanan, M. Paolucci, B. Parsia, T. Payne, E. Sirin, N. Srinivasan, and K. Sycara. OWL-S: Semantic Markup for Web Services. W3C Working Draft, Dec. 2004. http://www.daml.org/services/owl-s/1.1/overview/. 16. T. M. McPhillips and S. Bowers. An approach for pipelining nested collections in scientific workflows. In SIGMOD Rec. 1 , pages 12–17. 17. C. B. Medeiros, J. Perez-Alcazar, L. Digiampietri, J. G. Z. Pastorello, A. Santanche, R. S. Torres, E. Madeira, and E. Bacarin. Woodss and the web: annotating and reusing scientific workflows. In SIGMOD Rec. 1 , pages 18–23. 18. T. M. Oinn, M. Addis, J. Ferris, D. Marvin, M. Senger, R. M. Greenwood, T. Carver, K. Glover, M. R. Pocock, A. Wipat, and P. Li. Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics, 20(17):3045–3054, 2004. 19. S. Shankar, A. Kini, D. J. DeWitt, and J. Naughton. Integrating databases and workflow systems. In SIGMOD Rec. 1 , pages 5–11. 20. Y. L. Simmhan, B. Plale, and D. Gannon. A survey of data provenance in e-science. In SIGMOD Rec. 1 , pages 31–36. 21. R. Stevens, C. Goble, P. Baker, and A. Brass. A Classification of Tasks in Bioinformatics. Bioinformatics, 17(2):180–188, 2001. 22. M. Wieczorek, R. Prodan, and T. Fahringer. Scheduling of scientific workflows in the
November 6, workflows˙evaluation
2005
13:11
WSPC/INSTRUCTION
FILE
Evaluating workflow management systems for bioinformatics
21
askalon grid environment. In SIGMOD Rec. 1 , pages 56–62. 23. J. Yu and R. Buyya. A taxonomy of scientific workflow systems for grid computing. In SIGMOD Rec. 1 , pages 44–49. 24. Y. Zhao, J. Dobson, I. Foster, L. Moreau, and M. Wilde. A notation and system for expressing and executing cleanly typed workflows on messy scientific data. In SIGMOD Rec. 1 , pages 37–43.