OrthoSearch: A Scientific Workflow Approach to Detect Distant Homologies on Protozoans Sergio Manuel Serra da Cruz1 Vanessa Batista1 Alberto M. R. Dávila2
Edno Silva , Frederico Tosta Clarissa Vilela1 Maria Luiza M. Campos3
1
1
Rafael Cuadrat 2 Diogo Tschoeke 1 Marta Mattoso
2
1 - PESC/COPPE – UFRJ
2 - Oswaldo Cruz Institute - FIOCRUZ
3 - PPGI/IM/NCE –UFRJ
{marta,serra}@cos.ufrj.br
[email protected]
[email protected]
hard coded and difficult to manage. Scientific workflows represent an attractive alternative to describe bioinformatics experiments. Ideally, scientists should be able to configure their own bioinformatics workflows by dynamically combining programs provided by different teams, finding alternative programs to choose from, tuning workflow programs, and running parts of the workflow.
ABSTRACT Managing bioinformatics experiments is challenging due to the orchestration and interoperation of tools with semantics. An effective approach for managing those experiments is through workflow management systems (WfMS). We present several WfMS features for supporting genome homology workflows and discuss relevant issues for typical genomic experiments. In our evaluation we used OrthoSearch, a real genomic pipeline originally defined as a Perl script. We modeled it as a scientific workflow and implemented it on Kepler WfMS. We show a case study detecting distant homologies on trypanomatids metabolic pathways. Our results reinforce the benefits of WfMS over script languages and point out challenges to WfMS in distributed environments.
Workflow Management Systems (WfMS) are automated coordination engines that manage workflow specification, instantiation, execution, auditing and evolution. The main advantages of a WfMS are to combine a workflow execution engine with semantic support to register and help analyzing workflow executions and re-executions. Scientific WfMS aim at helping users in having flexibility to define, execute and manage their experiments through workflow management tools. Particularly in bioinformatics, it is important to: (i) define and design the workflow through a userfriendly interface, taking advantage of components reuse; (ii) execute the workflow in an efficient and yet flexible way through monitoring, steering and user interference; (iii) handle failures, retaining data integrity and making a reasonable ‘best effort’ to proceed with the invocations; (iv) access, store and manage data using DBMS, flexible data modeling with ontology support; and (v) track provenance of data and services to add semantics to workflow execution. Complementing those five items, issues such as compatibility with a distributed and grid middleware, open source code, perspectives of a large community of users and long term software support are also important and desirable in typical bioinformatics projects.
Categories and Subject Descriptors H.4.1 [Information System]: information system application, experiment automation, workflow management.
General Terms Management, Measurement, Experimentation.
Keywords Scientific Workflows, Provenance, Bioinformatics.
1. INTRODUCTION Bioinformatics experiments are typically composed of programs in pipelines manipulating a large quantity of data. These experiments are usually built by manually composing third-party programs with their input and output data in an execution flow. Output data is then analyzed and according to the experiment result, parameters are tuned, the pipeline is re-executed, programs are replaced on the pipeline and partial re-executions are often necessary. Currently, several bioinformatics pipelines are modeled and executed through Perl and BioPerl scripts which are simple and efficient, but very
In this paper, we present our evaluation on modeling OrthoSearch (Orthologous Gene Seacher), a genomic workflow, and managing it on Kepler WfMS [1][2][4]. From this experience, a set of requirements were analyzed to be used as guidelines when choosing a WfMS to model and manage bioinformatics pipelines. Since these pipelines often involve time consuming programs that run on clusters or grids, distributed workflow execution support was also analyzed. In addition, we propose and implement some improvements to the Kepler system.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC’08, March 16-20, 2008, Fortaleza, Ceará, Brazil. Copyright 2008 ACM 978-1-59593-753-7/08/0003…$5.00.
The rest of the paper is organized as follows. In section 2 we present WfMS concepts and discuss distributed environments issues. In section 3, we describe the OrthoSearch pipeline. In section 4, it is evaluated when represented on a scientific WfMS, presenting the pros and cons of using Kepler. In section 5, we present some improvements added to Kepler WfMS. Finally, in section 6, we conclude this work.
1282
pipeline, it has been (www.biowebdb.org).
2. WfMS IN GENOMIC PIPELINES Scientific workflows are usually related with the automation of scientific processes in which scientific programs are associated, based on data and control dependencies. Workflows of scientific applications in distributed environments, such as grids, use multiple nodes to accomplish computations that would be time-consuming or impossible to achieve on a sole node. They need to cope with a large number of jobs to monitor and control tasks execution, including support to ad-hoc changes; and also to execute on an environment where resources are not known a priori. According to Krauter et al. [6] and Zanikolas and Sakellariou [13], a scientific WfMS in distributed environments offers advantages such as: ability to build dynamic applications which can orchestrate distributed resources; utilization of resources that are placed in a particular site to increase throughput or decrease execution costs; execution spanning through numerous sites in order to obtain specific processing capabilities; integration of working groups involved with management of different parts of a given workflow, promoting cross-organizational collaborations. For these reasons, scientific WfMS in distributed environments constitute a good alternative to face bioinformatics challenges [1].
used
in
BiowebDB
consortium
3.1 OrthoSearch and the neglected diseases Neglected diseases have recently attracted special attention of the scientific community, and a global-scale task-force called DNDi was created to fight against them. These are a group of diseases caused by different protozoans and worm parasites, affecting mainly lowincome populations in developing regions of Asia, Africa, and the Americas. For instance, Trypanosoma cruzi, Trypanosoma brucei, and Leishmania major (the so-called Tritryps) collectively cause diseases and death in millions of humans and countless infections in other mammals. Little research for vaccine and drug development for the diseases caused by those pathogens have been carried out compared to other human diseases such as cancer or AIDS. Our insilico study encompasses other two protozoans parasites: Plasmodium falciparum and Entamoeba histolytica. The challenge for the post-genomic era is based on computational methods to exploit this wealth of information, gaining deeper insights into the biology of those five organisms, having as ultimate goal the future development of new therapeutic tools for the control of these devastating diseases.
Yu and Buyya [12] proposed a taxonomy that contains a set of requirements, which helped us to choose a scientific WfMS to experience OrthoSearch on a distributed environment. This set is composed of five requirements briefly described at the introduction section. Yu and Buyya´s taxonomy is based on the first four categories: (i) design and definition, (ii) execution control, (iii) fault tolerance, and (iv) intermediate data management. Simmhan, Plale and Gannon [10] complement this taxonomy, highlighting the importance of provenance on distributed environments, considering it as a fifth requirement. The design and definition category comprises workflow structure, workflow model and workflow composition language. The workflow structure refers to workflow patterns, indicating the temporal relationship between the tasks. The workflow model is classified as abstract (the workflow is specified without mapping to execution resources) and concrete (it binds workflow tasks to specific execution resources).
3.2 OrthoSearch design considerations OrthoSearch pipeline has been conceived to facilitate the tasks of searching, analyzing and presenting distant homologies on COGs (http://www.ncbi.nlm.nih.gov/COG/) of five common parasites as mentioned in the previous subsection. OrthoSearch pipeline, presented in Figure 1, is not limited to such five genomes. Ptn DB is a local repository of proteins and mRNA obtained from Genbank for each protozoa. COGs DB is a local repository of ortholog group files (CO-Genes) previously downloaded from NCBI. The repository stores orthologous genes from several metabolic pathways, including: energy production and conversion production; amino acid transport and metabolism; and nucleotide transport and metabolism. Such 3 categories were chosen because they have a central role in protozoa´s biochemistry. MAFFT optimize protein alignments based on physical properties of the amino acids. We decided to use such approach because it uses progressive alignment followed by refinement, which reduces the CPU time when compared to other existing methods. OrthoSearch uses some HMMER package (represented as dashed lines in Figure 1), such as: hmmbuild, hmmcalibrate, hmmsearch and hmmpfam. The matching of best reciprocal hits was implemented as a Perl script that outputs the GI numbers of proteins and we only select hits with maximum evalue of 0.1 in order to search for distant homologies.
The execution control category refers to the architecture and decision making. The way tasks are scheduled is very important for scalability, autonomy, quality and system’s performance. The intermediate data management category aims at the WfMS support to large amounts of data. For instance, input files need to be staged at a remote site before processing tasks, and output files may be required by child tasks processed on other resources. Centralized and mediated automatic data movement is important for bioinformatics workflows, because monitoring and browsing intermediate results are a common task executed by e-scientists.
BLAST programs (also represented in Figure 1 as a dashed line) are widely used for searching DNA and protein databases for sequence similarity to identify homologs to a query sequence. We have used formatdb to create a local BLAST-format database for each protozoa obtained from Ptn DB and fastacmd to generate a fasta file from the recently created BLAST-formatted databases. After finding the best reciprocal hits we have used InterPro to confirm annotated sequences. Finally, OrthoSearch outputs a set of new re-annotated genes from the five protozoans.
Provenance issues still constitute an open problem and can be described according to the domain where they are applied [2][3]. Provenance on WfMS is related to data and process provenance, such as: what to describe, how to represent and store provenance, and the ways to disseminate it. Considering all issues discussed in this section, choosing the ideal system involves a complex analysis that includes technical and practical aspects.
Although OrthoSearch implementation through Perl scripts brings many benefits, the pipeline does not cope with distributed applications, nor attends to some requirements of scientific workflows, as discussed in Section 2. Currently, it lacks flexibility in
3. ORTHOSEARCH PIPELINE OrthoSearch was designed and oriented to detect distant homologies on protozoans. Originally implemented as a local script-based Perl
1283
its design, with no abstract definition and a composition language that is hard coded as typical Perl scripts. The scheduling strategy is based on a centralized approach and the decision making is userdefined. OrthoSearch script-based version offers no execution monitoring and a poor data provenance support.
five requirements previously stated. We close this section describing some improvements we have implemented on Kepler WfMS.
4.2 Workflow definition and design To define the OrthoSearch workflow, we have used the same bioinformatics programs described in section 3 as distributed applications to expose them to Kepler. Although the OrthoSearch workflow was already defined in Perl scripts and poorly documented as block diagrams, we have noticed ambiguities in these definitions and lack of explicit execution control. Actually, we had to interact with the previous workflow designers to clear up the control flow to successfully model it. We ought to mention that the biology designers of OrthoSearch were quite happy to see a clear and graphic abstract workflow definition of their previous pipeline. The workflow shown in Figure 2 was designed as a hierarchy of sub-workflows, as Kepler´s composite actors represent an important mechanism of representation, privacy and reuse between workflows. We defined four composite actors (MAFFT/HMMER, FormatDB, FastaCmd and InterPRO) and implemented a single one (FindBestHits). Our experiment evidenced the advantage of this approach, as the users felt more confident with Kepler´s workflow representation than with their previous tools (i.e. scripts and draft diagrams).
Figure 1. OrthoSearch pipepline
4. ORTHOSEARCH ON KEPLER WfMS To evaluate OrthoSearch as a workflow managed by a WfMS, we have implemented the same sequence of programs on a distributed environment based on Fedora operating system servers. Our evaluation with Kepler addresses the requirements presented in Section 2, i.e.: (i) workflow definition and design; (ii) workflow execution control; (iii) fault tolerance; (iv) intermediate data management; and (v) data provenance support.
MAFFT/HMMER composite actor encompasses MAFFT and HMMER packages, performing an important role: due to some Kepler´s Directors shortcomings, we had to use a Ramp actor to control an inner loop. It iterates with COGs DB and invokes two other internal composite actors, HmmSearch and HmmPfam. The first actor runs remote instances of MAFFT applications: hmmbuild, hmmcalibrate and hmmsearch. The second actor searches sequences against an HMM database through hmmpfam application, which searches one or more sequences against an HMM database. FindBestHits is a simple Java homemade actor. It is responsible for finding the best reciprocal hits produced by the searches resulted from HmmSearch and HmmPfam composite actors. It outputs a list of the GI´s candidates that will be consumed by FastaCMD composite actor. FormatDB is in charge of creating the local parasites database (PtnDB) and then submit it to FastaCmd composite actor, which generates the subset of sequences to be validated by InterPRO actor.
4.1 Choosing Kepler The Kepler WfMS is an active open source Java cross-project and cross-institutional collaboration that runs on top of Ptolemy II [12]. Such WfMS range from local and small applications to parallel programs running on clusters and grids [3]. Abstract workflows are not supported by Kepler, the editor maps actors directly to concrete workflows. Concrete workflows are modeled in MoML (an XML Markup Language), allowing the specification of processing units (tasks), data transfer/transformation and execution. Kepler can handle local applications, web and grid services. Its scheduling strategy is based on a centralized enactment environment and the decision making is user-defined. Kepler does not automatically support fault tolerance at task and workflow levels yet, so escientists have to code user-defined exception handling triggers manually. Finally, it supports centralized automatic data movement. Such approach, despite being easier to implement, has some drawbacks like the high transmission time when handling large amounts of data.
We highlight the advantages of designing OrthoSearch on Kepler. For instance, it acts as a graph-based tool for rapid prototyping. This approach helps to reduce costly experimental reimplementation cycles for the WfMS, representing a high level picture of the pipeline separated from low level details. Depending on the nature of the experiment, data transformations should be implemented either by unspecific Kepler’s actors or by homemade actors. Up to now, Kepler presents solely few generic transformation actors. Writing actors in Java is not a trivial task, since it requires certain knowledge of Ptolemy II API and Java language itself. We also have to signal the difficulty to reuse an actor outside Ptolomy context. It would be important to have a more abstract representation of actors, ideally not specific to Ptolemy API.
Our evaluation showed that Kepler is more appropriate to accommodate protozoan workflows than OrthoSearch scripts´ version. Kepler execution engine is flexible, quite stable and its grid-enabled features are not easily found in other WfMS. In addition, to run OrthoSearch workflow on a distributed environment we have to address some complexities like: management of credentials and permissions, interaction with schedulers, and, more specifically, installation and deployment of bioinformatics packages on remote servers. Kepler also presents some limitations, discussed as follows. We present our evaluation of OrthoSearch following the
MoML is also an advantage, as it represents e-scientists’ workflow modeling definitions as typed and directed graphs, which present less ambiguity than error-prone Perl scripts. It is important to note that the concrete workflow is automatically generated by the editing tool. Our prototype was fully parameterized and modeled with
1284
composite and single actors. The use of composite actors (subworkflows) was feasible; it not only hid design complexities but also encapsulated collections of single actors. Finally, Kepler helped us to automate the workflow by quickly allowing the definition and design of a prototype, which, due to its visual representation, was easier to understand and explain to our colleagues.
4.5 Data management OrthoSearch original script pipeline did not cope with data management because it was far beyond user’s perception. But Kepler provides a degree of flexibility in data management. Such flexibility allowed us to define and store provenance and other relevant data into XML files or relational schemas. WfMS like Taverna [8] define their own provenance schema, which is valuable but may not match the users’ expectations. In Kepler, e-scientists can easily access different database management systems through Data Access, FTP and SSH actors. Kepler equally allows access to local and remote data, and local and remote service invocation along with a built-in concurrency control and job scheduling mechanism. Kepler offers support and facilities to connect, store data and submit SQL queries to several relational database management systems. Therefore, annotation, intermediate data and provenance data can be stored in a standard domain schema or defined by the user. Since we have adopted GUS, we were able to keep it as our semantic data support. However, Kepler suffers from a limitation in handling large database schemas, since it loads the whole schema in main memory. We experienced severe limitations with GUS in this respect.
Figure 2. OrthoSearch workflow represented in Kepler1.
4.3 Workflow execution control The execution of OrthoSearch workflow was not as simple as we expected, but we must reckon that Kepler made the execution simpler to the user than running the pipeline. Identifying the most costly programs was straightforward with Kepler´s control panel. It was very useful to define where to put more emphasis on execution control and help on taking advantage of the distributed environment. Programs such as hmmsearch and Interpro were identified as parallel processing candidates. Although OrthoSearch workflow did not process all CO-Genes files, the overall performance was better than its script version, due to its distributed execution.
4.6 Data provenance support OrthoSearch pipeline was not conceived with provenance recording and semantic support in mind. For instance, an e-scientist cannot determine the differences in the system during two runs of a given experiment, as the pipeline is unable to register the exact source of input CO-Genes used, its metadata and programs versions specified in the scripts. The pipeline also does not allow an e-scientist to determine the best way to run the experiment in a near future. It can register neither the server nor the time at which it was executed. Such lack of provenance information sometimes can compromise the validity of the experiment and its results.
Kepler is not a bioinformatic-specific tool, but presents a set of good characteristics for this domain like, for example, a feature that enables the steering of workflow execution. On the other hand, considering workflow scheduling, there is no built-in support to performance prediction, annotations and logging strategies on Kepler. The execution control still deserves some attention, since workflow processes can be mapped into multiple computing nodes. When running on a grid or cluster, Kepler no longer controls the execution. The monitoring of QoS parameters would be very useful to avoid processing bottlenecks. Finally, inner loops inside composite actors are still a barrier. The iterations are deeply dependent on the computation model defined by top level directors. To overcome this, we did not use a director in sub-workflows: our opaque composite actors inherited the director of their containers as an executive director.
On the other hand, Kepler can offer some support to provenance registering, but it is still incomplete when compared to the models of provenance described by Stevens et al [9] and Simmhan, , Plale and Gannon [10]. According to Altintas et al. [2] and Bowers et al. [4] it is possible to extend Kepler to register provenance: they mention some frameworks that are not yet publicly available. Kepler does not cope with intermediate data movement requirements, as it does not transfer data automatically. It offers limited possibilities to annotate information on data provenance (intermediate and end results, including files and database references), process provenance (data about workflow definition with data and parameters used on the run), execution provenance (error and execution logs) and design provenance (information and decisions taken during workflow design phase). Unfortunately, Kepler has very few facilities to manage bio-ontologies.
4.4 Fault tolerance Up to now Kepler does not support fault tolerance. Processes are invoked and, in case of failure, users have to reload original data and restart the execution from scratch. The beta 1 Kepler version used during our tests did not comply with some fault tolerance requirements. For instance, if an e-scientist needs to investigate a workflow running, maintenance actors should be implemented and inserted inside the flow. Such approach deviates attention to administrative issues and makes the workflow harder to understand and maintain.
1
Our provenance support implementation on top of Kepler presented some benefits. The e-scientist can gather and store data provenance, MoML registers what programs were used and also what parameters were coded to a given execution, enabling comparisons between different runs. We can also store execution provenance data through the usage of traditional event logs, recorded in each experiment performed, the order of programs invocation, and time-stamps.
Due to space restriction we did not present details of composite actors.
1285
Choosing an adequate WfMS can be a difficult decision. Our experiments concentrated on open source tools and open standard proposals. We addressed important WfMS issues on assisting escientists to estimate and plan the use of workflow engines within their bioinformatics projects through WfMS for distributed environments. There were indications that the requirements impact the design and execution of OrthoSearcher Kepler-based workflow. Besides, we suggest that those requirements should be used as guidelines to choose a proper WfMS. We shared our experience and insights gained reviewing common bioinformatics workflows requirements, using and evaluating Kepler on distributed environments against a real Perl script-based pipeline. In addition, we contributed by adding some functionality lacking on Kepler.
5. IMPROVEMENTS TO KEPLER Kepler constitutes a great foundation in terms of functionality, helping e-scientists to be more efficient and effective by providing formalism and by supporting automation, as workflows have the potential to accelerate and transform the scientific analysis process. OrthoSearch Kepler-based workflow version not only offered a number of advantages over OrthoSearch scripting-based version, but also coped with most of the requirements presented in section 2. It allowed us to combine data and processes into a configurable structured set of steps that implement an automated distributed solution. It greatly improved data analysis, especially when data is obtained from multiple sources and generated by computations on distributed resources. These advantages also encompassed ease of configuration, improved reusability and ease of maintenance, by supporting re-running of different versions of workflow instances, updateable parameters, monitoring of long running tasks, and helping to find parallel processing candidate programs.
7. REFERENCES [1] Altintas, I., et al. “Kepler: An Extensible System for Design and Execution of Scientific Workflows”, In SSDBM, (2004), 423-424.
However, as with many WfMS, Kepler must improve its support to provenance recording, fault-tolerance and recovery from failures. To contribute to Kepler project we implemented a set of new features on the execution engine. For instance, concrete workflows are stored in Kepler locally as single personal experiments. To overcome this limitation and enhance collaborative work avoiding versioning problems, we have coded a set of CVS functionalities into the execution engine. Such approach leverages the storage of workflow provenance necessary for scientific reproducibility, result publication, and result sharing among collaborators. It allows escientists to keep track of changes made over time, consolidating all personal versions of a given workflow, helping them to find the differences between each version and execution. It also allows escientists to upload and download production workflows from a server or workstation.
[2] Altintas, I., et al. “A Framework for the Design and Reuse of Grid Workflows”, In SAG'04, LNCS 3458, Springer (2005), 120-133. [3] Cruz, S. M. S., et al. “Monitoring Bioinformatics Web services Requests and Responses through a Log Based Architecture”. In SEMISH (2005), 1787-1801. [4] Bowers, S., et al., “A Model for User-Oriented Data Provenance in Pipelined Scientific Workflows”, In LNCS 4145, Springer, (2006), 133-147. [5] Cohen, S., Cohen-Boulakia, S., Davidson, S. B., “Towards a Model of Provenance and User Views in Scientific Workflows”, In LNCS 4076, Springer (2006), 264-279. [6] Krauter, K., Buyya, R. and Maheswaran M., “A Taxonomy and Survey of Grid Resource Management Systems for Distributed Computing”. Software: Practice and Experience, v. 32(2) (2002), 135-164.
Although the importance of provenance support on scientific data and workflow management is largely acknowledged [5][8][3], Kepler still does not offer a reasonable method to handle all kinds of provenance. The ability to capture runtime data and to re-execute a given workflow with different parameters is still an open issue. To address this in Kepler, we implemented a provenance recording service that intercepts execution data from distributed applications and records them as a set of XML files. A number of ready-to-use components are available in Kepler, but none of them was tailored to match genes. So, we used Kepler´s extensibility to develop an actor which performs the matching of the best reciprocal hits produced by hmmsearch and hmmpfam programs.
[7] Ludäscher, B., et al.. “Scientific workflow management and the Kepler system”. Concur. Comput.: Pract. Exper. v. 18(10) (2006), 1039-1065. [8] Oinn, T., et al., “Taverna: Lessons in creating a workflow environment for the life sciences”. Concur. Comput.: Pract. Exper. v.18 (10) (2006), 1067-1100. [9] Stevens R., Zhao J., Goble C., “Using provenance to manage knowledge of in silico experiments”. Brifings in Bioinformatics. v. 8 (2007), 183-194.
6. CONCLUSION
[10] Simmhan Y. L., Plale B., Gannon D., “A Survey of Data Provenance in e-Science”. ACM SIGMOD Record v. 34 (2005), 31 – 36.
As the usability and computation capacities of distributed environments increase, there will be growing demand to integrate scientific applications and databases through WfMS. There are a number of important aspects to facilitate the development of distributed workflows using service oriented architecture, such as: 1) applications may be offered as a service; 2) services may be registered on a common repository or be discovered using search engines; 3) service-based workflow composition tools may orchestrate different kinds of services as long as the services can exchange messages effectively. Integrating scientific workflows based on distributed grid services environments promises to be a chief advantage over local-based alternatives.
[11] Venugopal, S. Buyya, R. and Ramamohanarao. K., “A Taxonomy of Data Grids for Distributed Data Sharing, Management, and Processing”. ACM Computing Surveys, v. 38 (2006), 1-53. [12] Yu, J. Buyya, R., “A taxonomy of scientific workflow systems for grid computing”. ACM SIGMOD Record v..34 ,(2005), 44 – 49. [13] Zanikolas, S. Sakellariou R. “A taxonomy of grid monitoring systems”. Future Generation Computer Systems v. 21(1) (2005), 163–188.
1286