Reasoning on Scientific Workflows - IEEE Xplore

2009 Congress on Services - I

Reasoning on Scientific Workflows Z. Lacroix,1,2 C. R. L. Legendre,2 and S. Tuzmen1 1

Translational Genomics Research Institute 13028 Shea Blvd, suite 110, Scottsdale, Arizona, USA 2 Arizona State University, Tempe, PO Box 875706, Arizona, 85287-5706, USA phone (1)480-727-6935, fax (1) 480- 965-8325, e-mail: [email protected] that allows the design of scientific workflows that may combine wet and digital tasks and provides the framework for prediction and reasoning on the data flow. We illustrate our approach with a sub-cloning protocol designed to construct a fusion protein vector presented in Section 2. In ProtocolDB, workflows are represented at two levels. First the scientific aim is expressed in terms of a domain ontology as described in Section 3 then one or more implementations are specified as explained in Sections 4 and 5. Section 6 is devoted to the workflow representation and reasoning in ProtocolDB. We discuss related work in Section 7 and conclude in Section 8.

Abstract Scientific workflows describe the scientific process from experimental design, data capture, integration, processing, and analysis that leads to scientific discovery. Laboratory Information Management Systems (LIMS) coordinate the management of wet lab tasks, samples, and instruments and allow reasoning on business-like parameters such as ordering (e.g., invoicing) and organization (automation and optimization) whereas workflow systems support the design of workflows insilico for their execution. We present an approach that supports reasoning on scientific workflows that mix wet and digital tasks. Indeed, experiments are often first designed and simulated with digital resources in order to predict the quality of the result or to identify the parameters suitable for the expected outcome. ProtocolDB allows the design of scientific workflows that may combine wet and digital tasks and provides the framework for prediction and reasoning on performance, quality, and cost.

2. Workflow Motivation The impact of cholesterol and its regulation on human health is well known. Its role in atherosclerosis, and drug therapy for this process, has been extensively studied [3]. Cholesterol-rich membrane microdomains are known to be important in cell signaling pathways relevant to cancer [4, 5]. More recently, dysregulation of enzymes involved in cholesterol biosynthesis has gained significant notice for potential relevance for cancer. Farnesyl diphosphate farnesyl transferase 1 (FDFT1) is one of the key regulatory components of the cholesterol biosynthesis pathway, directing sterol biosynthesis to a branch resulting in the exclusive synthesis of cholesterol [6]. Over-expression of FDFT1 mRNA has been described in certain types of cancer [7]. The locus of this gene is also the site of loss of heterozygosity and homozygous deletions in other cancers [8]. Brusselmans K. et al [9], have illustrated the importance of de novo cholesterol synthesis for cancer cell biology suggesting that FDFT1 poses an important role as a putative drug target. The evaluation of cell cycle progression is essential for investigations in many scientific studies. The BrdU Proliferation Assay (Calbiochem, Cat# HTS01) is a nonisotopic immunoassay for the quantitation of bromodeoxyuridine incorporation into newly synthesized DNA of actively proliferating cells. This is a simple way to check the rate of cell proliferation either in a tissue or

1. Introduction Scientific workflows describe the scientific process from experiment design, data capture, integration, processing, and analysis that leads to scientific discovery. While wet workflows are typically supported by Laboratory Information Management Systems (LIMS) [20]that coordinate the management of tasks, samples, and instruments and allow reasoning on business-like parameters such as ordering (e.g., invoicing) and organization (automation and optimization), digital ones are supported by workflow systems such as Kepler [1] or Taverna [2]. The latter allow the expression of workflows combining various digital tasks into workflows executable on a platform such as a grid. The problem is that a large amount of scientific workflows mix wet and digital tasks. Experiments are often first designed and simulated with digital resources in order to predict the quality of the result or to identify the parameters suitable for the expected outcome. In this paper we propose an approach 978-0-7695-3708-5/09 $25.00 © 2009 IEEE DOI 10.1109/SERVICES-I.2009.73

306

in cells. Following the partial denaturation of double stranded DNA, BrdU is detected immunochemically allowing the assessment of the population of cells, which are actively synthesizing DNA. The main goal of the workflow is to be able to prepare new constructs (sub-cloning technique) of the FDFT1 gene to be tested for FDFT1 expression utilizing the BrdU proliferation assay. Two constructs (pAcGFP1-FDFT1N1 and pAcFDFT1-GFP1-C1) were prepared to facilitate the modulation of FDFT1 in its fusion form with GFP upon transfection of these constructs into various cancer specific cell lines. The GFP portion of the fusion protein will enable the visualization of the intracellular localization of the FDFT1 protein. The motivation for an –N1 and a –C1 construct is to find out the most functional construct of the FDFT1 fusion protein of interest and use it to enable over-expression of the FDFT1 gene in different cancer specific cell lines. This is expected to allow the contextual modulation of this gene either as a tumor suppressor gene or as an oncogene. We have chosen to amplify the FDFT1 gene utilizing sense and anti-sense primers that were tagged with EcoRI and BamHI restriction enzyme sites from a construct that already harbored this gene. This was performed in order to be able to have directional cloning into the pAcGFP1FDFT1-N1 and pAcFDFT1-GFP1-C1 vectors. The mere purpose of this new cloning strategy was to confirm our findings from previous transfection experiments using a different construct of FDFT1 gene.

ontology (see Figure 3). The ontology contains two types of nodes (concepts and complex conceptual types) and labeled directed edges (conceptual relations).

3. Design Workflow

The dataflow involved in the sub-cloning protocol is composed on an acceptor vector (AV), restriction enzymes (RE), donor vector (DV), target sequence (TS), and primers. AV, DV, and TS are instances of the concept Sequence, primers are a couple of two instances of the concept Sequence, and RE is an instance of the concept Enzyme. The type of the input of the workflow is represented in ProtocolDB as [Sequence, Enzyme,

Figure 1 - Design workflow Sub-cloning

Ligation Digestion

PCR

Digestion

Figure 2 - Tree representation of workflows

Workflows are designed top-down in ProtocolDB [10]. First the user enters the description of the workflow as a whole specifying its input, output, and name. Then the user selects a splitting operator1 (i.e., successor or parallel operators) to start describing the structure of the workflow and documents each of the sub-workflows. The sub-cloning workflow as entered in ProtocolDB is illustrated in Figure 1. The design of the sub-cloning workflow follows three successive steps: (1) the successor operator is selected and the second (bottom) workflow is documented as a ligation step, (2) on the first (top) task the parallel operator is selected and one of the tasks (left) is documented as a digestion step, finally (3) on the remaining task (right) the successor operator is selected and the first (top) task is documented as a PCR step whereas the second (bottom) task is documented as a digestion step. The workflow structure is a binary tree as illustrated in Figure 2. At each step of the design, the dataflow is captured by the input and output of each task is a complex datatype expressed in terms of a domain

Sequence, Sequence]].

Enzyme,

Sequence

[Sequence,

[Sequence]

Sequence

PCR

Enzyme Digestion Solvent

Ion

Ligation

[Sequence]

Figure 3 – Workflow domain ontology

1 Loops are not represented in the current model. The next version of ProtocolDB will be extended with a controlled workflow import operator to express loops.

307

AV, RE

DV

Primer Primer Design Design

RE RE Digestion Digestion

Dry Lab

RE, TS, Primers

DV

Vector Construction

AV, CV, RE

scientists with the ability to predict the outcome of the wet-lab experimental process with an in silico implementation. We illustrate the whole experimental workflow in Figure 4.

RE, TS

PCR PCR

4.1 Vector Construction The PCR task aims at producing large amounts of material and verifying the quality of the donor vector. It amplifies a specific sequence using primers, nucleotides (dNTPs), buffer, enzyme (Taq Polymerase), and a template. The design of primers is critical to the success of this experimental step. Computer tools such as Primer3,2 PCR Primer Designer,3 Oligo Primer,4 and GeneFisher5 can assist in the design of primers suitable for the experiment. The output of the PCR step is a double strand sequence in millions of copies which can be forwarded directly to the next step because no product in the final reaction will interfere with the restriction enzyme digestion. It worth noting that the input of in silico PCR tools such as In-Silico PCR6 and In silico PCR amplification7 is limited to the primers only. Templates are usually the whole genome of a selected organism. They are not suitable for workflow validation because it is not possible to enter a given template. The PCR design task is simulated by a workflow composed of two successive tasks Primer Creation and PCR . The digestion steps use the same enzymes to cut both the acceptor vector and the PCR product and produce sticky ends which can ligate together in the presence of T4 Ligase, a specific enzyme. The digestion task may lead to different sequences that remain in the solution. A gel electrophoresis is performed to discriminate the digested fragments by separating nucleotide sequences according to their molecular weight. The task can be simulated by using nucleotide sequence lengths as long as the complete sequence information is known. Indeed, tools such as In silico Simulation of Double Digestion Fingerprinting Techniques8 can locate the restriction enzyme sites on the sequence and simulate the digestion by truncating the sequence accordingly. The ligation of the acceptor vector and the sequence of interest (insert) may now take place. This process is achieved by a T4 Ligase enzyme. The sticky-ends of the sequences have to be complementary to each other in order to ligate. Because the same enzymes were used for the PCR product and AV, the ends of the cut AV and the insert match and a new functional circular vector (vector+insert) can be produced. The ligation step may be

DV, RE, TS, Primers, V with Insert, Insert, PCR product RE RE Digestion Digestion

In silico Loop

DV, RE, TS, Primers, V with Insert, Insert, PCR product, CPP

Ligation Ligation AV, CV, DV, RE, TS, Primers, insert, PCR product, CPP, V with Insert,

Sequence Validation

ORF ORF Searching Searching AV, CV, DV, RE, TS, Primers, V with Insert, insert, PCR product, List of AA seq with Coordinates

TS

AA AA seq seq Extraction Extraction

Translation Translation

All former Inputs + AA seq

AA seq, TS

Protein Protein Alignment Alignment Alignment Report Report Report Analysis Analysis In Frame? : Yes / No ?

NO

YES AV, RE

DV

RE RE Digestion Digestion

AV, CV, RE

RE, Primers

PCR PCR DV, RE, TS, Primers, V with Insert, Insert, PCR product RE RE Digestion Digestion DV, RE, TS, Primers, V with Insert, Insert, PCR product, CPP

Ligation Ligation AV, CV, DV, RE, TS, Primers, V with Insert, insert, PCR product, CPP

Wet Lab

2 Primer3 is a very popular tool developed at MIT and available at http://frodo.wi.mit.edu/. 3 PCR Primer Designer (GenScript) is available at http://www.genscript.com/cgi-bin/tools/primer_gens. 4 See http://www.oligo.net/. 5 See http://bibiserv.techfak.uni-bielefeld.de/genefisher2/. 6 See http://genome-mirror.bscb.cornell.edu/cgi-bin/hgPcr. 7 See http://insilico.ehu.es/PCR/. 8 See http://insilico.ehu.es/DDF/.

Figure 4 - Complete experimental workflow

4. Workflow Analysis A straightforward implementation of the workflow designed in Section 3 consists of a selection of services to implement each wet-lab task. ProtocolDB provides the 308

may vary significantly from one method to another even using similar parameters. At this point, only benchmarks for protocols might help to choose the right method, tools or algorithm to use according to the user’s needs [12].

performed manually in silico by combining the two sequences (backbone vector and insert) with their matching edges.

4.2 Sequence Validation

5. Workflow Implementations

In the context of the wet experiment, the validation of the in-frame insertion of the inserts into the backbone vector is obtained by the extraction and verification of positive clones that have grown under antibiotics culture pressure. An aliquot from the ligation reaction solution is used as input for the bacterial transformation. The aim of the bacterial transformation is to enter the new construct in bacteria in order to generate many copies of the same construct through bacterial proliferation. The task may include bacteria with the expected construct, bacteria without any construct as well as bacteria with acceptor vectors that have not integrated the insert. Only the bacteria with the construct grow. An in silico sequence validation workflow may be designed to predict the quality of the construct as follows. First a tool such as pDRAW329 can extract the complete list of the restriction enzyme sites and identify the Open Reading Frames (ORF). Then an alignment tool such as bl2seq10 can be used to align the selected ORF sequence with the two expected protein sequences, the target sequence (TS) and the Enhanced Green Fluorescent Protein (EGPF)11 sequences.

In ProtocolDB each workflow design is linked to one or more implementation workflows. Similarly to the design phase, the input and output of each implementation task is a complex datatype expressed in terms of a domain ontology. The basic implementation tasks are services including tasks performed by technicians, machines robots, etc. and digital tasks performed by applications or scripts (e.g., Web services). ProtocolDB allows the specification of several different implementations of a workflow design. For example, we enter two implementations of the sub-cloning workflow. The first implementation of the workflow is the in silico workflow illustrated in Figure 5. It is used to evaluate the method and the parameters. The input of the in silico simulation and validation workflow is represented in ProtocolDB as [Sequence, Enzyme, Sequence, Enzyme, Sequence].

4.3 Simulation The selection of each resource to perform a simulation constitutes an implementation in silico of the workflow described in Section 3. The execution of a simulation workflow predicts the accuracy of the method as follows. If the alignment report shows 100% similarity with the target sequence as well as with the EGFP proteins, the inframe status is set to “correct” and the protocol is validated. Otherwise, the protocol will not succeed. In the latter case the scientists will analyze the reasons for failure and revise the implementation in silico workflow accordingly until it is validated. The services selected to implement scientific workflows may affect significantly the quality of predicted output. For example, primer’s quality may have critical impact on the PCR quality, i.e., the productivity of the PCR. Primer quality is based on different parameters (Melting Temperature, selfcomplementarity, Self-dimerization, hairping formation, and self-annealing of 3' and 5' ends). Software and online tools and services exist for testing primer’s quality. Unfortunately, as they use different algorithms, the results

Figure 5 – In silico implementation workflow

The second implementation, illustrated in Figure 6, is the in vivo implementation that is executed once the in silico implementation is validated. This implementation captures the real wet implementation of the workflow as it is going to be executed once the simulation achieved with the in silico implementation validates the method and parameters. The ability to link several implementations to a workflow design offers many advantages. First, it is very useful to record the alternative implementations and various versions often generated by scientists. This functionality provides the framework for reasoning about possible implementations, simulating and predicting the performance of implementations.

9

See pDRAW32 revision 1.1.97 at http://www.acaclone.com/. See http://www.ncbi.nlm.nih.gov/blast/bl2seq/wblast2.cgi. EFGP is a protein coding sequence useful to localize the expression of other proteins. This protein sequence emits a green light when excited with a blue laser. Its fusion with other proteins has a limited effect on its fluorescent properties. 10 11

309

datatypes with respect to the isa hierarchy of the ontology (e.g., [DNA, {Sequence, Gene}] is a subtype of [Sequence, {Sequence, Gene}]). Definition 2 - Let R1 and R2 be two conceptual relationships with input conceptual type c1 and c2, respectively, and output conceptual type c1’ and c2’, respectively, if c1 c2 and c1’ c2’, then R1 R2. We extend the definition to conceptual paths where a path P is a composition of conceptual relationships R1.R2,…Ri…Rn where for each i from 1 to n-1 the output of Ri is the same conceptual type as the input of Ri+1. Finally we define a partial ordering between conceptual paths. This partial ordering allows the relation ? of exact conceptual match between the input and output as well as the internal structure of workflows. We now define a loosely semantic similarity of services. Note that the semantic equivalence is a reflexive relationship.

Figure 6 – In vivo implementation workflow

6. Reasoning on workflows

Definition 3 - Two services S1 and S2 implementing respectively the conceptual relationships R1 and R2 are loosely semantically equivalent if there exist R such that R1 R and R2 R.

In this section we discuss the benefits of the ProtocolDB approach to record workflow design and implementations and to perform valuable reasoning on workflows. The first benefit of the approach it the ability to record workflow designs with respect to a domain ontology. This allows a conceptual design of a workflow. The second benefit of the approach is the ability to represent the dataflow in terms of collections of conceptual variables. The specification of the dataflow is valuable to annotate datasets produced at workflow execution. ProtocolDB is not designed to execute workflows. Instead it supports workflow design, reasoning, and recording. Workflows recorded in ProtocolDB may be completely experimental and their execution dataflow correspond to the data produced and recorded by the scientists after each step. Other workflow may be completely digital and they can be executed by systems such as Taverna or Kepler. Others may mix experimental and digital tasks. Moreover the use of nested collections (record, set, and list) is suitable to represent scientific dataflows. With this workflow model we can define various notions of similarities that will be used to reason about workflows.

The definitions of semantic equivalence between services are extended to workflow implementations. We first extend the definitions to basic workflows. Definition 4 - Two workflow implementations W1 and W2 implemented respectively by the services S1 and S2 implementing respectively the conceptual relationships R1 and R2 are semantically equivalent (resp. strongly) if S1 and S2 are semantically equivalent (resp. strongly). The notion of equivalence of workflows is compatible with the two successor or parallel operators. This provides a method to evaluate whether a workflow implementation can be used as an alternative implementation for a workflow design. Definition 4 - Let W=W1W2 where W1 and W2 are semantically equivalent (resp. strongly) to W’1 and W’2 then W and W’1W’2 are semantically equivalent (resp. strongly).

Definition 1 - Two services are strongly semantically equivalent if they are mapped to the same conceptual relationship in the ontology. Two services are semantically equivalent if their inputs (resp. outputs) are mapped to the same conceptual datatype.

Definition 5 - Let W=W1W2 where W1 and W2 are semantically equivalent (resp. strongly) to W’1 and W’2 then W and W’1W’2 are semantically equivalent (resp. strongly).

Property 1 - Two services that are strongly semantically equivalent are semantically equivalent.

Property 2 - Two workflows implementations are semantically equivalent if they are two implementations of a workflow design.

We also define a notion of similarity for conceptual relationships in a domain ontology by allowing the navigation within the graph as follows. We first define the subtyping relationship (noted