Prov2ONE: An Algorithm for Automatically Constructing ProvONE ...

59 downloads 74947 Views 1MB Size Report
Jun 4, 2016 - Ajinkya PrabhuneEmail author; Aaron Zweig; Rainer Stotzka; Michael Gertz ... Provenance traces history within workflows and enables ...
Prov2ONE: An Algorithm for Automatically Constructing ProvONE Provenance Graphs Ajinkya Prabhune1(B) , Aaron Zweig1 , Rainer Stotzka1 , Michael Gertz2 , and Juergen Hesser3 1

3

Institute for Data Processing and Electronics, Karslruhe Institute of Technology, Karlsruhe, Germany {ajinkya.prabhune,aaron.zweig,rainer.stotzka}@kit.edu 2 Institute of Computer Science, Heidelberg University, Heidelberg, Germany [email protected] Department of Radiation Oncology, Heidelberg University, Heidelberg, Germany [email protected]

Abstract. Provenance traces history within workflows and enables researchers to validate and compare their results. Currently, modelling provenance in ProvONE is an arduous task and lacks an automated approach. This paper introduces a novel algorithm, called Prov2ONE that automatically generates the ProvONE prospective provenance for scientific workflows defined in BPEL4WS. The same prospective ProvONE graph is updated with the relevant retrospective provenance, preventing provenance to be captured in various non-standard provenance models and thus enabling research communities to share, compare and analyze workflows and its associated provenance. Finally, using the Prov2ONE algorithm, a ProvONE provenance graph for the nanoscopy workflow is generated.

1

Introduction

In the last decade, research communities have adopted workflow management systems (WfMS) for orchestrating their complex scientific workflows. Nanoscopy is a novel imaging technique in biological and medical research that aims to reduce the resolution gap between conventional light microscopy and electron microscopy [1]. In a nanoscopy workflow, the raw image datasets acquired by high-resolution microscopes are processed in multiple stages to produce final results. Nanoscopy Open Reference Data Repository (NORDR) [2] is provisioned to the researchers to store, process and access their data. For executing the nanoscopy workflows a WfMS1 is integrated with NORDR. A critical aspect associated with the NORDR is the management of provenance information. The paper addresses three main requirements of managing provenance in NORDR: (i) enable automated modelling of both prospective as well as retrospective provenance in a single provenance model; (ii) design an extensible provenance management component for NORDR; (iii) provision a dedicated provenance 1

We use the Apache ODE workflow engine, site: http://ode.apache.org/.

c Springer International Publishing Switzerland 2016  M. Mattoso and B. Glavic (Eds.): IPAW 2016, LNCS 9672, pp. 204–208, 2016. DOI: 10.1007/978-3-319-40593-3 22

Prov2ONE: An Algorithm for Automatically Constructing ProvONE

205

storage system with efficient query processing. To fulfill the first requirement, the paper presents the Prov2ONE algorithm that generates a ProvONE [3] provenance graph for BPEL4WS2 workflows. The algorithm is based on ProvONE due to the limitations of the Open Provenance Model (OPM) [5] and PROV [6] to model only the retrospective provenance [4]. The second requirement is met by presenting the provenance management architecture for NORDR and finally, for efficient storage and retrieval of provenance information, the ProvONE graphs are stored in a graph database (ArangoDB3 ).

Fig. 1. Provenance management in NORDR

2

Fig. 2. Nanoscopy workflow defined in BPEL4WS

Provenance Management Architecture

The Fig. 1 briefly describes the various components of NORDR system that are essential for either modelling, collecting or storing the provenance information for a scientific workflow. Workflow (WF) Engine: The WF engine is responsible for interpreting the workflow definition and invoking the necessary data processing services. NORDR: The NORDR is a multi-layered architecture with many modules that primarily offers the various data processing and data storage service. Provenance Manager: The provenance manager is responsible for handling all the provenance information generated before, during and after the execution of each scientific workflow. The Provenance Manager comprises four modules: (i) The Prov2ONE module holds the implementation of the Prov2ONE algorithm. (ii) NORDR Provenance Collector module collects the retrospective 2 3

http://docs.oasis-open.org/wsbpel/2.0/OS/wsbpel-v2.0-OS.html. https://www.arangodb.com/.

206

A. Prabhune et al.

provenance information from the NORDR. (iii) WF Engine Provenance Collector module collects the retrospective provenance information from the WF engine. (iv) OPM/PROV Provenance Exporter module enables interoperability between the ProvONE and OPM/PROV standard.

3

Prov2ONE Algorithm

The Prov2ONE algorithm comprises two components. In the first component, BPEL4WS activities are distinguished according to their status as structure activities or operation activities. Structure activities are added to the stack, with their head and tail sets determined according to the previous structure activities. The algorithm then recurses on the children of the ingested structure, which are popped upon completion. In the second component, labeled nodes defined by the set Σ = (Workflow,Process,InputPort,OutputPort, DataLink,SeqCtrlLink) are created and the relevant associations, with labels defined by set Ω = (sourcePToCL,CLtoDestP, hasInPort, hasOutPort, DLToInPort, outPortToDL) are drawn. This step is completed in the GenerateProvOne method of Algorithm 2. The ProvONE is defined as a graph G = (V, E, λ, ψ), with: a set of vertices V = {v1 , v2 , v3 , ..., vn }, a set of edges E ⊆ V × V, a vertex labeling function λ: V → Σ, an edge labeling function ψ: E → Ω. The ProvONE algorithm is tested for a nanoscopy workflow shown in Fig. 2.

Fig. 3. ProvONE graph of nanoscopy workflow (Color figure online)

Prov2ONE: An Algorithm for Automatically Constructing ProvONE

207

Algorithm 2.GenerateProvOne Method 1: 2: 3: List A := {sequence, process, while, scope} 4: List B := {f low, pick, switch} 5: List P := {invoke, receive, assign, reply, ∅} 6: List D := ∅, Stack S := {[process, {R}, {R}]} 7: C = {c1 , c2 , c3 , ..., cn } is vector of BPEL children 8: function Prov2ONE(C) 9: for ci in C do 10: if ci .operation ∈ P then 11: GenerateProvOne(ci ) 12: end if 13: if ci .operation ∈ A ∪ B then 14: top = POP(S) 15: if top[0] ∈ A then 16: head = top[2] 17: if top[1] == top[2] then 18: top[1] = ∅ 19: end if 20: top[2] = ∅ 21: else 22: head = top[1], top[1] = ∅ 23: end if 24: if ci .operation ∈ A then 25: tail = COP Y (head) 26: else 27: tail = ∅ 28: end if 29: PUSH(S, top) 30: PUSH(S, [ci .operation, head, tail]) 31: Prov2ONE(ci .children) 32: end = POP(S), top = POP(S) 33: if top[1] = ∅ then 34: top[1] = end[1] 35: end if 36: if top[0] ∈ A then 37: top[2] = end[2] 38: else 39: top[2] = top[2] ∪ end[2] 40: end if 41: PUSH(S, top) 42: end if 43: end for 44: end function 45: 46: 47:

Algorithm 1. Prov2ONE Algorithm 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36: 37: 38: 39: 40: 41: 42:

4

function GenerateProvOne(ci ) in = ci .input, out = ci .output λ(ci ) = Process ADD(V, ci ) top = POP(S) if top[0] ∈ A then ADD(V, s) λ(s) = SeqCtrlLink ψ(E) = sourcePToCL CONNECT(top[2], s, E) ψ(E) = CLtoDestP CONNECT(s, ci , E) top[2] = {ci } else ADD(V, s) λ(s) = SeqCtrlLink ψ(E) = sourcePToCL CONNECT(top[1], s, E) ψ(E) = CLtoDestP CONNECT(s, ci , E) top[2] = top[2] ∪ {ci } end if PUSH(S, top) if in = ∅ then λ(in) = InputPort ψ(E) = hasInPort ADD(E, ci , in) if in ∈ D then dl = GET(D, out) ψ(E) = DLToInPort ADD(E, dl, in) end if end if if out = ∅ then ψ(E) = hasOutPort λ(out) = OutputPort ADD(E, ci , out) λ(dl) = DataLink ψ(E) = outPortToDL ADD(E, out, dl), ADD(D, dl) end if end function function CONNECT(N , ci , E) for n ∈ N do ADD(E, n, ci ) end for end function

Conclusion and Future Work

This paper presented a novel algorithm, called Prov2ONE, that generates the ProvONE prospective provenance graph for an arbitrary BPEL4WS workflow. Figure 3 shows the ProvONE graph generated by the Prov2ONE algorithm for the nanoscopy workflow. During the execution of the workflow, the retrospective ProvONE, i.e. ProcessExec, Data and User is linked to the ProvONE prospective graph with associations wasAssociatedWith, wasGeneratedBy, used, wasDerivedFrom and dataOnLink. The services for collecting and appending the

208

A. Prabhune et al.

retrospective provenance to the prospective ProvONE graph are implemented in the Provenance Manager component. By modelling both the prospective and retrospective provenance for a scientific workflow in the ProvONE, the redundant task of collecting, storing and maintaining provenance in various systems is entirely avoided. The architecture of the NORDR system is shown in Fig. 1, and for enabling efficient storage and querying of the provenance information, a graph database is used. Currently, we are implementing the OPM/PROV exporter module based on formal semantic mapping between ProvONE and OPM/PROV.

References 1. Cremer, C.: Optics far beyond the diffraction limit. In: Tr¨ ager, F. (ed.) Handbook of Lasers and Optics, pp. 1359–1397. Springer, Heidelberg (2012) 2. Prabhune, A., et al.: An optimized generic client service API for managing large datasets within a data repository. In: IEEE BigDataService, pp. 44–51. IEEE (2015) 3. Cuevas-Vicentt´ın, V., et al.: ProvONE: A Prov Extension Data Model for Scientific Workflow Provenance (2015). http://purl.org/provone 4. Freire, J., Koop, D., Santos, E., Silva, C.T.: Provenance for computational tasks: a survey. Comput. Sci. Eng. 10(3), 11–21 (2008) 5. Moreau, L.: The specification, open provenance model core (v1. 1). Future Gener. Comput. Syst. 27(6), 743–756 (2011) 6. Moreau, L., Missier, P., et al. (eds.): PROV-DM: The PROV Data Model. W3C Recommendation (2013). http://www.w3.org/TR/prov-dm/

Suggest Documents