Email: 1r.qasha, jacek.cala, [email protected]. AbstractâScientific ... workflow applications and to automate their deployment we propose to use an .... type is used to repre- sent types of workflow blocks and shared libraries, and.
Towards Automated Workflow Deployment in the Cloud using TOSCA Rawaa Qasha, Jacek Cała, Paul Watson School of Computing Science Newcastle University, Newcastle upon Tyne, UK Email: {r.qasha, jacek.cala, paul.watson}@newcastle.ac.uk
Abstract—Scientific workflows play an increasingly important role in building scientific applications, while cloud computing provides on-demand access to large compute resources. Combining the two offers the potential to increase dramatically the ability to quickly extract new results from the vast amounts of scientific data now being collected. However, with the proliferation of cloud computing platforms and workflow management systems, it becomes more and more challenging to define workflows so they can reliably run in the cloud and be reused easily. This paper shows how TOSCA, a new standard for cloud service management, can be used to systematically specify the components and life cycle management of scientific workflows by mapping the basic elements of a real workflow onto entities specified by TOSCA. Ultimately, this will enable workflow definitions that are portable across clouds, resulting in the greater reusability and reproducibility of workflows.
I. I NTRODUCTION Scientific workflows have become an increasingly popular paradigm for enabling and accelerating scientific data analysis. As a result they have now generated many significant discoveries [1]. One key reason for their adoption is that they offer opportunity to share, exchange and reuse services and experimental methods [2], [3]. Another recent trend has been the rise of cloud computing which has given scientists unprecedented access to computing resources. Cloud computing offers on-demand access with the ability to scale the infrastructure up and down depending on actual needs. This fits well with many scientific research needs; for example, vast resources can be acquired to analyse new data when it becomes available. In combination, workflows and cloud computing therefore have the potential to increase dramatically the ability to quickly extract new results from the vast amounts of scientific data now being collected [4]. However, realizing this potential raises some major challenges that we address in this paper. Scientific workflows are typically composed of many diverse executables, each with specific dependencies against the software platform and libraries. They require multiple components to be deployed and configured before and during runtime. For a scientific method to be effectively reused over time, and for experiments to reproduced, the repeatability of these deployment and configuration steps is crucial. Otherwise, the value of building workflows is quickly lost [5]. Unfortunately, it is impractical to expect most scientists to perform these complex deployment steps manually.
In order to improve the reusability and reproducibility of workflow applications and to automate their deployment we propose to use an emerging OASIS standard: Topology and Orchestration Specification for Cloud Applications (TOSCA). It aims to enable the automated deployment and management of cloud applications. TOSCA is generic enough to cover a variety of scenarios and also portable between different cloud management environments [6].In this paper we present our work on using TOSCA as a language to describe workflows, workflow components and templates. We want to offer them as reusable entities that include not only the scientific experiment but also all details needed to deploy and execute it. In this paper, to demonstrate our approach in practice, we model an existing workflow using TOSCA. The example involves a typical scientific workflow, i.e. a set of tasks with data dependencies expressed as a directed acyclic graph. We use TOSCA to represent workflow components and the workflow itself but also to capture the configuration of the whole application. Overall, we show how to utilise the standard to generate a TOSCA-compliant topology for scientific workflows. To the best of our knowledge, this is the first attempt that explicitly addresses workflow deployment using TOSCA. II. BACKGROUND TOSCA is a specification for modeling a complete application stack, and automating its deployment and management in the cloud [7]. Its intent is to improve the portability of cloud applications in the face of growing diversity in cloud environments. The specification defines a meta-model for describing both the structure and management of IT services. The structure of a service is represented by the Topology Template which consists of Node and Relationship Templates. Together they define a service as a directed graph of deployable components. Each service component is represented by a Node Template which is an instance of Node Type. If the Node Type defines properties and operations of a component, the Template provides exact values for the properties and implementations of the operations. Node Types and Templates are defined separately to support reusability; the same Type can be instantiated multiple times in the same topology and also can be referenced by others. Similarly, a Relationship Template is an instance of Relationship Type. Together they are able to describe
the logical relationships and other dependencies between the application’s node templates [7]. The deployment process, i.e. creation, configuration, activation and termination of a service, is defined by Plans. Plans encode a sequence of operations required to instantiate TOSCA services and thus they follow an “imperative” approach. The use of plans is not mandatory however. Often, a TOSCA runtime environment is able to infer a correct deployment plan and management procedure only by interpreting service topology. This is known as the “declarative” approach [6]. The main advantage of the declarative approach is that it hides low-level deployment activities from the user. Scientists can focus on the definition of the high-level architecture of their experiment, which the TOSCA runtime can translate into a detailed deployment procedure. In this work we therefore adopt the declarative approach and use the Topology Template to define workflows. As mentioned earlier, TOSCA is still an emerging standard. At the time of writing this paper, the YAML-based version of the specification has not yet been released, thus we use vendorspecific flavour of TOSCA YAML provided by Cloudify (http://getcloudify.org). Cloudify is a free and open-source orchestrator platform that intends to use TOSCA to automate the deployment and scaling of applications over any cloud technology. Currently, it is actively developed and has a vibrant community, which makes it a promising research platform. III. M ODELING S CIENTIFIC W ORKFLOWS USING TOSCA In this section we show how TOSCA can be used to specify a scientific workflow, including discussion of the different stages followed to create a complete service template. In principle, to define a structure of any application with TOSCA one needs to model a set of Node and Relationship Types, corresponding Node and Relationship Templates, and include them in the topology of the Service Template. A. Defining Node Types The first step to model a workflow using TOSCA is to identify all its constituent parts. These include workflow tasks and all their software dependencies such as the specific packages and libraries required by the tasks to run. Workflow components and their dependencies may be described as Node Types. Node types are usually derived from the basic types provided by TOSCA and Cloudify DSL, like ApplicationModule, and then customised with specific property and interface definitions. Central to this is the Cloudify-specific life cycle interface with operations to create, start and terminate a service. When defining workflow components, each of them will need an implementation of the life cycle interface. B. Defining Relationship Types To capture dependencies between nodes, TOSCA offers a number of generic Relationship Types such as depends on and connected to. These types define an interface with operations to configure the source and target nodes joined by the relationship. Among the basic relationship types one of the most
common is contained in. It is used to create vertical software stack like virtual machine that hosts an operating system which in turn hosts one or more workflow services. When connecting new, non-standard node types, a new type of relationship may be required. The relationship definition is used to specify the semantics of a link between nodes and also methods which realize such a link. For example, the connected to relationship needs implementation of methods which can bind two end nodes, as in a client-server connection. C. Workflow Service Template The TOSCA metamodel uses the concept of a Service Template to describe a cloud application. We use it to model the high-level structure of scientific workflows. Service Template is a graph of Node Templates which represent specific instances of application components and Relationship Templates that model links between these instances. Clearly, it fits the notion of scientific workflow very well. In TOSCA, Node and Relationship Templates are instances of Node and Relationship Types. If types define properties and declare interfaces, templates provide values for the properties and implement interface operations. Again, this corresponds very well to the workflow domain where a workflow block (single type) can be included in a workflow definition multiple times (multiple templates). Importantly, the entire Service Template may be treated as another Node Type, which greatly improves reusability. IV. U SE C ASE : TOSCA- BASED MAPPING OF AN E -S CIENCE C ENTRAL S CIENTIFIC WORKFLOW To demonstrate the feasibility of using TOSCA to model scientific workflow applications we selected an existing workflow that performs phylogenetic analysis of the Leishmania parasite. The workflow is used in the EUBrazil Cloud Connect project (http://www.eubrazilcloudconnect.eu) to perform identification of Leishmania species using the neighbour joining method. Originally, it was designed in the e-Science Central system (e-SC): we now present its specification as a TOSCA service template. The result is a self-contained and portable service model that can be used to deploy and manage workflow instances in the cloud. A. e-Science Central workflows e-SC is a cloud-based workflow management system that provides capabilities to store, analyse and share data among scientists [8]. It includes a cloud-based workflow enactment engine to which users can submit their workflows via a web browser or desktop application. The system implements a simple dataflow model in which workflows are built from blocks connected into a direct acyclic graph. e-SC workflow comprises a comprehensive list of components, services, and assemblies required to achieve specific functionality implemented by workflow blocks. Blocks may have multiple input and output ports, multiple input properties and an output status. Links between blocks denote data dependencies while data between blocks is passed as files in the local file system.
description: Description of block function type: string block_name: type: string block_category: type: string service_type: type: string •
Fig. 1: An e-SC workflow modeling phylogenetic analysis.
Base Node Types
Compute
Root Container
ApplicationModule
Volume
Specific Node Types
Workflow_Service
Spec_library
Custom Node Types ImportFile FileJoin FilterDupl
ClustralW
MegaNj
...
CSVExport
Core_lib
MegaCC
...
esc_Comtools
Fig. 2: Node types hierarchy for an e-SC workflow.
As blocks can be of different types (Java, R, Octave) each have some specific dependencies; e.g., an R environment. They may also need additional libraries to run. Figure 1 depicts the selected workflow as designed in e-SC. It consists of 11 blocks of which 9 are Java-based and 2 others (ClustralW and MEGA-NJ) wrap executable tools to perform sequence alignment and the neighbour-joining analysis. B. TOSCA-based Mapping of an e-SC Workflow Following the concepts of the TOSCA specification, the topology description and components definitions for an e-SC workflow can be modeled as follows. 1) Workflow components as Node Types: There are two types of components in e-SC workflows – blocks and shared libraries needed by the blocks. To describe them we defined two node types. From these two types we derive node types corresponding to all e-SC workflow blocks and libraries in the example; this two level hierarchy is depicted in Figure 2: • Specific Node Types: Nodes at this level represent the most fundamental part of any e-SC workflow. They are derived from the Cloudify basic node types and define: (1) a generic workflow block that offers common properties to all types of blocks, and (2) a generic library that forms the basic type for all shared libraries in a workflow. We derived them from the ApplicationModule node type defined in Cloudify, which is a base type for any software module or artifact to be deployed. Listing 1 presents the complete node type definition for a workflow block. Listing 1: Node Type definition of a workflow block. workflow_block: derived_from: cloudify.nodes.ApplicationModule properties: block_description:
Custom Node Types: This node type is used to represent types of workflow blocks and shared libraries, and includes information about their specific properties, and block inputs and outputs. Custom node types will be instantiated by node templates to represent the actual blocks and libraries that compose a specific workflow. Listing 2 presents an example of the node type for a selected e-SC block. Listing 2: Node Type of custom workflow block ClustralW. clustralw: properties: align: description: Do full multiple alignment type: boolean default: true output-type: description: Choose output format type: string default: CLUSTAL # block inputs input-sequences: type: string default: file-wrapper # block outputs aligned-sequences: type: string default: file-wrapper
2) Block dependencies as Relationship Types: Most of the relationships used in an e-SC workflow are common to any cloud application. For example, the contained in relationship may denote that a block is hosted on a VM. The exceptions are dependency links that connect block input and output ports (e.g. FileLink, DataLink). We derived them from the generic depends on relationship type to implement the postconfigure operation that will pass data between connected blocks. Listing 3 shows one of the new relationship types. Listing 3: Definition of an e-SC workflow Relationship Type. DataLink: derived_from: cloudify.relationships.depends_on source_interfaces: cloudify.interfaces.relationship_lifecycle: postconfigure: implementation: scripts/data_postconfigure.sh inputs: target_port: default: ’’
3) Constructing e-SC Workflow Service Template: Following TOSCA service template rules, we created a service template for the e-SC workflow shown earlier (Figure 3). It represents not only the high-level structure of the phylogenetic workflow but also the actual deployment topology. The topology contains information on the full application stack from the VM properties (e.g. VM type as offered by the cloud provider), OS type and up to the actual Java class that implements block
FileWrapperLink DataWrapperLink Contained-in Depends-on ImportFile
MegaNj
ExportFile ExportFile
ClustralW
FileJoin
ExportFile
CSVExport
FilterDupl CSVExport ImportFile
Core_lib
JavaRunTime
Virtual Machine
esc.Comtools
ClustralW-lib
MegaCC
Virtual Machine
Core_lib
JavaRunTime
Virtual Machine
Fig. 3: TOSCA service template of the phylogenetic analysis workflow. functionality. Although due to space constraints we are not able to present the complete topology, it includes all the details needed by a TOSCA runtime to map the workflow components onto the resources in the cloud. And in this way we can define workflows in a reproducible way. V. R ELATED W ORK With the popularity of the cloud and workflow management systems (WfMS) a number of solutions have been developed to support management of scientific workflows in the cloud. One of the most prominent is Galaxy which has recently been integrated with the Globus Provision toolkit [4]. The toolkit enables Galaxy to automatically distribute data and computation on Amazon EC2. Another well established WfMS is Pegasus [9]. In Pegasus users define workflows as abstract and resource-independent and then they are mapped into concrete, platform-specific execution plans. The plans are enacted by DAGMan which tracks dependencies and releases tasks as they become ready, whilst Condor schedd runs them on available resources. Although both Galaxy and Pegasus can run in the cloud, they still use a very specific workflow definition language. Instead, we propose a way to define workflows in a portable way which might be deployed and executed by any TOSCAcompliant runtime environment. The research and application of using TOSCA is still in its early stages. Currently, it is focused on exploring the possibilities of applying TOSCA to manage various types of distributed application on the cloud. Wettinger et al. [10] present several concepts that integrate both model-driven cloud management and configuration management to automate deployment of web applications. In [11] authors propose to use TOSCA to specify the components and configuration of Internet of Things (IoT) applications. Kostoska et al. [12] presented an implementation of TOSCA to enable a custom University Management System to be deployed in a flexible and portable manner. However, to the best of our knowledge we are presenting the first attempt to map scientific workflows using TOSCA. VI. C ONCLUSIONS AND F UTURE W ORK The ability to package cloud applications in a way that enables their reusability and portability is an important pre-
condition to truly realizing the benefits of cloud computing for scientific and other applications. It does, however, require the existence of a well-defined standard that allows us to capture complex deployment and configuration requirements. This paper has shown that the TOSCA specification can fulfil this need for scientific workflows. We have presented the first attempt to use TOSCA to formally describe the internal topology of a scientific workflow, together with its deployment processes. The potential benefits of this work include the portability, automatic deployment and scalability of workflows. Following this work, we plan to develop scripts that implement the life cycle interface for the presented node and relationship templates. Our goal is to achieve the automatic deployment of workflows using Cloudify. VII. ACKNOWLEDGEMENT This work was partially supported by EU-funded project EUBrazil Cloud Connect, grant no. 614048. R EFERENCES [1] Y. Zhao, Y. Li, I. Raicu, S. Lu, W. Tian, and H. Liu, “Enabling scalable scientific workflow management in the Cloud,” Future Generation Computer Systems, no. 1, Nov. 2014. [2] C. Goble, J. Bhagat, S. Aleksejevs, D. Cruickshank, D. Michaelides, D. Newman, M. Borkum, S. Bechhofer, M. Roos, P. Li, and D. de Roure, “myExperiment: A repository and social network for the sharing of bioinformatics workflows,” Nucleic Acids Research, vol. 38, pp. 677– 682, 2010. [3] J. Goecks, A. Nekrutenko, and J. Taylor, “Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences.” Genome biology, vol. 11, p. R86, 2010. [4] B. Liu, B. Sotomayor, R. Madduri, K. Chard, and I. Foster, “Deploying Bioinformatics Workflows on Clouds with Galaxy and Globus Provision,” 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, pp. 1087–1095, Nov. 2012. [5] J. Zhao, J. M. Gomez-Perez, K. Belhajjame, G. Klyne, E. GarciaCuesta, A. Garrido, K. Hettne, M. Roos, D. De Roure, and C. Goble, “Why workflows break - Understanding and combating decay in Taverna workflows,” in 2012 IEEE 8th International Conference on E-Science. IEEE, Oct. 2012, pp. 1–9. [6] T. Binz, U. Breitenb¨ucher, O. Kopp, and F. Leymann, TOSCA: Portable Automated Deployment and Management of Cloud Applications, A. Bouguettaya, Q. Z. Sheng, and F. Daniel, Eds. New York, NY: Springer New York, 2013. [7] OASIS, “Topology and Orchestration Specification for Cloud Applications version 1.0,” 2013. [8] H. Hiden, S. Woodman, P. Watson, and J. Cala, “Developing cloud applications using the e-Science Central platform.” Philosophical transactions of the Royal Society, vol. 371, pp. 17–32, 2013. [9] E. Deelman, G. Singh, H. Su, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, K. Vahi, and G. B. Berriman, “Pegasus: A framework for mapping complex scientific workflows onto distributed systems,” Scientific Programming, vol. 13, pp. 219–237, 2005. [10] J. Wettinger, M. Behrendt, T. Binz, U. Breitenb¨ucher, G. Breiter, F. Leymann, S. Moser, I. Schwertle, and T. Spatzier, “Integrating Configuration Management with Model-Driven Cloud Management Based on TOSCA,” in CLOSER. SciTePress, 2013, pp. 437–446. [11] F. Li, M. Vogler, M. Claessens, and S. Dustdar, “Towards Automated IoT Application Deployment by a Cloud-Based Approach,” in 2013 IEEE 6th International Conference on Service-Oriented Computing and Applications. IEEE, Dec. 2013, pp. 61–68. [12] M. Kostoska, I. Chorbev, and M. Gusev, “Creating portable TOSCA archive for iKnow University Management System,” in Federated Conference on Computer Science and Information Systems (FedCSIS), Sep. 2014, pp. 761–768.