specific tools for complex tasks' description, which can be automatically translated into particular calls of cloud computing services or BigData analytics tasks.
Section Name
INTERACTIVE E-SCIENCE CYBERINFRASTRUCTURE FOR WORKFLOW MANAGEMENT COUPLED WITH BIG DATA TECHNOLOGY Denis Nasonov1 Alexander Visheratin1 Konstantin Knyazkov1 Sergey Kovalchuk1 1
ITMO University, Russian Federation
ABSTRACT The paper presents the technology for building e-Science cyberinfrastructure which enables integration of regular cloud computing environment with big data facilities and stream data processing. The developed technology is aimed to support uniform dynamic interaction with the user during composite application building and execution, as well as result analysis. The core concept of the proposed approach is based on the set of domain-specific knowledge including description of a) semantics of problem domain objects; b) used software and data services; c) data formats and access protocols. Linking all these knowledge parts together facilitates automatic solution of technological integration issues. It enables providing the user with high-level domainspecific tools for complex tasks’ description, which can be automatically translated into particular calls of cloud computing services or BigData analytics tasks. The developed technology uses interactive workflow (IWF) technique to interconnect services of different kind: computation, data analytics, external data sources, interactive visualization systems Keywords: Big Data, workflow, cloud computing, e-Science, cyberinfrastructure INTRODUCTION Contemporary e-Science toolbox is often build around the concept of workflow and workflow management systems [1] which are focused on providing high-level access to the computational resources usually organized within a grid or cloud computing infrastructure [2]. Geoinformatics is one of the scientific area that has necessity in the similar systems, especially in meteorology and hydrometeorology as well as in the GIS (Geographic Information Systems). On the other hand contemporary tasks are often related to the processing of large data sets (e.g. see the idea of Fourth paradigm in science by Microsoft [3]. Today a set of technologies for processing large data arrays is intensively developed within the area of BigData [4] which produces a set of new issues related to the scientific tasks and solving them using existing infrastructure. One of the ideas behind the BigData principles is implementation of the code-to-data approach (moving pieces of application to the resources where data is stored) instead of more common data-to-code approach (transferring data and parameters to the computational resource)[5]. This difference in the paradigms leads to the need of development of joined architecture which provides the capability for development of solutions
15th International SGEM GeoConference on……………
exploiting both paradigms (see e.g. [6], [7]). Still most of the coupled solutions divide the parts of composite applications (e.g. by providing BigData infrastructure as one on the services available to be incorporated into the workflow). The goal of the presented work is to develop and implement the architecture of the platform which enables seamless integration of different resources (computational resources within infrastructure, data storages and data services) within a single composite application defined at high domain-specific level according to basic requirements of e-Science area. COUPLING BIG DATA WITH WORKFLOWS Requirements analysis. After the consideration of e-Science tasks (mainly within simulation-based approach) and current challenges within this area [4], [8] we have defined the following issues to be managed by the developed platform. 1. The proposed solution should provide the capability to develop high-level task description without explicit relationship to the particular architecture or data processing styles. 2. The developed platform should incorporate different classes of resources: computational resources, data storages, services, etc. All the resource management procedures should be performed implicitly in an automatic way. 3. The resources should have high-level unified access on the workflow level with further automatic separation and translation into particular data-to-code or codeto-data requests. 4. The processing of tasks should be performed dynamically with real-time exchange of data sets and parameters enabled. This should involve exploiting data streams to support data processing immediately after its appearance. 5. The interaction with the user should be performed in a unified (domain-neutral) way with involvement of the user into the simulation process. This should make a further step to enable system-level exploration [9]) as a next-generation way of e-Science tasks’ solving. Additionally considering the specificity of BigData technology within the e-Science tasks the following requirements can be defined: 6. The platform should support integration of various data sources with different formats, access protocols, usage rules, etc. Semantically identical data should be available to the user (within the scope of workflow) in a unified way. 7. Implicit integration of data analytics tools for processing large arrays of data in an automatic way (i.e. without direct coding of MapReduce procedures or similar development activities). 8. The platform should provide the capability to integrate specific BigData visualization tools that can act interactively as a part of workflow executer using the platform. Technological Background. To develop the platform according to the proposed requirements the following technologies and concepts can be used as a basis. iPSE (Intelligent Problem Solving Environment) concept [10] was developed to provide knowledge-based conceptual framework for solving e-Science tasks using merged knowledge from three basic domains: task-specific problem domain, IT knowledge and simulation domain. Conceptual hierarchy of expressive technologies [11], which
Section Name
organize and integrate a set of domain-specific languages with textual or graphical notations for expression and usage of knowledge from different problem domains to enable automatic processing of e-Science tasks. CLAVIRE cloud computing platform [12] which enables high-level abstract workflow definition and execution. The platform uses a set of knowledge-based technologies to describe available software and hardware resources as well as domain-specific objects to be investigated within e-Science tasks. IWF (Interactive Workflow) technology [13] was created to ensure real-time data exchange within the workflow during the execution process. It extends the basic concept of workflow by introducing ports to enable data exchange using data streams. Additionally it supports building interactive simulation environments involving humancomputer interaction as a part of the workflow. VSO (Virtual Simulation Objects) concept and technology [14] was developed to organize high-level domain-specific simulation environment where the user can describe the investigated system by its structural semantic model (as a set of interconnected objects) which in turn can be automatically translated into executable workflow structure. Dynamic DSL for BigData analytics [15] is being developed to support high-level description of BigData analytic request using a set of domain-specific libraries, which extend the basic structure of the language. It can be used as an intermediate language to translate the parts of coupled workflow into BigData analytics requests. Workflow classification. Generally, workflows for e-Science task are classified into three major categories (Example of different workflows from several categories is shown on the Fig. 1). 1. Targeted (local) workflows. Workflows from this category are running locally on the dedicated resources. If any software which forms the workflow steps is missing on the resource, it is automatically deploying from package repository. Targeted WFs or locally operated WFs are tightly coupled with predefined specific conditions of the local computational environment which force it to be executed on the local resource. Following cases clearly show such conditions. a. Data-driven WF. It can be efficiently used only being processed on data storing nodes, in order to save time on data coping overheads; b. WF that has to take into consideration a certain level of confidentiality. It might be important to run WF locally in the case when an unique information should be processed only on the owner's resource; c. Offline WF that can be executed on the resources which are periodically available to the system like laptops, smart phones and so on. d. Computational (traditional) workflows. These workflows are executed on the most profitable resources according to the system multi-criteria scheduling algorithm based on the available computational facilities. These WFs are typically used in the most spread cases, among them: (1) High performance computing tasks that require involvement of significant computational resources; (2) WF with a lot of fork-join structures and should be immediately processed; (3) WF that contains combination of unique software that are deployed on different resources; (4) Collaborative WF with a group of people engaged in solving one global task.
15th International SGEM GeoConference on……………
Figure 1. Different types of workflows in use 2. System workflows. According to their purposes, system workflows are serving as functional elements to support consistent continuous platform operations. It is significant for such activities as external data source and infrastructure monitoring. System WFs should provide online support of the platform infrastructure or systems that were built above the platform. The crucial aspect for them is uninterrupted and robust execution through the time, even if the platform fails, it should continue running on dedicated resources and restore availability after the system recovery. Common cases are: (a) Processing WF helps to process some incoming data from external resources on periodical or permanent (streaming) basis; (b) Monitoring WFs are used for infrastructure monitoring based on the analyses of environment parameters. 3. Hybrid workflows combine possibilities of all three previous classes. Hybrid WFs are practically used for complex system solution implementation and may contain benefits of all mentioned approaches. In the section IV a prototype of flood preventing system is presented as an example. IMPLEMENTATION DETAILS Solution’s Architecture. Considering the proposed idea of coupled BigData technology and computational workflow management infrastructure the architecture of the platform (see Fig. 2) was developed. It meets all proposed criteria and requirements. The architecture is based on the components from CLAVIRE platform and extends its functional capabilities with high-level processing technologies using a set of metadata storages. The main idea of the proposed architecture is focused on several issues important to be managed.
Section Name
User interfaces Parameter management GUI
Workflow management GUI
CLAVIRE/POI
CLAVIRE/Ginger
Metadata
Interactive visualization
Domain semantics CLAVIRE/VSO
Composite application
WF
D
Data format description
C Interactive execution management
D
D
External data sources
D
Data management
Data collecting
Software description CLAVIRE/PackageBase
BigData request
Computing request
C
D
CLAVIRE/Crawler
Resource description CLAVIRE/ResourceBase
Distributed data storage
Resource management
CLAVIRE/DStorage
CLAVIRE/Executor
Data storing nodes
Execution services
Available resources D
WF S
S
D
S
D
S
S
S
- workflow
C
- code blocks
- existing CLAVIRE components
- available software
D
- datasets
- developed components
S
Figure 2. Architecture of the platform Firstly, the architecture is developed to interconnect automatically different classes of available resources: (a) regular cloud computing nodes (executing services), which usually execute available software using data provided by the user and transferred to those services (data-to-code); (b) distributed data storage nodes which can be used either for general data storing, or for distributed BigData processing using transferred code (code-to-data) and local software; (c) data sources which can be processed either as a streaming sources or as regular external data storages. The seamless integration of all these classes of resources within single composite application is the main goal of the Data management subsystem within the architecture. The system includes control of all the data streams within the platform as well as background management of the data being stored in the storage (including crawling data from external data sources and management of replication within the distributed storage). Secondary, to support the unified work with composite application regardless of the nature of its parts (data-to-code or code-to-data) the user interface of CLAVIRE (including workflow management system and visualization toolbox) platform as well as used workflow language (EasyFlow) should be extended with high-level blocks that will enable automatic interpretation of joined composite application. The composite application in that case should include (a) workflow structure; (b) corresponding data and parameters to be taken from the user; (c) implicit (composed from the EasyFlow constructs) code to be transferred to distributed storage as BigData requests along with the required parameters. The decomposition and two-ways interpretation of the composite application are performed by the Interactive execution management subsystem. This subsystem controls all the parts of the composite application and interconnects them with the help of IWF technology.
15th International SGEM GeoConference on……………
Finally all the processes are supported by the knowledge libraries which reside in the metadata storage. Among them the following knowledge libraries are derived from the basic CLAVIRE platform: (a) description of available computation resources (ResourceBase); (b) description of software packages available to call in batch or interactive mode (PackageBase); (c) high-level domain-specific objects which enable description of investigated system (VSO technology). Additionally the library which describes data of different format processing is introduced within the developed platform to support automatic data management. Solution development. To develop the platform described above, the architecture that is based on the mentioned technologies and CLAVIRE core, several main functional blocks should be highlighted for implementation phase. Data unification module. This block includes services that provide all functionality connected with data processing, such as: (a) template-based approach to organize external data management in monitoring workflows. It includes data acquisition, interpretation and organization of storage, (b) unified data management service which should be used for meta-data operations, (c) integration of package base parameters’ description with created unified database types to provide seamless data usage through the whole platform. EasyFlow extension. This block is crucial for workflow development and use of extended functionality. In spite of required “condition” and “procedure” operations, one of the most important features is ability to provide BigData requests with included data WFs from traditional computation WF. Another good feature is online workflow changing. Data WF module. In order to provide local WF execution on distributed storage nodes, an assembly of supporting services is needed. Distributed software deploy service, resource service extension and CLAVIRE storage plug-in are some of them. IWF. There is a need to provide steering capability and online data monitoring to involve users in the execution process when system solution can be improved by user’s interactive activity. Also support of IWF reconfiguration is necessary to balance and dynamically change the environment. FLOOD PREVENTION SYSTEM EXAMPLE On the Fig. 3.a. flood prevention’s application represents a prototype of Early Warning System (EWS) core. It has three different sides of data operation within workflows during system stages’ execution. The first block “Monitoring WF” represents activity which helps to detect upcoming hazards. It collects data from external sources such as sensors, web services or remote directories, and then it performs preprocessing, like filtering or recovery and finally saves it to distributed storage. The data is saved according to BigData principles. Also, in parallel, in order to check hazards’ occurrence “Monitoring WF” launches the computation workflow. The first part of “Computation WF” blocks is required for potential hazards’ discovery. Using provided aggregated data, “Computation WF” runs Swan and BSM models in order to get water level forecast which is used to detect flood hazards. If flood isn’t detected, “Computation WF” ceases its execution. Otherwise two functional brunches are executed. First brunch makes uncertainty analysis for provided atmospheric data forecast and its impact on water level prediction model. The main part of this brunch is fork-join structure which implements Monte Carlo method. The results of estimated uncertainty are used in “Plan maker” step.
Section Name
a)
b)
c)
d)
Figure 3. Demo applications (a, b) flood prevention application built within hybrid workflow The second brunch forms data WF to find out other type of uncertainty within BigData nodes. Firstly, through the retrospective data the search by atmospheric forecast pattern is performed. Detected cases are used to calculate water level forecast and to find out more cases produced by other atmospheric forecasts. Finally, the cases are compared with measurements. The uncertainty results are transferred in “Plan maker” step. The important feature is a possibility to submit high computation tasks like SWAN and BSM back to the platform in order to get results faster and prevent overloading of data nodes. Apart from that user can be provided by steering option to make decisions during search on inaccurate matches. At the last part of WF “Plan maker” step produces several plans that are ordered by multi-criteria algorithm. Then a group of experts chooses the best one. On the Fig. 3.b CLAVIRE implementation is shown. On the first screen an abstract composition of described workflow is demonstrated. On the second and third screens, completed workflow and selected safe plan are presented. CONCLUSION The presented work is aimed towards the development of the platform which combines the benefits of traditional workflow-based systems with BigData solutions keeping the e-Science user (domain specialist) away from technical details of two execution paradigms and multiplicity of various technologies. This goal could be reached by combination of the several knowledge-based technologies which enable high-level definition of the task and automatic interpretation of it. The developed platform is capable to integrate seamlessly various resources (computational, data storages, services) within a single workflow form which creates a basis for system-level scientific exploration and implementation of wide range of application classes. ACKNOWLEDGEMENTS
15th International SGEM GeoConference on……………
This paper is financially supported by Ministry of Education and Science of the Russian Federation, agreement #14.578.21.0077 (24.11.2014). This work was financially supported by Government of Russian Federation, Grant 074-U01; and project “Big data management for computationally intensive applications” (project #14613). REFERENCES [1] Yu J., Buyya R. A taxonomy of workflow management systems for grid computing //Journal of Grid Computing. – 2005. – Т. 3. – №. 3-4. – С. 171-200.; [2] Foster I. et al. Cloud computing and grid computing 360-degree compared //Grid Computing Environments Workshop, 2008. GCE'08. – Ieee, 2008. – С. 1-10. [3] Tansley S. et al. (ed.). The fourth paradigm: data-intensive scientific discovery. – Redmond, WA : Microsoft Research, 2009. – Т. 1. [4] Assunção M. D. et al. Big Data computing and clouds: Trends and future directions //Journal of Parallel and Distributed Computing. – 2014. [5] Manjunatha A. et al. Getting Code Near the Data: A Study of Generating Customized Data Intensive Scientific Workflows with DSL. – 2010. [6] Baranowski M., Belloum A., Bubak M. MapReduce Operations with WS-VLAM WMS //Procedia Computer Science. – 2013. – Т. 18. – С. 2599-2602. [7] Gil Y. et al. Time-bound analytic tasks on large datasets through dynamic configuration of workflows //Proceedings of the 8th Workshop on Workflows in Support of Large-Scale Science. – ACM, 2013. – С. 88-97. [8] Gil Y. et al. Examining the challenges of scientific workflows //Ieee computer. – 2007. – Т. 40. – №. 12. – С. 26-34. [9] Foster I., Kesselman C. Scaling system-level science: Scientific exploration and IT implications //Computer. – 2006. – №. 11. – С. 31-39. [10] Boukhanovsky A. V., Kovalchuk S. V., Maryin S. V. Intelligent software platform for complex system computer simulation: conception, architecture and implementation //Izvestiya VUZov. Priborostroenie. – 2009. – Т. 10. – С. 5-24. [11] Knyazkov K. V. et al. CLAVIRE: e-Science infrastructure for data-driven computing //Journal of Computational Science. – 2012. – Т. 3. – №. 6. – С. 504-510. [12] Kovalchuk S. V. et al. Knowledge-based Expressive Technologies within Cloud Computing Environments //PAoIS. – Springer Berlin Heidelberg, 2014. – С. 1-11. [13] Knyazkov K. V. et al. Interactive workflow-based infrastructure for urgent computing //Procedia Computer Science. – 2013. – Т. 18. – С. 2223-2232. [14] Kovalchuk S. V. et al. Virtual Simulation Objects concept as a framework for system-level simulation //arXiv preprint arXiv:1211.7080. – 2012. [15] Kovalchuk S. V. et al. A Technology for BigData Analysis Task Description Using Domain-specific Languages //Procedia Computer Science. – 2014. – Т. 29. – С. 488498.