Dec 9, 2009 - Keywords: scientific workflow; web service; distributed atmospheric data ... gird infrastructure as shared
2009 IEEE International Symposium on Parallel and Distributed Processing with Applications
A Web Based Workflow System for Distributed Atmospheric Data Processing Jie Cheng 1, 2, Xiaoguang Lin 1, Yuanchun Zhou 1, Jianhui Li 1 1 Computer Network Information Center, Chinese Academy of Sciences 2 Graduate University of Chinese Academy of Sciences
[email protected], {lxg, yczhou, lijh}@sdb.cnic.cn through use of XML and SOAP messaging [4].We proposed a web service model for atmospheric data access and built a data access service repository which can access large distributed datasets easily. While it is important to access to massive scientific data, the demand for adaptable interfaces and tools for accessing scientific data and executing complex analyses on the retrieved data has raised in many disciplines (e.g., astronomy, atmospheric sciences, and ecology). Such analyses can be modeled as scientific workflows in which the flow of data from one analytical step to another. The experiments are iterative data processing procedures, including some fixed steps with different granularity; the only different things among these steps are integrating different computational models. How to reuse the components with computational models and analyses information to raise research efficiency is a big problem for the domain researchers. Kepler [5] is a system for design, discovery, execution and deployment of scientific workflows from different scientific domains. In this paper, we describe a four-layered architecture of the workflow system that supports automated scientific workflows consisting of data processing with minimal user interaction. We have developed a web service enabled workflow system which is built on top of data management architecture. It provided an integrated problem solving environment where researchers can build workflows using local or Grid data and computational models, which are wrapped as Web Services. The workflow system leverages Kepler engine, a popular open source workflow engine. Furthermore, a challenging issue is to provide a friendly used web based access for viewing, executing and sharing scientific workflows. The rest of the paper is organized as follows: In section 2, we give a brief overview of atmospheric data processing and introduce the Kepler scientific workflow system. Section 3 describes a four-layered architecture of workflow system for processing scientific data and executing computational models in distributed environment. Section 4 describes how to compose a workflow for atmospheric data processing by using our system as a real world example. Finally, in section 5, we give a discussion and future work directions.
Abstract—Distributed atmospheric data are often stored and managed in diverse regional and global repositories. The scientific process requires composition of these resources to solve a specific atmospheric science problem. A challenge is how to harvest data and models for designing and executing experiments in a seamless manner. Moreover, data-intensive scientific experiments with largescale scientific data processing are iterative, dynamic and human centered. In particular, the atmospheric scientific experiments usually include some fixed steps, such as data discovery, access, preprocessing, transformation, visualization. A web service-enabled workflow system is a flexible tool for accessing distributed scientific data, and executing complex analysis on it. In this paper, we describe a four-layered architecture of the workflow system, which consists of web interaction layer, workflow engine layer, workflow components layer and resource layer. We also develop an intuitive and easy-to-use web based toolkit and apply it to atmospheric data processing. Keywords: scientific workflow; atmospheric data processing.
web
service;
distributed
I. INTRODUCTION Many of the data systems and services that we use today in domain science are distributed systems. How to access largest distributed and heterogeneous data is a big challenge for scientists. In an earlier work, we have integrated a number of scientific data collections into the SDG [1] data gird infrastructure as shared resources. The core service in SDG middleware is Data Access Service (DAS), which is designed to provide uniform data access to geographically distributed, heterogeneous and autonomous databases for users. DAS based on the Service Oriented Architecture [2] (SOA) allows cooperation of data among different regional repositories. The state-of-the-art in atmospheric research community is the Distributed Oceanographic Data System (DODS, also OPeNDAP [3]), which maps a file or aggregation of files onto a URL. A major problem in DODS is that any metadata query or data request must be cast, with very limited semantics, onto the syntax for URLs and is thus unable to exploit the generality available
978-0-7695-3747-4/09 $25.00 © 2009 IEEE DOI 10.1109/ISPA.2009.30
584
Authorized licensed use limited to: THE LIBRARY OF CHINESE ACADEMY OF SCIENCES. Downloaded on December 9, 2009 at 21:07 from IEEE Xplore. Restrictions apply.
II. ATMOSPHERIC DATA PROCESSING AND THE KEPLER
B.
Kepler workflow system The Scientific workflow, a flexible tool for accessing scientific data, and executing complex analysis on the retrieved data, is becoming increasingly important as a unifying mechanism to combine scientific data management, analysis, simulation, and visualization tasks. Scientific workflow systems are problem-solving environments, supporting domain scientists and researchers in the creation and execution of scientific workflows [9]. There are many mature academic scientific workflow systems in the international community, such as Kepler, Taverna, Triana [10], etc. With respect to their modeling paradigm and workflow execution models, these systems are closer to visual dataflow programming languages for scientific data and services. In particular, the Kepler system is the most popular one, for its providing a convenient program method with a special component called actor to extend specific scientific workflow system [11]. Kepler is a cross discipline project that aims to simplify access to scientific data and the analysis of the retrieved data. The Kepler environment is built upon the PtolemyII [12] platform, developed for modeling heterogeneous and concurrent systems and engineering applications, and the Kepler project has extended PtolemyII towards scientific workflows through adding support for web service invocations and access to Grid resources. Components to deal directly with the business logic in Kepler are objects called actors and the communication between actors and execution of a workflow is controlled by an object called director. In Kepler, users develop workflows by selecting appropriate actors, and joining them together to form the desired workflow. Actors have input ports and output ports that provide the communication interface to other actors. Each customized workflow model is comprised of a director and at least one actor. When workflow execution, the director controls the data flowing between actors and deploys the iteration of each actor. Taken together, workflows, actors, ports, connections, and directors represent the basic building blocks of actor-oriented modeling. For a domain science, the only thing for the programmers to do is to extend these interfaces to encapsulate domain scientific appropriate components (actors) [5].
SYSTEM
A.
Atmospheric data processing To a better understanding of data processing, firstly we give a brief introduction to the atmospheric data formats and some useful tools in atmospheric domain. In atmospheric science domain, besides the basic ASCII code and raw binary data formats, the common formats are listed as follows [6] [7] [8]. TABLE 1. ATMOSPHERIC DATA FORMATS
Formats
NetCDF
GRIB
HDF
Description Network Common Data Form is a set of software libraries and machineindependent data formats that support the creation, access, and sharing of array-oriented scientific data. GRIdded Binary,the standard format for the storage and interchange of meteorological data maintained by Maintainer: World Meteorological Organization (WMO) Hierarchical Data Format , a library and multi-object file format for the transfer of graphical and numerical data between machines.
The common tools in atmospheric data processing usually include NCL, CDO, MATLAB and other tools. In our experiment, the data format is NetCDF and we take the NCL as the tool to process atmospheric scientific data and visualize the result data. Figure 1 shows the atmospheric data processing diagram. Atmospheric data processing includes four steps: data query, data input, select computation model and publish the result.
III. WEB SERVICE-ENABLED WORKFLOW ENVIRONMENT
Figure 1. Atmospheric data processing diagram
585
Authorized licensed use limited to: THE LIBRARY OF CHINESE ACADEMY OF SCIENCES. Downloaded on December 9, 2009 at 21:07 from IEEE Xplore. Restrictions apply.
their only scientific data to our storage environment to share it with others. The algorithm model is one of the most important sections in our atmospheric scientific data analysis and computation environment. In atmospheric science, there are too many algorithm models, from simple addition, subtraction, multiplication, division operations, to complicated statistical functions such as EOF, Anomaly, and SVD etc. They could be stored in two forms: the persistent format in database (stored in Mysql database), and the format of actors in Kepler container. The workflow components layer of this system could transform the two formats from each other when required. And through the web interaction, user could add new algorithm models dynamically. We take the NCL as the tool to process atmospheric scientific data and visualize the result data. The full NCL script includes three parts: file input and output, data analysis, and visualization. We could use one simple command line to invoke the NCL easily in Linux OS. Some other important data are stored in the database, such as categories of actors, details of actors, user information and other important information. The CNGrid [13] is a test bed for the new generation of information infrastructure by integrating high performance computing and process transaction capacity. It efficiently supports various applications including scientific research, resource and environment research, advanced manufacturing and information service by sharing resources, collaborating and service mechanism. It also propels the progress of national informatization and related industry through technology innovation. CNGrid is equipped with independently developed grid oriented high performance computers (Lenovo DeepComp 6800, Dawning 4000A).
Figure 2. Framework of the workflow system
The framework of our workflow system is shown in Figure 2. It consists of four layers. The uppermost "User Interface" layer not only provides an easy-to-use web interface to design, execute, manage, and view workflow instances, but also gives users a one-stop service web environment for source data, result data, user information management, and other services. The "Workflow Engine" layer provides a runtime environment for workflow execution based on the Kepler engine. Kepler receives customized workflow instance (a XML file in MoML schema), resolve the definition of the flow, instantiate the flow, control the instance execution, and schedule tasks. The "workflow components" layer includes web serviceenabled modules. These modules consist of data, model and computation components which are all wrapped as web services. The data services provide uniform data access to distributed data and result visualization; the model services encapsulated some useful algorithm models; the computation services connect to the computation resources. The "Resource" layer contains large scale datasets managed by the SDG Grid, algorithm models of each atmospheric experiment and computation resources (e.g., CNGrid [13]).
B.
Workflow Components layer To allow third-party applications to access our data and computation resources, we have developed a set of data and computation services using web service technology as components. The services have been implemented using Apache Axis2 [14] and the interfaces are published in the form of a WSDL document by using the Java2WSDL command. We have also used Globus Toolkit 4.0 as a web service development resource to allow us to leverage the grid computing capabilities provided by the CNGrid infrastructure [15]. Model Registration service can encapsulate the algorithm models with NCL script as a service. This service supports that scientists and researchers add new algorithm models by making up new actors dynamically. The component defines a standard to the new algorithm model with NCL script, which specifies the formats of input variables, output variables, parameters, and others. After users upload the script through web, the model registration service encapsulates it automatically to a java source file satisfied the actor format. And then, the service invokes java command to compile the source file dynamically to be a
A.
Resources layer There are many physical resources in the atmospheric scientific data analysis and computational environment: large-scale local scientific data and data grid resources, algorithm models for steps of each experiment, software tools for data processing, and computational resource. Most atmospheric scientific data (like IPCC, NCAR) are large-scale, in particular, each size of the NCAR data files is at least a few hundreds million bytes. Most often, researchers have to fetch the data temporarily from the different repositories. Thus, lots of time and resources will be wasted, and the efficiency of the scientific research will be very low. What’s more, we have opened a highperformance computational environment and massive data storage platform to public. Hence, researchers could upload
586
Authorized licensed use limited to: THE LIBRARY OF CHINESE ACADEMY OF SCIENCES. Downloaded on December 9, 2009 at 21:07 from IEEE Xplore. Restrictions apply.
could do cut, copy, paste, drag and other visual editing operations on the graphic canvas, representing actors from the workflow components layer. A customized workflow instance is stored as a XML file in MoML schema in this module. And then, the instance will be submitted to the workflow engine layer, where the Kepler engine will be invoked to resolve and execute the instance. After Kepler finishes the execution, the result (dataset or image) will be exhibited to users. This layer also provides users interfaces to add new algorithm models and manage them, user management and other useful functionalities.
.class file. At the same time, detailed information of this actor is stored into database. The implementation of the web service interfaces not only allows the reuse of services and components, it also provides interoperability between applications running on different platforms. C.
Workflow Engine layer The Kepler system is used as the workflow engine in our environment. We use web front module instead of Kepler’s own intuitive GUI to design and execute workflow instances. So, Kepler is a background running program in our environment. Kepler sees to receive customized workflow instances (XML file in MoML schema), resolve the definition of the flow, instantiate the flow, control the instance execution, schedule tasks, and do other relative operations. Once the workflow execution starts, actors are “fired” and processed iteratively in Kepler. Table 2 shows main states of the workflow engine. When workflow instances are executed, each has an execution listener to monitor states of the engine during the whole execution phase.
IV. ATMOSPHERIC DATA PROCESSING EXAMPLE A simple atmospheric scientific workflow instance snapshot is shown in Figure 4. During the workflow composition, the user first select the desired data and computation models, these components appear on the left panel of our system and can be dragged and dropped onto the workflow canvas. Figure 3 provides a interface that users can set the variable parameters of the distributed data. In this simple experiment, we first calculate the mean value of 4xDaily Air temperature from the year 2005 and 2006, then we invoke the seasonal mean algorithm services to calculate the spring mean, finally we display the result image on the web. Figure 5 shows the final result image.
TABLE 2. MAIN STATES OF THE KEPLER ENGINE
State IDLE INITIALIZING RESOLVING_TYPES ITERATING PAUSED EXITING
Description There is not any active execution. The execution is in the initialize phase. Type resolution is being done. The execution is in the iteration process. The execution is paused. The execution is in the wrap up phase and about to exit.
D.
User Interface layer The Kepler system provides an intuitive GUI to design and execute workflow instances. But, considering users’ convenience and our enormous computational and storage resources, we develop a web based toolkit to enhance the user interaction experience, which is one of the highlights of our environment. This web interaction layer is a traditional MVC (ModelView-Controller) application, the layer not only provides an easy-to-use web interface to design, execute, manage, and view workflow instances, but gives users a one-stop service web environment for source data, result data, user information management, and other services. The design and view module is developed based on VML (Vector Markup Language), which is a markup script language supporting the IE browser, and AJAX (Asynchronous JavaScript and XML) technology. Users
Figure 3. Variable parameters’ setting
Figure 4. A snapshot of the atmospheric data processing workflow
587
Authorized licensed use limited to: THE LIBRARY OF CHINESE ACADEMY OF SCIENCES. Downloaded on December 9, 2009 at 21:07 from IEEE Xplore. Restrictions apply.
REFERENCES [1] Scientific Data Grid. http://en.sdg.ac.cn [2] Hao He, “Service-oriented architecture http://www.xml.com/pub/a/ws/2003/09/30/soa.html.
definition”,
[3] Open-source Project for a Network Data Access Protocol. http://opendap.org/ [4] Woolf, A., Haines, K., Liu, and C., A Web Service Model for Climate Data Access on the Grid. International Journal of High Performance Computing Applications, Vol. 17, No. 3, 281-295 (2003) [5] Kepler Project. http://kepler-project.org/ [6] NetCDF Data Form. http://www.unidata.ucar.edu/ software/ netcdf/ [7] GRIB Data Form. http://www.wmo.int/pages/index_en. html
Figure 5. Spring mean result of 4xDaily Air temperature
[8] HDF Data Form. http://www.hdfgroup.org
V. DISCUSSION AND FUTURE WORK
[9] Ludascher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger, E., Jones, M., Jones, E., Lee, E., Tao, J., Zhao, & Y., (2005) Scientific Workflow Management and the Kepler System. Concurrency and Computation: Practice & Experience 18(10), 1039-1065.
In this paper, we describe a four-layered architecture of the workflow system and illustrate the application in distributed atmospheric data processing. By publishing atmospheric data and models as Web Services, a unified framework is achieved for accessing distributed scientific data and executing complex analysis on it. An issue in Kepler is that since it is running from the researcher's desktop, the program has to run the entire time during a long running workflow, however, our system supports asynchronous execution of workflows in such a way that the interface can be closed and the workflow can be checked again at a later stage. The contributions of our work include: (1) a generic framework for scientific workflow that supports distributed data processing, (2) a set of web services that enables distributed, heterogeneous data access and model invocation in atmospheric domain, (3) a web interface to design, execute, manage, and view workflow instances. Although this paper is on atmospheric data processing, the solutions are generic to other domains as well. As for future work, we would like to develop a workflow monitoring service to be aware of the status of execution. Users can log into the web portal to view the process status. We are also interested in extending this workflow framework in ChinaFlux [16] data processing which is in ecology domain.
[10] S. Majithia, M. S. Shields, I. J. Taylor, I. Wang, Triana:A Graphical Web Service Composition and Execution Toolkit, IEEE International Conference on Web Services(ICWS'04), July 6-9, 2004, San Diego, USA [11] Rygg, A., Roe, P., and Wong, O., GPFlow: An Intuitive Environment for Web Based Scientific Workflow. Proceedings of the Fifth International Conference on Grid and Cooperative Computing Workshops (GCCW '06), 2006, Changsha, China [12] Edward A. Lee, Overview of the Ptolemy Project, Technical Memorandum UCB/ERL M03/25, July 2, 2003,University of California, Berkeley, CA, 94720, USA. [13] China National Grid.http://www.cngrid.org/cngrid-oldsite/en_index.htm [14] Apache Axis2 Toolkit. http://ws.apache.org/axis2/ [15] I. Foster, "Globus Toolkit Versin 4: Software for Service Oriented Systems," IFIP International Conference on Network and Parallel Computing, 2005, Springer-Verlag LNCS 3779, 2-13. [16] GR Yu, XF Wen, XM Sun, BD Tanner, X Lee, JY Chen. Overview of ChinaFLUX and evaluation of its eddy covariance measurement. Agricultural and Forest Meteorology, 2006
ACKNOWLEDGEMENTS We would like to thank Professor Gang Hang, Doctor Pengfei Wang, and Xia Qu from Institute of Atmospheric Physics, Chinese Academy of Sciences, for their open idea, discussion, cooperation, and contribution. This work was supported by the Knowledge Innovation Program of the Chinese Academy of Sciences(No.O815021108)and the Youth Foundation of Computer Network Information Center, Chinese Academy of Sciences (No. O714041701).
588
Authorized licensed use limited to: THE LIBRARY OF CHINESE ACADEMY OF SCIENCES. Downloaded on December 9, 2009 at 21:07 from IEEE Xplore. Restrictions apply.