EScience06
Integrating Data Grid and Web Services for E-Science Applications: A Case Study of Exploring Species Distributions1 Jianting Zhang1, Ilkay Altintas2, Jing Tao3, Xianhua Liu4, Deana D. Pennington1, William K. Michener1 1 LTER Network Office, the University of New Mexico, Albuquerque, NM, 87131, USA 2 San Diego Supercomputer Center (SDSC), UC San Diego, La Jolla, CA 92093, USA 3 NCEAS, UC Santa Barbara, Santa Barbara, CA 93106, USA 4 Department of Biology, the University of North Carolina, Chapel Hill, NC, 27599
Contact Email:
[email protected], Phone: 1-505-277-0666 Abstract: Data Grid and Web Services are among the advanced computing technologies that are available to support scientists and scientific applications. We use Kepler scientific workflow system to integrate the two popular technologies for e-science applications. A prototype system for exploring species distribution patterns has been developed for demonstration purposes by using data grid resources and a similarity-based clustering Web service in conjunction with Geographical Information System (GIS) based spatial visualization resources in Kepler.
1. Introduction “E-science” refers to the use of advanced computing technologies to support scientists (De Roure and Hendler, 2004). It represents the increasing global collaborations of people and shared resources (such as large scale computing systems and data archives) that are needed to solve new problems of science and engineering (Hey and Trefethen, 2003). The concepts of Service-Oriented Architecture (SOA) and Grid Computing have gained considerable popularities in E-science communities in recent years (Foster et al, 2002). The term “data grid” can be defined as a network of distributed storage resources that are linked using logical name space to create global and persistent identifiers (Rajasekar et al, 2003). Web Services is an industrial de-facto standard for loosely coupled distributed computing. Both the Data Grid and the Web Services technologies are being widely used to support e-science applications in distributed and heterogeneous computation environments. In this paper we 1
report our design and implementation of a prototype system for exploring and visualizing species distribution data using Data Grid and Web Services technologies. Interoperability is the key to all aspects that characterize e-science (De Roure and Hendler, 2004). Metadata plays crucial roles in ecological data management (Michener, 2006). We use the Ecological Metadata Language (EML, [HREF 1]) to describe species distribution datasets so that they can be discovered and used in a standardized manner. After the metadata documents are parsed and validated, real datasets can be retrieved from data grids using the logical identifiers as indicated in the metadata documents. Analytical functions are implemented as Web services and published using the Web Service Description Language (WSDL). Furthermore, we use the Kepler scientific workflow system (Ludäscher et al, 2006, [HREF 2]) to integrate the data grid resources and Web services into scientific application tasks. Kepler has built-in workflow components to consume such resources/services. It allows developing customer analytical and visualization workflow components. By connecting the workflow components we can compose workflows for E-Science applications. The rest of the paper is arranged as follows. Section 2 introduces the Kepler scientific workflow system, the EML data source actor to retrieve and cache data from data grids and the Web Services actor to interact with Web services’ endpoints. Section 3 describes the actor for generating taxonomic similarity matrix and the actor for visualizing species spatial
This work is supported in part by DARPA grant # N00014-03-1-0900 and NSF grant ITR #0225665 SEEK
Proceedings of the Second IEEE International Conference on e-Science and Grid Computing (e-Science'06) 0-7695-2734-5/06 $20.00 © 2006
EScience06
distribution patterns based on the clustering results. Section 4 presents the technical details for publishing the clustering algorithm as a Web service. Section 5 provides a running example using the North and South Carolina plant distribution data. Finally Section 6 is the summary and conclusions.
2. Kepler and its Data Grid and Web Services Actors Kepler scientific workflow system (Ludäscher et al, 2006, [HREF 2]) builds upon the mature, dataflow-oriented Ptolemy II system (Ptolemy, [HREF 3]). Ptolemy controls the execution of a workflow via so-called directors that represent models of computation. Individual workflow steps are implemented as reusable actors. An actor can have multiple input and output ports, through which streams of data tokens flow. Additionally, actors may have parameters to define specific behavior. An illustration is shown in Fig. 1. Note that Parameter Port is an extension of regular IO Port and its value can either be preset using an associated parameter or updated by the connecting port dynamically as a regular IO Port. Kepler inherits these advanced features from Ptolemy and adds several new features for scientific workflows. Director Consumer Actor Producer Actor
Relation
IO Port Parameter Port Consumer Actor
Link
Fig. 1 Illustration of Basic Components in Kepler Scientific Workflow System Kepler provides a library of workflow components (directors, actors, parameters, etc.). When users drag and drop them into the workflow composition canvas, the data associated with the workflow components are added to the workflow being constructed. Actors and the ports associated with the actors are
Proceedings of the Second IEEE International Conference on e-Science and Grid Computing (e-Science'06) 0-7695-2734-5/06 $20.00 © 2006
rendered graphically. Users can then connect two ports or a port and a relation (or creating a link in Ptolemy terminology) by dragging and dropping as well. The workflow composition canvas allows typical types of zooming (in/out/fit) and automatic layout (see Fig. 8). Currently there are hundreds of actors represent data sources, sinks, transformers, analytical functions, or arbitrary computational modules. The EML data source actor and the Web Services actor are two of them that will be used in this study.
2.1 EML Data Source Actor The EML data source actor (Fig. 2) allows a user to specify the location of an EML document and parses the document into an internal representation. The location of an EML document can be a local file, a URL, or a resulting query record from a metadata server (such as Metacat server at NCEAS) (Jones, 2001). The entity names inside the EML document are presented to users for selection. The metadata associated with the selected entity is then used to retrieve data from data grids, such as SDSC SRB (Storage Resource Broker; Rajasekar, 2003) based EcoGrid (Michener et al, 2005). Currently EML Data Source actor supports several types of entities, namely dataTable, spatialRaster, spatialVector, storedProcedure and view. Among them, dataTable, storedProcedure and view are mostly for tabular data and spatialRaster and spatialVector are designed for Geographical Information System (GIS) data. As an example, the Hydro1K ([HREF 4]) Digital Elevation Model (DEM) data for North America is replicated at a SDSC SRB based data grid with a unique logical identifier seek:/home/beam.seek/Hydro1k/NorthAmerica/n a_dem.tar. It is documented as a spatialRaster in its EML document stored at NCEAS Metacat server. Using both the metadata servers and data grids provides an interoperable way to discovery and make use of scientific data. To further illustrate the structure of EML, part of the EML document for the Hydro1k North American DEM data is shown in Fig. 3. Kepler has an advanced caching system that can cache both metadata and data downloaded from remote data grid to improve performance. This is especially useful for relative static and large datasets. Furthermore, the Kepler cache system will uncompress files to its cache storage if the downloaded files are compressed and return their local paths. The
EScience06
cache system not only reduces network traffic and improves system response time, it is also crucial for actors that require taking local file names as input parameters. The Kepler cache
system essentially eliminates the needs of implementing data downloads by individual actors.
Fig. 2 Interface for Customizing an EML Data Source Actor … hydro1k_north_america_dem … row 1 bil 16 big-endian 0 18204 18204 0 srb://seek:/home/beam.seek/Hydro1k/NorthAmerica/na_dem.tar …
Fig. 3 Part of the EML Document for Hydro1k DEM North America Dataset
2.2 The Web Services Actor The Web Services actor in Kepler serves as a proxy between the workflow system and the Web service endpoints. An example is shown in Fig. 4. There are two steps to use a
Proceedings of the Second IEEE International Conference on e-Science and Grid Computing (e-Science'06) 0-7695-2734-5/06 $20.00 © 2006
Web Services actor in Kepler. The first step is to specify the URL of the WSDL of a Web service. The Web Services actor will parse the WSDL document and retrieve available operations and their input/output types declared in the WSDL
EScience06
document. In the second step, users can select an operation from a dropdown list. After the selection, parameters associated with the operation are obtained, the corresponding ports are added to the actor and the actor is ready to connect its ports to ports in other actors. Kepler allows changing the default port names retrieved from WSDL to make them more meaningful and intuitive. When a Web Services actor is executed (or “fired” in Ptolemy/Kepler terminology), the actor collects all the data tokens from the input
ports and transforms them into the values of primitive XML data types. The values are further transformed to Java objects and put into an object array. The object array is then used in Apache Axis Java client API to invoke the Web service endpoint. Finally the invocation result is mapped back to Kepler output ports by setting their output data tokens. Currently the Web Services actor supports primitive XML data types and their array data types only. Support for complex data types is planned.
Fig. 4 Illustration of the Kepler Web Services Actor 3. Developing Actors for Species Distribution Exploration
Species distribution data are often associated with regions. A certain species can be distributed in multiple regions and a region can have multiple species. We choose to explore spatial distribution patterns of a set of species as follows. First, a similarity matrix is generated from the taxonomic data. Second, the matrix is applied to a clustering package called Cluto ([HREF 5]). Cluto has the capability of building tree hierarchy from the clustering results based on a certain similarity criteria. Finally, the clustering results are displayed in an open source GIS called JUMP ([HREF 6]). We could have developed a system on a single machine to achieve the exploration goal if different computational components used in the study were compatible to each other. However, the Cluto package only provides a C API in the form
Proceedings of the Second IEEE International Conference on e-Science and Grid Computing (e-Science'06) 0-7695-2734-5/06 $20.00 © 2006
of a binary static library. While the library for both Windows and Linux are provided, we encountered problems of using the library in Java codes using JNI technology under Windows. Fortunately the library worked well under Linux. We thus chose to publish the needed clustering functions as Web services for remote invocations. We present the actor for generating similarity matrix and the actor for visualization in this section and leave developing and deploying the clustering Web service for Section 4.
3.1 Actor for Generating Taxonomic Similarity Matrix The species in a region can often be formulated as a taxonomic tree according to one or more classification systems. Fig. 5 shows the taxonomic tree of a county in US (FIPS# 37025)
EScience06
with three taxonomic ranks: Family (F), Genus (G) and Species (S). We measure the similarity of two taxonomic trees as the size of the intersection of the two trees at all tree levels (taxonomic ranks). The algorithm to compute the similarity is recursive and is presented in Fig. 6. The algorithm stops when the leaf nodes of the two taxonomic trees under comparison are reached. Note taxonomic trees for all regions in a dataset have the same depths which are determined by the number of taxonomic ranks in the dataset.
in Kepler easily using generic text viewing actors which is good for debug purposes. Algorithm TaxonomicTreeSimilarity (TaxonomicTree T1, TaxonomicTree T2) Begin 1. Retrieve all the immediate child nodes of T1 and store them in Set s_t1 2. Retrieve all the immediate child nodes of T2 and store them in Set s_t2 3. Set the intersection of s_t1 and s_t2 to s_intersect 4. Return the result of GetIntersectSize (T1,T2, s_intersect) End Algorithm GetIntersectSize (TaxonomicTree T1, TaxonomicTree T2, Set s)
Fig. 5 An Example of Taxonomic Tree The actor for generating taxonomic similarity matrix has two input ports and five output ports. The two input ports specify the locations of the taxon data file and the index data file, respectively. The taxon data file contains records of species distributions in regions and the index file contains the data that maps nonstandard region names to some key values that can uniquely identify the regions and link the region names with geometric data of the regions for spatial visualization. The five output ports are designed in accordance with the clustering Web service interface (c.f. Fig. 4 and section 4). The first four ports are related to the matrix format used in Cluto ([HREF 7], Cluto Manual, Section 5.2): the number of rows, the array of row pointers, the array of row index and the array of row values. The last port outputs the key value array in the order of taxonomic trees used for generating the similarity matrix. All output ports are purposely set to string data type to achieve maximum system compatibility and user interpretability for two reasons. First, string data type is primitive in both Kepler and WSDL. Second, results of string data type can be viewed
Proceedings of the Second IEEE International Conference on e-Science and Grid Computing (e-Science'06) 0-7695-2734-5/06 $20.00 © 2006
Begin Initialize num_diff to 0 For each of the elements in s 1. Find the corresponding tree node in T1 and store it in n1 2. Find the corresponding tree node in T2 and store it in n2 3. Increase num_diff by the result of TaxonomicTreeSimilarity (n1,n2) Return num_diff End Fig. 6 Algorithm for Measuring the Similarity between two Taxonomic Trees
3.2 Actor for Visualizing Species Spatial Distributions
The actor is developed by customizing the JUMP open source GIS package ([HREF 6]). The actor has three input ports and, as a sink actor, it does not have output port. The first port specifies the location of an ESRI shapefile (the associated DBF file and optional index file should be in the same location), the second port specifies the sequence of region key values (which is an output of the actor for generating similarity matrix) and the third port is the clustering result. We use the predefined color schemas in JUMP to generate an array of colors (or color legends) and each cluster of regions will be colored using one unique color in the color array. To make the actor robust, if the number of clusters exceeds the number of available colors, all regions will be colored the same using the default color. The user interface that the visualization actor brings up after
EScience06
successful invocation is shown in Fig. 7. The interface is implemented as a split pane to set the proportions of spaces of color legends and the map. The color legends are arranged as a tree structure to reflect the clustering hierarchy. Users can select multiple tree nodes and hit the “Map” button located the bottom-left of the interface to visualize clusters of interests as follows. First, all the clusters under the selected tree nodes are computed. Second, regions that are classified as the clusters are colored using the color legends of the regions. Users can perform standard GIS operations on the colored (or thematic) map, such as Zoom In, Pan, Full Extent and Zoom to Previous/Next views. The visualization actor visually shows regions that are grouped into the same clusters and helps users understand the distribution patterns, stimulate hypotheses and seek further insights. Users can also change the number of clusters (c.f. section 2.2 and Fig. 4) to see the changes of distribution patters. This can be carried out in an iterative manor in exploring species distributions.
Fig. 7 User Interface of Species Spatial Distribution Visualization Actor
4. Publishing Clustering Web Service As discussed in the previous section, the clustering Web service is based on the Cluto package from University of Minnesota. While the package provides routines to cluster data in both vector space and similarity space using different clustering criteria function and clustering strategies ([HREF 7]), currently, only the graph-partitioning-based clustering algorithm are published as Web Service. As part of the future work, we are working on publishing all the similarity-based clustering routines as Web services. We refer readers to Cluto manual for more details with regarding to the clustering algorithm.
Proceedings of the Second IEEE International Conference on e-Science and Grid Computing (e-Science'06) 0-7695-2734-5/06 $20.00 © 2006
To simplify the Web service’s interface, several parameters that are nonessential to users are preset, such as seed to be used by the random number generator, the debug control setting, the number of the trials. We plan to expose such parameters to the Web service in the future and use PortParameter (c.f. Section 2) in Kepler to allow users to change their values dynamically, preset their values in the design time, or simply use their default values ([HREF 8]). Since JNI requires using shared library while the Cluto package only provides a static library, we have written a bridging library in C and compiled it to a shared library to solve the problem. This was also an opportunity to reshape the Cluto C APIs to provide more Java friendly APIs to use in Apache Axis for publishing the Web service.
5. Demonstration
In this study, we use the North and South Carolina plant distribution data as an example to demonstrate the functionality of the prototype system. We first illustrate how to compose a scientific workflow using system built-in actors and customer actors. The workflow becomes executable after filling necessary parameters for the actors. We then demonstrate how the prototype system can help to explore spatial distribution patterns of plants. Composing workflows in Kepler is straightforward after the customer actors are developed and the clustering Web service is deployed. What users need to do is to drag and drop workflow components (director and actors in this application, left side of Fig. 8) to the canvas (right side of Fig. 8). Once actors are dropped on the canvas, their default input and output ports will be attached. For complex actors such as EML Data Source actor and the Web Services actor, ports will change accordingly when the parameters in the actor change as described in the corresponding sections. Once the workflow is constructed, users can execute it in Kepler scientific workflow environment by hitting the triangle icon located on the top of Kepler window. Kepler allows users to execute a workflow in batch mode or interactive step-bystep mode. At any time during the execution of a workflow, users can stop/resume the execution and watch the intermediate results. The lower part of Fig. 8 shows two snapshots of the outputs of the species distribution visualization actor, one for mapping cluster 2 and 4 and one for mapping cluster 3. The clustering Web service assigns cluster 2 and
EScience06
cluster 4 to the same branch while cluster 3 to another branch in the resulting hierarchical clustering tree (c.f. Fig. 7). This means that regions in cluster 2 and 4 are more similar than regions in cluster 3 with regarding to the taxonomic tree comparison criteria. The clustering results can be better understood
visually by the two snapshots of the visualization actor. In the snapshots, regions in cluster 2 and regions in cluster 4 are geographically closer to each other than regions of cluster 3. The visualization supports the ecological theory that similar geographical environments often result in similar species distributions.
Cluster 3
Clusters 2 and 4
Fig. 8 The Composed Workflow for Species Distribution Explorations
Summary and Conclusions
References
In this study, a prototype system was built by using the existing data grid actor and Web Services actor as well as developing new customer actors. Kepler chained the actors seamlessly to achieve the scientific goal of exploring species spatial distribution patterns. The prototype demonstrated the capability of integrating distributed and heterogeneous computational resources to support scientific researches effectively. While the customer actors and the workflow are specifically developed for exploring species distributions, we believe the framework and the technologies can be applied to other e-science domains as well.
1.
2.
3.
4.
Proceedings of the Second IEEE International Conference on e-Science and Grid Computing (e-Science'06) 0-7695-2734-5/06 $20.00 © 2006
Ian T. Foster, Carl Kesselman, Jeffrey M. Nick, Steven Tuecke: Grid Services for Distributed System Integration. IEEE Computer 35(6): 37-46 (2002) Tony Hey and Anne Trefethen, The Data Deluge: An e-Science Perspective, Chapter in "Grid Computing - Making the Global Infrastructure a Reality", Edited by F. Berman, G.C.Fox and A.J.G.Hey, Wiley, January 2003, pp.809-824. Matthew Jones, Chad Berkley, Jivka Bojilova, Mark Schildhauer: Managing Scientific Metadata. IEEE Internet Computing 5(5): 59-68 (2001) Bertram Ludäscher, Altintas Ilkay, Chad Berkley, Dan Higgins, Efrat Jaeger, Matthew Jones, Edward A. Lee, Jing Tao,
EScience06
5. 6.
7. 8.
Yang Zhao, Scientific workflow management and the Kepler system, Concurrency and Computation: Practice and Experience 18(10): 1039-1065 (2006) William Michener, Meta-information concepts for ecological data management, Ecological Informatics 1(2006) 3-7. William Michener, James Beach, Shawn Bowers, Laura Downey, Matthew Jones, Bertram Ludäscher, Deana Pennington, Arcot Rajasekar, Samantha Romanello, Mark Schildhauer, Dave Vieglais, Jianting Zhang, Data Integration and Workflow Solutions for Ecology, Lecture Notes in Computer Science 3615 (2005) 321-324 David De Roure, James A. Hendler: EScience: The Grid and the Semantic Web. IEEE Intelligent Systems 19(1): 65-71 (2004) Arcot Rajasekar, Michael Wan, Reagan Moore, Wayne Schroeder, George Kremenek, Arun Jagatheesan, Charles Cowart, Bing Zhu, Sheau-Yen Chen, Roman Olschanowsky, Storage Resource Broker Managing Distributed Data in a Grid,Computer Society of India Journal, Special Issue on SAN, Vol. 33, No. 4, pp. 42-54 Oct 2003.
[HREF 1] Ecological Metadata Language, http://knb.ecoinformatics.org/software/eml/, (last accessed 09/18/06)
Proceedings of the Second IEEE International Conference on e-Science and Grid Computing (e-Science'06) 0-7695-2734-5/06 $20.00 © 2006
[HREF 2] Kepler scientific workflow system, http://www.kepler-project.org/ (last accessed 09/18/06) [HREF 3] Ptolemy II, http://ptolemy.eecs.berkeley.edu/ptolemyII/ (last accessed 09/18/06) [HREF 4] Hydro1K, http://edc.usgs.gov/products/elevation/gtopo30/h ydro/index.html (last accessed 09/18/06) [HREF 5] Cluto, http://wwwusers.cs.umn.edu/~karypis/cluto/ (last accessed 09/18/06) [HREF 6] Unified Mapping Platform (JUMP), http://www.vividsolutions.com/jump/ (last accessed 09/18/06) [HREF 7] Cluto Manual, http://www.cs.umn.edu/tech_reports_upload/tr20 02/02-017.pdf, (last accessed 09/18/06) [HREF 8] Ptolemy II Design Document, http://ptolemy.eecs.berkeley.edu/papers/05/ptIId esign1-intro, (last accessed 09/18/06)
Acknowledgements Kepler includes contributors from SEEK, SDM Center, Ptolemy II and Geon, supported by NSF ITRs 022567 (SEEK), 0225673 (GEON), DOE DE-FC02-01ER25486 (SciDAC/SDM), and DARPA F33615-00-C-1703 (Ptolemy).