Workflow Infrastructure for Multi-scale Science ...

1 downloads 0 Views 531KB Size Report
To address the initial requirement of wrapping existing legacy applica- tions, we present ...... non, Jay Alameda, Brian F. Jewett, Jack S. Kain, Steven J. Weiss,.
Workflow Infrastructure for Multi-scale Science Gateways 

Srinath Perera, Suresh Marru, Chathura Herath School of Informatics Indiana University, USA

A BSTRACT Science gateway users will want to construct, share, execute and monitor sequence of tasks executing on their local workstations to high-end grid-enabled compute resources. Typically, these tasks are written in various programming languages and shell command executables designed to be run in a single user environment. In collaborative research, owners of applications would prefer to share applications in a controlled manner. However, lack of technology support for such scenarios has made sharing non-trivial. In this paper, we present a set of gateway development tools: a toolkit to wrap tasks as Web service, a service registry, and a workflow composition, enactment and monitoring GUI. This domain independent workflow suite will facilitate users from wide verity of environments to selectively and securely share their applications as web services, and construct workflows with these services. The suite also includes features of on-demand service creation, workflow orchestration and monitoring. Furthermore, these components can be used individually or collectively to build a small to large-scale science gateways. keywords:Science gateway, Workflow, Service Regsitry, Application Wrapper, Web Services 1

I NTRODUCTION

As the concept of Grid Computing matures, science collaborations are emerging to share research code, data and distributed computing resources. These collaborations are multi-disciplinary in nature and are building gateways to empower seamless access to nation’s high-end integrated infrastructures like TeraGrid [9]. The starting point for many scientific collaborations is sharing research systems designed as computer based tools, and a major subset of these tools and simulation models are shell-command executable applications. As the computational requirements of these applications become demanding, infrastructural issues like security, remote data and compute task management incur significant time of scientists deterring them from their core science research. Typically, an end-to-end scientific simulation involves preprocessing, analysis and post processing. Since most of the science problems are inter-connected, scientists need to share applications, combine multiple application logics in flexible but planned order, and orchestrate them as workflows. These inter and intra discipline collaborative research is vital for the advancement of science. However, sharing command line applications is not a trivial process due to many technical and practical difficulties. Most legacy science applications are not easy to install, maintain and they undergo changes constantly. Some of these applications have dependencies on specialized softwares, non-conventional versions of operating systems, and some even depend on expensive hardware like a particle accelerator or a earthquake simulator. Furthermore, application authors may not be ready to share the application source as it is, but rather are willing to

share executions within a trusted community and in a controlled manner. Web services [12] provide standardized interfaces, defined, described, and discovered as xml artifacts, and they facilitate secured and controlled interactions with other web services and software components, by using well known message formats. Nevertheless, redesigning all scientific applications as web services, is neither suggested nor a practical solution. Hence, the concept of wrapping applications as web services is emerging. These wrapped application services abstract functional logic and these application services could also be composed together as scientific workflows, enabling the scientists to construct, orchestrate, monitor, and share application executions. In this paper we present the architecture and usage of a workflow suite, which facilitates users to wrap command line applications as secured services and construct, share, execute and monitor workflows. The workflow suite comprises of Generic Service Toolkit (GST) [25], XRegistry and XBaya Workflow Graphical User Interface [42] . These software components are developed and widely used by a NSF funded large Information Technology Research project, Linked Environments for Atmospheric Discovery (LEAD) [16]. LEAD is building an integrated, scalable Cyberinfrastructure in which meteorological analysis tools, forecast models, and data repositories can operate as dynamically adaptive, on-demand, Grid-enabled systems. Real-time dynamically adaptive weather forecasting provides a major subset of gateway requirements and have been the driving motivation for developing the workflow infrastructure. But these tools have been architected to cater to wider scientific domains addressing the generalized computational and data management issues. Owing to the wide range of requirements from domain centric gateways, building a general purpose science-gateway workflow infrastructure will need extensible and modular software components. These extensible tools are packaged, tested and supported through Open Grid Computing Environments (OGCE) [36] project, and can be used by small, medium or large scale science gateways as a integrated workflow infrastructure or can incorporate individual components into existing gateway infrastructures. The section 2 of this paper discusses high-level architecture of the workflow suite and briefly presents each individual component with references for further reading. In the section 3, we discuss the security model and various security usage patterns. To illustrate the domain independence and multi-scale gateway adaption, In the section 4 we present chemical informatics, atmospheric science, and genome analysis use cases. These gateways have successfully used the LEAD workflow infrastructure from a single user mode to a high end complex gateway serving hundreds of users. In following sections 5 and 6, we discuss related work, conclusion and future directions.

2

A RCHITECTURE

The LEAD workflow suite addresses the gateway requirements of generating web service interfaces for shell command applications and use them to construct data flow and/or control flow graphs as workflows. To address the initial requirement of wrapping existing legacy applications, we present the Generic Service Toolkit (GST) which not only provides web service interfaces to command line applications but also aids in creating on-demand web services. To catalog the generated application service descriptions, information about hosting environments, and constructed workflows, we present XRegistry service. The Xregistry is geared with sophisticated user and group support, and acts as a basis for sharing and publishing services among users. Finally we discuss a Workflow Graphical User Interface XBaya, which allows users to construct task graphs as workflows. As shown in Figure 1, we present a high-level architecture of the workflow system, and interactions between the service toolkit components, registry service and workflow GUI. Further in this section we discuss design details of individual components. To avoid confusion with deployment and usage of the multilayered architecture, the terminology is listed below: • Compute Hosts - the computing environment where the application will be executed, based on application needs, these can be from a simple workstation to a giant super computer. • Application Services - Web Service wrappers to command line science applications • Service Hosts - the servers used to run light weight application services, service factory and registry service. For small or a medium gateway the service hosts and compute hosts can be the same. • Scientific Workflows - tasks graphs orchestrating executing of a combination of application services. • Application Provider - Application owner managing the sharing and executions of applications. • Workflow Provider -Workflow user constructing application graphs and sharing them. • End user - Workflow system end user who is interested on the outcome of the workflow. Based on user knowledge level, the end user can be a workflow or application provider as well. Science Gateways have the need to support a large number of science applications, however only a sub set of them will be used at any give time. Deploying all application services, and keeping them alive will need a large pool of gateway resources, and they needs to be maintained and managed over a long period of time. To address this issue, we use the concept of on-demand service creation [25, 26, 42] features of the GST. With this approach, the application providers register the input and outputs and deployment locations of the applications with a registry, and they do not needs to deploy and maintain persistent web services. The workflow infrastructure uses Generic Factory Service (GFac), a component of the service toolkit, and creates web services at workflow execution time. Even though they are created on demand by a workflow user, they are reused amongst different users ensuring security and are kept alive for a short duration in anticipation of subsequent requests. These services automatically shut down after a pre-configured time of in-activity. Thus, the resource requirements to host and support multiple applications are minimal and a large number of applications can be registered to the science gateway. The deployment and usage of the workflow infrastructure can be broadly divided into three categories:

1. Gateway Setup and Deploying persistent services - a one time installation step. 2. Register Applications and construct and share workflows - iterated until all applications are registered and necessary workflows are constructed. 3. End users executing and monitoring workflows. We discuss these three steps in detail in reference to the sequences specified in Figure 1. The initial workflow system deployment would include setting up a factory service and and a registry service. If monitoring of workflows is desired, the optional web-service messaging framework [21] will be needed. Registering applications to the workflow system involves defining three deployment descriptors that define how a command line application should map to a service. These deployment and service descriptions are stored in the registry. As represented by arrow A1, application provider builds and tests the application on remote compute resources. As shown in steps A2 and A3, the deployment locations and service inputs and outputs are registered with registry service using a web form. The XRegistry service can be interfaced as a web-service and additionally a java API and command line shell scripts are provided. This facilitates gateways to programmatically generate and register application service descriptions. Typically, these steps A1 to A3 are only performed once and repeated as the application source changes. Workflow providers construct workflows as shown in steps B1 to B3. In Step B1 the XBaya workflow GUI [42] is loaded as a Java web start application, and the user is authenticated and authorized. As shown in steps B2, the XBaya GUI contacts a service registry and fetches the application services descriptions which are accessible to the by the user. The user can browse the services registered in steps A1 to A3, directly load an external service by its URL, or load existing workflows to construct hierarchal graphs. The XBaya GUI presents an intuitive information about the service and its inputs and outputs. The user may compose task graphs or workflows by connecting multiple services together, and the composer will validate input and output type compatibility. However the knowledge of application dependencies is assumed with the user. Furthermore, users may use the composer to access preconstructed workflows and execute them. An user inputs are captured, and the workflow is passed to an enactment engine for orchestrated execution. For smaller gateway deployment, XBaya has an embedded workflow engine implemented in Jython [2] or XBaya can serve as a client to standard workflow implementations like BPEL [4]. Arrows C1-C8 in Figure 1 illustrates the steps involved in execution of a workflow. Once the workflow is invoked, the Jython workflow enactment will search XResgitry for any application service instances (C2), and if found, it would invoke the application services with the user specified data and configuration parameters. If no such service instance is found, XBaya will contact the service factory and creates a service instance (C3), and the new service instance will register itself with XRegistry. The new service will be used by current workflow execution, or can be used by subsequent invocations with-in the life time of the created application service. Once invoked (C4), an application service will stage any input files (C5), and submit the job to an application host (C6) using the remote file transfer and job submission protocols specified in the deployment descriptors. For large-scale gateways, Grid based remote file transfer (GridFTP) and job management (GRAM) can be used. For simpler deployments, SSH based remote transactions can be supported. However, owing to security and better monitoring features, a grid based deployment is recommended. As the workflow is executing, the application services send activity notifications to the event bus (C7), and the XBaya GUI will be listening to it and graphically depict

Compute Nodes

(A1) Application Developers

(A2)

Service Toolkit Web UI

(B1, B3)

(A3)

(C6) Gram/SSH/Local

(B2,C2)

Applcations GridFtp/SFtp/http

XRegistry

(C5)

Workflow Provider Xbaya

(C1)

(C3) (C8)

(C4) Service Factory

Application Services

(C7)

Data

Notification Bus End Users

Figure 1: Lead Workflow System Architecture

the workflow progress by changing colors of workflow components and displaying notification messages (C8). 2.1

S ERVICE TOOLKIT

The goal of an application service is to wrap a command line application as a Web Service front, while handling details of file staging, Job submission, and security concerns. Furthermore, the service acts as the extensible runtime around which extensions like sharing, auditing, resource brokering and urgent computing are implemented. The service toolkit includes a generic factory service for on demand creation of applications services, and a service runtime that provides logic for application services. A user defines his applications, deployment information and mapping to the service, as a three deployment descriptors, application, host, and service description documents. The host description documents includes java and toolkit installation locations, temporary working directories, and if it is a compute host descriptions, it includes the remote access mechanisms for file transfers and job submissions. On the other hand, the application deployments descriptions define application install location and execution information about the application itself, and finally the service description documents define input, output and other application configuration information. When a user requests a new application service, the factory service chooses a host from registered service hosts, and starts a new application service on that host. If multiple service hosts are registered, the factory service will provide load balancing by choosing a host in round robin fashion. The newly created service fetches deployment descriptors from the registry and configures itself according to the contract defined by service, and applications descriptions. After a success-

ful initialization, the service registers its WSDL in the Registry service so it can be used by other workflow executions, and the service selfshutdown, after a given period of inactivity. When an application service is invoked, the service parses the request, identifies the parameters that should be passed in to the underlying application. As mentioned before, a typical application service invocation involves two hosts, the service host where the service instance is running, and the application host, where the application is executed. Services and applications have a one to many mapping, where multiple application descriptions correspond to different installations of the same application. After deciding the best application host to execute the application, the input data files specified by input parameters are staged to application host and the underlying application is executed using a job submission mechanism (e.g. GRAM). The service monitors the status of remote application, and publishes frequent activity information to the event bus. Once the invocation is complete, the application service tries to determine the results of the application invocation by searching the standard output for user-defined patterns or by listed a pre-specified location for generated data products. Apart from wrapping a command line application as a service, the application service provides number of add on facilities that are essential for a scientific workflow environment. The application service runtime is implemented using a processing pipeline based on the Chain of Responsibility pattern [50], where the pipeline can be altered by inserting interceptors. The resulting architecture is highly flexible and extensible, and provides the ideal architectural basis for a system that supports wide range of requirements. For an example, in the LEAD cyber- infrastructure project, we have used the interceptor pipeline to handle scenarios like supporting FORTRAN namelist preprocessing, adding security and authorization support, and integration with LEAD

data sub-system. Furthermore, the design has abstracted out common services like file transfer, registry support, notification support and job submission, allowing different implementations to be switched dynamically or via configurations. Apart from the core application wrapping and on-demand service creation functionally, the following is a list of additional features:

lead

CS

IU

1. An extensible request-processing pipeline that enables users to dynamically change request processing by introducing interceptors. A major application of these extensions is to implement domain specific functions. 2. Service toolkit supports wide verity of file transfer mechanisms (e.g. GridFTP, SFTP, HTTP, shared file system based copying, and local file transfers). 3. Service toolkit supports wide verity of job submission services (e.g. GRAM [14], WS-GRAM [17], SSH, Local and Virtual Grid Execution System [49]). 4. A application service supports several installation of the application, and a instance can be selected using random selection or sophisticated algorithms utilizing bandwidth estimates and best effort batch queue wait time predictions made by Network Weather Service and QBETS [52]. 5. Application services also support load balancing, and fault tolerance while processing requests. Every host description document may provide a list of file transfer and job submission endpoints, and application services would use those endpoints in round robin fashion. Furthermore, in case of a failure the application service can retry to a different endpoint if another one is available. 6. Selective resource sharing based on an authorization model provided by Xregistry.

OU

R EGISTRY S ERVICE

The Xregistry acts as the information repository for the workflow suite, and users may store documents (e.g., Service toolkit deployment files, workflow files), search and retrieve them, and services may register themselves with the registry, enabling other users and services to search and access registered services. Registry service provides extensive user and group support exposed via a Web service interface that provides the basis to sharing and access control in the gateway environment. Every document of the system has an owner, who has complete access to the resource, and users of the system may belong to groups, where groups may recursively belong to other groups. Furthermore, the registry facilitates user to define capabilities that authorize other users to perform actions on their documents. We define recursive group based authorization model formally as a tuple of users, resources, groups, group memberships, capabilities and ownerships, XREG = (U, R, G, M, C, O) , and other entities are defined as following. • U is the set of users of the system. • G is the set of groups of the system. • R is the set of resources in the system.

Students

• M ⊂ (U ∪ G) × G defines group memberships in the system. • C ⊂ (U ∪ G) × R × {Read, W rite} defines set of capabilities. • O ⊂ U × R defines the resource ownership Set. For a resource r ∈ R and user u ∈ U , we define following functions. We define the transitive reflective relationship, ∈∗ which represent group g1 contains g2 as following. When gx , gy ∈ G we say gx ∈ ∗ gy

⇐⇒

∃n ∈ N, g1 , g2 . . . gn ∈ G ∀i = 1 . . . n − 1, (gi , gi+1 ) ∈ M (gx , g1 ), (gn , gy ) ∈ M

We define the set G(u) ⊂ G, as the all groups a user belongs. G(u)

=

{g ∈ G | ∃g1 ∈ G, (u, g1 ) ∈ M, g1 ∈∗ g}

We define authentication function, auth : U × R → {T rue, F alse} as follows. auth(u, r)

2.2

Teachers

Figure 2: Sample Group Hierarchy

=

7. Support for job submission based on urgent computing facilities [6]. 8. The toolkit Supports auditing of resource utilization providing basis for accounting, and billing [30].

educators

meteorologists

(∃g ∈ G, g ∈ G(u) ∧ (g, r) ∈ C) ∨((u, r) ∈ C) ∨((u, r) ∈ O)

Finally we define R(U ) ⊂ R, all resources accessible to a user as follows. R(u)

=

{r ∈ R|auth(u, r) = true}

According to the formal model, we define the functionality of the system as following. 1. User u has access to resource r IFF auth(u,r) = true. 2. User u can edit entry for resource r in the authorization list A, IFF (u, r) ∈ O ownership list, and he shares the resource by doing so. 3. When a user lists resources based on a criteria, the resource is listed IFF r ∈ R(u) and matches the criteria. 4. Only administrators are allowed to edit, G group list, and M membership list 5. Any user can add a resource, to resource set R. For an example, as shown in Figure 2, Groups and users create a graph where edges represent relationship. Every user in a child group is considered to include in the parent groups as well. For an example, every user in the students group is a member of the parent group ”lead” as well. Student group users can access any resource assigned to Students, Educators or LEAD groups. However, users in LEAD group do not have access to resources assigned to Students group.

To enforce authorization while handling documents, the registry needs to calculate, functions R(u), and auth(u,r) and to efficiently implement the authorization model, system calculates G(u) for each user and stores it in the user object. Using G(u), both above functions can be easily implemented. Xregistry is implemented using a Web service with a MySQL back-end and group hierarchy is cached in memory to achieve better performance. Furthermore, it provides a portlet based web interface to create, share, manage and search documents.

and invoke service instances. The toolkit can also automatically generate a web interface based on service descriptions for users to input data, invoke service, and view results using a portlet-based interface. These generated web services can be used in any workflow infrastructure supporting web services composition and execution; early experiments have been made with Kepler and Taverna workflow systems [37]. 3

2.3

W ORKFLOW C OMPOSITION AND E XECUTION E NGINE

The XBaya GUI provides an interface for users to browse registries, connect registered services as directed graph workflows, execute, and monitor progress, where it can be used with a portal framework, or standalone. Users can access publicly accessible web services or provide security credentials to fetch X509 credentials from a MyProxy Server [35] and authenticate with the XRegistry or a secured workflow engines. XBaya allows users to access browse multiple service registries and workflows can be composed with concrete web services (the service is up and running at the time of composition) or with abstract descriptions registered with XRegistry. User may compose a workflow by dragging and dropping a service in to a worksheet, and visually defining relationships among them. The Services can be connected using data dependencies or additional control dependencies can be enforced. The GUI can be a loaded by clicking a JNLP [43] URL and the interface is started on users local workstation. The only pre-requisite to run XBaya is Java web start, which is installed by default with all latest versions of Java. XBaya provides a high-level workflow description and workflow engine specific execution scripts can be generated and deployed. The GUI currently supports GPEL [45] and embedded Jython script execution, and it can be used to monitor workflow executions by subscribing to event bus and based on event notifications; the colors on service boxes are changed to visually present execution the status. The workflow event life cycle is defined in the lead metadata schema [39] and the services created using Generic Service Toolkit produce notifications about file transfers, job submissions, batch queue status, application execution, and data transfer times among other status messages. When the events contain compute host information, XBaya dynamically displays the compute host where the application of the service box is being scheduled and executed. During the workflow composition, it is important to make sure that the correct services are wired together in a meaningful manner. One obvious technique that can be used to improve the correctness during workflow composition stage is to use type systems. In the current implementation of XBaya, we makes use of the WSDL based type system to perform some correctness test, but this method lacks any semantic correctness checking thus could lead to compositions that may appear to be properly type matched but may be semantically incorrect. As part of future work is we plan to provide semantic guidelines during workflow composition, using Ontology based descriptions. Apart from the jython based embedded workflow execution support, XBaya as a workflow composer supports separation of workflow composition from actual invocation, where a BPEL Engine would do the actual invocation of the workflow. There has been effort towards extending XBaya to interact with multiple workflow engines like Taverna [22], GPEL [45] and ODE [1] and this work could be considered as a prologue for workflow interoperability. It is important to note that, even though we present the system as a workflow suite performing service wrapping, registration, workflow composition, execution and monitoring, it is possible to use these components individually and integrate into different compatible frameworks. The service toolkit’s has an in built web interface which can be used to not only to register and share services but also to create

S ECURITY M ODEL

The Science Gateway’s are intended to ease end users from the knowledge of accessing and moving data and from compute job management. These user-friendly features come at a cost of adding middleware layers abstracting end-users from data, compute resources and scientific applications. Hence having a well-planned security model is essential to gateway infrastructure. Moreover, since application providers share their research codes, a carefully monitored controlled access mechanism is a necessity. In this section, we briefly discuss the security model for the workflow suite enforced at multiple levels. The security of the workflow suite, can be broken down to four major segments; securing the invocations between services, authentication users with X509 based grid credentials, ensuring secure exchange of these credentials for secured file transfers and job submissions, and finally the authorization model for controlled access of application and service descriptions. Since all web services are web accessible, it is important to secure these services and deny any requests made by anonymous clients. The XRegistry service, Generic Factory Service, and all transiently created application services enforce SSL only authentication, thus all services have https URL’s and clients have to do mutual authentication before initiating any requests. The SSL certificates are issued by the deploying gateway and only services having the gateway issued host certificates are allowed to interact. Users must authenticate with their grid credentials to XBaya interface and these credentials are used to make secure connections to interface with XRegistry Service, Factory service and secured application services. As shown in Figure 1 of the figure, the Application provider (Alice) creates a Service and shares it with other users (Step A2) by issuing a capability via Service Toolkit’s Web Interface and the capability is registered with XRegistry Service. Workflow Provider Bob searches for all service instances (Step B1) and Xregistry will list Alice’s service since Bob has been granted access. In step C1, Bob as an end-user, invokes the service instance (composed into a workflow or via service toolkit’s web interface) using his security credentials for authentication. The invoked service contacts Xregistry and verifies Bob’s permissions and the invocation will proceed to completion. Any other invocations initiated by a different user will be rejected. Based on the desired security model, the workflow infrastructure can be used in three modalities. • In the fist model, a single user has control over application deployments, workflow infrastructure and the same user executes workflows with credentials used for application and infrastructure deployment. • The second model is most suited for a medium to large gateways with manageable number of gateway users. In this usage scenario, all applications are deployed on multiple resources using shared gateway community account (managed by gateway administrators) and workflow infrastructure is deployed using gateway community account and is configured to support multiple gateway end users. In this model, the gateway user management can be done at a higher level (like web portal using PURSE [18]) and XRegistry can leverage the portal authentication system and manage mappings between gateways users.

Figure 3: Data Mining Workflow

• In some cases, sharing all applications via a single gateway account is unfeasible, because either such an account is not available, or the necessary trust has not established. Such cases can be handled using the following model. Applications will be deployed in the respective owners accounts, and each owner must deploy a service factory and embed the factory location with in the description of each application service. A single registry will be used to catalog all application services, and owners can selective share their services with other users. An user may compose workflows using his own services, and services that are shared with him. At the workflow execution time, the XBaya GUI contacts the factory service specified in the application service description, and uses that to create a new service. With this model, the users may share application services running in their own accounts, and selectively share them with other users while maintaining fine-grained access control. This model enables a set of untrusting users to share their applications in a controlled manner. Another important requirement of gateway infrastructure is the ability to audit end users usage of shared community compute allocations. To this end, the service toolkit has been instrumented and integrated with GRAM auditing service and details are mentioned in LEAD Auditing Paper [30] 4

E XAMPLE U SAGE S CENARIOS

In this section we discuss few usage scenarios of the XBaya Workflow GUI and Generic Service Toolkit individually in other workflow frame works and we also discuss how a large science gateway is using this integrated workflow framework for scientific research,educational and outreach activities. 4.1

LEAD W ORKFLOW I NFRASTRUCTURE FOR DATA M INING

In this use case we present how the workflow suite has been used to wrap data mining applications, and construct mining workflows to use for study of storms. At a workshop conducted as part of AMS annual conference [47] meteorology users have run workflows constructed using data mining algorithms from Algorithm Development and Mining (ADAM) [40] toolkit to understand clustering of storms. ADAM toolkit provides classification, clustering, and association rule mining methods to mine scientific data. ADAM application providers from University of Alabama at Huntsville, have used Generic Service Toolkit and created wrapper services for the storm detection algorithm (SDA), the k-means partitioning based storm cluster algorithm (SCA) and a density based clustering algorithm (DBSCAN). Using these wrapped application services a data mining storm analysis workflow is composed as shown in the following figure and shared with users for execution and analysis of outputs. The SDA service reads from workflow inputs NEXRAD radar data [34] with a user specified reflectivity threshold and filters, and detects locations of storms. The following service reads in the detected storm

Figure 4: Drug discovery use case

locations and filters for attributes needed for clustering the location of storms. The final clustering service in the workflow clusters the detected storms into groups based on spatial orientation or area coverage. Following are few examples on how the composed data mining workflows can be used by scientists studying the behavior of storms. Users may be interested in executing these data mining workflows to launch weather forecasts on the regions where storms are detected. Secondly, these workflows can be used by data mining experts to understand the effectiveness of clustering algorithms, and a study by UAH scientists found that DBSCAN algorithm fares better than kmeans. These workflows can be used for more comprehensive analysis as well. The data mining challenge is to automatically detect the number of clusters. Although subjective, some objective criteria or indices can be applied like the Hartigan statistical index and Silhouette index. For 14 out of 20 data sets (time periods), the two methods indicate the same number of clusters. Of the remaining six data sets, three data sets differ by one, one data set differs by two and one data set differs by three. Clustering results using the Hartigan index for storm events observed by NEXRAD radar over a 25-minute period and clustering results using the Silhouette index. In the right portion of both panels, the results are essentially identical. On the left, however, the storms are grouped into three clusters by the Hartigan index and 18 clusters by the Silhouette index. Based upon these experiments it is concluded that the optimal number of clusters determined using the Hartigan index and the Silhouette index are the same. However, clustering performance using the Silhouette index is more sensitive to storm event distributions. As a result the combination of DBSCAN algorithm with Hartigan index can be used to measure for automatic storm event clustering. On a typical scenario, performing this kind of analysis will require computing power and only the owner of algorithm can perform this study. But by use of workflow suite and wrapping this algorithms and composing workflows relieves the scientists from dealing with computations job management and data transfers and job scheduling issues. Multiple users can now use these workflows in a secured, controlled manner with the information needed to audit their usage. 4.2

D RUG DISCOVERY U SE C ASE

The field of chemical informatics is highly multi-disciplinary having intersection with both biology and chemistry. Researchers in this field have to deal with tools and applications from multiple domains, that are developed by various environments. The paper [15] elaborates various research efforts in chemical informatics, and in nutshell, chemical informatics is developing a infrastructure to integrate different tools, methods, and personalize them to cater far individual problems. In this use case we discuss the usage of LEAD Workflow Infrastructure by a chemical informatics research group in Indiana University, who developed workflows for drug discovery in order to better understanding

Figure 5: LEAD Usecase

the effect of drugs in inhibiting tumors. As shown in the figure 4, the workflow composed in XBaya reads in protein information pdb ligands, smiles, and similarity co-efficient, and generates docking scores and a 3 dimensional display using JMol, a chemical informatics visualization tool. A conventional chemical informatics scientists searching for newly published compounds and studying the effects of drugs, need to understand the internals of the workflow. Using the workflow infrastructure one person can compose the workflow with their knowledge about how an initial ligand with good binding properties should be selected, and then extract similar compound from a database and run it against the protein to observe the docked effect. The composed workflow can be shared with other users, who need not understand the internal logic, and focus the main research problem of iterating through multiple ligands with changing tanimoto coefficients, and see the effects in 3D visualizations of proteins complexes and potential drugs. This model of constructing workflows and sharing the workflows in a secured controlled manner is a major science requirement of a research professor delegating the analysis task to graduate students who need not focus on the internals, but on the problem at hand. Similarly scientists collaborating together can make tremendous progress on the science problem which otherwise would have deterred toward computational issues and learning new systems, both of which are irrelevant to their research goals. 4.3

G ENOME W IDE D OMAIN A NALYSIS USING G ENERIC S ERVICE TOOLKIT

Evolutionary and functional relationships among biological molecules are typically expressed as relationships among genes. Motifs/Domains can domain swap between genes. The Motif Network is a collaborative effort between Renaissance Computing Institute, Center for Bio Physics and Computational Biology, and National Center for Supercomputing applications looking at comprehensive transformation of sequenced genomes into domain space. MotifNetwork is building a suite of biologically oriented and grid enabled workflows for high throughput domain analysis of protein sequences. MotifNetwork workflows are orchestrated using Taverna Workflow system and Generic Service Toolkit is used to wrap computational applications to create grid-enabled services. More information of GST usage can be found in [48]. 4.4

A LARGE SCALE SCIENCE GATEWAY LEAD USE OF WORKFLOW INFRASTRUCTURE

The LEAD project goals can be summarized into two broad categories. Firstly, democratizing the availability of advanced weather technolo-

gies for research and education by building a user-friendly and easy to use interface abstracting science applications, real-time data streaming, management, and computational technologies. Secondly, LEAD Cyber infrastructure has to be flexible enough to build a dynamically adaptive system with ability to detect, analyze and predict atmospheric phenomena. By lowering the entry barrier the rapid understanding experiment design and execution of complex science applications is made possible. To this extent the LEAD workflow infrastructure is applied to weather simulation codes, and forecasting workflows have been constructed and shared across 100s of users spanning from research scientists to under-graduate students. In this use case, we discuss how a community of application developers can build, deploy, and manage, access to application services and workflows. The LEAD gateway developers have deployed the weather forecasting components on TeraGrid gateway account computing resources. These applications are written in FORTRAN and are widely used in atmospheric science community. The initial preprocessing and data assimilation is done using components from ARPS Data Assimilation System [54] and the forecasting is performed by Weather Research and Forecasting (WRF) [33] model. Once these applications are deployed on multiple TeraGrid clusters represented by Step A1 in Figure 1, service and application descriptions are registered with XRegistry Service represented by steps A2 and A3. Typically these components are orchestrated by a complex Perl script, which can only be understood and maintained by few experts. But using workflow infrastructure users who understand the data flow paths can compose workflows and share them with the community of users. As shown in the figure 5 the services are wired together as per their data dependencies, they will perform weather analysis using radar data. Advanced users can substitute these services with their own versions of individual service implementations without having the need to understand the implementation details of other components in the workflow. LEAD has successfully used these workflow tools in various community engagement workshops with great appreciation from end-user communities [5]. Graduate and undergraduate students have used LEAD forecasting workflow results and participated in national collegiate forecasting competition [11]. Scientists have used LEAD workflows to improvise weather forecasting during National Oceanic and Atmospheric Administration (NOAA) Hazardous Weather Test Bed (HWT) [51] Spring 2007 experiments [7]. 5

R ELATED W ORK

As discussed in the surveys by Yu, J et al [55] and Slominski et al [46], there have been many efforts to build workflow tools for e-science, and prominent among them are Kepler [29], Taverna [22], and Triana [10]. The fundamental requirement to any e-science workflow system is to enable workflow construction and execution tools and minimize the requirements about understanding the underlying applications and technologies. The requirements of the workflow systems range from the ability to run long running applications, orchestrate them in workflows, supporting launch now and visualize later interactions. The workflows should be easy to use and most importantly developed against standard specifications to facilitate the interoperability with other workflow systems. Most of the existing workflow systems provide support for Web Services; however, the workflow systems define their own proprietary descriptions like Kepler actor language, Sculf language [10], and industry standards like WS-BPEL. We try to address this problem by decoupling the workflow composition and execution, and the higher-level workflow composition can generate multiple enactment scripts like BPEL and Jython. Another significant difference LEAD workflow system offers is the ability to create services on-demand and thereby huge number of applications can be supported, and a service may scale to many requests

as workflow tools can do late service binding by automatically creating services as needed and dynamically invoking them at execution time. Furthermore, the workflow system not-only support services created using toolkits outside the infrastructure but also the services created in this infrastructure can be used in other infrastructures. Apart from the Jython and GBPEL, we are working towards supporting Taverna and Apache ODE based workflow execution systems. A detailed comparison of XBaya with other tools, is available in [42]. In the context of exposing legacy scientific applications as web services, there are many service toolkits that provide varying features. Among them, SOAP lab [44], OPAL [28], VAP [31], and Matsunaga et al [32] provide mechanisms to configure application description in a text file and use it to create a wrapping Web service. These Web services are developed using a Web Service toolkit and then are deployed to a tomcat container. Remote job submissions may be done through Globus Job management system GRAM. Among other systems GridDeploy [19] and GEMLCA [24] toolkits provide Grid Service interface to an application executing on the same host. A factory receives a request, spans a new Grid service in the same environment, and forwarded the request. GAP [41] provides a Web service interface that enables application executions and when a user sends a request with details about the application, GAP invokes the application in the given host on behalf of the user. However, it is important to note that, GAP does not expose the application as a Web service, rather provide an interface to the shell, therefore not intuitive with the workflows. Pierce et al [38] expose legacy code as services via a Service Manager, which create new service instance as a resource. The service instance is lightweight front end and it directs the requests to the worker backend via a resource manager. These services are stateful, and they support migration and check pointing based recovery. Furthermore, Otho Toolkit [20] presents another system that wraps legacy code as services, and services support multiple service platforms, parameter sweeping, iterative and parallel programs, progress reporting, and file staging. The Generic Service Toolkit stands out because of the flexible architecture implements an interceptor pipeline. Consequently, the toolkit is generic and all domain specific details have been implemented as plug-ins, which can be dynamically coupled as per the request. These features have been tested extensively with meteorological specific requirements. The current second generation version of this toolkit is a redesign of the earlier version [26] and provides many additional features like Auditing, batch queue scheduling, urgent computing. Furthermore, with the OGCE [36] initiative we have packaged it for generic grid deployment, making it usable by small and medium scale applications along with already supported large-scale complex use cases. The concept of a service registry has been proposed by Web Services architectures in the forms of UDDI; there have been many implementations of service registries from industry (e.g. WebSphere Service Registry [23]) and research community (e.g. [27]). Among Grid specific service registries, GIS [13] facilitates users to register and discover resources in the Grid environment via a directory service interface. On the other hand, Taverna workflow suite includes Grimoires [53] service registry that allows users to search for services using semantics, which also supports Access Control mechanism. However, the referred publications do not provide more details. GAT [3] provides a service toolkit, registry that allows users to search for services with some functionality, and Triana workflow composer to compose and run workflows. Bubaka et al [8] presents a P2P network based service registry, where domains and tasks are distributed among the nodes. Furthermore, the system includes a workflow composer that uses the registry for semantic based search for service and data in the system. The XRegistry service provides support for sharing resources via recursive groups based on hierarchies and this functionality is not

yet common in registry services for scientific workflow environments. However, we believe our main contribution from the registry is its well-suited adoption to collaborations and science gateway environments. The LEAD workflow architecture is an integrated framework of 3 different software components. And these components can be used as individual components or a cohesive system. And more over the workflow components are designed as high-level tools flexible to interoperate with multiple low level systems and the implementations have been embedded and well tested. We believe, the LEAD Workflow Infrastructure work out of the box for many use cases, and it is designed to be applicable to multiple science domains with minimal or no effort. 6

C ONCLUSION

Collaborations among scientists have been a major focus in today’s research world, and an initial step towards such collaborations is sharing of scientific models and applications across a wider community. However, without tools to facilities such collaborations, the controlled sharing and collective execution of applications is hard and time consuming if not impossible. In related work we discussed various tools which address scientific workflows requirements at different levels and some of them are catered to address domain centric issues. Among such tools are the ones developed as a part of Linked Environments for Atmospheric Discovery project which is building an integrated, scalable dynamically adaptive Cyberinfrastructure. The software tools have been designed as an extensible framework with pluggable domain specific modules. In this paper we presented the LEAD workflow infrastructure and its usage by multiple science gateways. We discussed the architecture of the workflow suite comprising of a service toolkit, a registry and a workflow GUI which facilitates users to wrap command line applications as Web services, compose them as workflows, and share, execute and monitor those service and workflows. Furthermore, we demonstrated how the software components are flexible and extensible. The use cases presented represent they can be used alike by small and large science gateways. Various usage scenarios derived from real-world collaborations have been presented justifying the generality of the components. In conclusion, we argue that this generic and extensible architecture can be adopted by wide range of communities with minimal or no work, so they shall be benefited the same way as the broad range of users and their requirements in the LEAD project. Therefore, we believe the presented workflow infrastructure will be of immense interest to TeraGrid user community, and the software is released and supported through OGCE project and further details and contact information can be found at http://www.collab-ogce.org. ACKNOWLEDGEMENTS The development of the infrastructure is funded through LEAD project supported by National Science Foundation (NSF) Cooperative Agreements ATM-0331594, ATM-0331591, ATM-0331574, ATM-0331480, ATM-0331579, ATM-0331586, ATM-0331587, and ATM-0331578. The workflow infrastructure is being packaged, tested and supported through NSF Award number 0721656, SDCI NMI Improvement: Open Grid Computing Environments Software for Science Gateways. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect those of the National Science Foundation. The Generic Service Toolkit and XBaya GUI are the research outcomes from PhD thesis of Gopi Kandaswamy and Satoshi Shirasuna and the authors would like to deeply acknowledge their contributions. Furthermore, the authors would like to thank LEAD users and members of the LEAD Team from

Indiana University, University of Oklahoma, Millersville University, Howard University, UCAR Unidata Program, University of Alabama in Huntsville and National Center for Supercomputing Applications who have been an integral part of the presented infrastructure. R EFERENCES [1] Apache ode project. http://ode.apache.org/. [2] Jython project. html.

http://www.jython.org/Project/index.

[3] G. Allen, T. Goodale, T. Radke, M. Russell, E. Seidel, K. Davis, K.N. Dolkas, N.D. Doulamis, T. Kielmann, A. Merzky, et al. Enabling Applications on the Grid: A Gridlab Overview. International Journal of High Performance Computing Applications, 17(4):449, 2003. [4] T. Andrews, F. Curbera, H. Dholakia, Y. Goland, J. Klein, F. Leymann, K. Liu, D. Roller, D. Smith, S. Thatte, et al. Business Process Execution Language for Web Services, Version 1.1. Specification, BEA Systems, IBM Corp., Microsoft Corp., SAP AG, Siebel Systems, 2003. [5] Tom Baltzer, Anne Wilson, Mohan Ramamurthy, Suresh Marru, Marcus Christie, Dennis Gannon, Al Rossi, Shawn Hampton, Jay Alameda, and Kelvin Droegemeier. LEAD at the Unidata workshop: demonstrating democratization of NWP capabilities. 23rd Conference on IIPS, 2007. [6] Pete Beckman, Ivan Beschastnikh, Suman Nadella, , and Nick Trebon. Building an ifrastructure for urgent computing, in high performance computing and grids in action. IOS Press, Amsterdam, 2007. [7] Keith A. Brewster, Daniel B. Weber, Kevin W. Thomas, Kelvin K. Droegemeier, Yunheng Wang, Ming Xue, Suresh Marru, Dennis Gannon, Jay Alameda, Brian F. Jewett, Jack S. Kain, Steven J. Weiss, and Marcus Christie. Use of the LEAD Portal for On-Demand Severe Weather Prediction. 24th Conference on IIPS, 2008.

[15] X. Dong, K.E. Gilbert, R. Guha, R. Heiland, J. Kim, M.E. Pierce, G.C. Fox, and D.J. Wild. Web Service Infrastructure for Chemoinformatics. J. Chem. Inf. Model, 47(4):1303–1307, 2007. [16] K.K. Droegemeier, V. Chandrasekar, R. Clark, D. Gannon, S. Graves, E. Joseph, M. Ramamurthy, R. Wilhelmson, K. Brewster, B. Domenico, et al. Linked environments for atmospheric discovery (LEAD): A cyberinfrastructure for mesoscale meteorology research and education. 20th Conf. on Interactive Information Processing Systems for Meteorology, Oceanography, and Hydrology, 2004. [17] I. Foster. Globus Toolkit Version 4: Software for Service-Oriented Systems. Network And Parallel Computing: IFIP International Conference, NPC 2005, Beijing, China, November 30-December 3, 2005: Proceedings, 2005. [18] N. GridCenter. A Portal-based User Registration Service for Grids. http://www.grids-center.org/solutions/purse/, 2005. [19] Z. Guan, V. Velusamy, and P. Bangalore. GridDeploy: A Toolkit for Deploying Applications as Grid Services. Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’05)-Volume II-Volume 02, pages 764–765, 2005. [20] J. Hofer and T. Fahringer. Presenting Scientific Legacy Programs as Grid Services via Program Synthesis. Proceedings of the Second IEEE International Conference on e-Science and Grid Computing, 2006. [21] Y. Huang, A. Slominski, C. Herath, and D. Gannon. WS-Messenger: A Web Services based Messaging System for Service-Oriented Grid Computing. 6th IEEE International Symposium on Cluster Computing and the Grid (CCGrid06), 2006. [22] D. Hull, K. Wolstencroft, R. Stevens, C. Goble, M.R. Pocock, P. Li, and T. Oinn. Taverna: a tool for building and running workflows of services. Nucleic Acids Research, 34(Web Server issue):W729, 2006.

[8] M. Bubak, T. Gubała, M. Kapałka, M. Malawski, and K. Rycerz. Workflow composer and service registry for grid applications. Future Generation Computer Systems, 21(1):79–86, 2005.

[23] IBM. Websphere web service registry. http://www-306.ibm. com/software/integration/wsrr/, 2008.

[9] C. Catlett. The philosophy of TeraGrid: building an open, extensible, distributed TeraScale facility. Cluster Computing and the Grid 2nd IEEE/ACM International Symposium CCGRID2002, pages 5–5, 2002.

[24] P. Kacsuk, A. Goyeneche, T. Delaitre, T. Kiss, Z. Farkas, and T. Boczko. High-level Grid Application Environment to Use Legacy Codes as OGSA Grid Services. Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing, 2004.

[10] D. Churches, G. Gombas, A. Harrison, J. Maassen, C. Robinson, M. Shields, I. Taylor, and I. Wang. Programming Scientific and Distributed Workflow with Triana Services. Grid Workflow, 2004. [11] R.D Clark, Suresh Marru, Marcus Christie, Dennis Gannon, Brad Illston, Thomas Baltzer, and Kelvin K. Droegemeier. The LEAD-WxChallenge pilot project: enabling the community. 24th Conference on IIPS, 2008. [12] F. Curbera, M. Duftler, R. Khalaf, W. Nagy, N. Mukhi, and S. Weerawarana. Unraveling the Web services web: an introduction to SOAP, WSDL, andUDDI. Internet Computing, IEEE, 6(2):86–93, 2002. [13] K. Czajkowski, S. Fitzgerald, I. Foster, and C. Kesselman. Grid Information Services for Distributed Resource Sharing. 10th IEEE International Symposium on High Performance Distributed Computing, 184, 2001. [14] K. Czajkowski, I. Foster, N. Karonis, C. Kesselman, S. Martin, W. Smith, and S. Tuecke. A Resource Management Architecture for Metacomputing Systems. Job Scheduling Strategies for Parallel Processing: IPPS/SPDP 98 Workshop, Orlando, Florida, USA, March 30, 1998: Proceedings, 1998.

[25] G. Kandaswamy, L. Fang, Y. Huang, S. Shirasuna, S. Marru, and D. Gannon. Building web services for scientific grid applications. IBM Journal of Research and Development, 50(2/3):249–260, 2006. [26] G. Kandaswamy and D. Gannon. A Mechanism for Creating Scientific Application Services On-demand from Workflows. Proceedings of the 2006 International Conference Workshops on Parallel Processing, pages 25–32, 2006. [27] M. Kasztelnik, M. Bubak, C. Gorka, M. Malawski, and T. Gubala. FAULT TOLERANT GRID REGISTRY. [28] S. Krishnan, B. Stearn, K. Bhatia, K.K. Baldridge, W.W. Li, and P. Arzberger. Opal: Simple Web Services Wrappers for Scientific Applications. International Conference for Web Services, 2006. [29] B. Lud¨ascher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger-Frank, M. Jones, E. Lee, J. Tao, and Y. Zhao. Scientific Workflow Management and the Kepler System. Concurrency and Computation: Practice & Experience, 2005.

[30] S. Martin, P. Lane, I. Foster, and M. Christie. TeraGrids GRAM Auditing & Accounting, & its Integration with the LEAD Science Gateway. Teragrid 2007 Conference, Madison. [31] A. Matsunaga, M. Tsugawa, S. Adabala, R. Figueiredo, H. Lam, and J. Fortes. Science gateways made easy: the In-VIGO approach. Concurrency and Computation: Practice and Experience. [32] A. Matsunaga, M. Tsugawa, and J.A.B. Fortes. Integration of text-based applications into service-oriented architectures for transnational digital government. Proceedings of the 8th annual international conference on Digital government research: bridging disciplines & domains, pages 112–121, 2007. [33] J. Michalakes, S. Chen, J. Dudhia, L. Hart, J. Klemp, J. Middlecoff, and W. Skamarock. Development of a next-generation regional weather research and forecast model. 2001. [34] NOAA. Next-generation-radar. http://radar.weather.gov/. [35] J. Novotny, S. Tuecke, and V. Welch. An Online Credential Repository for the Grid: MyProxy. Proceedings of the Tenth International Symposium on High Performance Distributed Computing (HPDC-10), pages 104–111, 2001. [36] OGCE. Open Grid Computing Environments. http://www. collab-ogce.org/ogce/index.php/Main_Page. [37] S. Perera and D. Gannon. Enabling Web Service Extensions for Scientific Workflows. Workshop on Workflows in Support of Large-Scale Science (works06). In conjunction with HPDC06, Paris, France, 2006. [38] M. Pierce and G. Fox. Making scientific applications as Web services. Computing in Science and Engineering, 6(1):93–96, 2004. [39] R. Ramachandran et al. LEAD Metadata Schema for Geospatial Data Sets Based on FGDC Standard. 2005. [40] J. Rushing, R. Ramachandran, U. Nair, S. Graves, R. Welch, and H. Lin. ADaM: a data mining toolkit for scientists and engineers. Computers and Geosciences, 31(5):607–618, 2005. [41] V. Sanjeepan, A. Matsunaga, L. Zhu, H. Lam, and J.A.B. Fortes. A Service-Oriented, Scalable Approach to Grid-Enabling of Legacy Scientific Applications. Proc. of the International Conference on Web services 2005, 2005. [42] Shirasuna Satoshi, 2007. Ph.D. Thesis.: Department of Computer Science, Indiana University. [43] R. Schmidt. Java Network Launching Protocol (JNLP) Specification 1.0. 1. 2001. [44] M. Senger, P. Rice, and T. Oinn. Soaplab-a unified Sesame door to analysis tools. Proc UK e-Science programme All Hands Conference, pages 2–4, 2003. [45] A. Slominski. Adapting BPEL to Scientific Workflows. Workflows for e-Science, pages 212–230. [46] A. Slominski and G. von Laszewski. Scientific Workflows Survey. [47] American Meteorological Society. Annual meeting. http://ams. confex.com/ams/. [48] J.L. Tilson, G. Rendon, M.F. Ger, and E. Jakobsson. MotifNetwork: A Grid-enabled Workflow for High-throughput Domain Analysis of Biological Sequences: Implications for annotation and study of phylogeny, protein interactions, and intraspecies variation. Bioinformatics and Bioengineering, 2007. BIBE 2007. Proceedings of the 7th IEEE International Conference on, pages 620–627, 2007.

[49] VGRADS. Virtual grid execution system. http://vgrads.rice. edu/research/execution_system/vges-overview. [50] S. Vinoski. Chain of responsibility. Internet Computing, IEEE, 6(6):80– 83, 2002. [51] SJ Weiss, JS Kain, DR Bright, JJ Levit, GW Carbin, ME Pyle, ZI Janjic, BS Ferrier, J. Du, ML Weisman, et al. The NOAA Hazardous Weather Testbed: Collaborative testing of ensemble and convectionallowing WRF models and subsequent transfer to operations at the Storm Prediction Center. 22nd Conf. Wea. Anal. Forecasting/18th Conf. Num. Wea. Pred., Salt Lake City, Utah, Amer. Meteor. Soc., CDROM 6B, 4, 2007. [52] R. Wolski, N. Spring, and J. Hayes. The Network Weather Service: A Distributed Resource Performance Forecasting Service for Metacomputing. Journal of Future Generation Computing Systems, 15(5-6):757– 768, 1999. [53] SC Wong, V. Tan, W. Fang, S. Miles, and L. Moreau. Grimoires: Grid Registry with Metadata Oriented Interface: Robustness, Efficiency, Security—Work-in-Progress. 2005. [54] M. Xue, D. Wang, J. Gao, K. Brewster, and K.K. Droegemeier. The Advanced Regional Prediction System (ARPS), storm-scale numerical weather prediction and data assimilation. Meteorology and Atmospheric Physics, 82(1):139–170, 2003. [55] J. Yu and R. Buyya. A taxonomy of scientific workflow systems for grid computing. SIGMOD Record, 34(3):44–49, 2005.

Suggest Documents