Integrating Command Line Programs as Web Services in a Grid ...

5 downloads 161 Views 621KB Size Report
and service composition (as workflows); and a software module to ... Command Line Pro- ..... We have designed a software component that allows service pro-.
Full paper

Integrating Command Line Programs as Web Services in a Grid environment for Biomedical Tools M. García1, J. Karlsson1,* and O Trelles1 1

Department of Computer Architecture, Malaga University, Spain

Received on XXXXX; revised on XXXXX; accepted on XXXXX

Associate Editor: XXXXXXX

ABSTRACT Motivation: The favorite way to provide universal remote access to bioinformatics applications is in the form of Web-Services under the paradigm of Service Oriented Architectures (SOA). However, sensitivity of biomedical data suggests the use of Grids to create a secure environment. Integrating both approaches to better exploit their particular advantages is a demand. Results: This paper reports the design, implementation and exploitation of a catalogue for knowledge-based discovery of services, storing metadata describing biomedical services, enabling discovery and service composition (as workflows); and a software module to wrap existing command-line based applications as a secure grid service able to handle sensitive information. Tool providers (service providers, workflow authors etc.) can publish metadata for their tools, making them instantly available for clients. The architecture and metadata is exemplified through a set of Grid services to show the entire development process of a service, including secure execution with authentication. Several successful clients have been implemented in the framework of the EU project Advancing ClinicoGenomics Trials on Cancer (ACGT-http://www.eu-acgt.org) to cover different aspects of data processing, ranging from service discovering to services invocation with advanced features such as service composition (semi-automatic workflow generation). Availability: The software is freely available (see supplementary information). Keywords. Metadata repository; service discovery; service integration, service execution; Web-Services and Grid. Contact: [email protected] Supplementary information: http://bitlab-es.com/repository/

1

INTRODUCTION

The availability of the human genome and other breakthroughs in post-genomic research will constitute the base for personalized healthcare with individually adapted therapies and complementing diagnoses by including genetic profiling of the patients. New technological advances such as Next Generation Sequencing (Shendure, J. et al. 2008), ultra-high density microarrays, and others, pose additional challenges for computational platforms as they must manage massive sets of data. Distributed and advanced parallel computing has been suggested as an effective solution To whom correspondence should be addressed.

*

© Oxford University Press 2010

where grid-computing is, in particular, suitable for these tasks because of the inherent ability to effectively use distributed computational resources in secure environments. However, most of the many existing services in the field of bioinformatics that could be useful in the field of biomedicine use Web Services technology (SOA, Service Oriented Architecture). This approach supports interoperable machine-to-machine interaction over a network. An important collateral effect of interoperability is the ability of building workflows, a pre-defined organized list of web services to solve complex problems. To enable the constructing complex service compositions in a Grid environment, it is necessary the standardisation of data types and formats and the efficient management of service metadata. Tool interfaces must therefore be described in a machine-friendly way. Clients that use these distributed tools need to be able to dynamically discover and use new tools and algorithms. Since the discovery process is supported by tool metadata, such metadata is therefore shared in a public repository. The approach of sharing tool metadata in public repositories is not new and there have been several published approaches in others domains than biomedicine. BioMOBY (Wilkinson, M. et al. 2003), myGrid (Stevens, R.D. et al. 2003) and others systems make use of this strategy to implement integration architectures. In the side of Grid environments, Globus (I. Foster. 2006) framework includes a Monitoring and Discovery System (MDS) consisting of a suite of Web Services to integrate new resources and tools in distributed systems, focused in the dissemination and gathering of information for grids and Virtual Organizations rather than elaborate a complete model schema for these tools. In a similar context caGrid (Oster, S. et al. 2007) provides components for biomedical research by including support publishing, discovery, access and management of data source and tool metadata. Secure access is also implemented by restricting access to services according to Grid credentials and trust levels. However, the caGrid infrastructure is specifically focused on cancer research and most of their programmatic support covers the structure and semantic aspects of these types of data. Two main concepts must be kept in mind. Command Line Programs (CLP) are a very common style for the deployment of computational services inside a grid infrastructure. Web-service technology, on the other hand, enables communication with external resources. To better exploit both strategies and benefit from the high number of CLPs available, we have designed and developed

1

M. García et al.

two software components. One enables sharing and discovery of high-level service descriptions and the other enables the execution of these registered tools. Grid services are described as abstractions of software components, providing data analysis capabilities for clinical trials. These service descriptions include metadata regarding service communication protocol in the form of WSDL (WSDL; http://www.w3.org/TR/wsdl) descriptions, service documentation (free-text) and include metadata regarding semantic datatype for parameters and functional descriptions of services, which in turn can be exploited for service interoperability. To the best of our knowledge, this is the first repository of tools specifically used for biomedical research. The objective of this paper is to describe the structure and motivate our semantic descriptions and outline how metadata is collected, stored and used in a Grid-based architecture for biomedical research. To this end, here we report the following results: • A common metadata schema used for the integration of the different types of services (see section 3) • A software component to develop secure Grid services based on command-line applications (see section 2)

2

METHODS

ACGT (ACGT; http://www.eu-acgt.org/) is an Integrated Project funded by the European Commission in the domain of Cancer research. One of the main objectives of ACGT is to improve medical knowledge discovery and integration of biomedical data through the design and development of a Grid platform for the discovery and integration of biomedical data. ACGT team has accomplished these objectives by developing an integrated and Grid-compatible software platform to support secure and multi-centre clinical trials. According to a layering design pattern (Top-Down), ACGT architecture is distributed in (see supplementary material for more information): • User access. • Business process services. • Advanced Grid middleware. • Common Grid infrastructure (using Globus toolkit). • Hardware layer and computational resources. Biomedical environments require a high level of security. In ACGT, this is handled by the Globus Authorization System

Clients

(GAS). User credentials from GAS are delegated in several parts of the platform, such as user data management (Data Management System, DMS) and service invocations for data analysis (GRMS) (Pukacki J. et al. 2006). Please see also section 2.2. In this architecture, the need is clear for a specific component to standardize the diversity of data types and data formats, facilitate the integration of services and allow discovery and invocation of tools.

2.1

2.2

Integration of CLP in the ACGT environment

In order to use CLP inside the secure ACGT Grid environment, several requirements need to be considered: Grid security infrastructure. Globus toolkit provides a Grid Security Infrastructure (GSI) to ensure that many different users and many kinds of organizations can securely share data and ac-

Web Services Layer A)

ACGT Metadata Repository

To standardize tools metadata definitions in ACGT architecture, we have developed a repository for metadata related to serviceoriented architectures in Grids. The repository database is accessible through the RepoServices API. RepoServices is organised in three main database tables or modules: tool, functional categories and data types (see section 3). Programmatic access for each module is implemented in three layers (see Figure 1): Databases, Persistence and Web-Services. RepoServices can be locally accessed (persistence layer) or remotely (Web service layer). In ACGT, the repository is deployed in two different instances: production and development. The first is stable and is used as front end of the system, and the second, more up-to-date, allows testing services before public deployment (see also section 4.1). Details about the instances and how to install and connect this software are available as supplementary material. The ACGT Repository is used to perform the following tasks: • Publish (register) tool metadata by service providers. • Find (discover) tool metadata by clients. • Provide necessary details for clients to invoke tools. • Modify existing tool metadata. • Retrieve all tool metadata (for metadata browsing tools). The repository is essential to enable correct integration of CLPs.

Persistence Layer B)

Database Layer C)

D)

Java Objects

C++ Objects

XML Docs

Other Objects

Fig. 1. RepoServices Architecture Layers. A) Clients of RepoServices. B) Web Service layer, using Axis (http://ws.apache.org/axis/) as Web Service engine and Tomcat (http://tomcat.apache.org/) like servlet container. C) Persistence layer, implementing Hibernate (http://www.hibernate.org/) mappings are used to create persistent Java objects. D) Data storage layer: Metadata schema implemented with MySQL database tables.

2

Integrating Command Line Programs as Web Services in a Grid environment for Biomedical Tools

cess computational resources in the Grid. GSI includes authentication with credential delegation and transport/message-level security in the low level of the Grid environment. On a higher level ACGT use the Gridge Authorisation Service (GAS) to support authorisation operations. GAS is used by the components in ACGT to confirm credentials of users who want use it. These credentials are used to control the delegation of rights in the entire architecture. File system support. In high performance environments a mechanism is necessary to providing fast access and management of large amount of data. Additionally, data needs to be annotated with metadata. ACGT use Data Management Suite (DMS), where secure access is ensured using user credential, controlled in GAS. Metadata of data can be annotated directly from tools during execution (execution times, author, Mime-types, dates, etc). Execution. The process of job submission usually is one of the most complex processes in a Grid environment. To manage the whole process of remote job submission, ACGT use Gridge Resource Management System (GRMS), with an API to launch, resume and monitoring jobs. GRMS is quite flexible in the treatment of data input sources, allowing local files, remote files (Giovanni A. et al. 2002) or DMS files.

2.3



GetJobStatus. This operation returns the current status of execution. • GetResult. Retrieves the DMS identifiers of generated outputs. Figure 2 shows a simplified UML sequence diagram of events between the software components. Assuming a scenario where user is already correctly identified with his credentials, the client program gets tools metadata from ACGT Metadata Repository. The client now has enough information to build a user interface in the application. Client program requests user data in the interface, and creates runAsync call to generic Web service. In this operation, a jobId is generated to be used in next operations. The generic WS then retrieves the necessary information to later store the results and creates a job description to be submitted with GRMS software in the Grid. Client application can monitor the current status of the Job using getJobStatus operation. When status is Finished, client can retrieve results using GetResults operation.

CLP life cycle

Including new CLPs in the Grid is carried out in two main steps: First, install the CLP in available servers in the Grid and register service metadata in ACGT Metadata Repository. Once registered, service metadata is used to discover and invocate the CLP in the ACGT environment. Registering. Registering is performed in the ACGT portal at http://rd.siveco.ro/acgt. Metadata details are given in section 3. Providers first need to check that necessary data types and functional categories are available, otherwise they must be registered first. Metadata related to invocation include parameter and tool location metadata. The parameter information is later used by clients to automatically generate service interfaces. Details such us input/output files, optional parameters; tool-location including endpoint information are also provided in this registration step. Discovering. Client applications with correct user credentials may access the metadata repository to discover tools based on descriptions, input/output datatypes, functional categories etc. Service discovery is typically performed using the Magallanes application (see section 4). Generic Web service for CLP. This Web service is the default Web service for executing CLPs. The CLPs themselves are registered as abstract tools, each with the endpoint to the actual webservice (the generic Web service). This service has been developed using Axis as SOAP (SOAP; http://www.w3.org/TR/soap/) engine and use Tomcat as a Servlet container. Servlet container must compliant with secure access restrictions of ACGT. Currently, this web-service is installed in two servers in the ACGT Grid. The client can launch and control execution using three available operations in the generic Web service: • RunAsync. Launch the execution in GRMS with the following parameters: host, relative path, DMS identifiers for inputs, options and metadata for generated output files (MIME-type, data types, Author, etc). It is also important to mention the possibility of using sets of data files as input and output.

Fig. 2. Simplified UML Sequence Diagram of events. See main text for detailed description; and a full version of this diagram is available as supplementary material.

3

RESULTS

A well-defined metadata schema is essential to enable tool providers to publish metadata descriptions for their tools and for clients to find (discover) tool based on metadata. The scope of the metadata schema (see section 3.1) includes descriptions of tools and their operations (specific functions of a tool), data-types (see section 3.2) and functional categories (see section 3.3). In the following subsections, only the main concepts of the schema are de-

3

M. García et al.

scribed. The complete metadata schema can be found in supplementary information.

3.1

Tools

A Tool represents a group of software components that can be used to solve a specific type of problem and acts as a container of operations with closely related functionality. An operation is a software sub-component that solves a specific problem and has several parameters; either input or output. Each parameter is associated with a specific data type (see section 3.2). The metadata for tools include the author and authority (author’s affiliation). Each tool and operation can be associated with human-readable descriptions (long and short description). The short description is intended for quick browsing purposes and the longer version is intended for users that wish to more carefully study tool/operation documentation. The tool metadata also specifies which type of tool it is (currently several types of tools are supported (traditional SOAP services, BioMOBY, BPEL (BPEL; http://docs.oasisopen.org/wsbpel/2.0/wsbpel-v2.0.pdf) workflows) and secure ACGT services (based on command lines (see section 2.2) and R package (R Development Core Team 2005). For traditional SOAP-based web-services, the WSDL description must be included during registry. The WSDL contains all information needed for the client to bind to the service (such as the protocol specifics and endpoint). For BioMOBY web-services, WSDL is not used to specify the format of the data. In this case, the data type metadata is used to infer the exact format according to the BioMOBY specification. Workflows are viewed as abstract tools (“black boxes”) which require inputs and produce outputs. Workflow metadata includes additional metadata such as an image representing the workflow and definition (in BPEL format). For ACGT services based on command line execution, the schema includes the command to execute (path). R-based ACGT services include the R-script which is retrieved by the service and executed on the Grid. Table 1. Summary of registered services. Type of Service

# available

scope

BioMOBY services

31

CLP

30

Mediator queries

15

R-Scripts

50

Workflows

10

General biomedical and bioinformatics Web services. CLP for Gene Expression data accessible across generic Web service. Predefined queries on Ontology search. Statistical treatment of data and data mining methods. Complex data analysis.

Summary of current tools registered in the repository. The number of services can vary between development and stable versions of the Repository.

As a proof of concept, we have deployed and registered in the repository more than 60 services in the field of gene expression data processing (CLP) and general bioinformatics’ (BioMOBY).

4

These tools share the repository with other R-scripts and queries to the mediator developed by other members of the ACGT consortium, in the style several tools to be used across the info stored on metadata repository. Table 1 summarizes the number of registered services per type. An updated list of available CLP with complete descriptions can be found in supplementary material.

3.2

Datatypes

The data type metadata defines a shared taxonomy of data types. This enables tool composition (combination). The taxonomy follows the object-oriented paradigm where data types are related to other data types. Data types can inherit parts from another data type and add additional structure. They may include (contain) or be arrays of other data types. The interpretation of such relations between data types is domain-specific for the service type. For example, BioMOBY Web services would interpret these relations as directly specifying the data format. For generic SOAP services, these tables would only be used to specify a hierarchy of data types without any assumptions of the data formats (which are specified in the WSDL descriptions).

3.3

Functional descriptions

Tools can be associated with one or more functional category. A functional category is a keyword that describes the function of the tool. The functional categories can be related to other functional categories to create a taxonomy of keywords. If the keywords are arranged in a hierarchical structure, this makes it possible for clients to discover tools that are annotated with a more generic functional keyword and all inheriting keywords. For example, if the functional category taxonomy consists of the keyword “clustering” and two sub-keywords inheriting from clustering “hierarchical clustering” and “k-means clustering”, searching for a tool with annotation “clustering” would return also tools that are annotated with “hierarchical clustering” and “k-means clustering”.

4

DISCUSSION AND CONCLUSIONS

We have designed a software component that allows service providers to share service descriptions, including semantics and facilitates CLP integration in ACGT platform. Compatibility between services is facilitated by enforcing the use of a shared data type ontology.

4.1

Scalability

An important issue for tool catalogues is the ability to curate the registry. The main catalogue for BioMOBY services, MobyCentral at the University of Calgary (http://moby.ucalgary.ca/moby/MOBY-Central.pl) has a large number of registrations; 1600 bioinformatics web-services and almost 800 data types (January 2009). Service registrations in that catalogue are possible for anyone without any centralized control. Learning from this experience, one alternative that we are considering it that new services and data types are suggested by service providers but are not added to the official metadata repository without prior approval from an ontology committee. A development catalogue is available for testing purposes where new data

Integrating Command Line Programs as Web Services in a Grid environment for Biomedical Tools

types and services can be registered without restriction. This approach is less scalable than the open approach of BioMOBY but attempts a trade-off between openness (scalability) and control.

4.2

Compatibility with standard web-service descriptions

The metadata schema we developed is strongly influenced by the BioMOBY metadata. However, one requirement during development was the ability to modularize the metadata; to store the metadata on different servers and to be able to combine the metadata according to the needs of a specific project. For example, some projects might need only data types, some might need data types and services etc. Therefore, we extended the BioMOBY metadata with additional documentation details and split the metadata schema in several modules with various degrees of independence. The default way to describe SOAP services is by WSDL descriptions. However, WSDL files are normally used to describe a single service while our repository is intended to store a larger catalogue of services (and other types of tools). Concepts in our metadata schema correspond to WSDL artifacts: Tools can be said to be similar to the WSDL document itself (both are wrappers of specific functionalities). Operation is an abstract concept (specific functionality) and has a similar role as WSDL port-types. Parameters in our schema are connected to an operation and a data type, so they match WSDL messages. ToolLocation metadata is used for a similar purpose as WSDL bindings. Data types in the ACGT repository correspond to WSDL types (XML Schema). However, our metadata descriptions also extend the information from WSDL with: • A shared datatype taxonomy which allow clients to dynamically discover compatible tools where the output of one tool can be used as input for another tool which use the same or a derived datatype (via inheritance). • Functional categories which describe the functionality of a tool. This is useful if the client wishes to discover only services that perform a certain task. The situation regarding data types is more complex. WSDL traditionally uses XML schema to describe the structure of the XML data for inputs and outputs of a service. XML Schema is a more expressive data description format compared to the approach in the meta-data schema. However, we believe it is very useful to maintain a shared hierarchy of data types to answer discovery (find) queries such as “show all tools that have operations with data type X as part of the input”. In this sense, the datatype taxonomy can be said to describe the semantic of input and outputs parameters instead of specifying exact data formats (exception for BioMOBY services, see section 3.2).

4.3

able to annotate the files with metadata (for example simple provenance metadata).

4.4

Demonstration of suitability (usage)

As a demonstration of the suitability of the catalogue within the ACGT Grid architecture, we describe several components and clients that rely on the catalogue metadata. The ACGT portal presents the ACGT system to end-users. Several portlets (specific type of web pages) have been developed to allow registry administrators to manage tool metadata. These pages provide a graphical interface to the metadata in the tool catalogue. There are portlets that allows users to explore the tools, organized according to their functional categories and the datatype taxonomies. Additionally, a novel service discovery component (Magallanes (Ríos J. et al 2009) has been integrated as a portlet. This component allows users to supply key-words which are matched (exactly or by approximation) to service and datatype descriptions. The component is also able to learn from user selections which are used to continuously improve the quality of the search results. Magallanes is also available as a stand-alone desktop application. Besides the functionality included within the ACGT portal, the application is able to use metadata (in particular parameter datatypes) to generate possible service compositions (workflows) based on user specifications of desired input and output datatypes. The workflow editor and enactor (Stelios S. et al. 2009) connects to the repository for retrieving tool metadata. AWE visualizes tool metadata in a browsable tree. Users can select tools from the tree and combine in a workflow. The editor uses metadata for the operations and parameters to verify the consistency of the workflow.

Coverage

Note that the ACGT architecture (see section 2) includes many other components which also deal with metadata and thereby complements the data stored in the tool catalogue. For example, Virtual Organizations (VO) contains metadata regarding a set of members (correspond to users). Rights to use data or services are decided on the basis of VO membership of users. The Data Management System (DMS) in the ACGT architecture stores the data files and is

Fig. 3. Workflow enactment of command line programs in jOrca standalone client. In this panel, user can specify intermediate parameters and inputs. The enactment is visualized to the left.

When the workflow is saved, AWE collects the required information (such as the WSDL definitions) and includes them as a BPEL

5

M. García et al.

workflow. This workflow contains all information needed to enact the workflow. jOrca (Martín-Requena, V. et al. 2010 - http://www.bitlabes.com/jorca) is a powerful and portable desktop client, highly customizable to cover a broad range of user skills. The client provides several interesting features: access to several repositories with different protocols, searching for compatible tools based on user data or keywords; embedded file system for handling local user files and user-defined lists of favourite tools. Workflows with CLP services can be generated and, as can be observed in figure 3, visualized during enactment.

4.5

Exporting results to other areas

The ACGT architecture has strong security requirements due to the sensitivity of data. Therefore, it is necessary to obtain proper credentials in order to add new or execute command line programs. These credentials can be obtained through the ACGT portal (http://rd.siveco.ro/acgt). Currently, we only support CLPs for secure ACGT services. The design principles of the CL services can be applied to different types of service (for example BioMOBY services). Such a generic BioMOBY service would be equally simple to use from the point of view of the CLP developer. The developer adopt the steps in section 2.2 for BioMOBY and use any client software for BioMOBY so send the parameters (specifying the path to CLP etc). There would be no requirement to understand complex aspects of the BioMOBY protocol.

ACKNOWLEDGEMENTS Funding: This work has been partially financed by the National Institute for Bioinformatics (www.inab.org) a platform of Genoma España and the EC project "Advancing Clinico-Genomic Trials on Cancer" (Contract No.026996)

REFERENCES Shendure, J. and Ji, H. (2008) Next-generation DNA sequencing. Nature Biotechnology. Vol 26, Nº 10, pages 1135-1145. Wilkinson, M.D., Gessler, D., Farmer, A., Stein, L. (2003). The Bio-MOBY Project Explores Open-Source, Simple, Extensible Protocols for Enabling Biological Database Inter-operability. Proceedings Virtual Conference Genomic and Bioinformatics (3):16-26. (ISSN 1547-383X). Stevens, R.D., et al. (2003) myGrid: personalised bioinformatics on the information grid. Bioinformatics, 19, (Suppl. 1), i302–i304 I. Foster. (2006), Globus Toolkit Version 4: Software for Service-Oriented Systems. IFIP International Conference on Network and Parallel Computing, SpringerVerlag LNCS 3779, pp 2-13. Oster S, Langella S, Hastings S, Ervin D, Madduri R, Phillips J, Kurc T, Siebenlist F, Covitz P, Shanbhag K, et al. (2007): caGrid 1.0: An Enterprise Grid Infrastructure for Biomedical Research. J Am Med Inform Assoc 2007 Juliusz Pukacki et al. (2006) Programming Grid Applications with Gridge. Computational Methods in Science and Thecnology; 12(1), 47-68 Web Services Business Process Execution Language Version 2.0, (http://docs.oasisopen.org/wsbpel/2.0/wsbpel-v2.0.pdf) Giovanni Aloisio, Massimo Cafaro, Italo Epicoco (2002); Early experiences with the GridFTP protocol using the GRB-GSIFTP library; Future Generation Computer Systems; Volume 18, Pages 1053-1059; ISSN:0167-739X Ríos J, et al. Magallanes: a web services discovery and automatic workflow composition tool. BMC Bioinformatics (2009) 10:334. Stelios Sfakianakis, Lefteris Koumakis, George Zacharioudakis, Manolis Tsiknakis, "Web-Based Authoring and Secure Enactment of Bioinformatics Workflows,"

6

gpc, pp.88-95, 2009 Workshops at the Grid and Pervasive Computing Conference, 2009 R Development Core Team (2005), “R: A Language and Environment for Statistical Computing”, R Foundation for Statistical Computing, Vienna, Austria, ISBN 3900051-07-0 (http://www.r-project.org/) Victoria Martín-Requena , Javier Ríos , Maximiliano García , Sergio Ramírez and Oswaldo Trelles; jORCA: easily integrating bioinformatics Web Services; Bioinformatics 2010 26(4):553-559; doi:10.1093/bioinformatics/btp709

Suggest Documents