GeneGrid: Architecture, Implementation and Application - CiteSeerX

4 downloads 33548 Views 451KB Size Report
However, the application management component of GeneGrid employs a Grid service wrapper around all the applications present on a computational resource.
J Grid Computing (2006) 4: 209–222 DOI 10.1007/s10723-006-9045-5

GeneGrid: Architecture, Implementation and Application P. V. Jithesh · P. Donachy · T. Harmer · N. Kelly · R. Perrott · S. Wasnik · J. Johnston · M. McCurley · M. Townsley · S. McKee

Received: 1 September 2005 / Accepted: 24 April 2006 © Springer Science + Business Media B.V. 2006

Abstract The emergence of Grid computing technology has opened up an unprecedented opportunity for biologists to share and access data, resources and tools in an integrated environment leading to a greater chance of knowledge discovery. GeneGrid is a Grid computing framework that seamlessly integrates a myriad of heterogeneous resources spanning multiple administrative domains and locations. It provides scientists an integrated environment for the streamlined access of a number of bioinformatics programs and databases through a simple and intuitive interface. It acts as a virtual bioinformatics laboratory by allowing scientists to create, execute and manage workflows that represent bioinformatics experiments. A number of cooperating Grid services interact in an orchestrated manner to provide this functionality. This paper gives insight into the de-

tails of the architecture, components and implementation of GeneGrid.

P. V. Jithesh (B) · P. Donachy · T. Harmer · N. Kelly · R. Perrott · S. Wasnik Belfast e-Science Centre, Computer Science, The Queen’s University of Belfast, SARC Building, Belfast, BT7 1NN, UK e-mail: [email protected]

GDMSF

J. Johnston · M. McCurley · M. Townsley Fusion Antibodies Ltd., Belfast, UK

GWMSF

S. McKee Amtec Medical Ltd., Antrim, UK

GWMS

Key words Bioinformatics · GeneGrid · Globus · Grid computing · Virtual Bioinformatics Laboratory

Abbreviations OGSA SOA OGSI SOAP GAMSF GAMS OGSA-DAI

GDMS GWDD GSTRIP

open Grid services architecture service oriented architecture open Grid services infrastructure simple object access protocol GeneGrid application manager service factory GeneGrid application manager service open Grid services architecture-database access and integration GeneGrid data manager service factory GeneGrid data manager service GeneGrid workflow definition database GeneGrid status tracking, results & input parameters database GeneGrid workflow manager service factory GeneGrid workflow manager service

210

GNM GARR

J Grid Computing (2006) 4: 209–222

GeneGrid node monitor GeneGrid application and resources registry

1. Introduction The improvement and sophistication in technologies such as genome sequencing and expression analysis in the past decade have resulted in an unprecedented generation of biological data pushing Bioinformatics into the forefront of disciplines that warrant massive storage, huge computing power and highly collaborative environments. Genome sequencing projects [1] such as the human genome project as well as the post-genomic technologies such as microarrays, which help in the genome wide analysis of the expression of thousands of genes in a single assay, have necessitated the development of new robust tools for mining and analysis of the data. However, the variety and complexity of data formats and the heterogeneity in the requirements of the individual programs have restricted their use in bioinformatics experiments especially involving multiple steps and data integration from public and private resources. Moreover, the real execution time of such an experiment far surpasses the cumulative component task execution time since such procedures often involve accessing of resources and data over remote web servers with the involvement of much manual intervention. Often the calculations are so compute intensive that they require high performance systems to accomplish the task in reasonable time limits, leaving the biologists with only moderate compute facilities at a loss. The emergence of Grid computing technology has opened up an unprecedented opportunity for the biologists to share and access data, resources and tools in an integrated environment leading to a greater chance of knowledge discovery. Grid computing, which is concerned with the coordinated resource sharing and problem solving in dynamic multi-institutional virtual organisations [2], has become the state-of-the-art in distributed computing methodologies, even evolving from the latter as an important discipline within the computer industry. The Open Grid Services Architecture (OGSA) [3] constitutes a concep-

tual framework for Grid computing leveraging the functionalities of a Service Oriented Architecture (SOA) [4] and the Globus Toolkit [5] provides an implementation of the Open Grid Services Infrastructure (OGSI) [6] which addresses the detailed specifications for OGSA. This paper presents the architecture and implementation details of GeneGrid, a Grid computing framework for the creation of a virtual Bioinformatics laboratory, based on Globus [5] and standards that the international Grid computing community have agreed upon. GeneGrid accomplishes the seamless integration of a myriad of heterogeneous resources that span multiple administrative domains and locations and provides the scientists an integrated environment for the streamlined access of a number of bioinformatics and other accessory programs through a simple and intuitive interface. It allows the scientists to create, execute and manage workflows that represent bioinformatics experiments.

2. Bioinformatics and Grid Computing The implication of Grid computing on bioinformatics has been tremendous. There are a number of projects worldwide that focus on employing Grid technologies for enhanced problem solving in biomedical informatics. Examples include Biomedical Informatics Research Network [7], Cancer Biomedical Informatics Grid [8], myGrid [9], North Carolina BioGrid [10] and Bio-GRID [11]. Some of these projects concentrate on building a computational Grid infrastructure for large scale distribution and analysis of biological data, some focus on the semantic interoperability and task complexity, while still others integrate data from biological, medical and related fields. GeneGrid differs from many of the above projects in the motivation, architecture and implementation. For example, myGrid [9] strives to address issues related to semantic complexity of biological data and the applications that process that data. Using Taverna as the workflow engine, it helps biologists to execute workflows that utilise a number of Web Services. Taverna is a highly complex e-Science workbench allowing complex workflows to be created, though not

J Grid Computing (2006) 4: 209–222

simple enough for biologists to work with. GeneGrid was conceived to provide a platform for scientists from the commercial entities involved in the project with an interest in the development of antibodies and drugs, to access their collective skills, experiences and results in a secure, scalable and reliable manner through the creation of a virtual bioinformatics laboratory [12]. Its functionality is based on several interacting Grid Services and the custom workflow manager provides basic capabilities required in creating and executing biological workflows. Furthermore, the GeneGrid Portal provides a simple and user-friendly interface to provide access to all the capabilities of the underlying Grid system. The architectural and implementation details of GeneGrid are discussed at length in the following sections of the paper.

3. GeneGrid: Architecture and Implementation A service-oriented architecture (SOA) is a loosely coupled architecture with a set of abstractions relating to components granular enough for consumption by clients and accessible over the network with well-defined policies as dictated by the components [13]. GeneGrid utilises this coarse grained nature of SOA in employing a highly modular architecture for the distributed components that seamlessly interact to provide the functionality. Further, it follows OGSA [3] which provides a uniform way to describe Grid services and define a common pattern of behaviour for these services. There are five distinct components that function to integrate the application programs, manage the databases, help in the resource monitoring and service discovery, manage the workflows and interface with the users. Except for the Portal, which forms the user interface, all the other components are Grid services developed based on the Globus Toolkit [5] version 3. Communication between the services is mediated by messages in the eXtensible Markup Language (XML) through the use of Simple Object Access Protocol (SOAP). 3.1. Integration of Application Programs Providing programmatic access to bioinformatics application programs located on heterogeneous

211

and disparate resources forms one of the major requirements of GeneGrid. Such application programs themselves are highly heterogeneous in their requirements and formats that make it difficult to provide a generic interface. However, the application management component of GeneGrid employs a Grid service wrapper around all the applications present on a computational resource and exposes these applications through an interface for programmatic access by the clients [14]. In fact, the client which utilises this interface in GeneGrid is the Grid service component responsible for the management of tasks and workflows, which communicates the appropriate task information in the form of an XML message that follows a predefined format. This approach of service localisation with respect to resources has the advantage of reducing failure rate over methods involving staging of executable files and databases to remote resources. However, it is also possible to configure the services to remotely execute a task. For implementing the functionality of the application manager, two types of Grid services which extend the standard interfaces in the OGSI [6] were defined. The GeneGrid Application Manager Service Factory (GAMSF) was conceived to be a persistent Grid service with the capability to create new instances upon request by the clients. This generic service integrates all the application programs on the resource which hosts the service. Upon invocation by a client, this service creates a transient service instance which has a finite lifetime and gets destroyed by the client once the specific task in hand is completed. Such a GeneGrid Application Manager Service (GAMS) extends the standard interfaces provided by the OGSI to expose an operation to deal with a specific task. This service is configured to receive the task details in the form of an XML message which adheres to a predefined format. It has the capability to create and execute the appropriate command on the local resource with the task parameters received (Figure 1). More details of how this is accomplished will be discussed in a later section of the paper. The current implementation of GeneGrid integrates a number of popular bioinformatics tools selected based on a number of use cases

212

J Grid Computing (2006) 4: 209–222

3.2. Integration of Databases

Client

Domain 1

xml GAMSF GAMS

Domain 2

BLAST

Figure 1 A client from a resource in Domain 1 executing a BLAST job on a resource in another domain. The client initially connects to the factory service (GAMSF) which integrates all the applications on that resource, creating an instance (GAMS) for accessing BLAST.

presented by biologists from the companies involved in the project. The list includes the widely used sequence analysis program BLAST [15] and many of its variants, the multiple sequence alignment program ClustalW [16], the profile building and database search program HMMER [17], many programs from the EMBOSS suite [18] as well as the protein characterisation programs like TMHMM [19] and SignalP [20]. Parallel versions of some of the programs such as BLAST [21] are also integrated to provide faster execution of the tasks on clusters and supercomputers. It has been a prime focus to ensure that the application management services as generic as possible. This allows easy integration of novel application programs to the system without making any modifications to the code. Bioinformatics experiments involving a number of application programs often require the output from one step to act as input for another in the pipeline. However, due to incompatibility in the formats of inputs and outputs, in many cases such linking between the tasks is difficult. In order to alleviate this problem, a number of utility programs were developed using Perl and BioPerl [22] that aid in linking consecutive tasks in an experiment. The above mentioned services also integrate such utility programs available on a resource with GeneGrid.

Bioinformatics experiments often involve the access to a number of databases storing a variety of data in order to retrieve the data of interest. Such databases are created and managed following different technologies and are generally present on disparate and heterogeneous resources. For example, many of the biological databases are flat files which are long text files containing concatenated data. However, some of the biological databases are created using relational database management systems (RDBMS) or are stored as XML collections. In addition to the publicly available biological datasets, the companies involved in the project have proprietary data stored in different formats, which need to be queried in addition to the former ones. Functionality of GeneGrid is dependant on some datasets specific to the system which are usually in the form of XML collections. There are other datasets generated during experiment creation and execution, such as the input files and result files which need to be stored in databases and retrieved when required. Providing access to these databases and facilitating a variety of operations such as query, update, deletion etc. in a Grid environment is another task in GeneGrid. This is accomplished by specific Grid services capable of interacting with the databases and exposing it to the clients through appropriate interfaces. In implementing the functionality of database integration in GeneGrid, Open Grid Services Architecture-Database Access and Integration (OGSA-DAI) [23] is used to generate the distributed Grid services. OGSA-DAI provides the middleware to assist with access and integration of data from separate data sources via the Grid. Two types of Grid services are involved in the database integration in GeneGrid, both extending the standard interfaces defined in OGSI [6] and also in OGSA-DAI [23]. The GeneGrid Data Manager Service Factory (GDMSF) extends the Grid Data Service Factory present in OGSA-DAI and is persistent in nature. Each such service is configured to support a single data set and hence each data set that is to be integrated with GeneGrid requires a dedicated GDMSF configured to support

213

J Grid Computing (2006) 4: 209–222

it. The primary role of the factory service is to create transient services that facilitate the interaction between the client and the data sets which these integrate with the system. In addition, upon client invocation the factory service also returns the metadata regarding the data set it supports. The transient service created by a factory service acquires the configuration from its parent service. Such a GeneGrid Data Manager Service (GDMS) exposes to the client, the operations to facilitate the database operations, such as queries or updates, as an extension to the standard operations provided by OGSI. A client can communicate the set of execution parameters in an XML document to the service, which in turn will return an XML document containing the results of the operation after executing the requested operation against the given database (Figure 2). GeneGrid currently integrates a suite of publicly available biological datasets including the nucleotide sequence dataset, EMBL Bank [24], the curated protein sequence dataset, SwissProt [25] and the Gene Ontology database [26]. There are two GeneGrid specific databases which are also integrated using the Grid services mentioned above. The GeneGrid Workflow Definition Database (GWDD) is an XML database developed using eXist [27] and contains the data required

Client

Domain 1

xml GDMSF SwissProt GDMS

Domain 2

Swiss Prot

GDMSF EMBL

EMBL

Figure 2 A client from a resource in Domain 1 executing a SwissProt query on a resource in another domain. The client initially connects to the factory service (GDMSF) which integrates SwissProt database on that resource, creating an instance (GDMS) for executing the query.

to construct a valid workflow, such as the Master Workflow Definition Document. Another database required for the proper functioning of the system is the GeneGrid Status Tracking, Results & Input Parameters Database (GSTRIP), which is a relational database in MySQL [28] for storing all the data generated by GeneGrid as well as data uploaded by the user. More details on these databases are presented in the later sections.

3.3. Management of Workflows A workflow acts as a process model that aids in the description and execution of work in a systematic manner. Bioinformatics experiments conducted in silico these days are not monolithic, but are usually workflows that involve a number of tasks often with considerable heterogeneity. Workflow execution on Grid is even more complex due to the heterogeneous and disparate nature of the resources. Experiments defined by the scientists for execution in GeneGrid are regarded as workflows, with the individual bioinformatics analytical or database access jobs mapped as atomic tasks amenable to distributed processing. Similar to the other services described earlier, there are two different types of OGSI [6] compliant Grid services that facilitate workflow management in GeneGrid. The persistent GeneGrid Workflow Manager Service Factory (GWMSF), which comes up along with the container which hosts the service, enables clients to interact and create transient instances. Each such GeneGrid Workflow Manager Service (GWMS) facilitates the enactment and management of a single workflow. Lifetime of such a service is until the particular workflow is active. A client interacts with the service through an XML document containing the workflow, which is then partially parsed by the service to identify the individual tasks present in the workflow as well as to find any interdependencies between the tasks. After resolving any such dependencies, appropriate resources are identified and the tasks are distributed. More details about the workflow enactment are described in the section on Information flow in GeneGrid.

214

J Grid Computing (2006) 4: 209–222

3.4. Resource Monitoring and Service Discovery While following a service-oriented approach it is essential to have some kind of mechanism to publish the existence of services to the clients and other services as well as a means to discover services that provide a specific functionality. Service registries such as Universal Description Discovery and Integration (UDDI) can provide the intended functionality. However, GeneGrid follows an alternate approach, where the service publication and discovery are accomplished through a Grid service, which also provides the system with comprehensive information about the status of the resources. The actual resource and service monitors are lightweight applications present on each of the resources integrated with GeneGrid. Such GeneGrid Node Monitors (GNM) can be configured to send static and dynamic resource data to the registry service at regular intervals. In addition, data about the Grid services such as the GAMSF, GDMSF and GWMSF described earlier can be transmitted to the service. Such data may include the service name, service type and location, as well as the names of applications or databases the service interfaces on the resource.

The GeneGrid Application and Resources Registry (GARR) service is a persistent Grid service that extends the OGSI standard interfaces to receive the resource and service information from the node monitor applications on various resources (Figure 3). All the data collected by the service are stored in a relational database for further use. This service also provides the interface for querying the database for service and resource information. For example, a client wishing to discover the location of any specific service within GeneGrid, say a GDMSF integrating the SwissProt [25] database, may query the GARR service and retrieve the required information. Moreover, when the requested type of service exists on more than one resource, it is possible to query for the optimal resource with the specific service.

3.5. Portal: The User Interface Scientists who are mostly from a biology background often find computing technologies, especially those which are in the developing stages such as Grid computing, highly complex and hard to comprehend. To alleviate the steep learning

Domain 1

GARR

Domain 2

Domain 3

GNM GAMSF BLAST

GNM GDMSF EMBL

Figure 3 Node monitor applications (GNM) on multiple resources across administrative domains registering resource information to the registry service (GARR). The

GNM GAMSF ClustalW

resource information includes the locations of Grid services integrating application programs and databases, in addition to the status of the resource itself.

215

J Grid Computing (2006) 4: 209–222

Figure 4 Clients can access GeneGrid system through the Portal. The Portal in turn contacts the registry service (GARR) to retrieve the locations of services required for its functionality, such as data manager services (GDMSF) interfacing GeneGrid specific databases and the services managing workflows (GWMSF), all of which may be present in different administrative domains.

curve which often hinder the biologists from taking up Grid technologies and the benefits it has to offer, GeneGrid employs a simple and intuitive user interface for interaction with the system. The GeneGrid Portal, which acts as the presentation layer for the Grid services, provides a secure and central access point for the users. In addition to concealing the underlying complexity it enables the user to create, execute and manage in silico biological experiments as workflows. The GeneGrid Portal is developed based on GridSphere [29], a portal framework that provides an open-source Java Specification Request (JSR) 168 compliant portlet based web portal, which enables developers to quickly develop and package third party portlet web applications that can be run and administered within the portlet container. Data stored in two different GeneGrid specific databases are required for the functioning of the Portal. Firstly, the Master Workflow Definition XML document present in the GeneGrid Workflow Definition Database (GWDD) contains the complete list of tasks the user can execute, along with the list of the required parameters for each task including descriptions and default values. This XML, which follows a predefined schema, also contains the rules required to submit and successfully execute each available task. The Portal develops web-based forms that allow the user to select the tasks and input the runtime parameter values to build a workflow, also based on the Master Workflow Definition. GeneGrid Status

Tracking, Results & Input Parameters (GSTRIP) is the second database which stores the data necessary for the functioning of the Portal. All the data from user input as well as data generated within GeneGrid during the execution of workflows are stored in this database. Hence whenever a workflow is submitted, a record is made in this database along with other data such as the current status, input data and execution parameters, which is updated regularly as the workflow execution progresses. Result data from the execution of the workflow and some metadata about the generated results are also finally updated into the record. By reading this record, it is possible for the Portal to display any static and dynamic information about the workflow, keeping the users abreast with the progress of the workflow execution. The Portal also allows the user to view the inputs, execution parameters and results from each task in the workflow (Figure 4). The Portal also provides the capabilities for administering GeneGrid from a central point, which will be discussed in a later section.

4. Information Flow in the GeneGrid As mentioned in the previous section, GeneGrid consists of a number of Grid services components which cater to the individual requirements of the framework. However, it is the coordinated action of these individual components that provides the

216

complete functionality expected from a virtual bioinformatics laboratory. 4.1. Initialisation A basic installation of GeneGrid should have at least one Portal, a deployment of both the registry service (GARR) and the workflow manager service factory (GWMSF), an implementation of each of the workflow definition database (GWDD) and the GSTRIP database, as well as the data manager service factories configured to interface these databases. These core essential distributed elements and their service instances form what is termed as the GeneGrid Environment. The application manager service factories (GAMSF) and data manager service factories (GDMSF) that interface the bioinformatics and accessory programs as well as databases make up the GeneGrid Shared Resources, which are capable of participating in one or more GeneGrid Environments. As mentioned in the previous section, the registry service GARR receives all resource and service data through the node monitor applications. The Portal and the workflow manager service factory are configured with the location of the registry service, in order to discover the location of all GeneGrid services. During start-up, the Portal contacts the registry service to discover the location of data manager service factories that interface GWDD and GSTRIP databases and creates service instances. After communicating with the appropriate service, the Portal obtains the Master Workflow Definition document, which allows it to display the information about tasks available to the users as web forms. 4.2. Experiment Creation Users can access the Portal through a web browser and once authenticated to use the service, will be displayed the tasks available in the GeneGrid installation (Figure 5). Going through the web forms which accept the parameter values and input files for specific tasks, a user can select the tasks required for the experiment. The Portal in turn will create the appropriate workflow XML document

J Grid Computing (2006) 4: 209–222

based on the user inputs and also store the input files and parameter details in the GSTRIP database. Once all the tasks required for the experiment are selected, the user is provided with two options. One of these options is to submit the complete workflow for processing after providing a suitable name for the experiment. The second option is to store the complete workflow as a template in the workflow definition database. This allows the reuse of workflows with different initial inputs, which saves time for conducting experiments which are to be repeated with various inputs. The above template is available only for the user who created it. However, it is also possible to create common templates that can be shared by all the users. Such templates aid in sharing of experimental protocols among the users as well provide an easy solution to train scientists to use GeneGrid.

4.3. Workflow Enactment As the user submits an experiment involving a number of tasks, the Portal contacts the registry service to obtain the location of a workflow manager service factory (GWMSF). It then creates a service instance of the workflow manager service factory to deal with the newly created workflow XML. There are three major steps in the further processing of the workflow by the workflow manager service, as described below (Figure 6).

4.3.1. Task Identification The workflow XML received by the workflow manager service will have details about one or more tasks. The first step in processing the workflow is to partially parse the XML document to identify the tasks present in the workflow and break it into component tasks. Interdependency of the tasks also needs to be analysed to determine the order in which tasks can be executed. Tasks which are not dependent on a previous task can be processed in parallel, while interdependent tasks should wait for the previous one in the workflow to finish.

217

J Grid Computing (2006) 4: 209–222

Figure 5 Screenshot of the Portal showing the task selection step in creating an experiment.

4.3.2. Service Discovery

4.3.3. Task Staging

Once the tasks are identified, the workflow manager service then contacts the registry service and gathers the locations of the appropriate services integrating the application programs or databases depending on the task in hand. For example, in order to process a BLAST [15] task, the workflow manager service will query the registry service for the location of an application manager service factory which can interface and process a BLAST job. Since the registry service is also capable of providing the resource data such as CPU load, available memory etc., it is possible to query for the optimal resource with BLAST service, in case there are more than one BLAST services available on different resources.

After deciding on which service to use, the workflow manager service contacts the appropriate application manager service factory and creates its service instance. This service is presented with an XML document with the details of the task to be executed. The task XML contains the name of the application, default or user defined parameters and some other information such as the location of the service interfacing the GSTRIP database where the input files are stored. 4.4. Task Processing Once the application manager service receives a task for processing, it parses the XML document

218

J Grid Computing (2006) 4: 209–222

Figure 6 A simplified overview of submission and execution of a workflow containing a BLAST task. The Portal first connects to the workflow manager service factory (GWMSF) to create an instance (GWMS) to assign the workflow XML. This service, after identifying the tasks,

contacts the registry service (GARR) for location of the suitable application service factory (GAMSF) e.g.: In this case integrating BLAST, for submission of the task XML. It also updates the GSTRIP database with the status of the workflow.

to get the details of the task such as the name and parameters. It then contacts the data manager service dealing with the GSTRIP database to retrieve the input files required for execution of the task. The appropriate execution command is then generated using the information from the task XML and a local configuration XML document, which is then submitted for execution on the resource. If the job is completed successfully, GSTRIP database is updated with the result files and metadata about the task which help in providing some provenance information about the data. Status of the task execution, whether success or failure will be notified to the workflow manager service along with the information about the location of the result files, if any. This service will then update the record about the task in GSTRIP database appropriately.

the user during the creation of the experiment may be viewed at any time. When a task is successfully completed, it is possible to view the result files through the Portal. User may decide to keep the experiment details in the system for later use or delete it from the system forever.

4.5. Tracking Status and Viewing Results Through the Portal users may check the status of the experiment as a whole or of individual tasks at any time after its submission (Figure 7). For all the tasks, the inputs and parameters provided by

5. Administering GeneGrid GeneGrid consists of a number of components with diverse configuration requirements and distributed across disparate domains. Hence it is essential and at the same time difficult to manage the components where changes need to be propagated throughout the framework. However, GeneGrid employs a central interface through the Portal for configuration management in addition to routine administration procedures like user addition. GridSphere [29] framework which forms the basis of GeneGrid Portal provides some administrative capabilities such as the management of users, editing layouts etc. New portlets developed extend this further to GeneGrid specific activities. For example, it is possible to select a different

219

J Grid Computing (2006) 4: 209–222

Figure 7 Screenshot of the Portal showing the status of tasks in an experiment. Inputs and results of each task can be viewed by clicking on the appropriate links.

registry service through the Portal, so that a different set of resources and services can be used for processing of the workflows. It is also possible to add information about new services manually using the portlet for resource administration. As mentioned earlier, the Master Workflow Definition XML document containing the complete task information template is central to the functioning of GeneGrid. The Portal provides the capabilities to edit an existing workflow definition document, such as addition of a new application and its details or modifying the parameters of an existing application. There is also provision to upload a new document modified outside GeneGrid. The Portal also enables the management of configuration of the services, which help in the propagation of changes in a system level. Thus, for example, through the Portal it is possible to retrieve the configuration used by the workflow

manager service factory as well as to modify it and transmit the changes back to the service.

6. Security in GeneGrid The GeneGrid Portal acts as a central access point for all the users. The user has to do a one-time registration process by providing all the essential details and an account will be created by the GeneGrid administrator giving permission to the resources with appropriate access control. The administrator also creates a User Certificate which the user has to install in the web browser. When a user tries to open the GeneGrid Portal, the Portal checks the User Certificate from the web browser. GeneGrid Portal is configured to accept https connection, which uses the Secure Socket Layer (SSL) protocol. SSL enables web browsers

220

and the GeneGrid Portal to communicate over a secured connection. GeneGrid uses the Globus Security Infrastructure (GSI) provided by the Globus toolkit for client authentication and server side authorisation. The GSI is based on public-key cryptography, and is configured to guarantee privacy, integrity, and authentication.

J Grid Computing (2006) 4: 209–222

desktop machines located at geographically disparate and diverse administrative domains including San Diego Supercomputer Center in the US and Melbourne University in Australia. While the developers are working on further improvements, scientists from the companies involved in the project are using the prototype for testing and routine bioinformatics procedures.

6.1. Client Side Authentication GeneGrid provides Single Sign on (SSO) facility that relieves the user from entering the password more than once during a session. Once the user is logged in to the Portal with the certificates installed in the browser, the certificates are used to authenticate the access to any resource within the GeneGrid system. 6.2. Server Side Authorisation The server decides whether to accept or decline an incoming invocation depending upon the entry in a file called Gridmap file. A Gridmap, like an access control list (ACL) contains a list of authorised users who have the authority to invoke the service. 6.3. Message Level Security GeneGrid uses the GSI Secure Conversation method. Here, a security context is first established between the client and the service which is then used to sign/verify/encrypt/decrypt the messages.

7. GeneGrid Prototype Several prototypes were developed during the course of the project. The latest prototype implements the architecture and functionality described in the earlier sections. In order to assess the practicality of distributing services that work in a coordinated fashion, the different component services such as the workflow manager, application manager, etc. are deployed on diverse resources. Further, heterogeneous resources across the globe are integrated with GeneGrid. These include Linux clusters to Sun SMP machines to

8. Using GeneGrid: A Case Study A routine bioinformatics analysis carried out by scientists from one of the companies involved in the project is finding the regions with antigenic properties in proteins starting from their gene sequences. The procedure involves a number of bioinformatics programs which are to be accessed from different sources such as publicly available web servers and those present on resources internally. Manual execution of such experiments are tedious, time-consuming and error prone, especially considering the volume of analyses need to be carried out on a daily basis. GeneGrid has provided a solution to automate the experiment through reusable workflows and substantially reduced the time for execution by distributing the tasks across optimal resources. The workflow for automated antigenic region detection involves a number of steps (Figure 8) as described below: 1. Gene sequences are translated into the corresponding protein sequences 2. Various characteristics of the protein such as presence of transmembrane regions, signal regions etc are examined. 3. Regions of the proteins are extracted with the desired characteristics and those regions which do not contribute to antigenicity are removed. 4. The sequence fragments are then searched against a database such as SwissProt [25] to eliminate the chance of having highly homologous sequences. 5. The unique fragments selected are further analysed to see whether primers for polymerase chain reaction can be developed.

221

J Grid Computing (2006) 4: 209–222





Figure 8 Outline of the use case. Please see the steps described in the text.

• When this experiment was conducted with manual intervention the time to complete the analysis of a single gene was about 30–60 min. In contrast, the automated antigenic region detection using GeneGrid took about 90 min for the analysis of 100 genes. This acceleration is largely due to the automation and parallelization of task execution, as well as the optimal use of available resources.

9. Conclusions GeneGrid consists of a number of Grid service components developed based on Globus [5] that interact seamlessly to provide the functionality of a virtual bioinformatics laboratory. It allows the scientists to share and access their collective skills, experiences and results in a secure, reliable and scalable manner through a simple and intuitive interface. The advantages it has to offer over conventional methods are manifold, some of which are discussed below. •

GeneGrid integrates numerous bioinformatics application programs and databases available on disparate and heterogeneous resources allowing scientists to easily access these applications and data sources without visiting many web servers manually. This reduces the overall time for execution of the experiment and is



roughly proportional to the number of tasks present in an experiment. As the system takes care of monitoring and selection of appropriate and optimal resources for the tasks requested, it relieves the user from such selections and more importantly, utilises the resources efficiently. This not only reduces the overall time required for the experiment execution, but the individual tasks are also sometimes accelerated due to the use of clusters and supercomputers or systems which are relatively idle. GeneGrid makes the experiments more errorfree through the automation of workflows, as manual intervention, which is a major cause of initiation and propagation of errors in experiments, is reduced to a minimum. Novel technologies, albeit with distinctive advantages over the conventional ones, often come with a steep learning curve and hence keep away the result oriented scientists from embracing them. The effort in GeneGrid development is to provide a user-friendly interface to harness the power of Grid computing. This is reflected in the Portal, which while masking the underlying complexity, presents the users a familiar interface for creation and management of experiments through the web browser. Scientists often have to repeat experiments with slightly different parameters or input sequences. If such experiments involve a number of tasks, manually accessing different web services for execution repeatedly becomes laborious and time-consuming. In GeneGrid, the Portal provides the option to save experiment templates which help in reusing the workflows with slight modifications.

Development of functional prototype of GeneGrid and its use in a number of in silico biological experiments have clearly illustrated the viability of using Grid services for the integration of heterogeneous bioinformatics applications and databases as well as utilising them in an orchestrated manner through the creation of workflows. However, it is essential to note that adapting a technology which is in its infancy and growing rapidly, is often painful and was the experience with

222

GeneGrid development using the Grid standards and tools. In future, GeneGrid intends to utilise more robust and sophisticated tools and standards being developed by the international Grid community through the convergence of web and Grid services. Acknowledgments We acknowledge the funding support for the project by the Department of Trade and Industry (DTI) under the UK e-Science Programme. Our thanks to Dr Mark Miller, SDSC, USA and Dr Rajkumar Buyya, Melbourne University, Australia for providing access to resources.

References 1. Genomes OnLine Database, see website http://www. genomesonline.org 2. Foster, I., Kesselman, C., Tuecke, S.: The anatomy of the Grid: Enabling scalable virtual organisations. Int. J. Supercomput. Appl. 15(3) (2003) 3. Foster, I., Kesselman, C., Nick, J., Tuecke, S.: The physiology of the Grid: An open Grid services architecture for distributed systems integration. Open Grid Service Infrastructure WG, Global Grid Forum, June 22 (2002) 4. Foster, I.: Service-oriented science. Science 308, 814–817 (6 May 2005) 5. Foster, I., Kesselman, C.: Globus: A metacomputing infrastructure toolkit. Int. J. Supercomput. Appl. 11, 115–128 (1997) 6. Tuecke, S., Czajkowski, K., Foster, I., Frey, J., Graham, S., Kesselman, C., Maguire, T., Sandholm, T., Vanderbilt, P., Snelling, D.: Open Grid services infrastructure (OGSI) Version 1.0. Global Grid Forum Draft Recommendation, 6/27/2003 7. Biomedical Informatics Research Network, see website http://www.nbirn.net 8. Cancer Biomedical Informatics Grid, see website https://cabig.nci.nih.gov 9. myGrid, see website http://www.mygrid.org.uk 10. North Carolina BioGrid, see website http://www. ncbiogrid.org 11. Bio-GRID, see website http://biogrid.icm.edu.pl 12. Donachy, P., Harmer, T.J., Perrott, R.H., et al.: Grid based virtual bioinformatics laboratory. In: Proceedings of the UK e-Science All Hands Meeting, Nottingham, pp. 111–116, 2003 13. Joseph, J., Ernest, M., Fellenstein, C.: Evolution of Grid computing architecture and Grid adoption models. IBM Syst. J. 43, 624–645 (2004)

J Grid Computing (2006) 4: 209–222

14. Jithesh, P.V., Kelly, N., Simpson, D.R., et al.: Bioinformatics application integration and management in genegrid: Experiments and experiences. In: Proceedings of UK e-Science All Hands Meeting, Nottingham, pp. 563–570, 2004 15. Altschul, S.F., et al.: Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (Sep 1. 1997) 16. Thompson, J.D., Higgins, D.G., Gibson, T.J.: CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673–4680 (Nov 11. 1994) 17. Eddy, S.R.: Profile hidden Markov models. Bioinformatics 14, 755–763 (1998) 18. Rice, P., Longden, I., Bleasby, A.: EMBOSS: The European molecular biology open software suite. Trends Genet. 16, 276–277 (2000) 19. Krogh, A., et al.: Predicting transmembrane protein topology with a hidden Markov model: Application to complete genomes. J. Mol. Biol. 305(3), 567–580 (January 2001) 20. Bendtsen, J.D., Nielsen, H., von Heijne, G., Brunak, S.: Improved prediction of signal peptides: SignalP 3.0. J. Mol. Biol. 340, 783–795 (Jul 16. 2004) 21. Darling, A., Carey, L., Feng, W.: The design, implementation, and evaluation of mpiBLAST. In: ClusterWorld Conference & Expo in conjunction with the 4th International Conference on Linux Clusters: The HPC Revolution 2003, San Jose, CA, June 2003 22. Stajich, J.E., et al.: The bioperl toolkit: Perl modules for the life sciences. Genome Res. 12, 1611–1618 (October 2002) 23. OGSA-DAI Project, see website http://www.ogsadai. org.uk 24. Kanz, C., Aldebert, P., Althorpe, N., et al.: The EMBL nucleotide sequence database. Nucleic Acids Res. 33 Database Issue, D29–D33 (Jan 1. 2005) 25. Apweiler, R., Bairoch, A., Wu, C.H., et al.: UniProt: The universal protein knowledgebase. Nucleic Acids Res. 32, D115–D119 (Jan 1. 2004) 26. The Gene Ontology Consortium. Gene ontology: Tool for the unification of biology. Nat. Genet. 25, 25–29 (2000) 27. Wolfgang Meier. eXist: An open source native XML database. In: Chaudri, A.B., Jeckle, M., Rahm, E., Unland, R. (eds.) Web, Web-Services, and Database Systems. NODe 2002 Web- and Database-Related Workshops, Erfurt, Germany, October 2002 28. MySQL, see website http://www.mysql.com 29. Novotny, J., Russell, M., Wehrens, O.: GridSphere: An advanced portal framework. In: Proceedings of EuroMicro Conference, pp. 412–419, 2004

Suggest Documents