Grid computing is emerging as a key infrastructure for a wide range of disciplines for ... (Stevens et al, 2003) has been developed as an open source data-intensive .... attach a bioinformatics package containing the programs with two files: ...
IADIS International Conference on Applied Computing 2005
A GRID-BASED SYSTEM FOR INTEGRATING BIOINFORMATICS APPLICATIONS Xiujun Gong Graduate School of Information Science, Nara Institute of Science and Technology 8916-5 Takayama, Ikoma, Nara, 630-0192 Japan
Kei Yura Center for Promotion of Computational Science and Engineering, Japan Atomic Energy Research Institute 8-1 Kizu, Souraku, Kyoto, 619-0215 Japan
Nobuhiro Go Graduate School of Information Science, Nara Institute of Science and Technology 8916-5 Takayama, Ikoma, Nara, 630-0192 Japan Center for Promotion of Computational Science and Engineering, Japan Atomic Energy Research Institute 8-1 Kizu, Souraku, Kyoto, 619-0215 Japan Neutron Science Research Center, Japan Atomic Energy Research Institute 8-1 Kizu, Souraku, Kyoto, 619-0215 Japan
ABSTRACT To access up-to-date biological data sources across multiple heterogeneous databases and analysis tools distributed all over the networks, and to combine them into workflows, we propose a grid-based system for integrating bioinformatics applications, which we named “Bioinformatics: Ask Any Questions (BAAQ)”. BAAQ allows scientists to store and manage biological data and bioinformatics tools as remote services, and to build analysis workflows that integrate these services seamlessly. Via a grid based computation and graphically-enabled editor environment, we address how to compose a robust workflow to perform a complex bioinformatics analysis and to visualize results using Active Service Provider and Wrapped Program Toolkits. We propose a metadata model to organize biological resources and to share and discover knowledge in grid community. Finally, we present a case study applying BAAQ to a bioinformatics problem. KEYWORDS Grid, Active Service Provider, Wrapped Program Toolkits, Metadata
1. INTRODUCTION Biological data are being produced at a rapid pace throughout the community. These huge data are heterogeneous, autonomous and dynamic (Lacroix, 2002). Along with the increase in biological data, many tools are being designed to analyze them by different research groups (Liu et al, 2003). Integration of these biological data and tools is becoming one of the major topics in the bioinformatics community (Achard et al, 2001). One of the goals of the integration is to enable users to access bioinformatics tools and up-to-date biological data across multiple heterogeneous databases. Integration effort should meet the following requirements: a. Uniform access to biological data sources: A complex bioinformatics analysis usually involves many data sources. However, these data sources are heterogeneous in their data formats, data management systems, and data schemata. Access to heterogeneous biological data sources is mandatory to biologists. b. Overview of bioinformatics tools and their interface standardization: There are many bioinformatics tools designed to handle the data, but no one can tell which set of tools is the best for performing an
551
ISBN: 972-99353-6-X © 2005 IADIS
analysis in any given situation. Often, most of the bioinformatics tools are used in combination for solving a complex biological problem, but because of the heterogeneity in their interfaces, the procedure of integrating them is cumbersome. The overview of the tools and their interface standardization will help users integrate the appropriate tools for solving problems in a short time. c. Access to high performance and distributed computing resources: Biological problems such as multiple sequence alignments and microarray data clustering are huge and complex, as require high performance computation platforms (Siepel et al, 2001). The methodology and mechanisms utilizing all the computing power available on heterogeneous mixtures of multi-processor servers, clusters, networked desktops, and computation grids are necessary for accelerating the execution of bioinformatics applications. d. Share and discover knowledge: In the bioinformatics research community, the methodology which a researcher used to solve problems might be an important clue for another researcher for solving similar problems. Researchers with similar interests might have common preferences. Capturing such knowledge in an integrated environment has vital needs to speed up the development process of, for example, drug design. Grid computing is emerging as a key infrastructure for a wide range of disciplines for resource sharing and collaboration over wide and open networks (Gil et al, 2004). Provided with fundamental mechanisms of access to distributed databases and tools, the grid is an ideal candidate for the integration of bioinformatics applications. Many organizations have participated in such grid. Asia Pacific BioGrid (http://www.apbionet.org/grid/) is trying to build a customized, self-installing version of the Globus Toolkit, a distributed environment for designing and managing grid. It comprises well-tested installation scripts and avoids dealing with Globus details. Bio-Grid working group (Pytlinski et al, 2002) is developing an access portal for biomolecular modeling resources. The project develops various interfaces for biomolecular applications and databases that will allow biologists and chemists to submit works to high performance computing facilities, hiding grid programming details. In a United Kingdom e-Service project, myGrid (Stevens et al, 2003) has been developed as an open source data-intensive bioinformatics application on the grid. Data integration is achieved both by dynamic distributed query processing and by creating virtual databases through federations of local databases. In this paper, we propose an integration system for bioinformatics applications, which we named “Bioinformatics: Asking Any Questions (BAAQ)”. Via a grid based computation and graphically-enabled editor environment, we address how to compose a robust workflow to perform a complex bioinformatics analysis and to visualize results using Active Service Provider and Wrapped Program Toolkits. We also propose a metadata model to organize biological resources and to share and discover knowledge in the grid community. This paper is organized as follows: Section 2 describes the system architecture of BAAQ, Section 3 discusses the implementations of related modules, Section 4 reports a simple case study using BAAQ; and section 5 concludes this paper and discusses future works.
2. SYSTEM OVERVIEW As a bioinformatics collaboration platform, BAAQ allows scientists to store and manage biological data and bioinformatics tools as remote services in the grid, and to build analysis workflows that integrate these services seamlessly. The system is also capable to manage and share these workflows, and to discover knowledge in the grid community. The implementation of BAAQ is based on Seamless Thinking Aid (STA) (Koide et al, 1998), a gridenabled environment which includes a number of tools for assisting parallel programming. Task Mapping Editor (TME) (Imaura et al, 2003), one of the components of STA, is adopted as grid middleware for managing these services and workflows. TME has been developed for handling distributed resources and supporting the integration of distributed applications. All services are represented as icons on TME naming space and data dependency is defined by a line linking the icons. Using TME, a user can design a workflow diagram of the distributed applications, just like using a drawing tool. From the conceptual view of the grid, the BAAQ components fall into four-layer infrastructure, as shown in Fig 1.
552
IADIS International Conference on Applied Computing 2005
a. b.
c.
d.
Computation resource layer provides a grid-based computing environment that is responsible for authentication, authorization, communication and computing resource location and allocation. Information service layer consists of grid middleware such as TME, TextBrowser, and PluginTool. These middleware modules interact with the user through a GUI interface and communicate with STA by the corresponding adaptors. TME is the core component for managing icons and building workflows. TextBrowser is used to browse the text-based content of data resources. PluginTool is designed to visualize the content of data resources using client side software. Application resource layer consists of Wrapped Program Toolkits (WPT) and Active Service Provider (ASP). WPT is a series of wrapped program tools to perform bioinformatics analyses. ASP has the following three functions: 1) to provide guidelines for choosing a service(data service or program service) and a set of parameters for program services; 2) to aid connecting two icons in the process of workflow generation of TME; and 3) to recommend a visualization tool (such as TextBrowser or PluginTool) for viewing the analysis result. Meta crawler layer collects and updates the data sources and maintains the consistency of the whole resources. The metadata management module is also responsible for the workflows transfer between different grid communities. Biological Data
Metadata
Active Service Provider
Bio-Tools
Wrapped Program Toolkits
TextBrowser
TME
PluginTool
Seamless Thinking Aid
Meta Crawler
Application Resource
Information Service
Computation Resource
Figure 1. architecture overview
In this architecture, the biological data and bioinformatics tools gathered by the meta crawler layer are published as remote services. These services are encoded with knowledge rich metadata and are stored under the tree-like structure of TME workspaces. Before publishing the bioinformatics tools, they are wrapped so that they have uniform interfaces. To compose analysis workflows using these published services, a user only needs to select the corresponding services and link them by drawing a line between them. ASP is helpful for this procedure as described above. The analysis results are usually stored as remote files and managed as icons in the TME workspace and a user views them using customized visualization tools such as TextBrowser and PluginTool. The key modules to implement the functions above are discussed at length.
3. IMPLEMENTATION 3.1 Metadata management module Grid technology is increasingly used to provide an integration infrastructure to access distributed resources within collaborated research groups. Metadata collection and management play critical roles in developing and maintaining the integration (Blythe et al 2004). Metadata in the BAAQ system is the core coordinator of the integration. It is not only responsible for the integration among different function modules (STA, TME, Meta applications and so on), but also remembers and controls the behavior of the modules’ status. For example, data services and program services, which are encoded with knowledge rich metadata, can be easily identified by ASP and maintained by the meta crawler. By the metadata transfer with corresponding biological data and tools, workflows can be imported or exported among different grid communities.
553
ISBN: 972-99353-6-X © 2005 IADIS
The metadata in BAAQ falls into three levels: global metadata, general-purpose metadata and component-oriented metadata. Global metadata is data about the integration information of different components. The general-purpose metadata describes the basic information of the global metadata and the component-oriented metadata. We store the first two kinds of metadata in XML format. The componentoriented metadata is designed for different functional modules, for instance, TME metadata. Fig 2 lists the schema of the global metadata. In this schema, there are four kinds of databases (double line boxes in Fig 2). META_APP_DB contains a set of descriptions of wrapped program tools, each of which is collected from PACKAGE_PROFILE. META_DB_DIR_DB describes the information of all available biological data sources, each of which comes from META_DB_DIR. The environmental variables (grid configuration, available resource location), public data sources (TME_DATA) and bioinformatics tools (TME_PROGRAM) are specified in SYS_PROFILE_DB. USER_PROFILE_DB records all the information about the users on the grid. Metadata management module collects and updates these four kinds of databases from DB_DIR_PROFILE, PACKAGE_PROFILE and USER_WORKSPACE and maintains the consistency of the whole integration environment. It also provides some interfaces so that other modules can access these metadata easily. By transferring the metadata with the corresponding data and tools, the workflows can be imported or exported among different grid communities. META APP DB META_PACKAGE
PACKAGE_PROFILE
META PROGRAM USER_DATA
SYS ENV DATA TME PROGRAM
USER_SCENARIO
SYS_PROFILE_DB TME_DATA
USER META DB USER PROFILE DB DB_DIR_PROFILE
META_DB_DIR USER_ENV_DATA META DB DIR DB
USER_WORKSAPCE
Figure 2. structure of the global metadata model
3.2 Wrapped Program Toolkit module Program integration is a very complex activity due to, for example, the heterogeneity of manipulated data and incompatible interfaces. In most cases, programs are black-box systems that do not allow any change in the program itself (Corradini et al, 2003). TME organizes a program as an icon with configurable interfaces. Its function and semantics of its interfaces are still hidden from users. Users have to shift more efforts for studying the function and understanding the meaning of each interface. Most bioinformatics tools are developed by different research or commercial units. The data formats that programs can accept are quite different from one another. We attach a bioinformatics package containing the programs with two files: profile and configuration. The profile encodes three kinds of information: 1) source information about the package, for example the url, from which the package is downloaded, version, release number, and code types(c/c++, perl and fortran and so on); 2) function summary information about analysis types (sequence alignment, structure prediction), methods and performance evaluation, most of which is extracted from the “README” file of the package; and 3) function and interface description of each program. The configuration file ensures programs in the package are executable on different platforms (Linux, SunOS and IRIX and so on). Program interfaces play a particular role in composing analysis workflows. Heterogeneity and diversity in program interfaces exhausts most users. To solve this problem, we classify the interfaces into three types: switch, real-value and file parameters. We can list all possible values for the switch parameters and limit the maximum and minimum for the real-value parameters. For the file parameters, the situation becomes a little
554
IADIS International Conference on Applied Computing 2005
more complex, because of the diversity in biological data formats. Our solution is to unify data formats of the file parameters as XML formats as much as possible. Since XML is more flexible to develop machine understandable code, we encode related information into XML files so that the system can identify them automatically. To this end, we design many filters that extract related information from the original outputs of the analysis tools into XML files (section 3.3). Using these filters, the programs are wrapped easily.
3.3 Active Service Provider module TME has already provided abundant utilities for composing and managing a workflow with distributed services. Each service is represented as an icon. Building a connection between different services is just to draw a line between icons. For specific application areas, however there are still some problems: 1) What data and programs should the user choose to perform a specific bioinformatics analysis? 2) How does the user keep semantic consistency when building a connection between two icons? For example what would happen if the user connects two icons with heterogeneous formats? For the first problem, we provide a tree-based browser for biological databases and tools. In TME window, each data or program service is represented as a node in the corresponding taxonomy tree. A user can search for the target object by traveling through the taxonomy tree. We are going to develop a search engine which allows user to search the services available by a natural language interface. For the second problem, although TME can check some syntactic errors for the whole workflow, little emphasis is put on semantics. We provide a method, called Active Service Provider, for solving the problem. Our Active Service Provider includes two procedures: semantic check and service provider. When a user connects two icons, Active Service Provider will be triggered to check whether this connection is admissible by examining the corresponding profile. If allowed, the system will detect where extra services are needed. For example, if a user wants to connect two icons with different data formats, the data filter service will be provided to perform data format transformation. Data format transformation is tedious work for biologists; hence it is becoming an active research area in bioinformatics. Two typical methods are data warehousing and federation. Related works are reported in detail in (Lacroix, 2002). Our method, data filter approach, is quite similar to the federation method. Filter manager & API
filter shared object library (XML format)
Filter
Filter
Filter
Filter
Filter
Filter
Filter
FASTA
PDB
SwissProt
SSEARCH
CLUSTALW
PostScript
RasMol
Figure 3. filter repository structure
We define a uniform XML format with enough capability of description for the biological data sources and typical analysis result of bioinformatics tools. To exchange different data formats with the defined XML format, we design a set of filters (Fig 3). Upon a user’s requirement, the filter manager accesses the corresponding filter to perform the intended transformation. Using this filter library, the data filter service can transform formats between any two different data sources. One of the advantages of this architecture is that developers only need to change the individual filter when a data source is changed. This filter library also contains two other classes of filters. One class is filters that extract some results from program outputs into XML format and is used to wrap programs. Another class of filter converts some XML analysis results into postscript format or RasMol script format and is used in the visualization module.
3.4 Analysis procedure/result visualization module Visualization plays a vital role in understanding the analysis procedures and results. Using the graphicallyenabled task editor environment in TME, a user can visually design and view the workflow, monitor its execution status and trigger corresponding tools to browse the content of an individual icon in the workflow.
555
ISBN: 972-99353-6-X © 2005 IADIS
In the visualization module, we focus on the visualization of analysis results through triggering third party visualization tools. There is a large amount of software for presenting biological data graphically, for example, Protein Explore, RasMol, CHIME for protein structure, gsview and GNU gv for many kinds of printable pictures and GnuPlot and SciLab for statistical data. To integrate those kinds of programs, we faced two challenging problems: 1) Although most of them provide different versions working on different platforms, for example RasMol can run on Windows, Mac and Unix-based platform, few of them are platform-independent; and 2) There are many programs for visualizing the biological data in a similar way. However, users usually prefer the one they are most accustomed to. In BAAQ, users can use the programs hosted in client side by the way of plugin, which most web browsers provide. What the system does is to provide the targeted data and to recommend content type, by which the browser decides which program on the client side will be triggered. Once content type is identified, Active Service Provider will call the corresponding filter (Fig 3) to transform the XML file into the corresponding format (i.e. PostScript, RasMol), then trigger the PluginTool to view the data. In this way, users can customize their preferred visualization tools.
4. CASE STUDY One of the most important tasks in the current field of bioinformatics is to predict the functions of proteins encoded in genome sequences. About 40 percent of proteins predicted from genome sequences are unknown function. Deinococcus radiodurans is well-known for its resistance to high dosage gamma ray irradiation (Daly et al, 1995), but the molecular mechanisms for this resistance remained to be disclosed. There must be mechanisms which respond to irradiation, because any DNA damage which results from irradiation is soon fixed in a while.
A
B
Figure 4. (a) Screen shot showing the configuration of grid environment, (b) workflow of the structure prediction for D. radiodurans proteome.
To find clues for these molecular mechanisms involved in the resistance out of proteome data, three programs: garnier, helixturnhelix and pepcoil in EMBOSS package (http://www.hgmp.mrc.ac.uk/Software/EMBOSS/) are wrapped as program services, each of which has an input file and an output file in XML formats. They are distributed over three different machines: itbltasv (SunOS 5.8), charon (Red Hat 7.3) and gong (Red Hat 9.0). A user can select the available machines and manage remote files on them, as shown in Fig 4A. The final workflow to perform this analysis is shown in Fig 4B. In this workflow, the input sequence with SwissProt format is first transformed into XML format using the bioDatafilter service, which is triggered by Active Service Provider. Then the output is connected
556
IADIS International Conference on Applied Computing 2005
to three branches, each of which is processed by one of the three services above. Finally, their outputs are transformed into three postscript files by the ps_maker service, which is a filter service designed for transforming XML format into postscript format. A user can browse the results graphically by just double clicking the results icons, which will trigger the PluginTool. The results are shown in Fig 5. One of the proteins (the 1791st protein from the origin of the bacterium genome), which consists of 139 amino acid residues and is annotated as a hypothetical protein, is shown to have a helix-turn-helix motif, one of the well-known motifs for transcription factors (Fig 5B), and a coiledcoil region, one of the well-known structures for protein-protein interactions (Fig 5C). The helix-turn-helix motif and coiled-coil structure both consist of helix structures, and the regions predicted to have the motif and the structure are predicted as helix structures (Fig 5A). The simultaneous application of the genome annotation tools shows the consistency in the predictions. The regions predicted to have coiled-coil structures contains helix-turn-helix motifs (Fig 5B and Fig 5C). This overlap is, to our knowledge, quite a rare event. The result suggests that the region functions as a helix to interact with DNA and other proteins.
A.
Deinococcus radiodurans R1 complete chromosome 1
helix strand
[AE000513_1791] ..........
100
turn VT RW SH DVP RW PS GL RDDT
PL P YAAW RVL DL VD GKR TLA QI AAH LD LA QE DVE RA LEQ AQ SW LG RAL RRE QP V TE AVL DNVT
....
139
LD QV GE GA TLS
B.
QA LV S VVG PI GE FL I DDA
A LLG AVAP
ELD EA HL HQF VRQ LR T RRL A
Deinococcus radiodurans R1 complete chromosome 1
Helix-Turn-Helix
[AE000513_1791]
..........
100
VT RW SH DVP RW PS GL RDDT
PL P YAAW RVL DL VD GKR TLA QI AAH LD LA QE DVE RA LEQ AQ SW LG RAL RRE QP V TE AVL DNVT
....
139
LD QV GE GA TLS
C.
QA LV S VVG PI GE FL I DDA
A LLG AVAP
ELD EA HL HQF VRQ LR T RRL A
Deinococcus radiodurans R1 complete chromosome 1 [AE000513_1791]
Coiled-Coil ..........
100
VT RW SH DVP RW PS GL RDDT
PL P YAAW RVL DL VD GKR TLA QI AAH LD LA QE DVE RA LEQ AQ SW LG RAL RRE QP V TE AVL DNVT
.... LD QV GE GA TLS
QA LV S VVG PI GE FL I DDA
139 A LLG AVAP
ELD EA HL HQF VRQ LR T RRL A
Figure 5. results of structure prediction: (A) secondary structure; (B) helix-turn-helix structure; and (C) coiled-coil structure
5. DISCUSSION In this paper, we have proposed an integration framework for bioinformatics applications. In a grid-based environment, we address the integration of biological data and bioinformatics analysis tools for composing a robust workflow smoothly. Using knowledge rich metadata, a user can store and manage the biological data and bioinformatics tools as remote services, and build analysis workflows that integrate these services seamlessly. One of our targets for future research is to explore and discover knowledge from these services and workflows. Based on certain protocols, users can publish their private tools, analysis workflows and results. Also they can obtain some active services for solving related problems based on their preferences and the workflows created by other users. For composing a workflow smoothly in TME window, we have provided two strategies. In one strategy, we wrapped the programs and encoded them with knowledge rich metadata. In the other strategy, Active Service Provider is used to check related semantic and to provide corresponding services, for example, the
557
ISBN: 972-99353-6-X © 2005 IADIS
data filter service, if applicable. The data filter implementation and program wrapping are made easier by utilizing the filter library. Improving the model of our XML schema is one of the main tasks of the future research. We also propose a method for the visualization of analysis results using PluginTool. One of the advantages is that users can customize their preferred visualization tools on the client side. Developers can focus on the design of individual filter.
ACKOWLEDGEMENT This work was supported by the Special Coordination Funds Promoting Science and Technology from Ministry of Education, Culture, Sports, Science and Technology (MEXT) of Japan and IT Based Laboratory (ITBL) Project. Authors are grateful to Dr. Norihiro Nakajima, Dr. Yoshio Suzuki, Mr. Yukihiro Hasegawa and Mr. Nobuhiro Yamagishi.
REFERENCE Achard F. et al, 2001. XML, Bioinformatics and Data Integration. Bioinformatics, Vol.17, No. 2, pp115-125. Asia Pacific BioGRID Initiative, http://www.apbionet.org/grid/. Blythe J. et al, 2004. Automatically Composed Workflows for Grid Environments. IEEE Intelligent Systems, Vol. 19, No. 4, pp16-23. Cannataro M. et al, 2004. Proteus, a Grid based Problem Solving Environment for Bioinformatics: Architecture and Experiments. The IEEE Computational Intelligence Bulletin, Vol. 3 No.1, pp7-18. Cannataro M. et al, 2001. KNOWLEDGE GRID: High Performance Knowledge Discovery on the Grid. GRID2001, Denver, Colorado, pp38-50. Corradini F. et al, 2003. An Agent-based Layered Middleware as Tool Integration. ESEC-Tool Integration Workshop, Helsinki, Finland, pp17-22. Critchlow T. et al, 2001. Experience Applying Meta-data to Bioinformatics. Information Sciences, Vol. 139, Issue 1-2, pp3-17. Daly M.J. et al, 1995. Resistance to Radiation. Science, Vol. 270, No. 1318, pp276-277. European Molecular Biology Open Software Suite, http://www.hgmp.mrc.ac.uk/Software/EMBOSS/. Gil Y. et al, 2004. Artificial Intelligence and Grids: Workflow Planning and Beyond. IEEE Intelligent Systems, Vol. 19, No.1, pp26-33. Koide H. et al, 1998. Development of a Message Passing Library for Heterogeneous Parallel Computer Cluster on STA. Trans. of JSCES, Vol. 3, pp85-88. Lacroix Z., 2002. Biological Data Integration: Wrapping Data and Tools. IEEE Trans Inf Technol Biomed, Vol. 6, No. 2, pp123-128. Liu L. et al, 2003. BioSeek: Exploiting Source-Capability Information for Integrated Access to Multiple Bioinformatics Data Sources. Third IEEE Symposium on Bioinformatics and Bioengineering, Bethesda, Maryland, pp263-271. Moreau L. et al, 2003. On the Use of Agents in Bioinformatics Grid. Third IEEE/ACM CCGRID'2003 Workshop on Agent Based Cluster and Grid Computing, Tokyo, Japan, pp653-661. Passi K. et al, 2002. A Model for XML Schema Integration. EC-Web2002, Aix-en-Provence, France, pp193-202. Pytlinski J. et al, 2002. BioGRID -Uniform Platform for Biomolecular Applications. Euro-Par2002, Paderborn, Germany, pp881-884. Siepel C. et al, 2001. An Integration Platform for Heterogeneous Bioinformatics Software Components. IBM Systems Journal, Issue 40, No.2, pp570-591. Stevens R.D. et al, 2000. TAMBIS: Transparent Access to Multiple Bioinformatics Information Sources. Bioinformatics, Vol. 16, No. 2, pp184-185. Stevens R.D. et al, 2003. myGrid: Personalised Bioinformatics on the Information Grid. Bioinformatics, Vol. 19, Suppl. 1, pp302-304. Takemiya H. et al, 1999. Development of a Software System (STA: Seamless Thinking Aid) for Distributed Parallel Scientific Computing. IPSJ MAGAZINE, Vol. 40, No.11, pp1104-1109. Imamura T. et al, 2003. A Visual Resource Integration Environment for Distributed Applications on the ITBL System. ISHPC 2003, Tokyo, Japan, pp258-268.
558