ToolBus – An Interoperable Environment for Biological ... - CiteSeerX

35 downloads 2216 Views 151KB Size Report
Bioinformatics data sources and analysis tools. Its utilization of ... The first step in the analysis is a comparison of the two annotated genomes using MUMmer [1].
ToolBus – An Interoperable Environment for Biological Researchers Boyu Yang, J Dana Eckart, Eric K. Nordberg, Bruno W. S. Sobral Virginia Bioinformatics Institute (0477) Washington Street Phase 1 Blacksburg, VA 24061, U.S.A. Abstract - ToolBus is an integrated environment in which data and tools can be interoperable in an open and flexible manner. Using this environment, biological researchers can access many kinds of Bioinformatics data sources and analysis tools. Its utilization of web services and its open API encourage and support the development of tools and visualization plugins by other development groups. As the number and diversity of tools and visualization plugins expands, the tool interoperability and data integration capabilities of ToolBus will enable its utility to expand at an exponential rate as the interconnections among the data can be fully realized. An example use case is discussed in detail.

Key words: Bioinformatics, ToolBus, Data Integration, Web Service

1.0 Introduction As a result of the development of many high throughput biological technologies, biological data are increasing exponentially, and the tools for analyzing these data are also expanding quickly. Unfortunately, most of these tools and data are not only scattered over the web, but often utilize different i/o formats, making their combined use to solve biological research problems both difficult and error prone. There is an immediate need for interoperable access across all such tools and data sets. This capability is provided by the ToolBus/PathPort system in which tools and visualization plugins can be dynamically added and configured, and the appropriate visualization plugins for displaying received data are automatically discovered at runtime. ToolBus has a unique web service user interface that is universal to all DOC and RPC based web services. To enable the integration and comparison of data obtained from disparate sources. ToolBus utilizes a data association mechanism that enables data selected from different visualizations to be grouped together for comparison. In addition to highly interactive visualization interfaces for examining data and analysis results, ToolBus allows the saving and sharing of work sessions which, because of the platform independent nature of the system due to its Java implementation, can be shared between colleagues.

2.0 An Example Use Case The Variola major virus is the causative agent of Small Pox, a highly contagious and deadly human disease. Vaccinia that is very similar genomically to Variola major, but does not generally causes disease in humans. The two viruses are so closely related that Vaccinia is used as a vaccine strain to immunize against Variola. We compared the genomes of these two viruses looking for a gene that may help explain the drastic difference in their effects on the human host.

The first step in the analysis is a comparison of the two annotated genomes using MUMmer [1]. The MUMmer web service also connects to a database of annotated genomes. When the results were received from the web service, ToolBus discovered that the 'Sequence Comparison' plugin could be used to view the MUMmer results. By drawing a box around the interesting region, as shown in Figure 1, we zoomed in for greater detail. From the parallel view, we identified a large feature that is present in Variola and does not appear to be present in Vaccinia. To further investigate this gene, a BLAST search was done to not only determine similar sequences, but also to confirm that this Variola gene is absent in Vaccinia.

Fig. 1 The sequence comparison and dot plot view We used the BLAST web service deployed on a Linux Cluster. Several versions of BLAST are available, including Translating BLAST, which is used in the work. The BLAST web service connects to a set of target sequence databases. Because we are dealing with viral genomes, we choose the 'virus' category of databases. The initial view for the BLAST results has a graphic display showing the query sequence at the top, with hit sequences aligned below. The hits are color-coded to indicate their scores. The next step in our analysis of this Variola gene was to align the high-scoring, full-length hits using the CLUSTALW [2] multiple sequence alignment tool. Grabbing any one of the selected sequences and dragging it to the CLUSTAL input form transfers all selected sequences. We viewed the alignment results with the 'Sequence Alignment' plugin (Fig 3). The final step in our analysis was to build a phylogenetic tree for these sequences based on our multiple sequence alignment. During the course of this analysis, we used: MUMmer to do a complete genome comparison of two genomes and identify an interesting gene; BLAST to search a database for similar genes; CLUSTALW to align our top BLAST hits; and PHYLIP[3] to build a phylogenetic tree based on the aligned sequences. Throughout the process, we

required no detailed knowledge of the individual tools, or the data formats used by the tools. We were able to extract important information visually, and transfer information using an intuitive drag-and-drop mechanism.

Fig. 2 Invoking CLUSTAL web service and its result

Fig. 3. The alignment result from CLUSTAL

3.0 The System Architecture & Design ToolBus is a Java application in which tools and visualization plugins can be developed independently and loaded into ToolBus. Data are received from tools, which can be either local programs or web services, and are graphically presented to the user via the visualization plugins. ToolBus queries its installed plugins to discover all data models capable of presenting the information received from a tool. Users can choose one or more of the data models to visualize and explore the data. ToolBus enables users, via the graphical viewers, to easily exchanged data between tools and other visualization plugins. Data from one or more models can also be associated with one another to form new group of information that can be further compared

within ToolBus. This architecture, whose structure is depicted in Figure 4 with solid arrows depicting module inclusion, allows new tools and visualization plugins to be added dynamically and incrementally. ToolBus provides an open API to encourage development by outside groups. Developing new tools or visualization plugins for use with ToolBus is accomplished by extending a few Java base classes provided by the ToolBus system. This not only makes it relatively easy to develop for, but also promotes a uniform presentation since many of the capabilities are inherited from these base classes.

3.1 Data Communication Between Tools and Models When the user uses a tool within ToolBus, the applyTool method defined in ToolBus is called which begins a cascade of actions. The tool’s init method is called first to provide either automatic initialization or a user interface to enable users to enter values for parameters. The tool itself will inform tool manager when the invoke method of the tool should be actually called. The invoke method implements the selected function of the tool. For a web service, the invoke method sends the user-selected operation to the web service server where the operation is actually implemented. The results returned by the tool, usually in the form of an XML document, are sent to ToolBus. ToolBus finds all compatible data models and launches the model chooser from which users may select the desired model(s) and viewer(s) to create. The dashed lines of Figure 4 illustrate the lines of communication just described. There are three primary ways in which tools can be added to ToolBus: 1. Using the VBI Web Service Finder - This is a UDDI based web service finder. 2. Giving a tool location – Supplying a name and the URL of a web service's WSDL document. 3. Adding customized tools - Customized tools usually provide a specially crafted interface to the user and are generally bundled with a corresponding visualization plugin. Tools are also present as part of some visualization plugins. In such cases, the visualization plugins may utilize private tools only known to the plugin. The statistical and clustering tools for microarray analysis are an excellent example of this type of tool. To facilitate their efficient and easy use, these tools have been embedded in the microarray visualization plugin. While this does not provide an avenue for the addition of new tools, the end user can provide new and alternate locations of such tools. ToolBus

Model/View Manager

Model A

View A

Model B

View B

Tool Manager

… More models

Tool A

Tool B

… More tools

… More views

Session Manager

Plugin Loader

Fig. 4 The ToolBus system architecture

Data Association

Table 1. VBI developed or wrapped tools Data:

Sequence Alignment or Search: Genome search BLAST [20] Pathgen Background Information FASTA [21] Phylogenetic Tree Construction[3] ClustalW[22] EBI (gene expression) [5] Smith-Waterman [23] GEO (gene expression) [6] Sean (find potential SNP)[24] Gene Prediction: Ssaha and SsahaSNP [25] Genscan[7] MUMmer[1] Glimmer [8] InterProScan[26] GlimmerM[9] Microarray analysis: TigrScan[7] Agnes (Cluster program in R [27]) GrailEXP[10] Hclust (Cluster program in R) GeneMark[11] Kmean (Cluster program in R) Orpheus[12] Anova (Anova analysis in R) Probe design: Manova (Manova analysis in R) PCR/Hybridization [13] Diana (Hierarchical clustering in R) YODA [14] Multtest (f/t tests, bioconductor [28]) DNA Assembly digestion and prediction Rfda (Discriminant analysis in R) Contigs from trace files [15] Rknn (KNN classification in R) Restriction enzymes [16] Rlda (Supervised classification in R) tRNA[17] Rpca (PCA classification in R) Transcription factors [18] Rsom (Gene SOM in R) Ribosomal RNA [19] Rsvm (SVM classification in R) Table 1 lists many of the tools provided by VBI for use with ToolBus. The vast majority of these tools (e.g. MUMmer) consist of one or more third party software systems that have been wrapped as web services to facilitate their use by ToolBus.

3.2 Plugin Data Models Each plugin consists of a single data model and one or more viewers. Each data model can handle a specific type of data, which might be represented using one or more data formats. The data model, its associated viewers, and any customized tools are bundled together within a single jar file and can be loaded into ToolBus at startup or at runtime. Table 2 lists the existing plugins as well as those currently under active development. Table 2. Models developed at VBI DNA and Annotations Sequence alignment visualizations Microarray The BLAST/FASTA Similarities Probedesign Protein interactions network Phylogenetic Tree Bio objects pathways Gel fingerprint Sequence comparison Pathogen background information GO Group Suggestion

3.3 Universal Web Service User Interface Because of ToolBus' reliance on web services, a GUI interface for DOC and RPC based web services is automatically generated from the information provided by their WSDL documents. Due to the limitations of WSDL, we have also developed the WSOPSL to provide additional information concerning the nature of the parameters for web services. While ToolBus web

services aren't required to provide a WSOPSL document, doing so greatly enhances their usability by end users. WSOPSL (Web Service Operations Language) is used here in supporting WSDL for adding parameter types, default values, and parameter descriptions. This information enables tools to determine which dragged values to accept, and which to reject. It also enables choice lists and other capabilities that provide a user-friendlier interface to the web service. The BLAST web service tool shown in Figure 2 is typical of this type of user interface.

3.4 Data Association Perhaps one of the most unique features provided by ToolBus is the ability to associate data from different data models into groups, and to enable users to manipulate and compare these groups by creating their own generalized Venn diagrams. Grouping is the mechanism by which independent data in separate models may be connected/associated. Data selected from different models can be grouped together for further comparison. Comparisons are made via user created Venn diagrams. Each circle, representing a group of associated data, can be dragged and overlapped within the panel, creating simple or complex intersecting regions. Regions can be temporarily highlighted to enhance their visibility, or removed from view.

4.0 Future Directions & Conclusions Continued development of additional web service tools and visualization plugins are planned, with particular emphasis on microarray, proteomics, and biological pathway analysis. A special class of visualization plugin designed to assist researchers in identifying associations between data is also planned for future development. In particular, Gene Ontology and Ortholog based group suggestors would enable data residing in different data models to be link together. Work on improving data storage management will help to reduce the memory utilization by ToolBus on the client. ToolBus is an integrated environment in which data and tools can be interoperable in an open and flexible manner. Using this environment, biological researchers can access many kinds of Bioinformatics data sources and analysis tools. Its utilization of web services and its open API encourage and support the development of tools and visualization plugins by other development groups. As the number and diversity of tools and visualization plugins expands, the tool interoperability and data integration capabilities of ToolBus will enable its utility to expand at an exponential rate as the interconnections among the data can be fully realized. An opensource version of ToolBus for non-commercial use is available from http://pathport.vbi.vt.edu/download.

5.0 Acknowledgments This work is supported by US Department Of Defense, Grant number: W911SR-04-0045, awarded to principal investigator Prof. Bruno W. S. Sobral. The development of ToolBus, and associated web service tools, has benefited considerably from the use of the following open source systems and tools: Tomcat and Axis from Apache, WSDL4J and UDDI4J from IBM, and Castor from Exolab. We would also like to thank to Ms. Christine Lee and Mr. Abhishek Agrawal for their efforts in developing aspects of ToolBus, and Dr. Tian Xue, for her tireless efforts in developing many of the currently available web services that ToolBus relies.

6.0 References [1] Kurtz, S. et al. Versatile and open software for comparing large genomes. Genome Biol 5, R12 (2004).

[2] Thompson, J.D., Higgins, D.G. & Gibson, T. J. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22, 4673-80 (1994). [3] Felsenstein, J. PHYLIP (Phylogeny Inference Package). 3.6 edn (2004). [4] Huang, Q., Liu, D., Majewski, P., Schulte, L. C., Korn, J. M., Young, R. A. Lander, E. S. & Hacohen, N. The plasticity of dendritic cell responses to pathogens and their components. Science 294: 870-875 (2001). [5] Rocca-Serra, P. et al. ArrayExpress: a public database of gene expression data at EBI. C R Biol 326, 1075-8 (2003). [6] Edgar, R., Domrachev, M. & Lash, A.E. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res 30, 207-10 (2002). [7] Burge, C. and Karlin, S. (1997) Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78-94. [8] Delcher, A.L., Harmon, D., Kasif, S., White, O. & Salzberg, S.L. Improved microbial gene identification with GLIMMER. Nucleic Acids Res 27, 4636-41 (1999). [9] Pertea, M. and Salzberg, S.L. Using GlimmerM to find genes in eukaryotic genomes. Current Protocols in Bioinformatics, 2002. [10] D. Hyatt, J. Snoddy, D. Schmoyer, G. Chen, K. Fischer, M. Parang, I. Vokler, S. Petrov, P. Locascio, V. Olman, Miriam Land, M. Shah, and E. Uberbacher, GRAIL-EXP and the Genome Analysis Toolkit, The 13th Annual Cold Spring Harbor Meeting on Genome Sequencing & Biology, May 2000. [11] Borodovsky, M. & J, M. GeneMark: Parallel Gene Recognition for both DNA Strands. Computers & Chemistry 17, 123-133 (1993). [12] Frishman, D., Mironov, A., Mewes, H.-W., and Gelfand, M. (1998). Combining diverse evidence for gene recognition in completely sequenced bacterial genomes. Nucl. Acids Res., 26, 2941-2947. [13] Rozen, S. & Skaletsky, H. Primer3 on the WWW for general users and for biologist programmers. Methods Mol Biol 132, 365-86 (2000). [14] Nordberg, E.K. YODA: Selecting Signature Oligonucleotides. Bioinformatics (2004). [15] Ewing, B. & Green, P. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res 8, 186-94 (1998). [16] Roberts, R.J., Vincze, T., Posfai, J., Macelis, D. Nucleic Acids Research 33: D230-D232 (2005). [17] Lowe, T.M. & Eddy, S.R. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res 25, 955-64 (1997). [18] Matys, V. et al. TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res 31, 374-8 (2003). [19] Rice, P., Longden, I. & Bleasby, A. EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet 16, 276-7 (2000). [20] Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. Basic local alignment search tool. J Mol Biol 215, 403-10 (1990). [21] W. R. Pearson and D. J. Lipman (1988), Improved Tools for Biological Sequence Analysis, PNAS 85:2444-2448. [22] Thompson, J.D., Higgins, D.G. & Gibson, T. J. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22, 4673-80 (1994). [23] Waterman, M.S. Efficient sequence alignment algorithms. J Theor Biol 108, 333-7 (1984). [24] A.M. Baldo, J. Labate, and L.D. Robertson. 2004. A search for molecular diversity in tomato. p. 147 In Final Abstracts Guide, Plant and Animal Genome XII, San Diego,CA. [25] Ning, Z., Cox, A.J. & Mullikin, J.C. SSAHA: a fast search method for large DNA databases. Genome Res 11, 1725-9 (2001). [26] Zdobnov, E.M. & Apweiler, R. InterProScan--an integration platform for the signature-recognition methods in InterPro. Bioinformatics 17, 847-8 (2001). [27] Dalgaard, P. Introductory Statistics with R, 288 (Springer Verlag, 2002). [28] Gentleman, R.C. et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 5, R80 (2004).

Suggest Documents