A Multiagent Framework to Integrate and Visualize Gene ... - CiteSeerX

2 downloads 0 Views 816KB Size Report
metabolic networks, such as KEGG, Expasy [14], and. EcoCyc [15]. The type of pathway visualization these web sites provide is static and predefined, i.e. users ...
A Multiagent Framework to Integrate and Visualize Gene Expression Information Li Jin1, 2, *, Karl V. Steiner 2, 3, *, Carl J. Schmidt 4, Gang Situ1, Sachin Kamboj1, Kay T. Hlaing1, Morgan Conner1, Heebal Kim4, Marlene Emara4, and Keith S. Decker1, * 1 Department of Computer Information Sciences, 2Delaware Biotechnology Institute, 3 Department of Electrical and Computer Engineering, 4Department of Animal and Food Sciences, University of Delaware, Newark, DE, 19711 *Email: 1{jin, decker}@cis.udel.edu, [email protected] Abstract

Public databases, such as KEGG [3], GenBank [4], SwissProt [5], and Ensembl [6], provide a huge data resource for genomic information, metabolic pathways and gene expression that can be utilized to study sequences, gene expression, or pathways of close species by sequence and functional annotations [7]. However, these public databases are heterogeneous and constantly updated, and new data sources are constantly published on-line. Therefore, a multiagent information gathering system, which satisfies the following basic requirements, should be helpful to advance biological research. (1) It can retrieve and integrate sequence and function information from distributed, heterogeneous and dynamic updated databases. (2) New agents for new data sources or new analysis services can be added easily in the future. (3) It should be easily adjustable for a variety of different organisms. In this paper, we present a general bioinformatics multiagent system, called BioMAS, which meets these three basic requirements. Biologists are familiar with several visual representations of metabolic pathways. Visualizations may aid in the understanding of the complex relationships between the pathway components, to extract important information, and to compare pathways between different organisms. Our goal is not only to retrieve and integrate gene expression information from public data sources into our knowledge databases but also to load all the gene expression information into an information visualization system such that relationships between the datasets can be visualized easily. Our approach is to utilize the Starlight information visualization system [8] to visually analyze organism-specific data within a three-dimensional hierarchical view with existing

With rapidly growing amounts of genomic and expression data publicly available, efficient and automated analysis tools are increasingly important for biologists to derive knowledge for a large variety of organisms. Multiagent information gathering methods can be used to retrieve and integrate genomic and expression information from available databases or web sources to generate knowledge databases for organisms of interest. We present a novel, flexible and generalizable bioinformatics multiagent system, called BioMAS, which in this paper is used to gather and annotate genomic data from various databases and web sources and analyze the expression of gene products for a given organism. In this paper, we also present a new approach to visualize complex datasets representing gene expression and pathway models in hierarchical view space using the Starlight information visualization system (Starlight). This approach is an innovative application of using Starlight in the field of comparative genomics.

1. Introduction 1.1. Overview Since the successful completion of the human genome project in 2003 [1], not only genomic data but also proteomic and expression data of numerous species have been published. Homologs, which are homologous genes that have common origins and share an arbitrary threshold level of similarity determined by alignment of matching bases [2], play an important role in predicting gene products by using the annotation in public databases. Therefore, sequences of different organisms are compared for clues about gene function.

2005 IEEE ICDM Workshop on MADW & MADM

1

pathway diagrams and chromosome diagrams. Starlight was initially created as a visualization system for the military intelligence community with features such as query tools for data and data mining, references tagged images and diagrams [9]. This paper describes an innovative application of this program to the field of bioinformatics. In the rest of this paper, we will first review existing approaches for retrieving, integrating, and visualizing biological data. Then we will discuss the details of processing of the gene expression component in BioMAS because most of the other components of BioMAS have previously been described in [10]. Next, the visualization of the gene expression data utilizing Starlight will be presented. The results of knowledge about the organism generated from different sources are combined into one visual space to make it convenient for biologists to compare genomic data across different organisms. As a demonstration we will present the visualization of the gene expression comparison of two different organisms – chicken (Gallus Gallus) and human (Homo Sapiens). Finally, we will discuss the advantages and limitations of our methods and future work.

web sites provide is static and predefined, i.e. users can obtain the information only by following the hyperlinks embedded in search results. For a dynamic visualization, considerable research effort has been focused on implementing new visualization tools to redraw pathway networks and interactions among proteins automatically. Typical applications are the BioMaze project [16] and the Visant visualization tool [17]. Although virtual reality tools also provide a way to understand metabolic networks and gene expression [18], visually representing the relationships among the multidimensional information is a complex task. A critical advantage of Starlight is that it can combine data and images such as pathway diagrams into one visual space to enable users to see the position of interesting data on the embedded images.

2. System and methods The architecture overview of our system is illustrated in Figure 1, using the Gallus Knowledge Base (GallusKB) as an example. The Gallus information is retrieved from distributed sources and integrated into GallusKB using BioMAS. Then the gene expression and pathway data are parsed into XML files as required by Starlight. Finally, user queries can be processed interactively by Starlight and results are presented to the user in a hierarchical structure. This enables users to explore the result space by following associated relationships. The gene product pathways and chromosome positions can be displayed in both the pathway diagram and the chromosome diagram in the Starlight view space.

1.2. Related work BioMAS is a bioinformatics multiagent system with an increasing number of functions and dedicated agents. The work presented in this paper is a new gene expression processing organization in BioMAS, whose previous work included, (1) basic annotation and query agent organization (2) functional annotation agent organization and (3) EST (Expressed Sequence Tags) processing agent organization [10]. There are several systems available that are aimed at retrieving, integrating, annotating and visualizing biological data. Our system differs from these systems in that our system can respond to the dynamic changes of data sources while other systems such as TSIMMIS [11] or InfoSleuth [12] cannot respond to dynamic changes in data sources. New sources and analysis methods can be easily integrated into our system while GeneWeaver [13] is not based on a shared architecture that supports reasoning about secondary user utility [10]. Our approach is not only to retrieve and integrate related gene expression information into databases and publish the information on-line but also to provide a novel method to visualize the relationships among the information utilizing Starlight. There have been several related studies to visualize gene expression data onto metabolic networks, such as KEGG, Expasy [14], and EcoCyc [15]. The type of pathway visualization these

2005 IEEE ICDM Workshop on MADW & MADM

2.1. Information retrieval and integration BioMAS is composed of five groups of agent organizations, including (1) Sequence Annotation Agents, which integrate gene sequence annotations from various sources, such as NCBI databases, Protein Domains [19], PSort [20], and SwissProt [5], (2) EST Processing Agents, which are responsible for building chicken contigs (contiguous sequences) and saving contigs into databases, (3) Functional Annotation Agents, which annotate the function of a gene using Gene Ontology [21] and MeSH (Medical Subject Heading) [22] terms, (4) Query Agents to facilitate queries from users, and (5) Pathway Agents for downloading of human pathway information from KEGG, predicting pathways for a given organism, and predicting organs where gene products are expressed.

2

NCBI DB

EST Processing Agents

PSort

Functional Annotation Agents

Protein Domains

Sequence Annotation Agents

Chicken Contigs DB

SWISSPROT Gene ontology

BlastX

Chicken Pathway Table

Ensemble DB

Information Extraction Agent

Human Pathway Table

MeSH Terms

Pathway Image Processing Agent

Pathway Images Table

Pathway Agents

Pathway DB

KEGG DB BioMAS

Distributed databases & on-line sources User

Starlight Visualization System

GallusKB

Java parser

Figure 1. Architecture of knowledge information gathering and visualization.

The organizations outlined in (1)-(4) were described previously in [10, 23]. Therefore, this paper only focuses on Pathway Agents. The Kyoto Encyclopedia of Genes and Genomes (KEGG) provides a web-accessible database of human pathways, which can also be used to predict chicken pathways by blasting the human gene sequence against the chicken contig database. As shown in Figure 1, the Pathway Agents of BioMAS are used to retrieve the KEGG pathway data for human genes and place them in the Pathway Database (Pathway DB) and to save all human pathway information in a human pathway table. The Chicken Contigs Database (Chicken Contigs DB) holds 30,214 chicken contigs. BlastX [24] is used to blast the retrieved human gene sequence against Chicken Contigs DB with an E-Value cutoff =1x10–5 to identify the chicken gene products involved in the individual pathways as annotated by KEGG. All the chicken gene products identified are saved into a chicken pathway table under Pathway DB. In the next step, the pathway maps with highlighted chicken homologs are generated by the Pathway Image Processing Agent. As an example, http://udgenome.ags.udel.edu/gallus/pathway/hsa000 10.php shows the pathway map of Glycolysis /

2005 IEEE ICDM Workshop on MADW & MADM

Gluconeogenesis of Homo Sapiens where the highlights indicate matches between chicken homologs and human genes. The human gene and chicken contigs are mapped to chromosomal locations using the genomic sequence as determined by Washington University [25] and annotated using Ensembl [6]. In this way the relationship between chromosome positions of human and chicken genes involved in the same pathway can be compared visually. A Java parser is used to transform the data in Pathway DB to XML format data, which serves as the input for Starlight. Users can then use Starlight to conduct data mining and data analysis.

2.2. Information visualization A significant effort has been focused on dynamically generating biochemical network diagrams. The manually generated KEGG static diagram is a popular resource commonly used by biologist. However, to efficiently explore genomic databases, such as Pathway DB, a sophisticated tool is needed to select, search, navigate and analyze the database visually, especially for the comparison of the metabolic pathways in different organisms. With the GallusKB, our intention is to analyze the

3

similarities and differences in metabolic pathways and gene expression between human and chicken. In addition, the corresponding chromosome positions of human genes and their chicken homologs is of interest. For this paper, we use the hsa00010 pathway for Glycolysis / Gluconeogenesis in Homo sapiens provided by KEGG as an example. Our approach is described in detail below for the following aspects: the data model defining the input data, the view mode with the hierarchical view, navigation to query the database in 3D visualization space, and the data presentation.

2.2.2. Hierarchical view mode. Among the various view modes available in Starlight, the hierarchical view mode is selected for visual representation to explore the following research interests: • Gene expression in several organs for chick contigs; • Relationships between human genes, chicken homolog contigs and their respective chromosome positions; • Comparison of human genes and chicken homolog contigs in KEGG pathways; • Visually supported detection of chicken homologs for human gene products.

2.2.1. Data model. The input data format is a flat XML file as outlined in the following example code in Figure 2 extracted from GallusKB. Kidney hsa00010 Glycolysis / Gluconeogenesis - Homo sapiens 61/61 hsa00010.gif ec:1.1.1.1 hsa:124 alcohol dehydrogenase 1A (class I), alpha polypeptide >hsa:124 ADH1A; alcohol dehydrogenase 1A (class I), alpha polypeptide [EC:1.1.1.1] (A) MSTAGKVIKCKAAVLWELKKPFSIEEVEVAPPKAHEVRIKM VAVGICGTDDHVVSGTMVT PLPVILGHEAAGIVESVGEGVTTVKPGDKVIPLAIPQCGKCR ICKNPESNYCLKNDVSNP QGTLQDGTSRFTCRRKPIHHFLGISTFSQYTVVDENAVAKID AASPLEKVCLIGCGFSTG YGSAVNVAKVTPGSTCAVFGLGGVGLSAIMGCKAAGAARI IAVDINKDKFAKAKELGATE CINPQDYKKPIQEVLKEMTDGGVDFSFEVIGRLDTMMASLL CCHEACGTSVIVGVPPDSQ NLSMNPMLLLTGRTWKGAILGGFKSKECVPKLVADFMAK KFSLDALITHVLPFEKINEGF DLLHSGKSIRTILMF 4q21q23 CHK124 ADH1. ADH1 Chr4 UD.GG.Contig26702 228 7e-60 http://udgenome.ags.udel.edu/gallus/displayReport.php?type =rep&name=GP&id=UD.GG.Contig26702 Organ,Heart,hsa00010,ec:1.1.1.1,hsa:124,UD.GG.Contid26702

Figure 3. Hierarchical data structure.

Figure 3 shows the hierarchical data structure defined for the visualization of data sets. By this hierarchical structure, we can study the relationships between organs, pathways, human genes and chicken contigs. We can also visually compare of human and chicken gene expressions with pathway diagrams and chromosome maps. 2.2.3. Navigation. With the hierarchy view mode, the dataset is displayed in 3-D view space as shown in Figure 4. As an example, this Figure presents a hierarchical view of pathway hsa00010 - Glycolysis / Gluconeogenesis - Homo Sapiens.

Figure 4. Hierarchical view of pathway hsa00010 Glycolysis / Gluconeogenesis - Homo sapiens.

Figure 2. Flat XML format data of one chicken contig record.

2005 IEEE ICDM Workshop on MADW & MADM

4

The navigation through organs, pathways, enzymes, genes, and chicken contigs can be conducted by following the hierarchical tree, as shown in Figure 5. Interactive navigation is easily possible along the path of any organ. Figure 5 shows the result of zooming into the dataset for the heart, hsa00010, ec:1.1.1.1, hsa:124 and CHK:124. By selecting any record, the details of this record can be viewed. This hierarchical view shows how certain genes are expressed within a specific organ, and how the respective chicken homologs are expressed. The details for each specific record can be called up as shown for the example in Figure 6, which is base don the UD.GG.Contig26702 of hsa:124. Each of these details can be published as an html page and can be published on a designated web site via XSLT.

The hierarchy view mode makes it possible to visually query the gene data and compare gene expressions of both human and chicken. The field query generator can be used to create the query and query results can be presented in the link view of the hierarchical level. In this hierarchical view, any relationship query among data in different hierarchy levels can be conducted by selecting the data bar of interest in each level to highlight the resulting links as yellow lines. The corresponding results of the query can be displayed in a 3D environment. Figure 7 presents the hierarchical query view for the kidney. All records expressed in the kidney and their related data are linked with yellow lines between levels.

Figure 7. Hierarchical view of gene expression in the kidney. Yellow lines connect all data points related to the kidney.

Figure 5. Close-up view of heart data in hierarchical view.

Figure 6. Details for item hsa:124 – UD.GG.Contig26702.

2005 IEEE ICDM Workshop on MADW & MADM

5

Figure 8. 3D view of pathway, human and chicken chromosome maps highlighting the position mapping of genes expressed in the kidney.

java parser, data in the databases can be formatted into XML and entered into Starlight. Since images can be loaded into the data view space of Starlight, these images can be very useful for a better understanding of the data. As shown in Figure 8, traditional static pathway diagrams and chromosome maps can be linked to data sets as well. Data can be presented to the user in a hierarchical view to allow the user to explore the view space through the associated relationships while navigating the data sets. The examples used in Section 2.2 show preliminary results of applying Starlight for a genomics evaluation.

One of the unique features of Starlight is its MetaImage tool, where image files can be processed and graphically linked to correlated information within a dataset. Here the Starlight MetaImage tool is used to process pathway and chromosome diagrams such that these diagrams can be visually linked to genetic information in our approach to support genomic visualization. Figure 8 presents a 3-D view of the data with gene, pathway, and chromosome position mapping of genes expressed in the kidney. Pathway diagrams and chromosome images are embedded into the same view space allowing for simultaneous exploration of these data. In Figure 8, image 1 (left) is the chicken chromosome map, image 2 (center) is the human chromosome map, and image 3 (right) is the pathway map of Glycolysis/Gluconeogenesis of Homo Sapiens. The yellow lines highlight where the chicken homologs are matched with the human genes.

4. Conclusion and future work In this paper, we have presented a multiagent approach to retrieve, integrate and analyze gene expression information and a new approach to visualize gene expressions by using the Starlight information visualization system. With the ability to explore information in the publicly available genetic domain on the web, it is essential to develop increasingly powerful tools to retrieve the target information from different bioinformatics resources and to visualize this information efficiently. Our approach is to apply a multi-agent system (BioMAS) to retrieve data from separate resources and subsequently utilize the Starlight system to visualize correlations between the datasets. The pathway agent of BioMAS can automatically update the human and chicken data in the pathway database whenever the data in the external resources are changing. Currently, the view space in Starlight

3. Results and discussion BioMAS is a flexible and general multiagent system, which can work for different organisms of interest. The Gallus Knowledge Base and Fungi Knowledge Base (http://udgenome.ags.udel.edu) have been generated using BioMAS. GallusKB has a total of 30,214 chicken contigs in the database and chicken gene products have been predicted to be involved in a total of 140 pathways based on annotations in the KEGG database. 5851 human proteins in the KEGG database were used to search GallusKB, and 5437 (92%) Gallus gene products were found to have associated pathways. Through a

2005 IEEE ICDM Workshop on MADW & MADM

6

has to be updated manually. However, this could be improved in future work by using a trigger to update the view space automatically whenever the external source data change. The MetaImage tool provided within Starlight can create the coordination of data with relevant images. This feature is not automated yet either, which can lead to significant lead time while processing large amounts of similar images. Therefore, this is another focus of future work. The present data format is in flat XML format and some work should be invested to use a database as the data format of choice. The preliminary results obtained to date have been encouraging in outlining the potential for new visualization techniques to interactively explore genomic and pathway databases.

[11] Chawathe, S., H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, and J.Widom, The TSIMMIS project: integration of heterogeneous information sources. In Proceedings of the Tenth Anniversary Meeting of the Information Processing Society of Japan, December 1994. [12] Nodine, M. and A. Unruh. Facilitating open communication in agent systems: the infosleuth infrastructure. In M. Singh, A. Rao, and M. Wooldridge, editors, Intelligent Agents IV, pages 281–295. SpringerVerlag, 1998. [13] Bryson, K., M. Luck, M. Joy, and D.T. Jones. Applying agents to bioinformatics in geneweaver. In Proceedings of the Fourth International Workshop on Collaborative Information Agents, 2000. [14] ExPASy Proteomics Server, Swiss Institute of Bioinformatics, http://us.expasy.org/ [15] Encyclopedia of Escherichia coli K12 Genes and Metabolism, http://ecocyc.org/ [16] Zimányi,E., S. Skhiri dit Gabouje, Semantic Visualization of Biochemical Databases In Semantics for GRID Databases: Proc. of the Int. Conf. on Semantics for a Networked World, ICSNW2004, Paris, France, June 2004. [17] Hu, Z., Joseph Mellor, Jie Wu, Charles DeLisi. VisANT: an online visualization and analysis tool for biological interaction data. BMC Bioinformatics. 2004; 5 (1): 17 [18] Dickerson, J.A., Y. Yang, K. Blom, Using Virtual Reality to Understand Complex Metabolic Networks, Atlantic Symposium Comp Biol Genomic Info Systems Technol September. 950-953 [19] Servant, F., Bru, C., Carrere, S., Courcelle, E., Gouzy, J., Peyruc, D., and Kahn, D., ProDom: automated clustering of homologous domains. Brief Bioinform 3(3), 246-51, 2002. [20] PSort: subcellular localization prediction, Brinkman Laboratory, Simon Fraser University, http://www.psort.org/ [21] Ashburner, M., and Lewis, S., On ontologies for biologists: the Gene Ontology--untangling the web. Novartis Found Symp 247, 66-80; discussion 80-3, 84-90, 244-52, 2002. [22] Medical Subject Headings, http://www.nlm.nih.gov/mesh/meshhome.html [23] Decker, K., S. Khan, C. Schmidt, D. Michaud, Extending a Multi-Agent System for Genomic Annotation. Proceedings of the Fifth International Workshop on Cooperative Information Agents, Modena, September 2001. LNAI 2182, Springer-Verlag, 2001. [24] Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J., Basic local alignment search tool. J Mol Biol 215(3), 403-10, 1990. [25] Contiguous Chromosomal Chicken Sequences Sequencing Center, Washington University, St. Louis, http://genome.wustl.edu/projects/chicken/index.php?unmas ked=1

5. Acknowledgements This publication was partially supported by awards from the National Science Foundation (NSF 0092336), the US Department of Agriculture (9935205-8228), and the National Center for Research Resource at the National Institutes of Health (2 P20 RR016472-04) under the INBRE program.

6. References [1] Human Genome Project, http://www.ornl.gov/sci/techresources/Human_Genome/ho me.shtml [2] Jackson, J.H., Terminologies for Gene & Protein Similarity, Technical Reports & Reviews No. TR 99-01 Michigan State University, http://www.msu.edu/~jhjacksn/Reports/similarity.htm [3] KEGG: Kyoto Encyclopedia of Genes and Genomes, Kanehisa Laboratory. http://www.genome.ad.jp/kegg/ [4] GenBank, NIH genetic sequence database, http://www.ncbi.nlm.nih.gov/Genbank/GenbankOverview. html [5] Swiss-Prot Protein knowledgebase, TrEMBL Computer-annotated supplement to Swiss-Prot; http://us.expasy.org/sprot [6] Ensembl Genome Browser, Sanger Institute, The Wellcome Trust, http://www.ensembl.org/ [7] Benson D.A. et al. Genbank. Nucleic Acids Res., 28:15– 18, 2000. http://www.ncbi.nlm.nih.gov. [8] Starlight Information Visualization System, Pacific Northwest National Laboratory, http://starlight.pnl.gov/ [9] Kritzstein, B., Starlight, Military Geospatial Technology Online Archives Volume: 1 Issue: 1, http://www.military-geospatialtechnology.com/article.cfm?DocID=339 [10] Decker, K., Khan, S., Schmidt, C., Situ, G., Makkena, R., Michaud, D., Biomas: A multi-agent system for genomic annotation International Journal of Cooperative Information Systems, 11 (3-4): 265-292, 2002

2005 IEEE ICDM Workshop on MADW & MADM

7

Suggest Documents