Key Engineering Materials Vols. 277-279 (2005) pp 259-265 online at http://www.scientific.net Journal (to bePublications, inserted by the publisher) © (2005)Citation Trans Tech Switzerland Online available since 2005/Jan/15 Copyright by Trans Tech Publications
Information Visualization with Text Data Mining for Knowledge Discovery Tools in Bioinformatics∗ Jinah Park1, Changsu Lee2, Jong C. Park3 1
School of Engineering, Information and Communications University (ICU) 119 Munji-ro Yusong-ku Daejeon, 305-714, KOREA 2,3
Computer Science Division and AITrc, Korea Advanced Institute of Science and Technology (KAIST) 389-2 Kusong-dong Yusong-ku Daejeon, 305-701, KOREA 1
[email protected],
[email protected],
[email protected] Keywords: molecular interactions, visualization, bioinformatics, knowledge discovery.
Abstract. An abundant amount of information is produced in the digital domain, and an effective information extraction (IE) system is required to surf through this sea of information. In this paper, we show that an interactive visualization system works effectively to complement an IE system. In particular, three-dimensional (3D) visualization can turn a data-centric system into a user-centric one by facilitating the human visual system as a powerful pattern recognizer to become a part of the IE cycle. Because information as data is multidimensional in nature, 2D visualization has been the preferred mode. However, we argue that the extra dimension available for us in a 3D mode provides a valuable space where we can pack an orthogonal aspect of the available information. As for candidates of this orthogonal information, we have considered the following two aspects: 1) abstraction of the unstructured source data, and 2) the history line of the discovery process. We have applied our proposal to text data mining in bioinformatics. Through case studies of data mining for molecular interaction in the yeast and mitogen-activated protein kinase pathways, we demonstrate the possibility of interpreting the extracted results with a 3D visualization system. Introduction Bioinformatics is the research discipline that deals with the vast amount of knowledge in biology through computational techniques. Many researchers in bioinformatics are interested in augmenting the available databases by automatically extracting information from natural language text data, such as journal articles in MEDLINE or natural language descriptions in comment fields of sequence databases such as GenBank. This technique is generally referred to as text data mining. Recently, many information extraction (IE) systems have been developed to help extract meaningful facts from such textual resources through a reduction in the search space for interesting pieces of undiscovered knowledge. IE systems produce facts of specified patterns from unstructured resources, and these facts are then used to infer new pieces of knowledge. However, the inference process is often quite complicated and depends on human thinking: putting several pieces together in order to construct a novel picture. We believe that a visualization system plays an important role in complementing the functions of an IE system by guiding the user both in his/her inference process over the extracted facts, as well as in the process of determining the kinds of facts to be extracted. In particular, an interactive 3D visualization system can facilitate human perception to work for us in achieving such tasks. In this paper, we propose to combine text data mining with a 3D visualization system for guided knowledge discovery. The particular application we are focusing on in this paper is in the domain of molecular interaction. We have earlier developed an automatic pathway identification system, or BioIE, that extracts molecular interaction information from MEDLINE [1, 2]. This IE system utilizes bi-directional incremental parsing in the framework of combinatory categorial grammar ∗
Supported by the Korea Science and Engineering Foundation through AITrc.
All rights reserved. No part of contents of this paper may be reproduced or transmitted in any form or by any means without the written permission of the publisher: Trans Tech Publications Ltd, Switzerland, www.ttp.net. (ID: 130.203.133.33-14/04/08,13:30:59)
260
On the Convergence of Bio-,(to Information-, Enrivonmental-, Title of Publication be inserted by the publisher) Energy-, Space- and Nano-Technolgies
(CCG), which is one of the full-fledged grammars for natural languages such as English. We have also explored the possibility of utilizing the extracted results for an automatic extension of existing ontologies such as Gene Ontology [3]. Nevertheless, the sheer size of the body of extracted results forbids any useful manual inspection into their nature, except to rely on further means of automation for consistency checking and so on. We therefore suggest in this study the possibility of interpreting the extracted results by means of a 3D visualization system. Unlike the structural or physical object visualizations that occur naturally in 3D, the canvas for information visualization is usually in 2D due to the high dimensionality in the body of information itself. In our research, we suggest that the extra 3rd space can be vitally utilized, presenting other orthogonal dimensions of information. While one may find many candidates for such an extra space, we consider the following two aspects in this paper: 1) abstraction of the unstructured source data, and 2) the history line of the discovery process. The rest of the paper is organized as follows: the ‘Method’ section describes our proposed 3D visualization systems and shows how each aspect of information is placed in the 3rd dimension; the ‘Result and Discussion’ section discusses case studies with molecular interactions in yeast as well as through the analysis of mitogen-activated protein kinase pathways; and finally, the pieces of knowledge discovered through our visualization system are reported on as well. Method 3D Visualization System. Our proposed paradigm is to place the 3D visualization system as an active part in the cycle of a knowledge discovery process as shown in Fig. 1. Through the information extraction (IE) system, the structured data or facts are established for further examination. In our application, we consulted on-line abstracts of MEDLINE journal articles with keywords such as ‘MAPK’ (mitogen-activated protein kinase), ‘human’, etc. The IE system that we used for our application is BioIE [1, 2] where predefined relations (such as ‘activate’, ‘inhibit’) and terms (or molecular names) for molecular interactions are used in the search space of the MEDLINE textual database. The set of structured data, which is generated by each query, is visualized on a 2D plane. Each term is abstracted by a bullet that can be color-coded; and each relation is abstracted by an edge connecting the associated bullets. The edge can be color-coded as well. The 2D plane, where the graph representing each query result is placed, is visualized in our 3D visualization window. We will later show how we utilize the extra dimension available for us in 3D.
Fig. 1. Schematic diagram showing our system in action
Key Engineering Materials Vols. 277-279 Title of Publication (to be inserted by the publisher)
261
Abstraction of the Unstructured Source Data. IE systems produce facts of specified patterns from unstructured resources. Since it is not possible to cover all the possible patterns that people use in their writings, we are required at times to go back to the unstructured source to verify that the extracted piece of relational data is indeed what the user intended to retrieve. Due to the sheer size of extracted data, however, it is impossible to check for every individual piece of data. The extra dimension available for us in 3D over 2D can be utilized in capturing such an aspect by placing the abstract representation of the source sentence for every individual relational data. We represent the source sentence using a color-banded rod [4] as shown in Fig. 2. Each identified term is marked in its place as a band in the rod representing the whole sentence. As shown in Fig. 2, for example, a conjunction item (‘and’) may be marked in a different color for some specific verification purpose. Our proposition is that in the presence of a massive amount of extracted data, the shape, color and length information can be utilized to produce a dramatic positive effect on the user’s perception in dealing with the data size.
Fig. 2. A color-banded rod. The terms are colored in red and the conjunction (‘and’) item is colored in blue Fig. 3 shows the 3D visualization where the abstraction of unstructured source data is placed in the 3rd dimension: On the horizontal plane, we place the facts that are retrieved by an IE system (e.g., BioIE) using colored dots and color-coded lines connecting a pair of dots. As mentioned earlier, each dot and the connected line in Fig. 3 represent a specific molecular name (e.g., Ste50P, MKKK) and a particular relation (e.g., inhibit, bind, interact), respectively. The color-banded rods are placed in the 3rd dimension, or in the vertical planes, at the mid-point of the edge. The rods represent the source sentence from the MEDLINE abstract where the corresponding relation and the molecular names are extracted. The relative height of the vertical rod indicates the length of the sentence and the location of each molecular name is colored (in red) within the rod.
Fig. 3. Abstraction of the source text is placed in the 3rd dimension One of the potential uses of such a visualization system is to quickly verify the merit of each extracted data from the IE system. For example, as in the case shown in Fig. 2 where the conjunction item is colored in blue, if the blue band is located after the two red bands, it is likely that there is unextracted molecular interaction information in the source sentence. Instead of reading the actual sentences from hundreds of abstracts, a user sees the pictorial representation of each sentence. The bird’s eye view of this visualization is that it allows the user to quickly pinpoint an odd type, or a majority type, guiding his/her attention in the visualized screen.
262
On the Convergence of Bio-,(to Information-, Enrivonmental-, Title of Publication be inserted by the publisher) Energy-, Space- and Nano-Technolgies
The History Line of the Discovery Process. Often it takes more than one search inquiry to arrive at the goal. The record keeping of the discovery progress or the thought process is not only beneficial for future reference, but is also useful in presenting reference for crosschecking. Again, let’s say that each IE output is visualized on a 2D plane. Each query output generates such 2D visualization, and let’s further assume that more than one query was used in a particular task. This time, the 3rd dimension represents the time stamp associated with each query. We place the 2D planes in layers in the 3D space, where the previous query result is placed just below. In this new visualization paradigm, we implemented the method to take the union of the data from the previous time-stamped data, to filter less-frequent data and to emphasize high occurrence data. Fig. 4 illustrates the 3D visualization with a union method. Let A = {p1, p2, p3, p4, p5} be the set of nodes that were identified from the initial query (Query No. 1). Suppose that p2, q2, p3, q1 and q3 are extracted in the next query (Query No. 2). Let B = {p2, q2, p3, q1, q3}. The result of Query No. 2 is placed on a different layer along the 3rd dimension (z-dimension in Fig. 4). At this time, however, we compute B – A so that only those terms that were not previously introduced were displayed in the new layer. In this way, we can interpret the result from the new query in relation to the preceding queries through the visualization system. The next section demonstrates the knowledge recovery processes guided by our proposed visualization system.
Fig. 4. The union of query data is visualized in a 3D space Results and Discussion Case Study I: Molecular Interactions in the Yeast. Some of the molecular interactions may occur physically, but not physiologically (or inside living cells) [5]. It is thus important to specify whether the biological experiment in question has been performed inside a living cell (in vivo), or not (in vitro). In the textual source, this conditional information is usually mentioned within the same sentence as the interaction under consideration. However, it is not always the case that the conditional information mentioned in the same sentence refers to the interaction under consideration (i.e., it may refer to some other interaction, which is not extracted and happens to be in the same sentence). The state-of-the-art in IE is not yet up to identifying which is which in a reliable manner, motivating the use of some other means of complementing the function of information extraction, such as visualization. The in vivo and in vitro conditions are based on a certain interaction. Hence, utilizing the color-banded rod to represent the source sentence, if a certain rod does not show any red bands but only a color-coded band for such a condition, we may predict that there is a high chance of unextracted pieces of information in the natural language sentence corresponding to that rod. We have collected approximately 2000 articles from MEDLINE through PubMED using the keywords “in vitro” and “S. cerevisiae”, which means yeast. From these articles, BioIE has extracted molecular interaction information, in which we have visualized approximately 300 interactions whose source sentences contain the expression in vitro. The conditions are represented as blue bands.
Key Engineering Materials Vols. 277-279 Title of Publication (to be inserted by the publisher)
263
Fig. 5. The in vitro condition for the interaction between RPA and URS1H
Fig. 5, for example, shows the visualization of the in vitro condition for the interaction between RPA and URS1H. In this case, the rod is overall rather long, and the two red bands are followed by a blue band, which is again followed at some distance by another blue band. In general, the in vitro or in vivo condition is applicable to an interaction. Thus, we can predict that there are other pieces of unextracted information between the two blue bands. This prediction is borne out by the following source sentence. Although RPA binds to the double-stranded URS1H site in vitro, it has much higher affinity for single-stranded than for double-stranded URS1H, and one-hybrid assays suggest that RPA does not bind to this site at detectable levels in vivo. (PMID: 9199289)
Going back to the source sentence, as suspected, we can see that RPA interacts with URS1H under the in vitro condition, but not under the in vivo condition. This example reveals another piece of evidence to confirm that visualization works as a powerful means for identifying information that may be non-trivial to identify otherwise. Case Study II: Mitogen-Activated Protein Kinase Pathways. An MAPK pathway is an important signal transduction pathway for the regulation of cell responses. It is composed of three modules: MKKK (MAPK Kinase Kinases), MKK (MAPK Kinases) and MAPK. A variety of extracellular signals such as growth factors, stresses, and differentiation factors activate MKKK, MKK, and MAPK, in that sequence. MAPKs regulate growth, survival and apoptosis through various proteins including Chop and c-Jun [6,7]. In this section, we show how visualization, which utilizes the 3rd dimension as the history line of the discovery process, may help to discover MAPKrelated knowledge. We have used the keyword combination “MAPK pathway and human'' for MEDLINE to collect approximately 3000 journal articles. We have used the keyword “human'' because the functionalities of MAPK are dependent upon the species to which it applies. Fig. 6 displays the query result. The thickness of the edge (cone-shaped) reflects the occurrence of the data within the search domain, like MEDLINE abstracts. An examination of the result via the interactive visualization tool leads us to make another query on JNK, which is a type of MAPK. We collected approximately 2000 MEDLINE abstracts through the keywords “JNK and human.” Its results are displayed in our 3D visualization system. Again, the inspection on the query results led us to make a third query on Rap1, which is not a type of MAPK. Another 300 abstracts from the keywords “Rap1 and human” were retrieved from the database and the results were displayed in our visualization system as shown in Fig. 7. The goal of this process is to rediscover MAPK pathways.
264
On the Convergence of Bio-,(to Information-, Enrivonmental-, Title of Publication be inserted by the publisher) Energy-, Space- and Nano-Technolgies
Fig. 6. Visualization of the query result with the keywords MAPK pathway AND human
Fig. 7. Visualization of three different time-stamped data
Fig.8 summarizes what we have found with respect to the JNK pathway [8], which is one of the MAPK pathways. We have also successfully rediscovered pathway maps of a reasonable quality for ERK and p38 pathway, which are other types of MAPK pathways. The results were so encouraging that a relative novice in the field was able to sail through the 3D visualized space where approximately 5300 abstracts are mapped out, and quite easily draw out reasonable and quality (a biologist granted) pathways, based mainly on the visualization.
Fig. 8. Summary of discovered knowledge of molecular interactions with respect to the JNK pathway Summary and Conclusion In this paper, we proposed novel 3D visualization techniques for guided knowledge discovery by combining text data mining with a 3D visualization system. Due to the fact that information as data is multidimensional in nature, 2D visualization has been the preferred mode. However, we have argued that the extra dimension available for us in 3D mode is a valuable space where we can pack an orthogonal aspect of the available information. We utilized the 3rd dimension to display colorbanded rods for abstracting unstructured source data; and to display different time-stamped data for visualizing the history-line of the discovery process. We applied the proposed method to molecular interactions in yeast and those on MAPK pathways. 3D representation provides an extra depth dimension that can be used as a natural way to pack more information onto the 2D monitor screen, slowing down considerably the process of overloading it [9]. In contrast, 2D representation gives rise to many line crossings and overlapping nodes, even when the number of involved nodes is relatively small, troubling most protein interaction maps that provide 2D graphical views of directed graphs. By using 3D representation, we can also rely on the human visual system as a powerful pattern recognizer and turn data-centric systems into user-centric ones [10]. We also believe that
Key Engineering Materials Vols. 277-279 Title of Publication (to be inserted by the publisher)
265
the method of combining text data mining and visualization provides a good testbed and challenge for, and will accelerate advances in, both technologies. References [1] JC Park, “Using Combinatory Categorial Grammar to Extract Biomedical Information,” IEEE Intelligent Systems (2001), Vol. 16, No. 6, pp.62-67. [2] JC Park, HS Kim, JJ Kim, “Bidirectional incremental parsing for automatic pathway identification with combinatory categorial grammar,” Proceedings of Pacific Symposium on Biocomputing, Vol. 6, 2001, pp. 396-407. [3] JB Lee, JC Park, “Text Data Mining for Automatic Gene Ontology Extension,” Intelligent Systems for Molecular Biology (ISMB), Proceedings of the 2nd Meeting of the Special Interest Group on Text Data Mining, Edmonton, Alberta, CA, Aug. 2002, pp. 22-25. [4] C Lee, J Park, JC Park, “Mediatory Visualization for Structured Data and Textual Information,” Proceedings of the 3rd IASTED International Conference on Visualization, Imaging, and Image Processing, Benalmadena, Spain, Sept. 2003, pp. 926-932. [5] CM Deane, L Salwinski, I Xenarios and D Eisenberg, “Protein Interactions: Two methods for assessment of the reliability of high throughput observation,” Molecular and Cellular Proteomics, 1, 2002, pp.349-356. [6] TP Garrington and GL Johnson, “Organization and regulation of mitogen-activated protein kinase signaling pathways,” Current Option in Cell Biology (1999), Vol. 11, pp 211-218. [7] C Widmann, S Gibson, MB Jarpe and GL Johnson, “Mitogen-activated protein kinase: Conservation of a three-kinase module from yeast to human,” Physiological Reviews (1999), Vol. 79, No. 1, pp. 143-180. [8] C Lee, J Park, JC Park, “Case Study: Visualization and Analysis of Mitogen-Activated Protein Kinase Pathways in the Literature,” IS&T/SPIE International Conference on Visualization and Data Analysis (VDA), San Jose, USA, Jan. 2004. [9] SG Eick, “Engineering Perceptually Effective Visualizations for Abstract Data,” in Scientific Visualization (Chapter 8), eds: GM Nielson, H Hagen, H Muller, IEEE Computer Society, 1997, pp. 191-210. [10] JP Lee, “Visualization for Bio- and Chem-Informatics: are you being served?” Proceedings of IEEE Visualization, San Diego, CA, Oct. 2001, pp. 515-518.