Semantic Search among Heterogeneous Biological ... - CiteSeerX

ISSN 0582-9879

Acta Biochimica et Biophysica Sinica 2004, 36(5): 365–370

CN 31-1300/Q

Semantic Search among Heterogeneous Biological Databases Based on Gene Ontology Shun-Liang CAO, Lei QIN1, Wei-Zhong HE1, Yang ZHONG2, Yang-Yong ZHU, and Yi-Xue LI1,3* Department of Computing and Information Technology, Fudan University, Shanghai 200433, China; 1Shanghai Center for Bioinformation Technology, Shanghai 200235, China; 2School of Life Sciences, Fudan University, Shanghai 200433, China; 3Bioinformation Center of Shanghai Institutes for Biological Sciences, the Chinese Academy of Sciences, Shanghai 200031, China

Abstract Semantic search is a key issue in integration of heterogeneous biological databases. In this paper, we present a methodology for implementing semantic search in BioDW, an integrated biological data warehouse. Two tables are presented: the DB2GO table to correlate Gene Ontology (GO) annotated entries from BioDW data sources with GO, and the semantic similarity table to record similarity scores derived from any pair of GO terms. Based on the two tables, multifarious ways for semantic search are provided and the corresponding entries in heterogeneous biological databases in semantic terms can be expediently searched. Key words

Gene Ontology; semantic search; BioDW

Various biological database systems including data capture, data storage, data retrieval and other data processing methods have been developed. These systems have become effective tools for today’s genomics and related studies. However, the highly distributed and heterogeneous characteristics of biological databases are inconvenient for retrieval of needed information from different data sources [1]. The integration of these heterogeneous databases has become a crucial issue in biological research. For this purpose, many renowned bioinformatics centers have presented their own solutions, e.g., the Entrez system developed in NCBI [2], the Sequence Retrieval System (SRS) of EBI [3]. Nevertheless, most of them aim at the heterogeneity of structure, hardly focusing on the integration of the content of data from different databases. In order to manage semantic heterogeneity in biological database systems, the meaning of the interchanged information has to be understood across the systems. The use of ontology for the explication of implicit knowledge becomes an approach to overcome the semantic Received: January 16, 2004 Accepted: March 9, 2004 This work was supported by the grants from the National High Technology and Development Program of China (No. 2002AA231011, No. 2002AA231041), the National Key Project for Basic Research (No. 2002CB512801) and the Major Project of Shanghai Commission of Science and Technology (No. 02DJ14013) *Corresponding author: Tel, 86-21-64836199; Fax, 86-21-64838882; Email, [email protected]

heterogeneity. By capturing knowledge about a subject in a shareable, computationally accessible and computable way they describe, ontology provides a model of concepts that can be used to form a semantic framework for many data storage, retrieval and analysis tasks. In practice, Gene Ontology (GO) describes gene products from three orthogonal aspects: molecular function, biological process and cellular component [4]. These vocabularies and their relationships are represented in the form of directed acyclic graphs (DAGs). Thus, the taxonomy in the GO may be referred to as a network in which each term or concept may represent a “child node” of one or more “parent nodes”. There are two types of child-to-parent relationships: “is a” and “part of”. The first type is defined when a child is an instance of a parent. The second type is used to describe the situation where a child node is a component of a parent. A biological data integration platform called BioDW has been developed [5,6]. This system can retrieve the data from different sources into a data warehouse represented by a relational data model, through the procedures of data extracting, transforming and cleaning. In this paper, using GO as a tool for biological clustering and constructing of DB2GO table to correlate GO annotated entries from BioDW data sources with GO terms as well as constructing semantic similarity table to record similarity scores derived from any pair of GO terms, we

366

Acta Biochimica et Biophysica Sinica

Vol. 36, No. 5

have provided semantic search methods based on GO for resolving semantic heterogeneity among biological databases.

through DB2GO table. After that a GO ID needs to be selected to start the process of semantic search in the current database. The organization of DB2GO table and semantic similarity table are described below.

Infrastructure of Semantic Search

DB2GO table

The flowchart of semantic search process in BioDW is shown in Fig. 1. In BioDW, there are two basic data access methods based on semantic search: one is from GO to database entries and another is from database entries to GO. For the first method: after browsing GO DAG tree or entering a keyword as a query to retrieve some GO terms/ IDs, the DB2GO table or semantic similarity table will help users to retrieve suitable entries from the designated target database. For the second method: from a known entry of a GO annotated database, such as SwissProt [7], users will retrieve a list of GO terms/IDs annotating this entry

Fig. 1

In order to retrieve the corresponding database entries annotated by GO quickly and conveniently, we extract and store all the relations between GO and entries from different databases into a unified table––DB2GO. The data origins of DB2GO can be divided into two types: (1) Group based upon the achievements accomplished by other database systems, such as Compugen’s GenBank [8] GO annotation and the GOA program developed by EBI, whereby we can acquire a series of files with records of relations between data sources entries and GO, such as InterPro2GO and EC2GO; (2) Group based on direct extracts from BioDW member

The flowchart of semantic search process in BioDW (A), (B), and (C) represent three different search paths.

May, 2004

Shun-Liang CAO et al.: Semantic Search among Heterogeneous Biological Databases Based on Gene Ontology

367

databases. By constructing DB2GO, the entries in the BioDW member databases are linked organically to the terms of GO. It is easy to find entries in the DB2GO table from different databases as they all are annotated by the same GO term. Since all these entries are annotated by the same GO term, they are semantically identified, and thus the semantic heterogeneity among databases is partially resolved.

operator, p(c) is the probability of finding a child of “c” in the taxonomy, i.e., the number of the children of “c” vs. that of total terms. In order to carry out the comparison of semantic similarity in a convenient way, we can calculate the similarity of every pair of terms in GO (each term corresponds to a node in the DAG of GO) according to the Resnik’s equation and store the results in the semantic similarity table.

Semantic similarity table

Search Types in BioDW

Various methods have been developed for quantifying the semantic similarity in terms of ontology [9]. The recent advances in this field are exploiting either information theoretic or information content principles [10]. It has been demonstrated that these types of approaches are less sensitive, and in some cases not sensitive at all [11]. Information content allows us to measure semantic similarity between terms based on the assumption that the more information two terms share in common, the more similar they are. In this situation, the information shared by two terms can be calculated using the information content of the terms and subsuming them in the ontology. Such a semantic similarity model was proposed by Resnik [11] and defined as Eqn (1):

{− log[ p(c)]} sim(ci , c j ) = c∈max S ( c ,c ) i

j

(1)

where sim(ci,cj) comprises the set of parent terms shared by both terms ci,cj, and “max” represents the maximum

Fig. 2

Browsing GO In usual case, user may want to retrieve all the protein entries annotated by a GO term with a GO ID from a certain database, e.g., SwissProt. If the user does not know the corresponding GO ID of the specific function in GO or the corresponding term, he can carry out the data retrieval by browsing GO. This process is illustrated in Fig. 2. Starting from the function root GO ID (GO:0003674), one can browse its child terms step by step to get the end terms’ GO ID (GO:0003680). Then, search through DB2GO table allows the retrieval of the corresponding entries annotated by GO terms in BioDW member databases. If the member database (SwissProt) is selected, one can retrieve the entries annotated by this GO ID (GO: 003680) in the database (namely P13483, P17096 and P32908, etc.).

Retrieval of related entries from a target database in BioDW, such as SwissProt, by browsing the DAG structure of GO

368

Acta Biochimica et Biophysica Sinica

Using GO as a starting point This kind of search is conducted as follows: first, select a target database in which semantic search will be carried out, for example, SwissProt; second, input a keyword to carry out full-text search in the GO; finally, a list of GO records provided where the input keyword can be found. Select and click on a record in the search result set and search out in the DB2GO table. The entries in the corresponding database annotated by the selected GO term will be retrieved, as illustrated in Fig. 3. In this instance, “ATP binding” is input as a GO keyword for search of related entries in SwissProt. In the first output window, “ATP binding” related GO IDs are listed in the table: GO:0005524 and GO:0003918. When GO:0005524 is selected, the entries in SwissProt annotated by this GO term are listed out as O03080, O04093 and O06220. Semantic similarity search First, select a database, for example, SwissProt; then input the entry name or accession No. in the selected database as a query. According to the DB2GO table, one can retrieve a list of GO terms annotating the query entry. Select a certain GO ID, and carry out the semantic similarity search, a list of GO terms can be retrieved which are same as or similar to this GO term and in the order of semantic similarity score. Click these records, and one will finally obtain entries annotated by these GO terms in the designated database. For example, when P07311 from SwissProt is input to retrieve semantic similar entries in SwissProt, a list of GO terms annotating these entries: GO:0003998, GO:0006796 and GO:0016787, will be obtained as shown in Fig. 4.

Fig. 3

Vol. 36, No. 5

When GO:0003998 is selected, semantic similarity search is initiated, and the most similar GO terms to GO:0003998 are shown in a new window as follows: GO:0003998 itself and GO:0000210 as its most similar GO term. Thus the entries in SwissProt annotated with the GO:0003998 are retrieved as the most similar entries in molecular function to P07311 semantically, such as O35031, P07032 and O029440; the entries annotated by GO:0000210 are retrieved as the second most similar entry to P07311, such as P53614.

Discussion and Conclusions In general, the similarity of gene products is measured by sequence alignment methods, such as BLAST [12] and its derivatives. Sequence alignment methods have become very important tools in bioinformatics as they can infer functions of an unknown sequence from the known one. However, all these methods may be complicated if the target database size increases. For example, sequences containing many repetitive elements or LCRs (low-complexity regions) are likely to produce false positive and seemingly unrelated matches . The patterns representing homology become less specific as subsets expand due to continued introduction of sequence variations. Accuracy of similarity is dependent on the quality of the sequence and not on other empirical observation or determined relationships, for example, molecular functions. In the case where the entries of data sources have been annotated by GO, the method of semantic similarity based on GO can get over the limitation of sequence alignment. Any entry that is annotated by GO has a definite meaning.

Retrieval of entries from SwissProt, by entering a GO keyword as the query

The server will reload the selected retrieved GO terms to the target database, SwissProt and retrieve entries annotated by the current GO terms from the database.

May, 2004

Shun-Liang CAO et al.: Semantic Search among Heterogeneous Biological Databases Based on Gene Ontology

Fig. 4

369

Retrieval of related semantic similarity entries from a target database in BioDW, such as SwissProt

After selecting the database, semantic similarity search employs the entry name or access number as the query term. The resulting GO terms in the first search cycle should be used to retrieve semantic similar GO terms further on. The process similar to the one described in Fig. 3 can retrieve a series of entries from the target database.

Therefore, with the help of the semantic similarity of GO, we can acquire the precise similarity of the data source entries annotated by GO. Thus, semantic similarity measure based upon GO is a good supplement to the method of sequence alignment. Semantic heterogeneity is one of the great challenges confronting the integration of biological data. It also remains a challenge for the majority of biological researcher in addressing the question how to carry out the semantic search in those heterogeneous data sources in a rapid and efficient way. In BioDW, we combine the technology of data warehouse and the GO in an organic way. Our method uses GO as an important tool for data clustering, and adopts the method of semantic similarity based on GO to give a solution for semantic heterogeneity among data sources. This method provides multifarious ways of semantic search based on the semantic similarity of GO, thus overcoming the disadvantages of sequence alignment methods such as BLAST. It presents a sound basis for further integration, analysis and mining of heterogeneous biological data, and also is a reasonable starting point for study of systems biology.

References 1 Davidson SB, Overton C, Buneman P. Challenges in integrating biological data sources. J Comput Biol, 1995, 2(4): 557–572 2 Schuler GD, Epstein JA, Ohkawa H, Kans JA. Entrez: Molecular biology database and retrieval system. Methods Enzymol, 1996, 266: 141–162 3 Zdobnov EM, Lopez R, Apweiler R, Etzold T. The EBI SRS server-new features. Bioinformatics, 2002, 18(8): 1149–1150 4 Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP et al. Gene ontology: Tool for the unification of biology. Nat Genet, 2000, 25(1): 25–29 5 Cao SL, Qin L, Wang W, Zhu YY, Li YX. Applications of gene ontology in bio-data warehouse. In Proc Sixth Annual Bio-Ontologies Meeting, Brisbane, Australia, 2003: 33–36 6 Cao SL, Li R, Zhang ZP, Zhu YY, Li YX. BioDW: A platform of the integrated bioinformatics data warehouse. Comput Sci, 2003, 30(10.B): 104– 106 7 Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ et al. The SWISSPROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res, 2003, 31(1): 365–370 8 Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank. Nucleic Acids Res, 2003, 31(1): 23–27 9 Zhong JW, Zhu HP, Li JM, Yu Y. Conceptual graph matching for semantic search. In Conceptual Structures: Integration and Interfaces, Springer Verlag, London, 2002: 92–106 10 Budanitsky A, Hirst G. Semantic distance in WordNet: An experimental,

370

Acta Biochimica et Biophysica Sinica application-oriented evaluation of five measures. In Workshop on WordNet and Other Lexical Resources, Pittsburgh (2001)

11 Resnik P. Using information content to evaluate semantic similarity in a taxonomy. In Proc of IJCAI-95, Montreal, Canada, 1995: 448–453

Vol. 36, No. 5

12 Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res, 1997, 25(17): 3389-3402

Edited by Yi-Fei WANG