Towards the automatic generation of biomedical sources schema Fleur Mougina, Anita Burguna, Olivier Loréalb, Pierre Le Beuxa a
Laboratoire d’Informatique Médicale, Faculté de Médecine, Université de Rennes 1, France b INSERM U522, CHU Pontchaillou, 35033 RENNES Cedex, France
Fleur Mougin, Anita Burgun, Olivier Loréal, Pierre Le Beux management systems or simply flat files for unstructured sources.
Abstract Biologists and physicians need to access biological and medical data for their experimentations and researches. This information is available on the Internet and is scattered over many heterogeneous data sources. Collecting information is consequently tedious, time consuming and must be improved. To cope with this difficulty, our overall objective is to realize a mediator-based system to integrate heterogeneous biomedical data sources. This requires first an automatic generation of source schema, which is the goal of this work. For that, we describe an algorithm which is based on information extraction. It consists of the extraction of meta-information from each source to infer their schema. Our system enables users to access relevant and specific data, which are up-to-date. To solve the semantic heterogeneity of data sources, we are considering the creation of an ontology. Finally, the management of source evolution is discussed.
•
The output format provided by the tool available to query the source. Indeed, for example, there is text with OMIM, graphics with Kegg, XML3 with Genbank4. Moreover, even if two sources export data in the same format, the result is different since there is no common and universal model. Even for Genbank and DDBJ5 which contain the same information and both export XML format, the result is not identical; e.g. the output format of Genbank gives the Pubmed ID of the cross-reference(s) whereas DDBJ does not provide this.
•
Semantics : some difficulties have to be considered, such as semantic translations (e.g. it is useful and sometimes necessary to convert genetic distance units of measure, which can be expressed in kilobases or centimorgans according to the source). Other semantic problems are related to the way sources describe their data. Again in DDBJ and Genbank XML output, we have noticed, for instance, that “Homo sapiens” is represented in different ways. DDBJ describes him as an organism while Genbank is more precise by dividing him as the genus “Homo” and the species “Sapiens”. A solution for this kind of inconsistency must be found.
Keywords Computational biology, artificial intelligence, semantics, systems integration, knowledge representation, Semantic Web
Introduction Biologists and physicians need to access genomic, biological and medical data sources when they discover new genes, sequences or experiment measurements to compare them with what exists. This information is available on the Internet and a critical point is that these sources are multiple and distributed over many servers throughout the world. In addition, these sources are heterogeneous. Consequently, their exploitation depends on the particularities of each source, which are situated at many levels [1], such as : •
•
Information collection and processing are thus tiresome works for biologists and physicians. One challenge for bioinformatics researchers is to offer Internet users a global, centralized and homogeneous access to these data sources. To manage these heterogeneities and the widespread nature of the sources, it is necessary to realize an integration system. Among the different existing kinds of such systems [2], the mediator-based one has been described as the system which seems to best suit the requirements for molecular biology [3]. Its architecture is organized in three layers [4]. Briefly, the different components are :
The content of the source : e.g., OMIM1 represents a comprehensive and constantly updated catalog of inherited diseases while Kegg2 provides metabolic pathways. However, those two data sources are both relevant for biologists and physicians. The implementation type : in particular relational database management systems (RDBMS), object oriented
users who query the global system,
•
mediators which analyse the user query, choose the wrapper(s) which will be able to answer and transmit the query to it (them). They also recover wrapper(s)
Consequently, our overall objective is to create a mediator-based system which manages biomedical sources evolution in a dynamic way. It consists of three steps : the description of the sources, the constitution of the mediated schema and finally, the interaction between users and the mediator. We realize here the first step, which is the wrapper layer. Our specific aim is to generate the data source schema automatically, to allow its update each time a source evolves. We describe here in detail our method and the results obtained for the automatic generation of data source schema.
response, process it (them if many wrappers) and provide an homogeneous result to the user, •
the wrappers which receive the query from a mediator and translate it to query the integrated data sources. Once they obtain the answer, they translate it to make it understandable by the mediator and transmit the response. Table 1: Sample of our integrated biomedical sources Name
Materials and Methods Biomedical sources As we wanted to deal with sources from medical and biological domains, we opted for the integration of many different sources with the help of a domain expert. For each of them, we recovered information, sometimes available on their Web site. Examples of such sources are given in Table 1. We indicate their name, what kind of information they contain, the cross-references between the other sources we integrate (here, we only give the cross-references between the sources of this sample) and the URL from where it is possible to query the source. This URL enables a direct access to the contents of the source, avoiding us to go on its Web site to recover the information we need. XXX corresponds to the word(s) we want to find information on (e.g. a gene name or a pathology). Note that the choice of this word is not as simple as one can imagine since it is not the same according to the queried source. For instance, with certain sources, it is necessary to use the gene symbol whereas for others, the gene name is required. Method to describe data source schema Meta-information We want to create automatically the schema of each source by means of meta-information, which inform us on the content of the source. Therefore, to describe biomedical data sources efficiently, we need to identify some meta-information about them in a dynamic way. Among the information which has to be known, as advocated in [7], there is, for example, the name of the source. This information corresponds to metadata, i.e. general data about data. But there is also meta-information such as keywords, which are useful for high level searches, and cross-references to other data sources. This information is more informative concerning the content and structure of the source, which is what we need.
In this way, mediator-based system projects have been led to integrate biomedical data sources. Among them, TAMBIS [5] and BACIIS [6] offer much functionality, especially useful facilities to guide users in the way they can query these systems. The architecture of such systems is well-adapted to the biomedical domain needs since it avoids a complicated storage procedure of huge sources data, which are constantly evolving (every day for certain sources !). Due to the sources characteristics, such a storage would create a system which would become unworkable. However, as the above systems do not address the source schema evolution, they require frequent and manual updates. With a static wrapper layer, it is not possible to ensure that the user is still able to access up-to-date information. Indeed, as long as the source schema is not modified according to the source changes, the information provided by such systems is obsolete.
However, this meta-information is usually not available or hard to exploit when provided, which is the reason why we are working on another way to recover this essential information for data sources description. Information extraction With the query URL for each source (given in Table 1), we have direct and dynamic access to data. Our aim is to extract meta-information from the output provided by the source query tool. According to the structuring level of the output format of each source, the method is different :
784
F. Mougin et al.
sources, which is a key feature in our system. Finally, it provides the possibility of categorizing terms, which is also necessary to describe our sources. According to the HTML tag overlap obtained with the algorithm, it is possible to define a hierarchy between the extracted terms and reproduce it with RDFS.
HTML format We first constituted a limited corpus (n=100) of biomedical terms, which we chose randomly. It contains pathology names, gene names and their corresponding symbols. For each element of the corpus, we query the sources and obtain the resulting Web pages. There, we have a set of HTML examples per source on which we use information extraction process to recover relevant information.
XML format For the sources providing this kind of output, it is possible to use the associated DTD or XML schema to extract relevant information. Indeed, as advocated in [8], we can use DTD or content of XML documents as ontology sources. Thus, we extract information about the structure and content of the source if the DTD is available. Otherwise, with a set of XML output examples, we apply the algorithm presented above, without suppressing any tag (each XML tag can be relevant). With the overlap of the XML tags, or the way the DTD is organized, the creation of the RDF source schema is easy since the hierarchy is already well structured.
Our method relies on the exploitation of the intra-Web-page redundancy. There are many similarities between the different pages (common blocks) of a Web site, such as the header and footer which are usually identical in each page. This is called intra-Web-page redundancy. We use it to automatically identify the common general elements in the set of HTML examples and then infer the overall format of the data source. Apart from irrelevant elements, such as the header of Web pages, we believe that similar terms, which can be found, are pertinent for source description. The programs which query each source are CGI (Common Gateway Interface), they construct Web pages with identical HTML structure. Consequently, they present each result in the same format. Moreover, we assume that some HTML tags (such as
,
) contain significant and general information useful to induce data source schema, contrary to others (e.g. which is only for appearance style). So by combining these two assumptions, it is possible to recover meta-information with regards to certain specific HTML tags and according to the similarity of extracted terms and finally infer the source schema.
Results We have begun our experimentation on OMIM and Genecards sources to validate the algorithm. We can analyse the output of the algorithm and see if the extracted meta-information is relevant to describe the source. With the resulting observations and corrections, we improve the mechanism of the algorithm and in the same way, its results. The kind of extracted terms provided by the algorithm is, for example, the “Gene function” of a gene in OMIM source. Figure 2 shows an example of OMIM query tool output and on the right the corresponding extracted terms. We have not yet validated our approach due to the lack of a general schema common to each source we could use to evaluate easily our results.
Concretely, the algorithm (Figure 1) takes the set of HTML examples of a source as input and provides the terms useful for the data source description in output. The mechanism is the following : it extracts information from specific HTML tags and associates each of them to a (group of) term(s). Once each example has been scanned, the algorithm identifies the (tag – term(s)) couples which are common to most of the examples. These terms correspond to relevant information useful for the creation of the source schema.
Figure 2 - Example of OMIM’s extracted meta-information From these results, we can see what kind of queries our system will be able to answer. For instance, the system easily finds the function of a given gene (found in OMIM) and the associated references citations (links to MEDLINE). Therefore, biologists
Figure 1 - The algorithm operating cycle With the resulting meta-information, we create a schema for each source. We have chosen RDF Schema1 (RDFS) language to describe it since this enables automated processing of Web re-
1.
785
http://www.w3.org/TR/rdf-schema/
F. Mougin et al.
by Tim Berners-Lee [14], it “is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation”. The necessity to add semantics to the syntactical description of information has emerged. Our approach does not address this issue yet. To cover this lack, we plan to create an ontology – in OWL1 language, which is semantic Web oriented - to organize the extracted meta-information sources and homogenize this knowledge. This ontology will have the role of a mediator between the different wrappers, i.e. the integrated sources. We want to find matches between the different meta-information of the sources to ensure the semantic homogeneity of our system. For example, regarding to their respective content, a semantic correspondence exists between the extracted term “Pathogenesis” in OMIM and “Disorders & Mutations”, which is a meta-information of Genecards. For that, we want to use existing ontologies or thesauri in the biomedical domain, which is documented in other works such as [4,15].
and physicians obtain, through the meta-information, specific data which they need for their researches. The system avoids to drown them in a too massive result which is a risk to hide the relevant information. Users will only obtain the part which precisely corresponds to the gene function and not the complete HTML resulting page.
Discussion Interest of the automatic source schema generation We want to manage the wrapper layer of a mediator-based system in a dynamic way to ensure biologists and physicians that they access up-to-date and workable information. Our method enables to generate automatically the source schema by means of meta-information. The use of meta-information to federate biological sources has also been exploited in [9] but this information was pre-defined and static. This work does not address the problem of constant evolution of biomedical sources, contrary to us. Concerning the mediator-based approach, another work is described in [10]. Their method is the inverse of ours, since they begin to create a mediated schema and then exploit the cross-references between the integrated sources to define their schema. However, their queries are limited because they are linked to the pre-defined mediated entities and not to the sources themselves. Moreover, the problem of inheritance between entities is not addressed. Since our method is inductive, we use the integrated sources to describe the mediated schema and thus enable every kind of query to our system. And the problem of inheritance is solved with the choice of RDFS.
Part of the semantic heterogeneity will be addressed by BioMeKe [16], an ontology-based system that we develop in our laboratory. It combines knowledge from the medical domain with the integration of the Unified Medical Language System® and knowledge from molecular biology with the Gene Ontology™. Consequently, we do not have to manage the access to and schema description of these massive data sources, while we can interact with BioMeKe to recover the relevant information related to these sources. Evolution of the system A significant issue for ontology in the context of the semantic Web is its evolution [17]. Indeed, it would be inefficient to create a static ontology whereas it interacts with components which are changed automatically. In our system, it will be possible to extend and to modify the ontology according to wrapper schema changes. The mediator will be notified of this and it will trigger an ontology management process. For instance, if a new term is extracted from one of the integrated sources, we will enrich the ontology with this new concept if no existing concepts are semantically equivalent. And as the wrapper schema generation is automatic, the evolution of the ontology will be at least semi-automatic or even automatic.
The information extraction technique we opted for in our method is used in many studies and for diverse objectives. Existing tools, such as those developed by Stuckenschmidt and al. [11], also generate meta-information from resources of a Web site. For that, they use a source ontology. However, this is not suitable for our purpose since it requires manual work, which we want to avoid. Recent research also used redundancy of the intra-Webpage to extract pertinent information from site pages. Our approach is similar to that described in [12] because it is also automatic. But they consider that the common elements are not useful for them. This is the main difference of our work since it is based on intra-page similarity of a Web site. This is because we want to create the source schema and not to exploit the real content of each page, contrary to the work mentioned above.
Concerning the evolution of the integrated sources, the possible modification of one of them is addressed by our dynamic generation process of the corresponding schema. A trigger will provoke automatic update of the schema when it is necessary. If a new source needs to be integrated, it will not require much work. Indeed, the person who wants to add it has only to give information about : the name, the query URL and the kind of data it contains. Then the algorithm uses the corpus to constitute a set of examples, extracts the new terms and/or concepts and its schema is finally inferred.
Addressing semantic heterogeneity Among the four categories of heterogeneity mentioned in the introduction, three of them have been solved. The information presented in Table 1 enables to manage the content heterogeneity. With the information extraction method, we solve the problem of output format heterogeneity. In the same way, we get round implementation type differences between the sources since we describe their schema independently to their management system. The heterogeneity which has not been addressed is situated at the semantic level. Indeed, it is efficient to use the resulting meta-information for the description of data sources but this is not sufficient to solve the problem of semantic heterogeneity [13]. For that, the use of ontology and semantic Web technology will be necessary. The notion of semantic Web was introduced
Open issues Concerning the evolution process, the possible deletion of a source is not assumed by our system. If it occurs, the corresponding wrapper will be suppressed. But it is more complicated to 1.
786
http://www.w3.org/TR/2003/WD-owl-ref-20030221/
F. Mougin et al.
matics information sources. Bioinformatics, 2000. 16(2): p. 184-195
manage changing ontology. Indeed, if a concept of the ontology is only linked to the deleted source, this concept will either have to be deleted or kept if it is however useful for the query process. And in any case, will the ontology still be consistent ?
[6] Ben Miled Z, Wang Y, Li N, Bukhres O, Martin J, Nayar A and Oppelt R. BAO, A Biological and Chemical Ontology For Information Integration. Bioinformatics, 2002. 1: p. 6073
A limitation of our work is the management of more particular output formats, which are provided by some data sources query tool. We have not found yet how we will exploit them. For instance, Kegg contains relevant information on metabolic pathways but the graphics format is not exploitable with the method we presented above.
[7] Markowitz VM, Chen IMA, Kosky AS and Szeto E. Facilities for exploring molecular biology databases on the Web: a comparative study. Pac Symp Biocomput, 1997. p. 256267
During the elaboration of the algorithm, we noticed that the extracted terms from a data source can be different according to the type of word used. In particular, the output meta-information of OMIM that we identified are not the same if the word is a pathology or a gene (e.g. “Gene function” appears only for a gene and “Clinical features” is extracted only if the word is a pathology). This observation has interesting perspectives. For instance, if a non-expert user queries the global system with a word that he does not know if it is a pathology or a gene, the system is able to provide him this information by exploiting the content of the output OMIM result for this word. Concretely, if “Gene function” appears in the resulting page, our system can inform the user that his queried word is a gene. It is a new issue to infer biomedical knowledge.
[8] Nedanic G, Mima H, Spasic I, Ananiadou S, Tsujii J. Terminology-driven literature mining and knowledge acquisition in biomedicine. Int J Med Inf., Dec 4 2002. 67(1-3): p. 3348 [9] Cheung KH, Nadkarni PM and Shin DG. A metadata approach to query interoperation between molecular biology databases. Bioinformatics, 1998. 14(6): p. 486-497 [10] Mork P, Halevy A, Tarczy-Hornoch P. A model for data integration systems of biomedical data applied to online genetic databases. Proceedings of AMIA Annual Symposium, Washington, DC, USA, Nov 3-7 2001. p. 473-477 [11] Stuckenschmidt H and Van Harmelen F. Ontology-based metadata generation from semi-structured information. In Proceedings of the first international conference on knowledge capture (K-CAP'01), Sheridan Printing, 2001. p. 440444
Conclusion We have presented an approach to generate automatically the schema sources, i.e. the wrappers description, by extracting meta-information from a set of HTML examples obtained from each source. It ensures users that they access up-to-date information whatever the kind of changes which affect the integrated sources is. To address the semantic heterogeneity of the diverse schemas, we plan to create an ontology which will be adapted to the semantic Web requirements. We want to use this method to integrate biomedical sources in a mediator-based system.
[12] Shian-Hua L, Jan-Ming H. Discovering informative content blocks from Web documents. SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002. p. 588-593 [13] Kashyap V and Sheth A. Semantic heterogeneity in global information systems: The role of metadata, context and ontologies. In Papazoglou MP and Schlageter G editors, Cooperative Information Systems, Academic Press, San Diego, 1998. p. 139-178
Acknowledgments
[14] Berners-Lee T, Hendler J, Lassila O. The Semantic Web. Scientific American, May 2001
This work was funded by Région Bretagne (PRIR) and University of Rennes I (BQR 2003).
[15] Davidson SB, Overton C, and Buneman P. Challenges in integrating biological data sources. J Comput Biol, 1995. 2(4): p. 557-572
References [1] Bry F, Kröger P. A Molecular Biology Database Digest. PMS-FB-2001-3, Institute for Computer Science, University of Munich, 2001
[16] Marquet G, Golbreich C, Burgun A. From an ontologybased search engine towards a mediator for medical and biological information integration. In ISWC Workshop on Semantic Integration, Florida, USA, October 20 2003
[2] Busse S, Kutsche RD, Leser U, Weber H. Federated Information Systems: concepts, terminology and architectures. Technical Report Nr. 99-9, TU Berlin, 1999
[17] Klein M and Fensel D. Ontology versioning for the Semantic Web. In Proceedings of the International Semantic Web Working Symposium, Stanford University, California, USA, July 30 - Aug. 1 2001. p. 75-91
[3] Wiederhold G. Mediators in the Architecture of future Information Systems. IEEE Computer, March 1992. 21(3): p. 38-50
Address for correspondence
[4] Karp PD. A strategy for database interoperation. J Comput Biol, 1995. 2(4): p. 573-586
Fleur Mougin Laboratoire d’Informatique Médicale CHU Pontchaillou 2, rue Henri Le Guilloux 35033 RENNES FRANCE [email protected]
[5] Stevens R, Baker PG, Bechhofer S, Paton NW, Goble CA, Brass A. TAMBIS: transparent access to multiple bioinfor-