A MEDIATOR SYSTEM FOR SEMANTIC WEB REASONING

1 downloads 0 Views 252KB Size Report
Liviu Badea, Doina Tilivea, Anca Hotaran. Abstract. In this paper, we describe an initial prototype of a full-fledged mediator system for integrated reasoning about ...
CEEX 2007 CONFERENCE

A MEDIATOR SYSTEM FOR SEMANTIC WEB REASONING Liviu Badea, Doina Tilivea, Anca Hotaran Abstract. In this paper, we describe an initial prototype of a full-fledged mediator system for integrated reasoning about data and knowledge using Semantic Web technology. In order to prove the feasibility of our approach, we present a real-life application in the domain of bioinformatics which combines not only very large, but also semantically as well as structurally complex data and knowledge sources distributed on the Web. This work is part of a larger research project, ROTEL aimed at combining Semantic Web and Natural Language technology. Keywords: Semantic web technologies, natural language technology, bioinformatics 1.INTRODUCTION AND MOTIVATION The huge number of information sources available on the Web has created unprecedented opportunities for their joint processing and exploitation 1 . Although virtually any conceivable query of a user could potentially be answered using existing sources, the related automated processing necessary to achieve this is limited by existing Web technology, which has been designed initially for browsing by human users rather than to be machine processable. Recently, we are witnessing important changes on the Web such as: - an increasing amount of structured content hidden in the so-called “deep web” (as opposed to knowledge in textual form, which used to prevale in the near past), - annotation schemas like Google-Co-op and Flickr 2 that enable people to add labels to content of pages on the web, - Web 2.0 which allows contributors to share information easily (e.g. Wikipedia) and sites like Google Base 3 , that allows users to load any structured data into a central repository, - many applications producing XML exchange data and huge numbers of publicly accessible Web services and description standards for using them (e.g. WSDL), - advanced Natural Language Processing methods for Information Extraction from documents as well as associated lexical resources (like WordNet), upper-level ontologies (SUMO, MILO, DOLCE, etc.). Vast heterogeneous collections of structured data hidden in the “deep web” refer to pages that are dynamically created in response to HTML forms using structured data from hidden data bases. The deep web is possibly larger than the current WWW and contains very high-quality content. An estimation [1] of the number of deep web sources based on a random sample of 25 million web pages from the Google index leads to 647000, which, scaled to the a size of 1 billion web pages, amounts to around 25 million deep web sources. 1 Public estimates of index sizes of the main search engines are at over a billion pages [1] 2 Flickr. ( http://www.flickr.com ), Google co-op. (http://www.google.com/coop ) 3 Google Base. http://base.google.com.

29 - 1

CEEX 2007 CONFERENCE While the above-mentioned changes of the Web content created an opportunity for an integrated and structured data management on the Web, dealing with the heterogeneity of web sources presents many new challenges, because the traditional data integration techniques are not appropriate for this task. The main problems are the mismatch of the schema and content of the various web sources and the huge number of possibly very large sources. Existing ontologies that should in principle facilitate the structural and semantic interoperability of the sources cannot fully solve them due to the different granularities of the data in the various sources, as well as to the different knowledge models employed. There are many types of applications on the web which can take advantage of combining information from multiple information sources on the web such as: e-activities, citizen public information services, science informatics, enterprise data integration 4 , etc. For example, the integration of biological data is just one phase of the molecular biology and genomic research process, but without automating this process, it may consume most of a biologist’s efforts. The sheer amount of relevant biological data 5 has made manual integration practically infeasible due to several factors such as: The high diversity of the available data covers several biological and genomic research fields including sequence data, gene expression measurements, disease characteristics, molecular structures, information about protein interactions, pathways, etc. The representational heterogeneity of the data. Several aspects referring to the same entities can be contained in several sources but represented in a variety of ways. Thus, each source may have its own schema resulting in structural differences, or each source may refer to the same semantic concept with a different term or identifier, which can lead to naming and semantic discrepancies. The autonomous and web-based character of the sources. Their schemas or data can change unexpectedly and nearly all sources are web-based and are therefore dependent on network access band-width and instability. The diversity of interfaces and querying capabilities offered by the different sources. The sources may only allow a limited access to their content and impose restrictions on the forms of the queries. Many sources provide data in plain text format, so that the query language is limited to simple keyword searches. For these reasons, the combined use of bioinformatic sources is difficult (if not simply impossible) for biologists. More sophisticated query tools are badly needed. The same is true for other application domains. For example, querying the vast amounts of legislative texts involves combinations of plain text queries (mostly keyword searches) with more structured queries referring to the tangle of cross-references between the relevant legislative documents. A significant amount of reasoning as well as background knowledge in the form of ontologies are essential in this process. In this paper, we describe an initial prototype of a full-fledged mediator system for integrated reasoning about data and knowledge using Semantic Web technology. In order to illustrate the specificity of the problems related to the integration of data on the web we then briefly present a real-life application in the domain of bioinformatics.

4 an average company has about 50 databases and spends around 35% of its IT budget on integration efforts. 5 the number of available biomolecular and genomic on the web has grown exponentially over the past decade [2], sources on the web exceeding 500 (they are 220 data sources just for pathways)

29 - 2

CEEX 2007 CONFERENCE 2. THE MEDIATOR SYSTEM Our approach aims at dealing with the main Semantic Web problems in a mediator-type architecture [5] using automated reasoning for intelligent integration of the information sources and answering user queries. The integrated sources can be of any type: (semi-) structured or even plain text. The information extraction from such documents containing mainly plain text is a challenging and laborious task. Natural Language Processing techniques are essential in this context and are pursued in our project by the Institute for Research in Artificial Intelligence of the Romanian Academy and the “A.I. Cuza” University of Iasi (due to the space limitations this very complex research will be described elsewhere). In this paper we focus on deep web sources which produce dynamic HTML pages with a partially stable structure and useful content. Specialized source “wrappers” are developed to extract the information content of these pages into a structured format. In addition, the mediator solves the problem of source schema alignment using mapping rules between local source schemas and the mediator global schema (described by specific domain ontologies), while data content alignment (of syntactically different data items that refer to the same entity) is addressed using specific dictionaries and lexicons. Our model of integration is a combination of remote-source mediation and datawarehousing (to avoid repetitive remote accesses over the Web, only the relevant data is cached locally). The mediator system identifies the sources that are relevant to the query and produces alternative sequences of source accesses (plans) before actually querying these sources. The associated query planner handles logical constraints of the domain as well as the source access limitations expressed by so called source capabilities [5]. As opposed to traditional database query languages, Web sources provide only limited query capabilities. For example, a specific Web interface to a database or a Web service, may allow only certain types of selections and may also require certain parameters to be known at query time. The query planner uses this information to determine the correct dataflow in the query plan. Because reasoning in general is based on combining knowledge, Semantic Web reasoning will have to deal with combining knowledge distributed on the Web. The distributed nature of relevant knowledge in turn places significant limitations on the reasoners, due to the limited data transfer speeds of the current Web in combination with existing large datasets. This problem is solved using local warehouse for critical sources (the time can be improved from minutes to seconds in this case) in combination with the exploitation of the specific capabilities of the sources (filters/selection, availability of batch processing facilities, etc). For describing the domain ontology and rules, the system uses F-logic [6], a frame-oriented logic language widely employed in the Semantic Web community, and in particular the Flora2 6 implementation [6] of F-logic. Although Flora2 is a very efficient deductive engine, in some applications involving large data sources and especially the combination of many sources, finding solutions of a complex query can be practically insolvable due to memory limits of an “in-memory-reasoning” inference engine. In order to solve this 6 The tabling mechanism of Flora2, a highly optimized compilation technique developed for Prolog, is similar to the Magic Sets method for bottom-up evaluation in database query engines.

29 - 3

CEEX 2007 CONFERENCE problem, all intermediate query results (provided by wrappers in XML format) are cached in a local database supporting automated mapping of semi-structured information to a relational representation. Most predicate combinations needed to answer the query are performed via database joins leading to significant improvements in mediator response time. 3. A TYPICAL APPLICATION IN THE DOMAIN OF BIOINFORMATICS Determining the molecular-level details of complex diseases is a challenging issue. Traditional genetic methods are inapplicable since, typically, there is no single gene responsible for the disease. Rather, a complex interplay of pathways is usually involved, so that many different genetic defects may affect the same pathway. The study of complex diseases has been revolutionized by the advent of whole-genome measurements of gene expression using microarrays. However, the initial enthusiasm related to such microarray data has been tempered by the difficulty in their interpretation without additional knowledge, which has to be somehow used in the data analysis process. We thus need to integrate at a deep semantic level the existing domain knowledge with the partial results from data analysis 7 . The application [7] uses various data and knowledge sources usually available on the web via a form-based HTML interface or services. The structured content of these sources is extracted by specialized XQuery wrappers. We initially integrated the following sources 8 : NCBI/Gene. The e-utilities interface to the NCBI Gene database returns information about genes, such as gene names, descriptions, domains, literature references, Gene Ontology (GO) annotations, the pathways and interactions in which these are known to be involved. TRED. The Transcriptional Regulatory Element Database contains knowledge about transcription factor binding sites in gene promoters. Such information is essential for determining potentially co-expressed genes and for linking them to signaling pathways. Biocarta is a pathway repository containing mostly graphical representations of pathways contributed by an open community of researchers. The above sources contain complementary information about the genes, their interactions and pathways, neither of which can be exploited to their full potential in isolation. For example, the GO annotations of genes can be used to extract the functional roles of the genes, but do not allow us to determine their interactions and pathway membership. These can only be extracted from interaction or pathway data-sources, such as TRED or Biocarta. 7 In this application we use a selection of 359 genes resulting from the data analysis of a public pancreatic cancer dataset produced in the Pollack lab at Stanford [3] 8 NCBI e-utilities (http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html), NCBI Gene (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene), TRED(http://rulai.cshl.edu/cgibin/TRED/tred.cgi?process=home), Biocarta.(www.biocarta.com), GO(http://www.geneontology.org)

29 - 4

CEEX 2007 CONFERENCE

Figure 1. The architecture of the application Since the sources are heterogeneous, we use so-called “mapping rules” to describe their content in terms of a common representation or ontology. For example, we can retrieve direct gene interactions either from the gene-centred NCBI Gene database, or from TRED: di(I):direct_interaction[gene1->G1, gene2->G2, int_type->IntType, source->'ncbi_gene', description->Desc,]

Å Ni:interaction[gene->G1, other_gene->G2, description->Desc, pubs->PM]@'NCBI_gene', If str_sub('promoter',Desc,_) then IntType = 'protein-to-DNA ' else IntType = ' protein-to-protein'. di(I):direct_interaction[gene1->G1, gene2->G2, int_type->’ 'protein-to-DNA’, source->'tred'] Å I:interaction[tf->G1, gene->G2]@'TRED'.

While certain knowledge is more or less explicit in the sources (e.g. the interaction type is ‘protein-to-DNA’ if the description of the NCBI interaction contains the substring ‘promoter’), in other cases we may have to describe implicit knowledge about sources (e.g. the TRED database contains only interactions of type ‘protein-to-DNA’, but this is nowhere explicitly recorded in the data). Although the wrappers and the mapping rules are in principle sufficient for formulating and answering any query to the sources, it is normally convenient to construct a more complex model that is as close as possible to the conceptual model of the users. This is achieved using so called “model rules”, which refer to the common representation extracted by the mapping rules to define the conceptual view (model) of the problem. For example, we may want to query the system about “functional” interactions between two genes (which can be either due to a direct interaction, or to the membership in the same pathway): fi(I):functional_interaction[gene1->G1, gene2->G2, int_type->IntType,] Å I:direct_interaction[gene1->G1, gene2->G2, int_type->IntType] ; I1:pathway[name->P, gene->G1, role->R1], I2:pathway[name->P, gene->G2, role->R2], I =p(I1, I2), interaction_type(R1,R2,IntType) /* can be transcriptional, coexpression, etc */.

Other rules could be formulated to derive new interesting knowledge from primary source data, such as defining other gene classes of interest (receptors, ligands, transcription regulators) and using them for finding signaling chains, such as ligand Æ receptor Æ signal transducer Æ…Æ transcription factor. Another significant problem is the determination of transcription factors and their targets, since the targets’ co-expression can reveal the groups of genes that are differentially co-regulated in the disease. 29 - 5

CEEX 2007 CONFERENCE 4 CONCLUSIONS AND FUTURE WORK The mediator system has proved to be a useful tool for creating a global “picture” of the interactions among the genes differentially expressed in pancreatic cancer data set, combining not only very large, but also semantically and structurally complex data as well as web sources. Our initial experiments confirmed the practical feasibility of our approach: the system was able to deal with the complete data-sources mentioned above 9 for the selection of 359 “interesting” genes 10 . The response time for the complex queries involving large datasets and combinatorial reasoning was acceptable. Moreover, as far as we know, other existing approaches are either slower 11 or cannot deal with such datasets at all. The improvements in response time are due to the combination of mediation and warehousing for critical sources and to the use of a query planner which exploits source capability description to construct feasible plans. The use of a local database to alleviate the “in memory” reasoning limitation of the inference engine significantly enhances the performance of the system. There are still other technical issues whose improvement would lead to a significantly better Semantic Web reasoning system, mainly related to query planning such as the exploitation of more information about source capabilities or the implementation of strategies for answer ranking. Furthermore, queries combining many (possibly alternative) sources can produce combinatorial explosions in the number of plans, requiring effective techniques for generating diverse plans and ranking them. ACKNOWLEDGEMENTS This work has been supported by the ROTEL project (CEEX contract 29/2005). We are grateful to the REWERSE Network of Excellence of the EC. References 1. J. Madhavan et al., Web-scale Data Integration: You can only afford to Pay As You Go, Conference on Innovative Data Systems Research (CIDR) 2007 2. T. Hernandez and SKambhampati, Integration of Bioinformatic Sources: Current Approaches and Systems , ACM SIGMOD, Vol 33, No 3 September 2004 3. Bashyam MD et al. Array-based comparative genomic hybridization identifies localized DNA amplifications and homozygous deletions in pancreatic cancer. Neoplasia Jun2005;7(6):556-62 4. Yang G., Kifer M., Zhao C. FLORA-2: A Rule-Based Knowledge Representation and Inference Infrastructure for the Semantic Web. ODBASE, November 2003. 5. Liviu Badea, Doina Tilivea, Anca Hotaran. Semantic Web Reasoning for Ontology-Based Integration of Resources. Proc. PPSWR 2004, pp. 61-75, Springer Verlag. 6. Kifer M., Lausen G., Wu J. Logical Foundations of Object-Oriented and Frame-Based Languages. Journal of the ACM, 42:741-843, 1995. 7. Liviu Badea. Semantic Web Reasoning for Analyzing Gene Expression Profiles. Proc. PPSWR 2006, LNCS 4187, pp. 78-89, Springer Verlag. 9 NCBI Gene interactions (2239), TRED interactions(10717), Biocarta gene to pathway relations (5493), NCBI gene to pathway relations (622), other pathway relations (5095), GO annotations (2394), Domains(614) 10 The number of potential interactions (64261) would have made the task impossible for a human exploration 11 In the case of systems based on plain Prolog (with no tabling or other similar optimizations).

29 - 6