Semantic-enhanced Information Search and Retrieval

Semantic-enhanced Information Search and Retrieval Wang Wei, Payam M. Barnaghi, Andrzej Bargiela School of Computer Science and IT, The University of Nottingham Malaysia Campus {eyx6ww, payam.barnaghi}@nottingham.edu.my, [email protected] Abstract Information Retrieval (IR) techniques have been extensively studied since late 1940s and achieved great success evidenced particularly by popular online search engines. However, various classical information retrieval models also have witnessed criticism for emphasizing computation with occurrence of words while ignoring semantics (i.e. meaning of words, search context and etc). Research of the Semantic Web in recent years has provided an opportunity to migrate from mere word-computing to semantic-enhanced information search and retrieval. In this paper, we describe a methodology by combing the Semantic Web technologies, information extraction and social network analysis techniques to elicit semantics from available data in order to develop a semantic-enhanced information search and retrieval system.

1. Introduction Information retrieval techniques have been continuously developed over the past decades. Advances in the research academia have also been partially transferred into practical successes following the revolution of computing technologies and tremendous development of the Web. Online search engines and various specific vertical search systems are well-known representative exemplars. Classical information retrieval models such as Boolean Retrieval Model, Vector Space Model, and Probability Model have been researched extensively and they are underlying models used in some of the current information retrieval systems. Despite of their popularity and success, these classical models, which are mostly based on computing of statistics of words, they also present certain drawbacks, for instance, most of the time search context and semantics of words are not taken into account, which results in low satisfaction of search results. In recent years, the Semantic Web has been advocated as the extension of the current Web. It is intended to provide better computer-human

cooperation and machine-to-machine interaction [1]. The Semantic Web aims to achieve better data automation, reuse, and interoperability. Ontology is one of the most important concepts used in the Semantic Web framework. The Resource Description Framework/Schema (RDF(S)) 1 and Web Ontology Languages (OWL) 2 are W3C recommended data representation models which are used to represent the ontologies. By encoding expert knowledge into machine processable formats, building ontologies, and providing web services, enhanced and automated machine-to-machine interactions in different applications would be a feasible approach. The classical information retrieval models which view information, more specifically, documents as “bag of words” concentrate on occurrences of terms and not often exploit semantics of the queries and documents or information to be retrieved. Those information retrieval models which utilise semantics of the user queries and document collections would undoubtedly better serve the information needs for most of the current applications. There have been several pioneer works which employ the Semantic Web technologies for information search and retrieval, such as TAP [4], [5], KIM [6], OWLIR [7], [8], Swoogle [10], Squiggle [9]. In the current research we aim to distil semantics based on provided metadata and knowledge-base (i.e. ontologies) to establish a semantic-enhanced information retrieval framework. In particular, we discuss our methodology by incorporating social network analysis and semantic inference towards such a framework. The paper is organised as follows. Section 2 reviews some of the classical information retrieval models and key concepts in the Semantic Web, and describes some of the information retrieval systems developed based on the Semantic Web technologies. In Section 3, we demonstrate our approaches to extract semantics from network analysis and the underlying ontology, and propose an enhanced information search

1 2

http://www.w3.org/TR/rdf-schema/ http://www.w3.org/TR/owl-guide/

and retrieval framework. Section 4 concludes the paper and discusses the future work.

2. Related Work Contemporary search engines deployed in various systems and the current Web have been seen as an unprecedented success which has changed the way people obtain information. In the light of recent developments on the Semantic Web, a great number of research and projects have been carried out to improve various applications based on the Semantic Web technologies in order to achieve enhanced information automation, reusability and interoperability. In this section we briefly discuss classical information retrieval models and describe information search and retrieval on the Semantic Web.

2.1. Information Search and Retrieval on the Semantic Web Here, we review some of the recent works carried out in order to improve the information retrieval by utilising the Semantic Web technologies. TAP [4,5] creates an infrastructure for applications on the semantic web by providing a set of simple mechanisms for sites to publish data (with semantics) and for applications to consume this data. It improves information search and retrieval results in two ways: on one hand, it provides a simple mechanism to help the semantic search module to understand the denotation of the query; on the other hand, it augments the search results by considering search context and exploring closely related objects based on this context. KIM [6] introduces a holistic architecture of semantic annotation, indexing and retrieval for documents. It aims to achieve fully automatic annotation and to improve search and retrieval by integrating information extraction (IE) (i.e. using GATE [11]), information retrieval and Semantic Web technologies. Finin et al. [7], [8] viewed the documents representation on the Semantic Web as a combination of text, which is suitable for current Web search engines’ indexing and semantic mark-up. This can be used to perform inference over a knowledge-base and proposes an integrated approach to combine the inference capability and traditional information retrieval techniques. Finin et al. implemented a prototype system, called OWLIR [7], for retrieving university event announcements. Squiggle [9] is another framework for building domain-specific semantic search applications. It provides capabilities for annotating, indexing, and retrieving multimedia

items based upon the SKOS 3 ontology. Swoogle [10] is also a semantic search engine for retrieving Semantic Web document. Its primary use is found in searching the web and locating relevant ontologies in order to help users access, explore and query semantic web documents.

2.2. Semantic Search and Classical Methods Classical information retrieval models have been extended to models such as Latent Semantic Indexing (LSI) [2], machine learning based models [3] (i.e. artificial neural network, symbolic learning and genetic algorithm). However, it has been shown that these models based on formal mathematical theories do not necessarily surpass the classical models [2]. In classical information retrieval models, matching between queries and documents is formally defined, but it is semantically imprecise. Most of these models make a plausible assumption that words in documents are independent. Clearly, this is not a precise assumption. If one adopts a semantic-enhanced information search and retrieval method, then the words carry meaning or semantics when they are mentioned in documents under a specific context. Intuitively, it is natural to adopt a networked view on words. Furthermore, we perceive that there might be some hidden semantics which can be extracted from the provided metadata. For instance, in scientific publications retrieval, not only the keywords appeared in the documents are important but the expertise and authority of authors in their respective research fields and the reputation of the journal or conference are also significant parameters to decide on relevancy of documents. We regard this kind of information as plausible semantics which can be utilised for an enhanced information search and retrieval process.

3. Semantic-enhanced Information Search and Retrieval Framework As discussed earlier, we perceive the possibility to obtain semantics from available metadata and the ontology to retrieve more relevant information in a query and response process. The document representation within the framework consists of three components: original text, metadata, and semantic annotation which can be automatically achieved using IE techniques and enriched through an inference process. Figure 1 provides a logical view of an annotated document representation. In this section, we 3

http://www.w3.org/TR/swbp-skos-core-guide

first discuss our approach towards extracting such semantics as coming from the idea of social network analysis, which here is called scientific collaboration network analysis, and from the ontology representations. We then discuss our proposed framework for the semantic-enhanced information search and retrieval.

3.1. Networked Analysis for Information Search and Retrieval Objects are related to each other in their respective environments or networks. Characteristics of various objects depend not only on their intrinsic properties but also their relative positions and roles they play. The study of such networks is often referred to as social network analysis. Early social network analysis concentrated on statistical and mathematical perspectives such as the “small world” phenomenon and the “six degree of separation” [13] and various scientific collaboration networks [14], [15], [16], [17].

Figure 1. A logical view of the document representation The past few years has witnessed a great interest towards social network analysis from the Semantic Web research community. This is because that the Semantic Web technologies provide a formal framework to integrate distributed data into a centralised model to create socially aware, intelligent applications [20]. Many interesting researches have been carried out emphasising the usefulness of the social network analysis in various applications such as Flink [18], trust-based email filtering and viral marketing [20]. We suggest the social network analysis can also lend itself to develop a complementary ranking scheme in information search and retrieval process. To retrieve documents from a large scientific publication collection, such as biomedicine literatures

from PubMed 4 , overwhelming amount of articles will be retrieved using keywords (“Breast Cancer” retrieves around 220,000 articles, however, a user who is familiar with the indexing of the PubMed will be able to retrieve much less number of documents using terms in the Medical Subject Headings “MeSH” 5 ). This necessarily means that keywords alone are not good discriminators to retrieve highly relevant information from large repositories. In this work, we make an assumption which is that users always prefer to obtain documents from authors with high expertise and containing trustable content. For example, for two articles, one is written by an undergraduate student and another is written by a well known researcher, and two articles are indexed with similar keywords. There might be a possibility that the article written by the student will be ranked higher and retrieved before another one. However, this is not the user’s intension. We suggest that the expertise or trust of the documents or authors can be calculated using results of network analysis and used an important parameter for developing alternative document ranking scheme. Their respective expertise could be calculated based on their research contribution to that area. For example, numbers of quality publications and coauthors, their relative positions in the network (i.e. the closeness and betweenness centrality measures), value of their trust propagated [19]. By accommodating these considerations the search results will be more satisfactory and authoritative even if few keywords from the query appear in the document.

3.2. Ontology-based Information Search and Retrieval While the social network analysis focuses on actors in the network constructed by integrating author related information from different information sources, the ontology focuses on more general concepts and their relationships. By constructing and populating an ontology for a particular domain, one could develop a knowledge-base which encodes expert knowledge for that domain. Using information extraction techniques, one can effectively identify terms in a document collection which prevail in the knowledge-base. Relations among terms identified in the documents can thus be recognised by referring to the ontology. This has the fundamental difference with the classical information retrieval model which often assumes 4 5

http://www.ncbi.nlm.nih.gov/entrez/ http://www.nlm.nih.gov/mesh/

independence of terms. The classical information retrieval models and techniques are well established, nevertheless, they are incapable of supporting logical inference. On the contrary, one of the most salient aspects of the Semantic Web technologies (i.e. OWL and ontology) is the inference capability. The work described in [4], [5] ,[6], [7], [8], [9] and many others has shown that by accommodating the inference process, information search and retrieval systems could augment the search result, disambiguate query concepts, support complex question answering, achieve retrieval precision, and provide meaningful search results to the user. However, this requires that search and inference processes to be tightly integrated [6], [7], [9], which allows the information retrieval process exploit advantages offered by today’s broadcoverage text-based search techniques and also the semantic inference capabilities. This raises questions about when and how much to reason the contents [7]. The inference process is done at three stages in [7], and [8]: at indexing time when the semantic annotation will be enriched from the knowledge-base by expanding the original concepts mentioned in the text with related concepts in the knowledge-base; at query processing time where the system tries to identify concepts denoted by the original query terms and then attempts to expand the identified concepts with related concepts; at result evaluating time, when the retrieved results are filtered with regards to the trust of sources [19]. However, without careful consideration the search results might be irrelevant. We try to address this problem by pre-processing the query terms and semantic annotations. The inference approach discussed earlier does not take the context of the search and document into account. We refer this to as blind inference which often treats various relationships in the ontology equally. We adopt an approach which provides statistical processing of relationships among the indexing and query terms in the inference process in order to obtain more precise results. For query processing, first we try to identify the corresponding concepts in the knowledge-base and figure out the meaningful relationships between the query terms. For example, a single term query “Semantic Web” would retrieve general publications on the “Semantic Web” with a very broad coverage using blind inference because there is no additional information available. For a query including terms “Semantic Web, OWL”, the relationship between the two terms could be identified from the knowledge-base, using a triple (namespace is neglected here): [Semantic-Web,

contains, OWL]. By treating different predicates differently through defining rules (i.e. the “contains” predicate has semantics different from the predicate “relatedTo” in a triple [Semantic-Web, relatedTo, Artificial-Intelligence]), the system could infer that the users is likely to look for documents related to more specific term “OWL”. The system understand that “Semantic-Web” concept is broader than “OWL” and choose to infer on concept “OWL” only, which results in retrieving documents related to OWL DL, Description Logic, while the blind inference would expand the query terms with Artificial Intelligence. Another inference scenario with query processing is that the relationships among query terms cannot be directly addressed in the knowledge-base. For example, a query defined as “Semantic Web, information retrieval” in which the relationships are not explicitly presented in the knowledge base. We bring the idea of Semantic Association adapted from Sheth et al’s work which defines Semantic Association as meaningful complex relationships between entities, events and concepts [12].

Figure 2. Semantic Association between two entities “Semantic-Web” and “Information-Retrieval” Figure 2 demonstrates concepts related to the Semantic Web and Information Retrieval in the ontology. There is no direct relation between “Semantic-Web” and “Information-Retrieval”, nevertheless, there is a semantic association between them represented as: [“Semantic-Web”, “Ontology”, “Semantic-Search”, “Information-Retrieval”]. Through the identification of the semantic association we could identify semantically related terms, “Ontology” and “Semantic-Search”, which can be used as improved query expansion terms. At indexing stage, traditional methods perform inference over the concepts in the semantic annotation. The result is a bag of words and then these words are used as index terms. We argued that this will cause

many irrelevant results to be retrieved. Using the information extraction techniques important terms in documents could be recognised. However, there are some terms that only have been mentioned occasionally and would not be given high weights. Here we suggest an approach which does statistical analysis over the semantic annotations and occurrence of words and then performs inference during document indexing.

This is a criterion to measure the relevancy of a concept to the topic of a document. Definition 3. Semantic Score of a concept (SCse): Measures the importance of a concept from the semantic annotation point of view. The value of the semantic score is defined based on Agglomeration Ratio. Definition 4. Statistical Score of a concept (SCst): Measures the importance of a concept based on the term frequency. The final score of a term in a particular document “d”, is defined as follows: SCt = a ∗ SCse+ b ∗ SCst = a ∗ NSSd(t ,d ) + b ∗ tf

Figure 3. Represented concepts in the annotations of an arbitrary document In classical information retrieval models the number of occurrence of a word, i.e. term frequency (tf), in a document is used as an indicator of its importance and utilised for computation of term weight. We also consider the frequency of a concept that appears in all RDF statements derived from the semantic annotation. For example, the semantic annotations could be transferred into a RDF file which can be represented as a directed labeled graph as shown in Figure 3. The concept “Semantic-Web” is involved in two statements (one is for “Ontology” and another is for “OWL”), the concept “Description-Logic” and “First-Order-Logic” both are involved in 3 relations. Intuitively, the importance of a concept within a document is not only dependent on how many times it is mentioned syntactically but also the number of roles it plays with other concepts within all statements in a document. In our example, the system would be able to conclude that the document is more related to Description Logic than “Fuzzy-Logic” which the term is mentioned occasionally in the document. To clarify the idea we give some definitions below. We assume that the semantic annotations of the document are represented as a set of RDF statements.

Ar =

NS (t , d ) Sd

(1) (2)

where SCt is the score of the term in a document. NS(t, d) is the number of statements in a document consisting of t, |Sd| is the total number of statements identified in that document; and a and b are coefficients. A given value for a and b is 0.5 (based on heuristics). The computation of the score for a concept only considers the local statistics within a document and not the global statistics. The reason is because the intention is to compute a score for selecting indexing terms, not the ranking of documents. The selection of indexing words now could be done by choosing those terms whose scores are beyond threshold value defined. Table 1 gives an example on the computation of terms scores for selecting key index terms. Clearly scores for term “Ontology” and “Fuzzy Logic” are much lower than the others and thus, should not be used as the key indexing terms.

Definition 1. Significant Term (t): Terms in the semantic annotations of a document that can be identified in the ontology or knowledge-base.

Table 1. An example of computing terms scores for selecting key index terms

Definition 2. Agglomeration Ratio (Ar): The number of statements in which a term appears in a document divided by total number of statement in that document.

Lastly an inference process would be performed at the results evaluation stage in order to obtain more trustable answers based on trust analysis [7], [19], which the ideas are discussed in the previous section.

The results would also be filtered based on semantic annotation to improve retrieval precision.

4. Conclusion and Future Work As an extension of the current Web, Semantic Web provides a structured data and knowledge representation framework for Web information. This enables enhanced applications to be developed by accommodating various reasoning and analysis processes based on the represented data and knowledge. Several works have been carried out in different areas to improve traditional applications prevailing in the current Web. In this paper, we have demonstrated an effort to develop a semantic-enhanced framework which applies the Semantic Web technologies to improve the results obtained from traditional information search and retrieval methods. We explain how the system can analyze semantics to fine-tune the information search and retrieval process. The paper introduces a collaboration network analysis that can be constructed from the metadata of collection of scientific publications. The results of the collaboration network analysis could be used as a complementary document ranking scheme. However, the proposed framework needs to be tested with large data sets to evaluate the feasibility of the approach. The papers discussed general ideas and approaches taken to enhance the recall and precision ratios in an information search and retrieval in a focused repository. In particular, we plan to apply this method to biomedicine and scientific literature repositories.

References [1] T. Berners-Lee, J. Hendler, and O. Lassila, “The Semantic Web”, Scientific American, May 2001. [2] R. Baeza-Yates, and B. Ribeiro-Neto. 1999. Modern Information Retrieval. Harlow: Addison-Wesley. [3] H. Chen, “Machine Learning for Information Retrieval: Neural Networks, Symbolic Learning, and Genetic Algorithms”, Journal of the American Society for Information Science, vol. 46, no.3, pp.194-216, 1995. [4] R. Guha ， R McCool ， and E. Miller, “Semantic Search,” In proceedings of the 12th international conference on World Wide Web, pp.700-709, 2003. [5] R. Guha, and R. McCool, “TAP: a Semantic Web platform”, Computer Networks, vol. 42, no. 5, pp. 557-577, 2003. [6] A. Kiryakov, B. Popov, I. Terziev, D. Manov, and D. Ognyanoff, “Semantic Annotation, Indexing, and Retrieval”, In proceedings of International Semantic Web Conference, pp. 484-499, 2003. [7] T. Finin, J. Mayfield, C. Fink, A. Joshi, and R. S. Cost, “Information Retrieval on the Semantic Web”, In

proceedings of the 38th International Conference on System Sciences, 2005. [8] J. Mayfield, and T. Finin, “Information retrieval on the semantic web: integrating inference and retrieval”, In proceedings of SIGIR 2003 Semantic Web Workshop, 2003. [9] I. Celino, E. D. Valle, D. C., and A. Turati, “Squiggle: a Semantic Search Engine for Indexing and Retrieval of Multimedia Content”, In proceedings of the 1st International Workshop on Semantic-Enhanced Multimedia Presentation Systems, 2006. [10] L. Ding, T. Finin, A. Joshi, R. Pan, S. Cost, Y. Peng, P. Reddivari, V. Doshi, and J. Sachs, , “Swoogle: A Semantic Web Search and Metadata Engine for the Semantic Web”, In proceedings of the thirteenth ACM international conference on Information and knowledge management, pp.652-659, 2004. [11] H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan. “GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications”, In proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL'02), 2002. [12] A. Sheth, I. B. Arpinar, V. Kashyap, “Relationships at the Heart of Semantic Web: Modeling, Discovering, and Exploiting Complex Semantic Relationships”, Enhancing the Power of the Internet: Studies in Fuzziness and Soft Computing, Springer-Verlag, 2002. [13] S. Milgram, “The small world problem”, Psychology Today, vol.1, pp. 61–67, 1967. [14] M. E. J. Newman, D. J. Watts, and S. H. Strogatz, “Random Graph Models of Social Networks”, In proceedings of the National Academy of Sciences of the USA, pp. 2566-2572, 1999. [15] J. W. Grossman, “The Evolution of the Mathematical Research Collaboration Graph”, Congressus Numerantium, vol. 158, pp. 201–212, 2002. [16] M. E. J. Newman, “Scientific Collaboration Networks I. Network Construction and Fundamental Results”, The American Physical Society, 2001. [17] M. E. J. Newman, “Scientific Collaboration Networks II. Sortest Paths, Weighted Networks, and Centrality”, The American Physical Society, 2001. [18] P. Mika, “Flink: Semantic Web Technology for the extraction and analysis of social network”, Journal of Web Semantics, no 2, vol. 3, 2005. [19] J. Golbeck, B. Parsia, J. Hendler, “Trust Networks on the Semantic Web”, In proceedings of cooperative intelligent agents , 2003. [20] S. Staab, P. Domingos, P. Mika, J. Golbeck, L. Ding, T. Finin, A. Joshi, A. Nowak, R., R. Vallacher, "Social Networks Applied," IEEE Intelligent Systems, vol. 20, no. 1, pp. 80-93, 2005.