SWISE: Semantic Web based Intelligent Search Engine - IEEE Xplore

2 downloads 233 Views 779KB Size Report
Abstract -- Most of the search engines search for keywords to answer the queries from users. The search engines usually search web pages for the required ...
SWISE: Semantic Web based Intelligent Search Engine Faizan Shaikh, Usman A. Siddiqui, Iram Shahzadi

Syed I. Jami, Zubair A. Shaikh

Department of Computer Science, National University of Computer & Emerging Sciences Karachi, Pakistan

Center for Research in Ubiquitous Computing, National University of Computer & Emerging Sciences Karachi, Pakistan {imran.jami, zubair.shaikh}@nu.edu.pk

Abstract -- Most of the search engines search for keywords to answer the queries from users. The search engines usually search web pages for the required information. However they filter the pages from searching unnecessary pages by using advanced algorithms. These search engines can answer topic wise queries efficiently and effectively by developing state-ofart algorithms. However they are vulnerable in answering intelligent queries from the user due to the dependence of their results on information available in web pages. The main focus of these search engines is solving these queries with close to accurate results in small time using much researched algorithms. However, it shows that such search engines are vulnerable in answering intelligent queries using this approach. They either show inaccurate results with this approach or show accurate but (could be) unreliable results. With the keywords based searches they usually provide results from blogs (if available) or other discussion boards. The user cannot have a satisfaction with these results due to lack of trusts on blogs etc. To get the trusted results search engines require searching for pages that maintain such information at some place. This requires including domain knowledge in the web pages to help search engines in answering intelligent queries. The layered model of Semantic Web provides solution to this problem by providing tools and technologies to enable machine readable semantics in current web contents. Keywords: Semantic Web, Search Engines, Ontology, RDF Graphs

I. INTRODUCTION Search Engines has become one of the most important and interesting area towards the users of World Wide Web (WWW). These commercially available search engines do not completely serve the needs and demands of the users. The problems that typical search engines usually suffer can be cast down in two major areas. First, they do not provide the factor of reliability as the user demands. For example, when a particular user issue any query like “Which is the best University in my city?” the 978-1-4244-8003-6/10/$26.00 ©2010 IEEE

search engine although provides thousand of result to the user but it’s difficult for the user to find out which source is reliable. The user has to sift through all the retrieved pages to find only the reliable results. Secondly, the relevancy of provided results is not up to the mark. The results against the previous query provides some of the results like” Scholarships in best University of my city?” or “Admissions in best University of my city” etc. These types of problems occur because of the structure of current Web. In current web different documents, pages etc are just linked with each other with the help of hyperlinks. Today’s web is just for presentation. It’s easy for a human to understand what’s provided by a particular webpage, but it’s not easy for a machine to understand the content of the webpage. The major reason is the missing of machine understandable semantic information from the Web. This can be achieved by transforming the contents of current web through framework which is known as Semantic Web [1], a web with a meaning. In this paper, we propose the semantic web based search engine named SWISE. We use the power of XML meta-tags deployed on the web page to search the queried information. The XML page will be consisted of built-in and userdefined tags. The metadata information of the pages is extracted from this XML into RDF. The RDF graphs [2] are populated by inputting through XForms. These tags will help the system in getting answers from reliable sources (for example the controlling authority for universities). For relevancy factor, we use the power of ontology [1] in order to group the domain information of our interest. In this work, we focused only on the domain of universities but the system is capable of adding more domains by incorporating their respective ontologies. SPARQL is used for retrieving results for our sample queries. We incorporated W3C [3] tools to make our system applicable on all systems. Semantic interoperability is

achieved by using Ontologies while the use of XML/RDF ensures machine understandibility.

search engines however its scope is only limited to articles of Wikipedia.

II. RELATED WORK

DeepDyve [12] is a powerful, professional research engine that lets users access expert content from the “Deep Web” the part of the internet that is not indexed by traditional search engines. It indexes every word in a document, but also computes the factorial combination of words and phrases in the document and uses some industrial strength statistical techniques to assess the “informational impact” of these combinations [8]. The presentation of search results is very complex. It presents the users with many advanced options for refining, sorting or saving the search. The search results are however relatively easy to navigate. The results presented are only for the paid customers, they are not available for the general public.

Google [4], Yahoo [5] and Bing [6] have been out there which handles the queries after processing the keywords, which makes them keyword based search engine. They only search information given on the web page. Recently, some research groups start delivering results from their semantics based search engines, however most of them are in their initial stages. Hakia [7] is a general purpose semantic search engine that search structured text like Wikipedia. Hakia calls itself a “meaning-based (semantic) search engine” [8]. They’re trying to provide search results based on meaning match, rather than by the popularity of search terms. The presented news, Blogs, Credible, and galleries are processed by hakia's proprietary core semantic technology called QDEXing [7]. It can process any kind of digital artefact by its SemanticRank technology using third party API feeds [9]. A single query by the user brings results from any repository including Web, News, Blogs, Video, Images, Hakia Galleries and also from Credible Sources [9]. For short queries the site displays results in categories, instead of a standard list as shown in current search engines. For longer queries, Hakia highlights relevant phrases or sentences. The results are somehow relevant and reliable but Hakia does not reveal it’s inside technology. Hakia take the searched query and find the results in many categories for example from galleries, videos etc. so it took more time than the usual search engines in the retrieval of results [7].

III. PROPOSED MODEL The problems described in previous sub-sections can be resolved by maintaining metadata repository for the pages that contain domain knowledge from trusted sources. Search Engines instead of searching keywords on the web page will now search metadata for the required information. We, in this work, developed search engine that is based on this concept. Our search engine first searches the pages and then gets the result by searching for the metadata. The metadata recording could either be made manual or automated. The manual system requires input of information from the administrator of web site. This solution is improper since it can compromise the reliability and efficiency. An automated system can be developed by employing Agents that can gather information from the trusted web sites.

SenseBot [10] represents a new type of search engine that prepares a text summary in response to the user's search query. SenseBot extracts the most relevant results using Semantic Web technologies from the Web. It then summarizes the results together for the user as per topic. It uses text mining algorithms to parse (human readable) Web pages which leads to identification of key semantic concepts [8]. The coherent summary is then performed from multidocuments that are retriived [10]. This summary itself becomes the main result of the search. Although the search results are still not relevant, because the summarized result may divert the results from actual demands of the user [8]. The sources from which the results are coming are usually the news agencies so reliability is also somehow missing. Powerset [11] does not search simply on keywords alone, but also try to understand the semantic meaning behind the search phrase as a whole. Powerset's first product is a search and discovery experience for Wikipedia. It attempts to use natural language processing to understand the nature of the question and return pages containing the answer [9]. It gives more accurate results, and aggregates information from across multiple articles [8]. The results returned by Powerset are most reliable and relevant than all the other semantic

Figure 1: Design Architecture of SWISE

The interoperability issues can be resolved by using W3C compliant tools. For representing domain knowledge, W3C proposes Ontologies in OWL [1] while metadata can be represented in Graphs as RDF Triples [2]. This approach will ensure heterogeneity at data, schema and device level. In the next subsections we will discuss each component of figure 1.

IV. SEARCH ENGINE The interface for our search engine is shown in figure 2. On passing the query from user, the search engine runs SPARQL query engine [13] to search the relevant tags from our maintained semantic web documents.

Figure 2: Search Engine Interface

This search engine answers queries related to real world problems by showing only reliable and close to relevant results. Existing keyword based search engines techniques cannot answer intelligent queries. For example, in querying the best property agent for buying or selling, one needs to ask it from a qualified person (e.g. Estate Agent) instead of a person walking on the road. In this way the information would be more reliable as compared to getting it from other sources. In order to adopt this feature in search engine, it is required to use tagged information to categorize the relevant figure. SWISE uses these tags to identify proper resource (for example, property agent) on the Web. V. SEMANTIC WEB DOCUMENTS Semantic Web documents are the mixture of classes and relationships among them. They hold the metadata that describe the digital objects identified by URI. Conceptually they hold two kinds of information: schema and instances. Fully understanding this document requires complex set of tasks depending on the granularity of retrieved information by search engines. The information in these documents is represented as the graph structure. Our semantic web document as shown in figure 1, maintain the metadata of web pages from the reliable source only as specified by respective controlling authority. The relevant information is organized as graphs on the basis of developed schema. The schema is developed using Ontology. To test our system we select the domain of higher education of Pakistan due to the high availability of relevant information.

A. Ontology Development Ontologies are widely used as technique for representation and reuse of knowledge. Ontology can be defined as “explicit formal specification of a shared conceptualisation” [1]. They are extensively used to share common understanding of the structure of information among people, machines or applications. For the development of ontology, we adapted an approach known as ”Methontology” [14]. This approach is used to build ontologies from scratch. Following are the main steps in developing ontology through this approach [14]. •

Specify the purpose and scope of the ontology



Collection of domain knowledge by brainstorming, formal and informal analysis of texts, and knowledge acquisition tools.



Identification of Glossary of Terms with all possibly useful knowledge in the given domain.



Terms are grouped according to concepts and verbs, which are then classified as hierarchies of relationships.



Search for any already existing ontologies that can be reused or adapted.



Implementation of ontology codified in a formal language.

As a first step the purpose of designing ontology for search engine is to achieve the factor of relevancy in the search results. The scope of this ontology is limited to the universities of Pakistan. In the second phase, we collect the domain knowledge related to our scope. In this regard, we select Higher Education Commission (HEC) of Pakistan [15] which is the most authenticated body to get the reliable domain information. Every university must be chartered by HEC in order to start their degree programs. So in order to rank the universities, HEC has some pre specified criteria and on the basis of it, they publish rankings of universities of Pakistan every year. So we used the criteria mentioned by the HEC for collecting the domain information of universities in Pakistan. In the third phase, we retrieved the list of terms available at HEC [15] and developed the relationship among them using the structure similar to RDF triples. In the next phase, we define the hierarchies of relationships that could exist among our classes of ontology. These concepts, facts and relationships are implemented using Protégé as shown in figure 3. It allow us to define the classes exist in our ontology, the relationships exists among the classes, i.e. either parent child relationship or sibling

relationship. In this phase, we also identified disjoint subclasses that is the classes which don’t share any common instance must be disjoint. There are two basic types of properties in protégé, first is Object Property which has individuals in it’s range and domain and the other one is Data type Property for which the value is a data literal

Figure 4: Domain Range Relationship of Entities

Figure 3: University Ontology Diagram in Protege

B. Metadata Repository The second part in our semantic web documents block is the instances of ontology described in previous section. These instances are represented as metadata that contains information about the target web pages. We used W3C based tools to ensure semantic interoperability. In this regard we used OWL / RDF to represent metadata in a graph based structure. This tool represent information as ‘subject’, ‘predicate’ and ‘object’ analogous to English language grammar structure. Subject and object is an instance of classes while predicate defines the relationship or property between them. With respect to mapping, the set of subjects are domain that is mapped to set of objects as range through relationships. This relationship graph is shown in figure 4. As an example, consider the following set of information extracted from HEC criteria for universities. •

University has FYP lab



University has research Labs



University has PhD professors



University has Assistant professors

The corresponding schema is extracted from figure 3 and 4 is shown in figure 5.

Figure 5: Code Snippet of University Ontology in OWL/RDF

The repository contains a huge OWL/RDF graph that contains information about universities of Pakistan (metadata). The information is extracted only from reliable sources while relevancy is provided by the ontology to search only for tagged information. A crawler is under development to crawl for the required fields. Currently this metadata is populated manually using XForms. VI. RESULTS We tested this system against set of different queries from the domain of higher educations. SPARQL query engine is used for querying purpose. It is a semantic web tool for

querying on RDF graph structures. We used Jena toolkit [16] in Java to build the query interface for the user through SPARQL. Following are some of the intelligent queries that we tested on the system: •

Which university has the highest number of PhDs?



How many PhDs are working in this university?



How many faculty members of this university are on the rank of professor/assistant professor/lecturer?



Which university has highest number of students in particular program?



Which universities are running their PhD programs?



How much funds are allocated by a particular university for their research/teaching/infrastructure/extracurricular activities?

Figure 6 shows the code snippet in SPARQL that query for the number of PhD faculty

REFERENCES

[1]. G. Antoniou and F. van Harmelen, A Semantic Web Primer, (Cooperative Information Systems). 2nd ed. 2008: The MIT Press. [2]. F. Manola, E. Miller, and B. McBride, RDF primer. W3C recommendation, Vol. 10, No., 2004. [3]. "World Wide (W3C)".http://www.W3C.org

Web

Consortium

[4].

"Google Search Engine".http://www.google.com

[5].

"Yahoo Search Engine".http://www.yahoo.com

[6].

"Bing Search Engine".http://www.bing.com

[7]. D. Tümer, M. A. Shah, and Y. Bitirim, An Empirical Evaluation on Semantic Search Performance of Keyword-Based and Semantic Search Engines: Google, Yahoo, Msn and Hakia, 2009 4th International Conference on Internet Monitoring and Protection (ICIMP ’09) 2009. [8]. "Top 5 Semantic Engines".http://www.pandia.com/

Figure 6: SPARQL Query

The results are delivered by our metadata that is the combined view of only reliable pages. VII. CONCLUSION Semantic Web is considered as Web of Data. It is not the newer version of Web but it only advocates for the conversion of existing contents of Web into machine readable form. The machines require semantics information to establish relationship among the content. The major limitation of current search engines is the lack of these missing semantics in current Web contents. This results in huge number of retrieval of results. Most of them are neither reliable nor relevant. We in this work use the Semantic Web tools for searching the semantic information on the pages using W3C compliant tools. This enables our system to work on any platform. There are many extensions possible in this system. Information from other domains can be included by proposing Ontologies. Currently metadata feeds is the manual process using XForms. We are currently working towards automation. There are two possible solutions for this problem. One is Semantic Crawlers while second is the use of RSS feeds. The queries are required to be parsed according to Natural Language Processing framework. Complex set of algorithms are required to be implemented.

Search

[9]. H. Dietze and M. Schroeder, GoWeb: a semantic search engine for the life science web. BMC bioinformatics, Vol. 10, No. Suppl 10, pp. S7, 2009. [10]. "SenseBot Semantic Engine".http://www.sensebot.net/

Search

[11]. "Powerset: Wikipedia Engine".http://www.powerset.com/

Search

[12]. "DeepDyve: The largest online rental service for scientific, technical and medical research ".http://www.deepdyve.com/ [13]. E. Prud’Hommeaux and A. Seaborne, SPARQL query language for RDF. W3C working draft, Vol. 20, No., 2006. [14]. O. Corcho, M. Fernández-López, A. Gómez-Pérez, and A. López-Cima, Building legal ontologies with METHONTOLOGY and WebODE. Law and the Semantic Web, No., pp. 142-157, 2003. [15]. "Higher Education Comission (HEC)".http://www.hec.gov.pk

of

Pakistan,

[16]. B. McBride, Jena: A semantic web toolkit. IEEE Internet Computing, Vol. 6, No. 6, pp. 55-59, 2002.