Paper Title (use style: paper title)

1 downloads 801 Views 502KB Size Report
query may have more than one range in a domain. The semantic web search engines such as Hakia, Swoogle, and Watson do not identify domains and ranges ...
Domain and Range Identifier Module for Semantic Web Search Engines Tarik Alafif and Sreela Sasi Computer and Information Science Department Gannon University Erie, U.S.A. Email: {alafif001, sasi001}@gannon.edu

Abstract—Semantic web search engine is the new generation of conventional web search engine that brings precise and meaningful information from the Internet. These new search engines answer user queries using Semantic Web Documents (SWDs) that are found in ontologies database. It is likely that a query may have more than one range in a domain. The semantic web search engines such as Hakia, Swoogle, and Watson do not identify domains and ranges of the user’s query while retrieving search results. Hence, the retrieved search results are not in a single range from a domain of the user query. In this paper, a novel Domain and Range Identifier (DRI) module is proposed that can be incorporated into the existing semantic web search engines to resolve this problem. The DRI module uses ontologies represented as SWDs to validate the domain, and to identify and classify the ranges in that domain. This will help the semantic web search engines to retrieve more focused information based on user’s preferences. A Graphical User Interface is developed for users to select a range and to obtain only the relevant search results.

Keywords-Semantic Web Search Engines, Domain and Range Identifier, Ontologies, Ontology Web Language, Semantic Web Documents

I.

INTRODUCTION

The web search engines are used to explore and retrieve information from the World Wide Web. These applications are facing difficulties for retrieving only relevant websites as search results. This problem can be resolved by the use of semantic web technologies. Semantic web 3.0 is the new extension of Web 2.0 that was defined by its inventor Tim Berners-Lee [1]. Information in the semantic web is represented by a set of defined terms and relations known as metadata within a given knowledge domain. This set of metadata is called “ontologies”. Ontologies generally have been formalized as a standard in the semantic web by W3C. These ontologies are created in plain text format and are represented using Ontology Web Language (OWL) [2]. OWL is derived from XML standard that allows developers to define new semantic description tags and to make modifications. It is considered as a form of Semantic Web Document (SWD) that describes resources on the web [3]. SWDs are not for users to view. They are designed to be read and understood by software applications in other machines. They are elements of the semantic web and run in parallel on the web as HTML.

These elements have different syntax forms and languages to represent data and knowledge. Some of these languages are Resource Description Framework (RDF), RDF Schema (RDFS), RDF/XML, N3, N-Triples, SPARQL, and OWL [4]. The semantic web search engines can understand these documents, and perform information retrieval from the web without any loss of meaning. Semantic web search engine uses ontologies to understand information about objects and to identify how they relate to each other logically. It utilizes SWDs for retrieving search results. Most semantic search engines combine machine learning, natural language processing, web mining, and information retrieval techniques for processing the queries and retrieving search results. There are many conventional and semantic search engines available for use. II.

EXISTING SEARCH ENGINES

Google is one of the best of traditional and conventional web search engines of the web 2.0. The main function of Google web search engine is to use a crawler to crawl HTML documents on the web [5]. This crawler collects these documents and stores them in the database. An indexer creates an inverted file to scan the text of each document and save it in a temporary file. Then, it sorts this temporary file into term number order. Google uses special PageRank algorithm for retrieving and ranking search results [6]. Google is not a semantic web search engine because it does not treat SWDs and does not deal with any domain knowledge. It only uses HTML documents as web pages. Google still cannot retrieve only relevant search results from the web to satisfy people’s growing demand. Hakia is a semantic web search engine [7]. Its search structure is based on Text. It provides search results for the user’s query based on “meaning match” rather than popularity of the text. Hakia uses semantic web technology called QDEXing to present the search results categorically. These categories include galleries, news, blogs, videos, and credible, etc. Swoogle is another semantic web search engine [8]. It essentially has a crawler that crawls and analyzes SWDs. After discovering and collecting SWDs, it uses the same Google’s PageRank algorithm for retrieving and ranking search results. Swoogle helps users to find ontologies containing specified terms in user query.

Watson is also another semantic web search engine [9]. It discovers and collects SWDs using a web crawler. Then, it analyzes and indexes each document based on the ontologies it contains. Both Swoogle and Watson semantic web search engines retrieve ontologies as search results. They are still under research work. Hawking and Pokorny have proposed generic traditional web search engine architectures [5][9]. It consists of websites servers, URL database, crawler, indexer, and user interface. They explained the functionalities of a web search engine’s components and data processing behind web crawling and searching. Lv, Kobayashi, Agusa, Wu, and Zhu have introduced a new approach that attaches content description to web images by using RDF [10]. In their paper, they have stated some problems that cannot be avoided. One of these problems is in searching with keywords that have various ranges in a domain such as “eclipse”. The keyword eclipse is associated with software, car navigation, and astronomy. This research paper addresses and resolves this problem using the semantic web by adding the DRI module initially to validate the domain and then identify and classify the range in that domain. Still, the traditional and semantic web search engines such as Google, Hakia, Swoogle, and Watson are not identifying the domain and the ranges in that domain. Hakia semantic search engine retrieves websites from all the ranges of the user query. On the other hand, Swoogle and Watson semantic search engines retrieve only ontologies as search results of the user query because they are still under research work. In this paper, a DRI module is proposed that can be incorporated into the existing semantic web search engines to resolve the problem of having different ranges for a user’s query. The DRI is an intelligent web mining and information extraction approach that uses ontologies. This approach helps the semantic web search engines to retrieve only relevant search results. It identifies the domain of the user query and then classifies ranges of that domain. The third section of this paper shows the DRI module inside the semantic web search engine architecture. The fourth section explains the process flow for the DRI module. In the fifth section, the details of a Graphical User Interface (GUI) developed for this research is given. Conclusion and future work are presented in the sixth section followed by references used for this research. III.

ARCHITECTURE OF THE SEMANTIC WEB SEARCH ENGINE USING DRI MODULE

The basic structure of a semantic web search Engine is shown in Figure 1. A novel DRI module is incorporated inside this semantic web search engine. This architecture consists of a user interface, DRI module, ontologies database, indexer, semantic web crawler, and servers. A. User Interface A user interface is developed for the users to enter their search queries on the semantic web. This will be passed on to

the DRI module for analysis and retrieval of the search results. This search results will be displayed to the user.

Figure 1. Architecture of the semantic web search engine using DRI module

B. Domain and Range Identifier (DRI) Module The main function of DRI module is to identify the user query’s domain. It also identifies and classifies the range of its domain using ontologies. When the user enters his or her query, the DRI module initially looks for an existence of a domain inside the ontologies database. If there is no domain, the module requests the user to enter another query. Otherwise, the module will identify and classify various ranges in that domain. Then, this module will give the options for the user to choose a range. The search results will be displayed for the chosen range. C. Ontologies Database The ontologies database stores all the SWDs that are brought by the semantic web crawler. These SWDs include ontologies describing the domain, range, link, and content for each HTML document on a website. D. Indexer The indexer’s main function is to index SWDs that have been crawled and stored by semantic web crawler from the web. It also sorts and ranks these search results based on content, complexity, quality and relation to other resources. E. Semantic Web Crawler The ‘semantic web crawler’ is a program that crawls, visits, and collects every SWD on the web regularly. The main goal of the crawler is to build a queue of SWDs to visit in the

future. Also, it will add/update them into the ontologies database. F. Server Servers are physical dedicated computers to provide services over the network. These servers contain websites and SWDs. They run in parallel in the servers. Generally, users publish their websites and SWDs containing ontologies. IV.

QUERY FLOW PROCESS USING DRI MODULE

The flow process of the user “query” inside the semantic web search engine is shown in Figure 2.

V.

SIMULATION EXAMPLE FOR THE CONCEPT

The Open Source Code for a semantic web search engine is not available currently. Hence an application that implements a portion of the semantic search engine has been developed as part of this research. This application implements a DRI module using PHP to identify and classify the meaning of the user “query” within a range. The semantic web crawler discovers, and visits the SWDs on the web and then stores them into ontologies database. For simulation, ontologies are created for specific keywords of user “query” that have more than one range as shown in Figures 3, 4, 5, and 6.