Ontology based web retrieval

Ontology based web retrieval 1

Raman Kumar Goyal1, Vikas Gupta2, Vipul Sharma3, Pardeep Mittal4 Lecturer (Information Technology), RIEIT, Railmajra, 2AP (CSE), RIEIT, Railmajra 3 Student, UIET, Panjab University, Chandigarh, 4AP (CSE), BFCET, Bathinda

Abstract Web crawling is the process used by search engines to collect pages from the Web. The requirement of a Web crawler that downloads most relevant Web pages from such a large Web is still a major challenge in the field of Information Retrieval Systems. Most Web crawlers use keyword base approach for retrieving the information from Web. But they retrieve many irrelevant Web pages as well. With the use of semantics more relevant pages can be downloaded. Semantics can be provided by ontologies. This paper proposes an algorithm on ontology based Web crawler such that only relevant Web pages can be retrieved. The algorithm is tested against different relevancy limits by comparing the harvest rate and recall of each relevancy limit. It also shows that harvest rate of this crawler is much higher than of simple breadth first crawler. Keywords --Web Crawling, Search Engines

I

INTRODUCTION

A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Most Web search engines or crawlers use keywords for retrieving the information from Web. Information on the Web is stored in relational databases. Data stored in relational databases is invisible to search engines and Web crawlers, but still participate in creation of Websites. This data can be obtained with the help of ontologies. If some crawler crawls only through specific pages is known as focused crawler. It crawls the particular focused portions of World Wide Web quickly. Focused crawling is suitable where only a particular domain is of interest, not the whole Web. In order to perform focused crawling, domain knowledge is needed. This domain knowledge is represented in the form of ontologies. Ontology [1] is a formal, explicit specification of shared conceptualization. An ontology provides a common vocabulary of an area and defines, with different level of formality, the meaning of terms and relationships between them. Tim BernersLee, creator of World Wide Web, foresees a future when the Web will be more than just a collection of Web pages. In this future, computers themselves will be able to consider the meaning, or semantics, of information sources on the Web known as Semantic Web [2].

II RELATED WORK AND OVERVIEW OF OUR APPROACH S. Ganesh et al. [6] proposed new metric that solved the major problem of finding the relevancy of pages before the process of crawling, to an optimal level. Chang et al. [4] present an intelligent focused crawler algorithm in which ontology is embedded to evaluate the page’s relevance to the topic. Debajyoti et al. [5] proposed a relevancy calculation algorithm with a relevancy limit. We use an open source platform known as Protégé [3]. For making the ontology we have taken help from the study of Energy Security Ontology by Rosella [7]. Energy is the root class in our ontology. It has six subclassesMarket, Energy Security, Infrastructures, Country, Energy Sources and Environmental Consequences. Energy Security has further two subclasses-Risks and Solutions. Energy Sources has further three subclasses-Non Renewable, Renewable and Nuclear. Non Renewable has a further subclass of Fossil Fuel. Market has four subclasses- Commercial Sector, Residential Sector, Industrial Sector, Transportation. Nuclear, Renewable and Non Renewable classes are made disjoint to each other that is no instance that belongs to its respective class can fall into another’s class.

Fig.1: OWL Viz view of Classes generated by Protégé III

ALGORITHM

Level 1 2

Ontology Terms Energy Country

Weight 0.1 0.6

2

Market

0.6

2

Environmental Consequences Infrastructures

0.6

2

0.6

2

Energy Security

0.6

2 3. 3. 3. 3. 3. 3. 3. 3. 3. 3.

Energy Sources Residential Sector Commercial Sector Industrial Sector Transportation Risks Solutions Renewable NonRenewable Nuclear Fossil Fuel

0.6 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

Table 1: Ontology Weight Table P is Webpage and Relevance_P=0 (Relevance Score). Limit is a numerical value that will be set by us for checking relevancy of a Webpage. Different results are obtained by crawling the same Website against different limits. 1. We will read first term (T) from our Ontology (from table) that is “Energy” and give it the weight (W) according to table given above. 2. Calculate how many times the term (T) and its synonyms occur in the Web page P. Let the number of occurrence is calculated in Frequency. 3. Multiply the number of occurrence calculated at step2 with the weight W. Let call this Score. Score = Frequency * W. 4. Add this term weight to RELEVANCE_P. So new RELEVANCE_P will be, RELEVANCE_P = RELEVANCE_P + Score 5. Select the next term and weight from weight table and go to step2, until all the terms in the weight table are visited. 6. If RELEVANCE_P < Limit then the Webpage p is discarded else the page is downloaded. 7.

IV

End

Architecture of Crawler

We have developed a simple crawler which uses breadth first algorithm. Implementation of our Simple Crawler as follows: 1.

Get the seed URL.

2.

If the webpage is valid that is it of the defined type (html, php, jsp etc.) then it is added to queue. 3. Parse the content. 4. Get the response from server if it is ok add the webpage to index and caches file to a folder. With the help of cache and index searching can be done. Web database Fetchers Html Parser Lists

Indexers

Indexes Searchers

Lists Fig. 2: Architecture of Simple Crawler In case of ontology base Web crawler after parsing the parsed content is matched with the ontology and if the page is relevant it is indexed otherwise it is not considered. So, first three steps are same as in simple crawler. The next steps are as follows 5. Get the response from server if it is ok then read the protégé file of ontology and match the content of webpage with the terms of ontology. 6. Count the Relevance Score of Web page as mentioned in the algorithm and add the webpage to index and caches file to a folder. 7. With the help of cache and index searching can be done.

Fetchers

Html Parser

Web database

The ontology crawler is analyzed with three relevancy limits 5, 7, 9. Fig.4 shows the comparison of harvest rates of different crawlers on analyzing the first 100 pages on the above mentioned three URLs. Fig.5 shows the comparison of recall with different relevancy limits. The number of documents in the database is the number of relevant documents retrieved by the simple crawler. Simple Crawler

Ontology Ontology Crawler Relevancy Limit 5

Parsed Content 0.6

Relevance Computation

0.5

Indexers

Ontology Crawler Relevancy Limit 7

0.4 Harvest 0.3 Rate 0.2

and if it is >limit Indexes

0.1 0 Crawlers

Searchers


Fig.4: Comparison of Harvest Rate of different Crawlers

Fig. 3: Architecture of Ontology Based Web Crawler

V EXPERIMENTATION & EVALUATION Harvest rate [8] is a common measure how well a focused crawler performs. Harvest rate hr = r/p where r is the number of relevant documents retrieved by crawler and p is the number of documents retrieved by crawler. Recall is used to compare the ontology crawler with different relevancy limits. Recall rc = r/t where r is the number of relevant documents retrieved by crawler and t is the no. of documents in the database. We have analyzed first 100 Web pages retrieved by each crawler on these three following URLs 1. http://www.eere.energy.gov 2. http://www.nrel.gov 3. http://www.eeca.govt.nz on the basis of harvest rate and recall.


1 0.9 0.8 0.7 0.6 Recall 0.5 0.4 0.3 0.2 0.1 0

Ontology Crawler Relevancy Limit7

Crawlers


Fig.5: Comparison of Recall of different crawlers The harvest rate with relevancy limit 9 is highest, but its recall is too low. Therefore relevancy limit 7 is most suitable for our crawler. Now comparing the ontology crawler with relevancy limit 7 with simple crawler on the first 300 Web pages we find that harvest rate of our ontology crawler is much higher than simple crawler as shown in Fig.6.

harvest rate

0.6 0.5 0.4 0.3 0.2 0.1 0

ontology crawler simple crawler

100

200

300

No. of pages crawled

Fig 6: Graph of harvest rate for ontology crawler and simple crawler

VI CONCLUSION & FUTURE SCOPE The main aim of our paper is to retrieve relevant Web pages and discards the irrelevant ones. We have developed an ontology based crawler which retrieves Web pages according to a relevancy calculation algorithm and discards the irrelevant Web pages. The harvest rate of our crawler is much higher than the Simple Crawler. There are certain issues in our crawler that can be addressed. In our relevancy calculation algorithm, we have to set the weight of the ontology term manually. A mechanism can be devised such that after reading the ontology and after visiting certain Web pages it can provide the weight of the ontology term automatically. Also, the processing time of the Web crawler can be improved. In our algorithm the ontology remains static; ontology can be evolved dynamically by adding new concepts and relations while visiting Web pages.

References [1] Gruber “Towards principles for the design of ontologies used for knowledge sharing.” International Journal of Human Computer Studies 1995. [2] A. Johannes Pretorius “Ontologies Introduction and Overview.” Vrije Universitiet Brussels 2004. [3] Mathew Horridge, Holger , Alan Rector, Robert Stevens, Chris Wroe, “A Practical Guide To Building OWL Ontologies Using The Protégé-OWL Plugin and CO-ODE Tools Edition 1.0.”, The University of Manchester 2004. [4] Chang Su, Yang Gao, Jianmei Yang, Bin Luo “An Efficient Adaptive Focused Crawler Based on Ontology Learning”, Proceedings of the Fifth international Conference on Hybrid Intelligence Systems- 2005 IEEE. [5] Debajyoti, Arup Biswas, Sukanta “A New Approach to Design Domain Specific Ontology Based Web Crawler”, 10th International Conference on Information Technology – 2007 IEEE. [6] Ganesh, S; Jayaraj M, Aghila G “Ontology Based Web Crawler” Information Technology ; Coding & Computing ,2004 volume 2,2004 page (s)-337-341-IEEE. [7] Rosella “An information guided spidering: A domain specific case study”-2007.

[8] Gautam Pant, Padmini Srinivasan, and Filippo Mencezer, “Crawling the Web”, Indiana University 2004.