A New Approach to Design a Domain Specific Web ... - Springer Link

6 downloads 105215 Views 206KB Size Report
design and development mechanism of domain-specific Web search crawler, ... Keywords: Domain specific search, Multilevel classifier, Ontology, Ontology.
A New Approach to Design a Domain Specific Web Search Crawler Using Multilevel Domain Classifier Sukanta Sinha1,4, Rana Dattagupta2, and Debajyoti Mukhopadhyay3,4 1

Tata Consultancy Services Ltd., Victoria Park Building, Salt Lake, Kolkata 700091, India [email protected] 2 Computer Science Dept., Jadavpur University, Kolkata 700032, India [email protected] 3 Information Technology Dept., Maharashtra Institute of Technology, Pune 411038, India [email protected] 4 WIDiCoReL Research Lab, Green Tower, C-9/1, Golf Green, Kolkata 700095, India

Abstract. Nowadays information published in the internet has become a common knack for all. As a result volume of information has become huge. To handle that huge volume information, Web researchers are introduced various types of search engines. Efficiently Web-page crawling and resource repository building mechanisms are an important part of a search engine. Currently, Web researchers are already introduced various types of Web search crawler mechanism for the various search engines. In this paper, we have introduced a new design and development mechanism of domain-specific Web search crawler, which uses multilevel domain classifiers and crawls multiple domain related Web-pages, uses parallel crawling, etc. Two domain classifiers used to identify domain-specific Web-pages. These two domain classifiers are used one after the other, i.e., two levels. That's why we are calling this Web search crawler is a multilevel domain-specific Web search crawler. Keywords: Domain specific search, Multilevel classifier, Ontology, Ontology based search, Relevance value, Search engine.

1

Introduction

Keyword searching is a very popular mechanism for finding information from the Internet [1-2]. However, the Internet has become like an Ocean of various types of information. From this huge reservoir of information finding a relevant Web-page based on user given search query is not a matter of a joke. To overcome this situation Web researcher have introduced various types of search engines. Web-page crawling mechanism plays a big role to produce an efficient Web-page repository, which leads to produce better search result for a user given search query. There are various types of Web-page crawling mechanism already introduced by the Web researchers and they are focused crawler [3-5], domain-specific crawler [6], multi domain-specific crawler [7], hierarchical crawler [8], parallel crawler [9-12], etc. C. Hota and P.K. Srimani (Eds.): ICDCIT 2013, LNCS 7753, pp. 476–487, 2013. © Springer-Verlag Berlin Heidelberg 2013

A New Approach to Design a Domain Specific Web Search Crawler

477

In our approach, we are introducing a new mechanism for the construction of a Web search crawler which follows the parallel crawling approach and supports multiple domains. To construct our prototype we have used two classifiers. These two classifiers are Web-page Content classifier and Web-page Uniform Resource Locator (URL) classifier. Based on these two classifiers we are customizing our crawler inputs and create a meta - domain, i.e., domain about domain. Web-page content classifier identifies relevant and irrelevant Web-pages, i.e., domain-specific Web-pages like Cricket, Football, Hockey, Computer Science, etc. and URL classifier classifies URL extension domains like .com, .edu, .net, .in, etc. The paper is organized in the following way. In next section 2, the related work to the domain extraction as well as parallel crawling is discussed. The proposed architecture for domain-specific Web search crawler using multilevel domain classifier is given in section 3. All the component of our architecture is also discussed in the same section. Experimental analyses and conclusion of our paper is given in section 4 and 5 respectively.

2

Related Works

To find a geographical location in the Globe, we usually follow the geographical map. Same way to find a Web-page from the World Wide Web (WWW), we are usually using a Web search engine. Web crawler design is an important job to collect Web search engine resources from WWW [6, 8, 13-14]. A better Web search engine resource leads to achieve a better performance of the Web search engine. In this section, we describe a few related works. Definition 2.1: Ontology –It is a set of domain related key information, which is kept in an organized way based on their importance. Definition 2.2: Relevance Value –It is a numeric value for each Web-page, which is generated on the basis of the term Weight value, term Synonyms, number of occurrences of Ontology terms which are existing in that Web-page. Definition 2.3: Seed URL –It is a set of base URL from where the crawler starts to crawl down the Web pages from the Internet. Definition 2.4: Weight Table – This table has two columns, first column denotes Ontology terms and second column denotes weight value of that Ontology term. The ontology term weight value lies between ‘0’ and ‘1’. Definition 2.5: Syntable - This table has two columns, first column denotes Ontology terms and second column denotes synonym of that ontology term. For a particular ontology term, if more than one synonym exists, those are kept using comma (,) separator. Definition 2.6: Relevance Limit –It is a predefined static relevance cut-off value to recognize whether a Web-page is domain specific or not.

478

2.1

S. Sinha, R. Dattagupta, and D. Mukhopadhyay

Domain Extraction Based on URL Extension

Finding domains based on the URL extension was a faster approach, but the URL extension does not always return a perfect domain-specific Web-pages. In addition, we cannot tell the content of the Web-page from the Web-page URL. One of the most practical examples is that of a digital library, where many universities publish book lists with a link to online books like www.amazon.com. According to the URL extension, this Web-page belongs to commercial (.com) domain, but this URL is very popular to an educational (.edu) domain. To overcome this type of situation, we need to consider the content of the Web-page. 2.2

Domain Specific Parallel Crawling

In parallel crawling mechanism, at a time multiple Web-page crawl and download performs, because multiple crawler running simultaneously. Hence, it is a quick Webpage download approach. Using the parallel crawling mechanism we can download the Web-pages in a faster way, but we cannot tell whether the downloaded Web-pages belonging to our domains or not. 2.3

Domain Extraction Based on Web-Page Content

Finding domains based on the Web-page content was a great approach, but it is a time-consuming process as there was no such parallel crawling mechanism applied to downloading the Web-pages. For finding domains based on the Web-page content, first parsed the Web-page content and then extracted all the Ontology terms as well as syntable terms [15-18]. Then each distinct Ontology term was multiplied with their respective Ontology term weight value. Ontology term weight values are taken from weight table. In this approach, for any syntable term used corresponding Ontology term weight value. Finally, taken a summation of these individual terms weightage and this value is called relevance value of that Web-page. Now if this relevance value is greater than the predefined relevance limit of that domain, then that Web-page belongs to a predefined particular domain otherwise discard the Web-page, i.e., the Web-page didn’t belong to our domain. In Fig. 1 we have shown a mechanism to find a domain based on the Web-page content. Here, we consider ‘computer science’ Ontology, syntable and weight table of computer science Ontology for finding a Web-page belongs to the computer science domain or not. Suppose, the considered Web-page contains ‘student’ term 3 times, ‘lecturer’ term 2 times and ‘associate professor’ term 2 times and student, lecturer and associate professor weight values in the computer science domain are 0.4, 0.8 and 1.0 respectively. Then the relevance value becomes (3*0.4 + 2*0.8 + 2*1.0) = 4.8. Now, if 4.8 is greater than the relevance limit, then we called the considered Web-page belongs to the computer science domain otherwise we discard the Web-page.

A New Approach to Design a Domain Specific Web Search Crawler

479

Fig. 1. Web-page Relevance Calculation Mechanism

3

Proposed Approach

In our approach, we have generated a new Web search crawler model which supports parallel crawling mechanisms as well as identifies the proper domain by using Webpage content classifier and Web-page URL classifier. In section 3.1 we have given basics of an Ontology. In section 3.2 and 3.3 we have described Web-page content classifier and Web-page URL classifier respectively. In section 3.4, we have explained our user interface and section 3.5 depicts the construction mechanism of our prototype. Finally, in section 3.6, we have given Web-page retrieval mechanism by using our prototype based on user given inputs. 3.1

Introduction to Ontology

The term Ontology is a data model that represents a set of concepts within a domain and the relationships between those concepts. It is used to reason about the objects within that domain. Ontologies are used in artificial intelligence, the Semantic Web, software engineering, biomedical informatics, Library Science, and information architecture as a form of knowledge representation about the world or some part of it. Ontology is a formal description of concepts and the relationships between them. Definitions associate the names of entities in the Ontology with human-readable text that describes what the names mean. Each domain can be represented by an Ontology and each Ontology contain a set of key information of that domain, which formally called Ontology term. We have assigned some weights to each Ontology term. The strategy of assigning weights is that, the more specific term will have more weight on it. And the terms which are common to more than one domain have less weight. The Ontology term weight value lies between ‘0’ and ‘1’.

480

S. Sinha, R. Dattagupta, and D. Mukhopadhyay

3.2

Classifier 1: Web-Page Content Classifier

Web-page content classifier classifies Web-page domain with respect to their Webpage content (Fig. 2(a)). The domains are cricket, computer science, football, etc. These domains are classified according to their predefined domain Ontology, weight table and syntable. Ontology contains key terms of a particular domain in an organized way. Weight table contains the weight value of each Ontology term. Syntable contains synonyms of each Ontology term. When any Web-page content received, we are parsing the Web-page content and extracting Ontology terms as well as synonyms of each Ontology term and get a distinct count. We have received Ontology term relevance value by multiplying the distinct counts with their respective Ontology term weight value. Then we took a summation of those term relevance values, which formally called Web-page relevance value. If the relevance value of the Web-page is larger than the predefined Web-page relevance limit, then we have considered the Web-page belongs to that domain.

(a)

(b)

Fig. 2. (a) Web-page Content Classifier (b) Web-page URL Classifier

3.3

Classifier 2: Web-Page URL Classifier

Web-page URL classifier classifies Web-page URL domains like .com, .edu, .in, etc. (refer Fig. 2(b)). Web-crawler crawls down the Web-pages. We have extracted all the hyperlink URLs from already crawled Web-page content by doing a lexical analysis of keywords like ‘href’ and then sent those URLs into Web-page URL classifier. Web-page URL classifier parsed all the extracted Web-page URLs and classified according to their URL extension domain. 3.4

User Interface

In our proposed search engine, we have facilitated Web searchers to customize their search result by selecting classifier1 and classifier2 inputs. We have used radio buttons for classifier 1, i.e., at a time at most only one domain selection possible for Web-content classifier by the Web searchers (refer Fig. 3) and used check boxes for classifier2, i.e., Web searcher can select more than one Web-page URL extension

A New Approach to Design a Domain Specific Web Search Crawler

481

domain. To get optimistic search results from our proposed search prototype, Web searchers have required some basic knowledge about the classifier2 inputs with respect to classifier1 selected input. Suppose, Web searcher has selected ‘Computer Science’ domain as classifier 1 input then classifier2 inputs should be .edu, .in or .net. We assume that .com is a commercial domain and no such Web-page exists, which belongs to ‘Computer Science’ domain. After providing required inputs, i.e., search string, classifier1 and classifier2 inputs, Web searcher have to click on “Go” button to get the search results. In addition, if we assume Web-searchers don’t have the basic knowledge about the classifier 2 inputs and selects all the options, that time also our prototype produces the search result but it will take few extra seconds due to traversing more number of schema data (refer Fig. 4).

Fig. 3. A Part of User Interface

3.5

Proposed Algorithm

The proposed algorithm describes multilevel domain-specific Web search crawler construction in a brief. We have divided our algorithm into various modules. Module1 tells about Web-page URL classifier and module2 describes the Web-page content classifier. Module1 was invoked inside module2 and module2 invoked by the main domain specific Web-search crawler method. Module1: Web-pageURLClassifier(Web-page URL List) 1. begin 2. while(Web-page URL List is not empty) do step 3-5 3. extract URL Extension;

482

S. Sinha, R. Dattagupta, and D. Mukhopadhyay

4. find URL extension domain; 5. if (Web-page URL extension belongs different domain) discard (URL); else pass URL to the respective crawler input; 6. end; Module2: Web-pageContentClassifier(Web-page) 1. 2. 3. 4.

begin parse Web-page Content; calculate Web-page relevance value; if (Web-page belongs different domain) discard (Web-page); else store Web-page in respective domain repository ; extract URLs from Web-page Content; call Web-pageURLClassifier(Web-page URL List); End; 5. End; DomainSpecificWebSearchCrawler () 1. 2. 3. 4. 5.

begin extract a URL from the seed URL queue; download the Web-page; call Web-pageContentClassifier(Web-page); end;

A pictorial diagram of domain-specific Web search engine resource collector is shown in Fig. 4. In our approach, we have divided our data repository into multiple schemas based on the number of URL extension domains we are considering. To collect resources for each schema, we follow parallel crawling mechanism. For example, we have shown .net, .edu and .com crawlers and those crawlers expecting .net, .edu and .com seed URLs respectively. Each and every crawler runs individually, and all are connected with WWW. Initially based on first seed URL every crawler downloads the Web-page content and send it to the first level classifier, i.e., Web-page content classifier used for classifies Web-page domain and stores it in respective domain section, i.e., cricket, football, hockey, etc. In the second level of the Web-page content, we have extracted all the hyperlinks exists in the crawled Web-page content by doing a lexical analysis of keywords like ‘href’ and send all links to classifier2, i.e., Web-page URL classifier. After classification of all hyperlinks, send it to their respective crawler input. According to our approach classifier 1 identifies the Webpage domain and classifier 2 continuously supplying parallel crawler inputs.

A New Approach to Design a Domain Specific Web Search Crawler

483

Fig. 4. Proposed architecture of Domain Specific Web Search Engine Resource Collector

3.6

Web-Page Retrieval Mechanism Based on the User Input

Web-page retrieval from Web search engine resources are an important role of a Web search engine. To retrieve Web-pages from our Web-page repository, we need to find the schema and domain based on the user given classifier1 and classifier2 inputs (refer Fig. 3). As discussed in section 3.4, at a time user can select only one classifier1 input and multiple numbers of classifier2 inputs. Classifier 1 input tells about user selected domain and classifier 2 inputs tell about the schema from where we are going to fetch the Web-pages. After identification of the domain and schema, we are taking the search string and parse it to find the Ontology terms. According to those Ontology terms, we have performed a Web-page search operation from identified resources based on classifier 1 and classifier 2 inputs.

4

Experimental Analyses

In this section, we have given some experimental study as well as discussed how to set up our system. Section 4.1 explains our experimental procedure, section 4.2 gives an overview of our prototype time complexity and section 4.3 shows the experimental results of our system. 4.1

Experiment Procedure

Performance of our system depends on various parameters, and those parameters need to be setup before running our system. The considered parameters are domain relev-

484

S. Sinha, R. Dattagupta, and D. Mukhopadhyay

ance limit, weight value assignment, ontology terms, etc. These parameters are assigned by tuning our system through experiments. We have assigned 20 seed URLs for each crawler input to start initial crawling. Scope Filter. In order to ensure that our crawler only downloads files with textual content, not Irrelevant files like images and video, we have added a filter when performing the tests. Harvest Rate. Harvest rate [3] is a common measure of how well a focused crawler performs. It is expressed as HR= r/t, where HR is the harvest rate, ‘r’ is the number of relevant pages found and ‘t’ is the number of pages downloaded. 4.2

Complexity Analysis

There are few assumptions taken to calculate the time complexity of our system. a) We are dealing with ‘n’ number of terms, which includes both Ontology terms as well as synonyms of those terms. b)‘d’ time taken to download a Web-page because internet speed is a big factor to download a Web-page. c) We are dealing with ‘m’ number of URL extension domains. d) On an average we assumed we are receiving ‘p’ number of hyper link URLs in a Web-page content. e) Constant time complexity denoted by Ci where ‘i’ is a positive integer. We have given our prototype time complexity analysis in Fig.5.

Fig. 5. Line by line complexity analysis

From the analysis we have found for a single crawler complexity becomes [2O(n)+ O(m*p)+d]. Now, we have found ‘m*p’ always

Suggest Documents