Virtualized Dynamic URL Assignment Web Crawling ... - IEEE Xplore

4 downloads 0 Views 1MB Size Report
August 01-02, 2014, Dr. Virendra Swarup Group ofInstitutions, Unnao, India. Virtualized Dynamic URL Assignment Web. Crawling Model. Wani Rohit Bhaginath,.
IEEEInternational Conference on Advances in Engineering & Technology Research (ICAETR - 2014), August 01-02, 2014, Dr. Virendra Swarup Group ofInstitutions, Unnao, India

Virtualized Dynamic URL Assignment Web Crawling Model Wani Rohit Bhaginath,

Sandip Shingade

Mahesh Shirole

Department of CE and IT, VJTL Matunga, Mumbai. rohit. b. wani@gmail. com

Department of CE and IT, VJTL Matunga, Mumbai. stshingade@vjti. org. in

Department oi CE and IT, VJTL Matunga, Mumbai. mrshirole@vjti. org. in

Abstract- Web search engines are software systems that help to

available in the market, like Google, Yahoo!, Bing, MSN

retrieve the information from the net by accepting the input in

Search, etc. Web crawlers are the vital component of any

the form of query and providing the result as files, pages, images

search engine. The Web crawler is an

or information. These search engines heavily rely on the web crawlers that interact with millions of the web pages given a seed URL or a list of seed URLs. However, these crawlers demand a large amount of computing resources. The efficiency of web search engines depends upon the performance of the crawling

Internet bot

that

systematically browses the web after taking the URLs as the seed, and then downloads the web pages associated with these URLs. After that, extracts any hyperlinks contained in them, and

recursively

continues

to

download

the

web

pages

processes. Despite the continuous improvement in the crawling

identified by these hyperlinks [6]. Web crawler may also be

processes still there is a need of improvement towards more

called a

efficient and low cost crawler. Most of the crawlers existing today

scutter. Web crawlers demands large amount of computing

have a centralized coordinator that brings the disadvantage of

resources for crawling the data from the web, and data are

single point failure. Taking into consideration the shortfalls of the existing crawlers, this paper proposes an architecture of a distributed web crawler. The architecture addresses two issues of

Web spider,

an

ant,

an

automatic indexer

or a Web

very dynamic in nature in the terms of the size and the modifications [4].

the existing web crawlers: the first is to create a low cost web

The Web crawler consumes the maximum time in the

crawler using the concept of virtualization of cloud computing.

searching process. Hence the crawling process should be

The second issue is

a balanced load distribution based on

dynamic assignment of the URLs. The first issue is solved using mutli-core machines where each multi-core processor is divided into number of virtual machines (VM) that can perform different crawling task in parallel. Second issue is addressed using a

performed continuously to maintain highest updatability of it search outcomes [6]. A crawler for a large search engine has to address two issues. First, it should have a good crawling strategy, i.e., a strategy for deciding which pages to download

clustering algorithm that assigns requests to the machines as per

next. Second, it needs to have highly optimized system

the availability of the clusters thereby realizing the balance

architecture, i.e., robust against crashes, manageable, and

among components according to their real-time condition. This

considerate of resources and web servers [1].The performance

paper discusses a distributed architecture and details of the

of the crawling process has been improved by using a parallel

implementation of the proposed algorithm.

web crawler instead of a batch crawler. But, the existing

Keywords- Crawler, Dynamic assignment,Seeds, Virtualization,

Clustering algorithm, K-means clustering. l.

parallel crawler as a single point coordinator with high chances of data redundancy leading to crawling of the same URLs multiple times thus affects the performance [3]. In order

INTRODUCTION

With the improvement of social information and the rapid development of Internet the World Wide Web has grown from a few thousand pages in 1993 to more than two billion pages at present and is still continuing [12]. The basic information acquisition Web search engine tool is becoming increasingly important because of the explosion in size and increasing demand of users for finding the information [1][2][4].The Web search engines are information retrieval software systems that help in finding the information stored on the internet by taking input query words, and retrieving the information based on the matching criteria. Some search engines mine data available in databases or open directories. The search results are generally presented in terms of search engine results pages

to prevent parallel web crawler from grabbing overlapping redundant data, a framework, which divides the entire web into several parts, and distributes each part to one web crawler had been proposed and implemented by Mercator [5][7][8]. However, distributing the web statically may lead to load unbalancing since some crawlers might remain idle leading to underutilization of the resources. Recently, a remarkable gain in computing resources performance has been achieved due to the introduction of parallelism in the architecture of processors for

multi-core

processors,

in

the

form

of

pipelining,

multitasking, and multithreading. The cores can operate in parallel and thus the programs run faster than on a traditional single-core processors [2][4]. This paper proposes the use of virtualization technology to

[3]. The information may be in the form of web pages, images

improve

or other types of files. There are many search engines

thereby enhancing the performance of the crawling process.

the

computing

978-1-4799-6393-5/14/$31.00 ©2014 IEEE

power

of

mutli-core

processors

IEEEInternational Conference on Advances in Engineering & Technology Research (ICAETR - 2014), August 01-02, 2014, Dr. Virendra Swarup Group ofInstitutions, Unnao, India

Moreover, this approach takes care of the load imbalance

frontier, and then crawler gives the URL to the DNS Resolver

issues by using the dynamic assignment of the URLs with the

to obtain the IP address of the corresponding server of the

distributed framework thereby eliminating the single point

given URL. The frontier is a data structure that contains the

failure in search engines. This paper presents a model of

list of the URLs to be crawled. The Fetcher is a HTTP library

processor-farm methodology with dynamic assignment of

that connects to the corresponding web server and tries to

URLs to the crawler. The paper is organized as follows:

download web pages. These pages are passed using the parser

Related work has been explained in Section II. Section III

which filters out the contents of the page to remove the HTML

describes proposed architecture. The Section IV describes the

tags, and other text thereby extracting the URLs to which the

benefits of such system. Finally, Section V concludes the

page points. Next, the new URLs are then checked in the database of the blacklisted URLs for their validity. Along with

paper. II. A.

it the URL that is just visited is marked as fetched in the URL

RELATED WORK

database. When URL passes through the URL Filter then appropriate page ranking algorithm are applied to rank and

Web crawler The web search engine has three major components: (i)

Web crawler, (ii) document analyzer and indexer, and (iii) search processor. A Web crawler starts with a list of URLs to visit, called as the

seeds.

As the crawler visits these URLs, it

identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the

crawl frontier

[1][3][4].URLs

from the frontier is recursively visited according to a set of policies. Fig.l. outlines the crawling process of a basic crawler [6].

r-------�====�

index the URLs. One of the techniques used to find out the URL having the maximum number that other URL pointing to it, and rank on the basis of obtained numbers. The indexed URLs are then stored in the Frontier.

Frontier provides the

un-fetched URLs to the DNS resolver [4][1][6].

B. Frameworksfor Parallel Crawler. A parallel crawler consists of

C-proc's,

which is nothing

but the same crawling process that occurs in multiple number independently. Each C-proc's is a symbolic representation of basic crawlers working simultaneously. They perform the functions of basic web crawlers, and they download pages from the Web, store the pages locally, extract URLs from the downloaded pages, and follow the links. Since the process crawl independently, there are chances that two processes might crawl the same URL. In order to prevent parallel web

Web servers

crawlers from grabbing and overlapping redundant data, Wu and Lai have listed two general frameworks: the first one is dynamic assignment framework with a center coordinator; the

Host name

second one is static assignment framework [7]. Blacklisted URLs

The dynamic assignment framework is adopted by Google has a center node as URL Server, i.e., the coordinator, and another three separate machines are responsible for grabbing web pages and communicating with URL Server directly. Due to central coordinator the complete database is stored at the central point, in the case of failure it may lead to a situation of bottle neck at the URL server. Also any failure or crash of the URL server can lead to the entire system failure. Thus single point

failure

and

scalability

restrictions

are

the

major

drawbacks of this framework [7][8]. Stop

On the Contrary the Mercator framework by Najork and Heydon divides the entire web into several parts, and assigns each part to separate crawler. This design is an example of static assignment. There is no central component. However, their assignment algorithm of static framework is, not relevant

Fig. 1.

Crawling process of a basic crawler

The crawler consists of multiple processes running on different machines connected by a high-speed network. Each of these crawling processes consists of multiple threads, which are worker threads, and each of the concerned worker thread performs repeated work cycles. The crawler initially obtains the seed URL (if for the first time) or takes an URL from the

to the instantaneous state of the system, thereby causing system load imbalance [7].

C. Virtualization Virtualization, refers to the act of creating a virtual (rather than actual) version of something in computing. Virtualization is not limited to a virtual computer hardware platform,

978-1-4799-6393-5/14/$31.00 ©2014 IEEE

IEEEInternational Conference on Advances in Engineering & Technology Research (ICAETR - 2014), August 01-02, 2014, Dr. Virendra Swarup Group ofInstitutions, Unnao, India

operating system (OS), storage device, or computer network

contains the URLs status (URL name, URL status, URL

resources.

metadata). Fig. 2 outlines the virtual machine communication

hidden

The physical characteristics of the resources are

to

simplify

applications,

systems,

using VMCI in the proposed architecture. These virtual

or end users interact with these resources.

the

way

in

which

other

machines act like an independent crawlers and retrieve the

Virtualization is an abstraction used in practice for benefits to

HTML pages.

save cost, reduce footprint, and consolidate systems. The VM creates a virtualization layer to translate the request of the hardware layer, thereby emulating a physical computing environment. It also manages requests for resources like CPU, memory, hard disk, network and other hardware resources [2][4]. Different forms of virtualization have been developed like

Shared memory through VMCI

Database to store URLs

guest OS-based, shared kernel, kernel-level, and hypervisor virtualization. Virtual Machines (VMs) have many advantages over regular method of installation of the operating systems and software. The main advantages of the VMs are they provide isolation between the applications running on different VMs, and there will be no interference of the Virtual machines with

the

machines.

host operating system Thus,

virtual

machine

or other

similar virtual

provides

the

efficient

utilization of hardware resources leading to the decrease in the cost.

where 1:9It9l

Fig. 2.

Proposed Architecture of web crawler using Virtualization.

These virtual machines also act like independent clusters physically, and the URL belonging to the same clusters are

D. Virtual Machine Communication Interface.

then provided to the corresponding virtual machines thereby

The virtual machines communicate through the network

saving some amount of crawling and speeding up of the

layer. The network layer adds overhead to performance of

process. The crawling process consists of different processes.

VMs. However, if the communication between the virtual

The complete flow of the crawling process is explained in the

machines takes place using a high speed infrastructure that

Fig.3. The crawler initially needs to be a set of seed URLs.

provides efficient interface between the machine and the host

The number of the seed URLs can be different for each VM.

operating system, thus, the overhead can be reduced. Such an infrastructure

Interface.

is

called

Virtual Machine Communication

It provides the socket interface similar to the one

that is provided by TCP/UDP with using the VMCI Id

The crawling process starts with injector providing the seed URLs. These seed URLs are then used by clustering for the cluster formation. The clustering process first calculates the hash values of the URL strings using the formula: n

numbers. VMCI provides both connectionless and connection W=

oriented communication. VMCI needs installation of VMware or Vspehere, and changes to the .vmx file to enable the VMCI. The VMCI provides the virtual machines with a shared region

where

where the machines can put and get the data, and also a

the length of the string, and

datagram API to send small messages among the virtual

string.

machines. VMCI minimizes the overhead and also reduces the number of the task required for communication [II]. III. The

architecture

uses

the

concept

of

processor-farm methodology. In this framework, the multiple cores of the machines are used as virtual machines. Each performs

crawling

thus

n is

represents the weight of the

The hash value uses the Uniform Resource Identifier or a resource in the Internet. This identification enables

virtualization and dynamic URL assignment to improve the

machine

W

interaction with resource representations in a network using

performance of the crawler. This crawler model uses a

virtual

ASCII value of character of the URL string,

(URI), which is a string of characters used to identify a name

PROPOSED CRAWLER MODEL.

proposed

c; is an

Lei i=l

giving

out

a

distributed framework. These virtual machines interact with each other through a shared region or memory using the

specific protocols. For example, consider a sample URL as http: //www.vjti.ac.inIcomputer/there?name=q1#abc http: // o7scheme www.vjti.ac.in

7 Domain 7 Path 7 Fragment.

computer/there? ?name=q1#abc

VMCI (Virtual machine Communication Infrastructure) that provides the API to communicate with each other needed at the time of crawling. The shared region has the database that

978-1-4799-6393-5/14/$31.00 ©2014 IEEE

IEEEInternational Conference on Advances in Engineering & Technology Research (ICAETR - 2014), August 01-02, 2014, Dr. Virendra Swarup Group ofInstitutions, Unnao, India Algorithm Name: rh_crawling Input: List of Seed URLs Output: Page Rank of the URLs

Step 1 : Seed URLs are provided to the Initial node. Step 2 : Create Clusters of the seed URLs using K-means clustering. Step 3 : Assign the URLs to the different nodes based on the

Distributing

clusters.(Refer

the cluster_info table)

Step 4 : On Every corresponding nodes

add URLs to the queue. (queue not empty) Dequeue operation. Check if the dequeued URL not in the Black listed urls(pdj,jpeg,. xml,.xsl,. zip) if (present)

filter: while VM 2

VM 1

VM n

goto filter

else goto

step5.

end while

Step 5 : Calculate the Page Rank using the formula

(Refer the urIs_database for the backward link count, no of outward going links. )

PCB)

=

'" PCK) LNCK)

where K is the page which has the forward link to B. P(K) � is the Page rank of K, and N(K)� is No of the forward or outgoing linksfrom K. Step 6 : Add Result

Add Result

to Database

to Database

Add Result

to Database

if( URL present in the urIs_database ) if(jorward count is zero) goto filter else

fetch the page and make entry in the uri links table. Step 7 : Parse the page to extract the links to obtain the forward links. Insert the entries into the uris links table. Step 8 : Perform clustering at each node using K-means clustering algorithm. Step 9 :

Fig. 3.

Flow Diagram of Proposed Crawling.

The hash value of the domain and scheme is calculated to

if (URL belongs to different cluster) distribute them to the corresponding cluster else goto filter.

Step 10 : end.

indentify the cluster to which the URL belongs, and thus URL

Each of the virtual machine then initially filters all the

is assigned depending upon the threshold value to a virtual

URLs of their cluster to identify the black listed URLs and

machine. The threshold value is decided as per the availability

applies the page ranking algorithm to rank URLs In our

of the virtual machine. The clustering algorithm used is K­

means clustering algorithm.

proposed approach, Apache Lucene crawler is used that

The URL clusters thus formed are

provides the ranking of the URLs. The most common ranking

then distributed to different virtual machines (clusters). The

method is used to identity the URL, where the lowest number

injector is needed only for the first time to provide the seed

is allotted to the URL that has maximum number of URLs

URLs , where only one virtual machine is needed. After that,

pointing to it (if lower the number indicates the URL to be

the Distribution unit of the URL clusters, distributes URLs to

crawled first). The first URL of the ranked URL are then taken

other virtual machines.

and crawled to obtain the content of the URL. It is then forwarded to HTML parser to remove all the images, HTML tags and other data thereby extracting other URLs. This entire

978-1-4799-6393-5/14/$31.00 ©2014 IEEE

IEEEInternational Conference on Advances in Engineering & Technology Research (ICAETR - 2014), August 01-02, 2014, Dr. Virendra Swarup Group ofInstitutions, Unnao, India

process occurs in parallel by each of the virtual machine. After

This table provides the information about the relationship

that, the result URLs are again provided to the clustering

between the URLs. It is used to identify the number of links

phase to identify their new clusters. If at all the URL belongs

that point to the given link.

to other cluster then the URL is provided to corresponding

urI links

VM using the Virtual machine communication Infrastructure.

I

I

Forwardlink id

Backwardlink id

The entire process with the VM is outline in the FigA. The algorithm is for the entire process flow is described below.

IV.

BENEFITS OF THE SYSTEM

VMl

Advantages of the proposed system over the existing one are as follows: •

.

.

. ......................... ........... ....

In

a

VM-based

Web

crawling

system

multiple

machines operate simultaneously and so it performs

.

faster than the same system with no VMs installed on, i.e., reduces crawling CPU time and increases processor utilization. •

Since the URL of the same server fall under the same cluster the dynamic assignment reduces the crawling phase if the earlier crawling phase results are cached.



The proposed architecture removes the crawling of the same URLs again by two different independent crawlers

as

compared

to

the

earlier

system

implementation in the distributed framework. •

There is no single coordinator and hence the single point failure is removed. Failure of any VM will make that part of URL not accessible for some period till the machine recovers but it will not cause the entire system to halt.



Using the shared memory region for maintaining the URL database the need to copy the URLs to each crawler is removed.

VMn

VM2



Virtual machines working as crawlers

Fig. 4.

The crawler crawls and passes only those URLs that do

not

fall

under

their

cluster.

Hence

the

communication between the crawlers is of only A.

URLs. Thus the communication overhead is also very

Database Table Schema

less.

Cluster Info table:

V.

This table is used by the nodes to decide if the URL falls in the threshold range of the cluster.The schema of the table is shown below.

number and type of documents. Their results are compared

I

Threshold range

Node id

This table stores the information about the URLs regarding the page rank of the URL and the number of links that the URL points to and the number of URLs that point to the URL.

urIs database table Uri id

I

against the CPU time. The time required are plotted to obtain the

Speedup

factor achieved by the crawler on the virtual

machines. It is well recognized that the document crawling

Urls database table:

I

In order to evaluate the performance of the proposed architecture two types of crawlers are run to crawl the same

cluster_info table Cluster no

EXPERIMENTS AND RESULTS

time

depends

on

the

documents

content

and

network

perfonnance, Crawling time increases with increasing number of crawled documents. Therefore, we have estimated the average document crawling time (tavg). The computer used in this work comprises a high-speed

uri

Pg rnk

Frwrd-count

Backwrd count

single multi-core

processor, Intel® Core™ i5-2300 CPU @

2.80 GHz, 4 GB memory, 500 GB hard drive, and Ubuntu Url Link Table:

(open source Linux-based OS). Crawling

different

pre-specified

number

of

Web

documents ranging from 75 to 600 in step of 75, are estimated.

978-1-4799-6393-5/14/$31.00 ©2014 IEEE

IEEE International Conference on Advances in Engineering & Technology Research (ICAETR - 2014), August 01-02, 2014, Dr. Virendra Swarup Group of Institutions, Unnao, India The crawling process starts with the single URL of vjti.ac.in . For each fetched URL, a number of Web documents are retrieved. Since each Web document contains a number of URLs. They are extracted, parsed and added to the crawled database, i.e. updating the crawled database. The program(s) keeps records of both the actual number of crawled documents and the actual number of fetched URLs. At the first time when VMs crawling starts a single VM is active after crawling collects URLs and provides them for clustering. Subsequently, four clusters are created, and three more VMs are made active thus giving a total of four VMs. These VMs crawl the URLs of their clusters and then distribute the URLs belonging to the different clusters. The graph of the speedup factor (Sfactor) obtained is shown below in Fig. 5.The speed up factor Sfactor is calculated as:

ã

Fig. 6.

Comparison of Speedup Factor with number of VMs

Another experiment is performed on the same seed of URLs with two different cores one having three virtual machines running the crawler and other having four virtual machines running the crawler. The comparison is shown in the Fig. 6. The figure compares the speed up achieved by use of different number of virtual machines. The figure shows that the speed up achieved by using three or four virtual machines is almost same. Though increasing the number of virtual machines reduces the computations time of each VM, the communications between the VMs increases the total time of crawling. VI.

Fig. 5.

Comparison of single node crawler with 4VM based Crawler.

As shown in the graph of Fig. 5- the speedup factor of the machine with single node is compared with the core having four virtual machines. Initially there is very little difference between the times taken by both the crawlers, subsequently time required by the single node increased linearly as the number of the documents crawled increased. Whereas, the time taken by the core having virtual machines requires less time with the increasing number of documents. Thus, the time required to crawl the number of documents is reduced by increasing the number of the virtual machines on the nodes thereby increasing the computing capacity of the node. However, there is limit on the number of virtual machines that can be increased because after a certain value increase in the number of virtual machines won t make any difference due to the increase in the communication overhead.

CONCLUSION

This paper has introduced the distributed controller architecture that eliminates the problem of single point failure in the central controller. It adopts the dynamic URL assignment technique. Using the concept of virtualization the resources and the infrastructure required are reduced. Also, the number of virtual machines can be increased or reduced as per need. Moreover the failure transparency is provided since the failure of one VM will not hinder the working of the other requests. Thus using high-performance multi-core processors with VMs installed can be used to develop a cost-effective Web crawler that enhances the performance of current Web crawlers. There is need to empirically evaluate proposed framework and algorithm on huge infrastructure and large size of seed URLs. The proposed work can further be extended for secure crawling. REFERENCES [1]

[2]

[3]

[4]

Hussein Al-Bahadili, the Web Crawling process on a

Multi-core processor using

(IJWSC), Vol.4, No.1, March 2013 Pooja Gupta, Mrs. Kalpana Johari, Second International Conference on Emerging Trends in Engineering and Technology, ICETET-09. Akassh A Mishra, Chinmay Kamat Migration of Search Engine Process into the Cloud International Journal of Computer Applications (0975 8887) Volume 19 No.1, April 2011. Junghoo Cho, Hector Garcia58113-449-5/02/0005.

978-1-4799-6393-5/14/$31.00 ©2014 IEEE

IEEEInternational Conference on Advances in Engineering & Technology Research (ICAETR - 2014), August 01-02, 2014, Dr. Virendra Swarup Group ofInstitutions, Unnao, India [5]

[6]

M. Najork and A Heydon. On high performance web crawling[R].Compaq Systems Research Center, 130 Lytton Avenue, Palo Alto,Califomia, September, 200l. Christopher Olston, Marc Najork "Web Crawling", Foundations and

Trends in Information RetrievalVol. 4, NO. 3 (2010). [7]

Min Wu, Junliang Lai 'The Research and Implementation of parallel crawler in cluster", 2010 International Conference on Computational and Information Sciences 2010 International Conference on Computational and Information Sciences.

web

[8]

A Guerriero, F. Ragni, C. Martines, "A dynamic URL assignment

[9]

Liyang Yu. , "A Developer's Guide to Semantic Web."

method for parallel web crawler", 2010 IEEE International Conference.

[10] Hussein Al-Bahadili, Hamzah Qtishat, "Application of VM-Based Computations to Speedup the Web Crawling Process on Multi-Core Processors. [11] http://pubs.vmware.comlvmci-sdkl. [12] Vladislav Shkapenyuk and orsten Suel, " Design and Implementation of a High-Performance Distributed Web Crawler" In Proceedings of the 18th International Conference on Data Engineering (ICDE '02). pp. 357369.

978-1-4799-6393-5/14/$31.00 ©2014 IEEE

Suggest Documents