August 01-02, 2014, Dr. Virendra Swarup Group ofInstitutions, Unnao, India. Virtualized Dynamic URL Assignment Web. Crawling Model. Wani Rohit Bhaginath,.
IEEEInternational Conference on Advances in Engineering & Technology Research (ICAETR - 2014), August 01-02, 2014, Dr. Virendra Swarup Group ofInstitutions, Unnao, India
Virtualized Dynamic URL Assignment Web Crawling Model Wani Rohit Bhaginath,
Sandip Shingade
Mahesh Shirole
Department of CE and IT, VJTL Matunga, Mumbai. rohit. b. wani@gmail. com
Department of CE and IT, VJTL Matunga, Mumbai. stshingade@vjti. org. in
Department oi CE and IT, VJTL Matunga, Mumbai. mrshirole@vjti. org. in
Abstract- Web search engines are software systems that help to
available in the market, like Google, Yahoo!, Bing, MSN
retrieve the information from the net by accepting the input in
Search, etc. Web crawlers are the vital component of any
the form of query and providing the result as files, pages, images
search engine. The Web crawler is an
or information. These search engines heavily rely on the web crawlers that interact with millions of the web pages given a seed URL or a list of seed URLs. However, these crawlers demand a large amount of computing resources. The efficiency of web search engines depends upon the performance of the crawling
Internet bot
that
systematically browses the web after taking the URLs as the seed, and then downloads the web pages associated with these URLs. After that, extracts any hyperlinks contained in them, and
recursively
continues
to
download
the
web
pages
processes. Despite the continuous improvement in the crawling
identified by these hyperlinks [6]. Web crawler may also be
processes still there is a need of improvement towards more
called a
efficient and low cost crawler. Most of the crawlers existing today
scutter. Web crawlers demands large amount of computing
have a centralized coordinator that brings the disadvantage of
resources for crawling the data from the web, and data are
single point failure. Taking into consideration the shortfalls of the existing crawlers, this paper proposes an architecture of a distributed web crawler. The architecture addresses two issues of
Web spider,
an
ant,
an
automatic indexer
or a Web
very dynamic in nature in the terms of the size and the modifications [4].
the existing web crawlers: the first is to create a low cost web
The Web crawler consumes the maximum time in the
crawler using the concept of virtualization of cloud computing.
searching process. Hence the crawling process should be
The second issue is
a balanced load distribution based on
dynamic assignment of the URLs. The first issue is solved using mutli-core machines where each multi-core processor is divided into number of virtual machines (VM) that can perform different crawling task in parallel. Second issue is addressed using a
performed continuously to maintain highest updatability of it search outcomes [6]. A crawler for a large search engine has to address two issues. First, it should have a good crawling strategy, i.e., a strategy for deciding which pages to download
clustering algorithm that assigns requests to the machines as per
next. Second, it needs to have highly optimized system
the availability of the clusters thereby realizing the balance
architecture, i.e., robust against crashes, manageable, and
among components according to their real-time condition. This
considerate of resources and web servers [1].The performance
paper discusses a distributed architecture and details of the
of the crawling process has been improved by using a parallel
implementation of the proposed algorithm.
web crawler instead of a batch crawler. But, the existing
Keywords- Crawler, Dynamic assignment,Seeds, Virtualization,
Clustering algorithm, K-means clustering. l.
parallel crawler as a single point coordinator with high chances of data redundancy leading to crawling of the same URLs multiple times thus affects the performance [3]. In order
INTRODUCTION
With the improvement of social information and the rapid development of Internet the World Wide Web has grown from a few thousand pages in 1993 to more than two billion pages at present and is still continuing [12]. The basic information acquisition Web search engine tool is becoming increasingly important because of the explosion in size and increasing demand of users for finding the information [1][2][4].The Web search engines are information retrieval software systems that help in finding the information stored on the internet by taking input query words, and retrieving the information based on the matching criteria. Some search engines mine data available in databases or open directories. The search results are generally presented in terms of search engine results pages
to prevent parallel web crawler from grabbing overlapping redundant data, a framework, which divides the entire web into several parts, and distributes each part to one web crawler had been proposed and implemented by Mercator [5][7][8]. However, distributing the web statically may lead to load unbalancing since some crawlers might remain idle leading to underutilization of the resources. Recently, a remarkable gain in computing resources performance has been achieved due to the introduction of parallelism in the architecture of processors for
multi-core
processors,
in
the
form
of
pipelining,
multitasking, and multithreading. The cores can operate in parallel and thus the programs run faster than on a traditional single-core processors [2][4]. This paper proposes the use of virtualization technology to
[3]. The information may be in the form of web pages, images
improve
or other types of files. There are many search engines
thereby enhancing the performance of the crawling process.
the
computing
978-1-4799-6393-5/14/$31.00 ©2014 IEEE
power
of
mutli-core
processors
IEEEInternational Conference on Advances in Engineering & Technology Research (ICAETR - 2014), August 01-02, 2014, Dr. Virendra Swarup Group ofInstitutions, Unnao, India
Moreover, this approach takes care of the load imbalance
frontier, and then crawler gives the URL to the DNS Resolver
issues by using the dynamic assignment of the URLs with the
to obtain the IP address of the corresponding server of the
distributed framework thereby eliminating the single point
given URL. The frontier is a data structure that contains the
failure in search engines. This paper presents a model of
list of the URLs to be crawled. The Fetcher is a HTTP library
processor-farm methodology with dynamic assignment of
that connects to the corresponding web server and tries to
URLs to the crawler. The paper is organized as follows:
download web pages. These pages are passed using the parser
Related work has been explained in Section II. Section III
which filters out the contents of the page to remove the HTML
describes proposed architecture. The Section IV describes the
tags, and other text thereby extracting the URLs to which the
benefits of such system. Finally, Section V concludes the
page points. Next, the new URLs are then checked in the database of the blacklisted URLs for their validity. Along with
paper. II. A.
it the URL that is just visited is marked as fetched in the URL
RELATED WORK
database. When URL passes through the URL Filter then appropriate page ranking algorithm are applied to rank and
Web crawler The web search engine has three major components: (i)
Web crawler, (ii) document analyzer and indexer, and (iii) search processor. A Web crawler starts with a list of URLs to visit, called as the
seeds.
As the crawler visits these URLs, it
identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the
crawl frontier
[1][3][4].URLs
from the frontier is recursively visited according to a set of policies. Fig.l. outlines the crawling process of a basic crawler [6].
r-------�====�
index the URLs. One of the techniques used to find out the URL having the maximum number that other URL pointing to it, and rank on the basis of obtained numbers. The indexed URLs are then stored in the Frontier.
Frontier provides the
un-fetched URLs to the DNS resolver [4][1][6].
B. Frameworksfor Parallel Crawler. A parallel crawler consists of
C-proc's,
which is nothing
but the same crawling process that occurs in multiple number independently. Each C-proc's is a symbolic representation of basic crawlers working simultaneously. They perform the functions of basic web crawlers, and they download pages from the Web, store the pages locally, extract URLs from the downloaded pages, and follow the links. Since the process crawl independently, there are chances that two processes might crawl the same URL. In order to prevent parallel web
Web servers
crawlers from grabbing and overlapping redundant data, Wu and Lai have listed two general frameworks: the first one is dynamic assignment framework with a center coordinator; the
Host name
second one is static assignment framework [7]. Blacklisted URLs
The dynamic assignment framework is adopted by Google has a center node as URL Server, i.e., the coordinator, and another three separate machines are responsible for grabbing web pages and communicating with URL Server directly. Due to central coordinator the complete database is stored at the central point, in the case of failure it may lead to a situation of bottle neck at the URL server. Also any failure or crash of the URL server can lead to the entire system failure. Thus single point
failure
and
scalability
restrictions
are
the
major
drawbacks of this framework [7][8]. Stop
On the Contrary the Mercator framework by Najork and Heydon divides the entire web into several parts, and assigns each part to separate crawler. This design is an example of static assignment. There is no central component. However, their assignment algorithm of static framework is, not relevant
Fig. 1.
Crawling process of a basic crawler
The crawler consists of multiple processes running on different machines connected by a high-speed network. Each of these crawling processes consists of multiple threads, which are worker threads, and each of the concerned worker thread performs repeated work cycles. The crawler initially obtains the seed URL (if for the first time) or takes an URL from the
to the instantaneous state of the system, thereby causing system load imbalance [7].
C. Virtualization Virtualization, refers to the act of creating a virtual (rather than actual) version of something in computing. Virtualization is not limited to a virtual computer hardware platform,
978-1-4799-6393-5/14/$31.00 ©2014 IEEE
IEEEInternational Conference on Advances in Engineering & Technology Research (ICAETR - 2014), August 01-02, 2014, Dr. Virendra Swarup Group ofInstitutions, Unnao, India
operating system (OS), storage device, or computer network
contains the URLs status (URL name, URL status, URL
resources.
metadata). Fig. 2 outlines the virtual machine communication
hidden
The physical characteristics of the resources are
to
simplify
applications,
systems,
using VMCI in the proposed architecture. These virtual
or end users interact with these resources.
the
way
in
which
other
machines act like an independent crawlers and retrieve the
Virtualization is an abstraction used in practice for benefits to
HTML pages.
save cost, reduce footprint, and consolidate systems. The VM creates a virtualization layer to translate the request of the hardware layer, thereby emulating a physical computing environment. It also manages requests for resources like CPU, memory, hard disk, network and other hardware resources [2][4]. Different forms of virtualization have been developed like
Shared memory through VMCI
Database to store URLs
guest OS-based, shared kernel, kernel-level, and hypervisor virtualization. Virtual Machines (VMs) have many advantages over regular method of installation of the operating systems and software. The main advantages of the VMs are they provide isolation between the applications running on different VMs, and there will be no interference of the Virtual machines with
the
machines.
host operating system Thus,
virtual
machine
or other
similar virtual
provides
the
efficient
utilization of hardware resources leading to the decrease in the cost.
where 1:9It9l
Fig. 2.
Proposed Architecture of web crawler using Virtualization.
These virtual machines also act like independent clusters physically, and the URL belonging to the same clusters are
D. Virtual Machine Communication Interface.
then provided to the corresponding virtual machines thereby
The virtual machines communicate through the network
saving some amount of crawling and speeding up of the
layer. The network layer adds overhead to performance of
process. The crawling process consists of different processes.
VMs. However, if the communication between the virtual
The complete flow of the crawling process is explained in the
machines takes place using a high speed infrastructure that
Fig.3. The crawler initially needs to be a set of seed URLs.
provides efficient interface between the machine and the host
The number of the seed URLs can be different for each VM.
operating system, thus, the overhead can be reduced. Such an infrastructure
Interface.
is
called
Virtual Machine Communication
It provides the socket interface similar to the one
that is provided by TCP/UDP with using the VMCI Id
The crawling process starts with injector providing the seed URLs. These seed URLs are then used by clustering for the cluster formation. The clustering process first calculates the hash values of the URL strings using the formula: n
numbers. VMCI provides both connectionless and connection W=
oriented communication. VMCI needs installation of VMware or Vspehere, and changes to the .vmx file to enable the VMCI. The VMCI provides the virtual machines with a shared region
where
where the machines can put and get the data, and also a
the length of the string, and
datagram API to send small messages among the virtual
string.
machines. VMCI minimizes the overhead and also reduces the number of the task required for communication [II]. III. The
architecture
uses
the
concept
of
processor-farm methodology. In this framework, the multiple cores of the machines are used as virtual machines. Each performs
crawling
thus
n is
represents the weight of the
The hash value uses the Uniform Resource Identifier or a resource in the Internet. This identification enables
virtualization and dynamic URL assignment to improve the
machine
W
interaction with resource representations in a network using
performance of the crawler. This crawler model uses a
virtual
ASCII value of character of the URL string,
(URI), which is a string of characters used to identify a name
PROPOSED CRAWLER MODEL.
proposed
c; is an
Lei i=l
giving
out
a
distributed framework. These virtual machines interact with each other through a shared region or memory using the
specific protocols. For example, consider a sample URL as http: //www.vjti.ac.inIcomputer/there?name=q1#abc http: // o7scheme www.vjti.ac.in
7 Domain 7 Path 7 Fragment.
computer/there? ?name=q1#abc
VMCI (Virtual machine Communication Infrastructure) that provides the API to communicate with each other needed at the time of crawling. The shared region has the database that
978-1-4799-6393-5/14/$31.00 ©2014 IEEE
IEEEInternational Conference on Advances in Engineering & Technology Research (ICAETR - 2014), August 01-02, 2014, Dr. Virendra Swarup Group ofInstitutions, Unnao, India Algorithm Name: rh_crawling Input: List of Seed URLs Output: Page Rank of the URLs
Step 1 : Seed URLs are provided to the Initial node. Step 2 : Create Clusters of the seed URLs using K-means clustering. Step 3 : Assign the URLs to the different nodes based on the
Distributing
clusters.(Refer
the cluster_info table)
Step 4 : On Every corresponding nodes
add URLs to the queue. (queue not empty) Dequeue operation. Check if the dequeued URL not in the Black listed urls(pdj,jpeg,. xml,.xsl,. zip) if (present)
filter: while VM 2
VM 1
VM n
goto filter
else goto
step5.
end while
Step 5 : Calculate the Page Rank using the formula
(Refer the urIs_database for the backward link count, no of outward going links. )
PCB)
=
'" PCK) LNCK)
where K is the page which has the forward link to B. P(K) � is the Page rank of K, and N(K)� is No of the forward or outgoing linksfrom K. Step 6 : Add Result
Add Result
to Database
to Database
Add Result
to Database
if( URL present in the urIs_database ) if(jorward count is zero) goto filter else
fetch the page and make entry in the uri links table. Step 7 : Parse the page to extract the links to obtain the forward links. Insert the entries into the uris links table. Step 8 : Perform clustering at each node using K-means clustering algorithm. Step 9 :
Fig. 3.
Flow Diagram of Proposed Crawling.
The hash value of the domain and scheme is calculated to
if (URL belongs to different cluster) distribute them to the corresponding cluster else goto filter.
Step 10 : end.
indentify the cluster to which the URL belongs, and thus URL
Each of the virtual machine then initially filters all the
is assigned depending upon the threshold value to a virtual
URLs of their cluster to identify the black listed URLs and
machine. The threshold value is decided as per the availability
applies the page ranking algorithm to rank URLs In our
of the virtual machine. The clustering algorithm used is K
means clustering algorithm.
proposed approach, Apache Lucene crawler is used that
The URL clusters thus formed are
provides the ranking of the URLs. The most common ranking
then distributed to different virtual machines (clusters). The
method is used to identity the URL, where the lowest number
injector is needed only for the first time to provide the seed
is allotted to the URL that has maximum number of URLs
URLs , where only one virtual machine is needed. After that,
pointing to it (if lower the number indicates the URL to be
the Distribution unit of the URL clusters, distributes URLs to
crawled first). The first URL of the ranked URL are then taken
other virtual machines.
and crawled to obtain the content of the URL. It is then forwarded to HTML parser to remove all the images, HTML tags and other data thereby extracting other URLs. This entire
978-1-4799-6393-5/14/$31.00 ©2014 IEEE
IEEEInternational Conference on Advances in Engineering & Technology Research (ICAETR - 2014), August 01-02, 2014, Dr. Virendra Swarup Group ofInstitutions, Unnao, India
process occurs in parallel by each of the virtual machine. After
This table provides the information about the relationship
that, the result URLs are again provided to the clustering
between the URLs. It is used to identify the number of links
phase to identify their new clusters. If at all the URL belongs
that point to the given link.
to other cluster then the URL is provided to corresponding
urI links
VM using the Virtual machine communication Infrastructure.
I
I
Forwardlink id
Backwardlink id
The entire process with the VM is outline in the FigA. The algorithm is for the entire process flow is described below.
IV.
BENEFITS OF THE SYSTEM
VMl
Advantages of the proposed system over the existing one are as follows: •
.
.
. ......................... ........... ....
In
a
VM-based
Web
crawling
system
multiple
machines operate simultaneously and so it performs
.
faster than the same system with no VMs installed on, i.e., reduces crawling CPU time and increases processor utilization. •
Since the URL of the same server fall under the same cluster the dynamic assignment reduces the crawling phase if the earlier crawling phase results are cached.
•
The proposed architecture removes the crawling of the same URLs again by two different independent crawlers
as
compared
to
the
earlier
system
implementation in the distributed framework. •
There is no single coordinator and hence the single point failure is removed. Failure of any VM will make that part of URL not accessible for some period till the machine recovers but it will not cause the entire system to halt.
•
Using the shared memory region for maintaining the URL database the need to copy the URLs to each crawler is removed.
VMn
VM2
•
Virtual machines working as crawlers
Fig. 4.
The crawler crawls and passes only those URLs that do
not
fall
under
their
cluster.
Hence
the
communication between the crawlers is of only A.
URLs. Thus the communication overhead is also very
Database Table Schema
less.
Cluster Info table:
V.
This table is used by the nodes to decide if the URL falls in the threshold range of the cluster.The schema of the table is shown below.
number and type of documents. Their results are compared
I
Threshold range
Node id
This table stores the information about the URLs regarding the page rank of the URL and the number of links that the URL points to and the number of URLs that point to the URL.
urIs database table Uri id
I
against the CPU time. The time required are plotted to obtain the
Speedup
factor achieved by the crawler on the virtual
machines. It is well recognized that the document crawling
Urls database table:
I
In order to evaluate the performance of the proposed architecture two types of crawlers are run to crawl the same
cluster_info table Cluster no
EXPERIMENTS AND RESULTS
time
depends
on
the
documents
content
and
network
perfonnance, Crawling time increases with increasing number of crawled documents. Therefore, we have estimated the average document crawling time (tavg). The computer used in this work comprises a high-speed
uri
Pg rnk
Frwrd-count
Backwrd count
single multi-core
processor, Intel® Core™ i5-2300 CPU @
2.80 GHz, 4 GB memory, 500 GB hard drive, and Ubuntu Url Link Table:
(open source Linux-based OS). Crawling
different
pre-specified
number
of
Web
documents ranging from 75 to 600 in step of 75, are estimated.
978-1-4799-6393-5/14/$31.00 ©2014 IEEE
IEEE International Conference on Advances in Engineering & Technology Research (ICAETR - 2014), August 01-02, 2014, Dr. Virendra Swarup Group of Institutions, Unnao, India The crawling process starts with the single URL of vjti.ac.in . For each fetched URL, a number of Web documents are retrieved. Since each Web document contains a number of URLs. They are extracted, parsed and added to the crawled database, i.e. updating the crawled database. The program(s) keeps records of both the actual number of crawled documents and the actual number of fetched URLs. At the first time when VMs crawling starts a single VM is active after crawling collects URLs and provides them for clustering. Subsequently, four clusters are created, and three more VMs are made active thus giving a total of four VMs. These VMs crawl the URLs of their clusters and then distribute the URLs belonging to the different clusters. The graph of the speedup factor (Sfactor) obtained is shown below in Fig. 5.The speed up factor Sfactor is calculated as:
ã
Fig. 6.
Comparison of Speedup Factor with number of VMs
Another experiment is performed on the same seed of URLs with two different cores one having three virtual machines running the crawler and other having four virtual machines running the crawler. The comparison is shown in the Fig. 6. The figure compares the speed up achieved by use of different number of virtual machines. The figure shows that the speed up achieved by using three or four virtual machines is almost same. Though increasing the number of virtual machines reduces the computations time of each VM, the communications between the VMs increases the total time of crawling. VI.
Fig. 5.
Comparison of single node crawler with 4VM based Crawler.
As shown in the graph of Fig. 5- the speedup factor of the machine with single node is compared with the core having four virtual machines. Initially there is very little difference between the times taken by both the crawlers, subsequently time required by the single node increased linearly as the number of the documents crawled increased. Whereas, the time taken by the core having virtual machines requires less time with the increasing number of documents. Thus, the time required to crawl the number of documents is reduced by increasing the number of the virtual machines on the nodes thereby increasing the computing capacity of the node. However, there is limit on the number of virtual machines that can be increased because after a certain value increase in the number of virtual machines won t make any difference due to the increase in the communication overhead.
CONCLUSION
This paper has introduced the distributed controller architecture that eliminates the problem of single point failure in the central controller. It adopts the dynamic URL assignment technique. Using the concept of virtualization the resources and the infrastructure required are reduced. Also, the number of virtual machines can be increased or reduced as per need. Moreover the failure transparency is provided since the failure of one VM will not hinder the working of the other requests. Thus using high-performance multi-core processors with VMs installed can be used to develop a cost-effective Web crawler that enhances the performance of current Web crawlers. There is need to empirically evaluate proposed framework and algorithm on huge infrastructure and large size of seed URLs. The proposed work can further be extended for secure crawling. REFERENCES [1]
[2]
[3]
[4]
Hussein Al-Bahadili, the Web Crawling process on a
Multi-core processor using
(IJWSC), Vol.4, No.1, March 2013 Pooja Gupta, Mrs. Kalpana Johari, Second International Conference on Emerging Trends in Engineering and Technology, ICETET-09. Akassh A Mishra, Chinmay Kamat Migration of Search Engine Process into the Cloud International Journal of Computer Applications (0975 8887) Volume 19 No.1, April 2011. Junghoo Cho, Hector Garcia58113-449-5/02/0005.
978-1-4799-6393-5/14/$31.00 ©2014 IEEE
IEEEInternational Conference on Advances in Engineering & Technology Research (ICAETR - 2014), August 01-02, 2014, Dr. Virendra Swarup Group ofInstitutions, Unnao, India [5]
[6]
M. Najork and A Heydon. On high performance web crawling[R].Compaq Systems Research Center, 130 Lytton Avenue, Palo Alto,Califomia, September, 200l. Christopher Olston, Marc Najork "Web Crawling", Foundations and
Trends in Information RetrievalVol. 4, NO. 3 (2010). [7]
Min Wu, Junliang Lai 'The Research and Implementation of parallel crawler in cluster", 2010 International Conference on Computational and Information Sciences 2010 International Conference on Computational and Information Sciences.
web
[8]
A Guerriero, F. Ragni, C. Martines, "A dynamic URL assignment
[9]
Liyang Yu. , "A Developer's Guide to Semantic Web."
method for parallel web crawler", 2010 IEEE International Conference.
[10] Hussein Al-Bahadili, Hamzah Qtishat, "Application of VM-Based Computations to Speedup the Web Crawling Process on Multi-Core Processors. [11] http://pubs.vmware.comlvmci-sdkl. [12] Vladislav Shkapenyuk and orsten Suel, " Design and Implementation of a High-Performance Distributed Web Crawler" In Proceedings of the 18th International Conference on Data Engineering (ICDE '02). pp. 357369.
978-1-4799-6393-5/14/$31.00 ©2014 IEEE