IglooG: A Distributed Web Crawler Based on Grid Service - Springer Link

12 downloads 70 Views 238KB Size Report
Each crawler is deployed as grid service to improve the scalability of the system ... Search engine has played a very important role in the growth of the Web. Web.
IglooG: A Distributed Web Crawler Based on Grid Service Fei Liu, Fan-yuan Ma, Yun-ming Ye, Ming-lu Li, and Jia-di Yu Department of Computer Science and Engineering, Shanghai Jiaotong University, Shanghai, P. R. China, 200030 {liufei001, my-fy, ymm, li-ml, yujiadi}@sjtu.edu.cn

Abstract. Web crawler is program used to download documents from the web site. This paper presents the design of a distributed web crawler on grid platform. This distributed web crawler is based on our previous work Igloo. Each crawler is deployed as grid service to improve the scalability of the system. Information services in our system are in charge of distributing URLs to balance the loads of the crawlers and are deployed as grid service. Information services are organized as Peer-to-Peer overlay network. According to the ID of crawler and semantic vector of crawl page that is computed by Latent Semantic Indexing, crawler can decide whether transmits the URL to information service or hold itself. We present an implementation of the distributed crawler based on Igloo and simulate the environment of Grid to evaluate the balancing load on the crawlers and crawl speed. Both the theoretical analysis and the experimental results show that our system is a highperformance and reliable system.

1 Introduction Search engine has played a very important role in the growth of the Web. Web crawler forms an integral part of any search engine. The basic task of a crawler is to fetch pages, parse them to get more URLs, and then fetch these URLs to get even more URLs. In this process crawler can also log these pages or perform several other operations on pages fetched according to the requirements of the search engine. Most of these auxiliary tasks are orthogonal to the design of the crawler itself. The explosive growth of the web has rendered the simple task of crawling the web nontrivial. The architecture of the current crawler [1] [2] is based on a single architecture design. Centralized solutions are known to have problems like link congestion, being single point of failure, and expensive administration. To address the shortcomings of centralized search engines, there have been several proposals [3, 4] to build decentralized search engines over peer-to-peer networks. Peer to Peer system are massively distributed computing systems with each node communicating directly with one another to distribute tasks or exchange information or accomplish task. The challenge, while using a distributed model such as one described above, is to efficiently distribute the computation tasks avoiding overheads Y. Zhang et al. (Eds.): APWeb 2005, LNCS 3399, pp. 207 – 216, 2005. © Springer-Verlag Berlin Heidelberg 2005

208

F. Liu et al.

for synchronization and maintenance of consistency. Scalability is also an important issue for such a model to be usable. To improve the quality of service, we adopt the grid service as the distributed environment. Several crawlers can run in one node and the number of the crawler is impacted by the bandwidth and computing ability of the node. The information services are organized with P2P network----CAN [5]. URLs are collected by information service with the semantic vectors of URLs and the ID of the information service. The semantic vectors of URLs are computed with Latent Semantic Indexing (LSI). In this way IglooG can scale up to the entire web and has been used to fetch tens of millions of web documents. The rest of the paper is organized as follows. Section 2 introduces the related work about crawler. Section 3 introduces the Latent Semantic Indexing. Section 4 proposes the architecture of IglooG. Section 5 describes the experiment and results. We conclude in Section 6 with lessons learned and future work.

2 Related Works The first crawler, Matthew Gray’s Wanderer, was written in the spring of 1993, roughly coinciding with the first release of NCSA Mosaic [6]. Several papers about web crawling were presented at the first two World Wide Web conferences [7, 8, 9]. However, at the time, the web was two to three orders of magnitude smaller than it is today, so those systems did not address the scaling problems inherent in a crawl of today’s web. All of the popular search engines use crawlers that must scale up to substantial portions of the web. However, due to the competitive nature of the search engine business, the designs of these crawlers have not been publicly described. There are two notable exceptions: the Google crawler and the Internet Archive crawler. Unfortunately, the descriptions of these crawlers in the literature are too terse to enable reproducibility. The google search engine is a distributed system that uses multiple machines for crawling [10, 11]. The crawler consists of five functional components running in different processes. A URL server process reads URLs out of a file and forwards them to multiple crawler processes. Each crawler process runs on a different machine, is single-threaded, and uses asynchronous I/O to fetch data from up to 300 web servers in parallel. The crawlers transmit downloaded pages to a single store server process, which compresses the pages and stores them to disk. The pages are then read back from disk by an indexer process, which extracts links from HTML pages and saves them to a different disk file. A URL resolver process reads the link file, derelativizes the URLs contained therein, and saves the absolute URLs to the disk file that is read by the URL server. Typically, three to four crawler machines are used, so the entire system requires between four and eight machines. The internet archive also uses multiple machines to crawl the web [12, 13]. Each crawler process is assigned up to 64 sites to crawl, and no site is assigned to more than one crawler. Each singlethreaded crawler process reads a list of seed URLs for its assigned sites from disk into per-site queues, and then uses asynchronous I/O to fetch pages from these queues in parallel. Once a page is downloaded, the crawler extracts the links contained in it. If a

IglooG: A Distributed Web Crawler Based on Grid Service

209

link refers to the site of the page it was contained in, it is added to the appropriate site queue; otherwise it is logged to disk. Periodically, a batch process merges these logged “cross-site” URLs into the site-specific seed sets, filtering out duplicates in the process. In the area of extensible web crawlers, Miller and Bharat’s SPHINX system [14] provides some of the same customizability features as Mercator. In particular, it provides a mechanism for limiting which pages are crawled, and it allows customized document processing code to be written. However, SPHINX is targeted towards sitespecific crawling, and therefore is not designed to be scalable.

3 Latent Semantic Indexing (LSI) Literal matching schemes such as Vector Space Model (VSM) suffer from synonyms and noise in description. LSI overcomes these problems by using statistically derived conceptual indices instead of terms for retrieval. It uses singular value decomposition (SVD) [15] to transform a high-dimensional term vector into a lower-dimensional semantic vector. Each element of a semantic vector corresponds to the importance of an abstract concept in the description or query. Let N be the number of description in the collection and d be the number of description containing the given word. The inverse description frequency (IDF) is defined as

IDF = log[

N ] d

(1)

The vector for description Do is constructed as below

Do = (T1 * IDF1 , T2 * IDF2 ,..., Tn * IDFn )

(2)

Where Ti takes a value of 1 or 0 depending on whether or not the word i exists in the description Do. The vectors computed for description are used to form a description matrix S . Suppose the number of returned description is m, the description matrix S is constructed as S

= [ S1 , S 2 ,..., Sm ] . Based on this description matrix S , singular value

decomposition (SVD) of matrix is used to extract relationship pattern between description and define thresholds to find matched services. The algorithm is described as follow. Since S is a real matrix, there exists SVD of

S : S = U m×m ∑ m×n VnT×n where

U and V are orthogonal matrices. Matrices U and V can be denoted respectively as U m×m = [u1 , u2 ,...um ]m×m and Vn = [v1 , v2 ,..., vn ]n×n , where ui (i = 1,..., m) is a m-dimensional vector vector vi

ui = (u1,i , u2,i ,..., um,i ) and vi (i = 1,..., n) is a n-dimensional

= (v1,i , v2,i ,...vn ,i ) . Suppose rank ( S ) = r and singular values of matrix S

β1 ≥ β 2 ≥ ... ≥ β r ≥ β r +1 = ... = β n = 0 . For a given threshold ε (o < ε ≤ 1) , we choose a parameter k such that ( β k − β k −1 ) / β k ≥ ε . Then we

are:

210

F. Liu et al.

denote U k and

= [u1 , u2 ,...uk ]m×k , Vk = [v1 , v2 ,..., vk ]n×k ,



k

= diag ( β1 , β 2 ,...β k ) ,

Sk = U k ∑ k VkT . Sk is the best approximation matrix to S and contains main

information among the description. In this algorithm, the descriptions matching queries are measured by the similarity between them. For measuring the descriptions similarity

Sk , we choose the ith row Ri of the matrix U k ∑ k as the coordinate vector of description i in a k-dimensional subspace: Ri = (ui ,1 β1 , ui ,2 β 2 ,...ui ,k β k ) i = 1, 2,..., m based on

The similarity between description i and query j is defined as:

sim( Ri , R j ) =

| Ri .R j |

(3)

|| Ri ||2 || R j ||2

4 The Implementation of IglooG 4.1 The Web Crawler Service We wrap each web crawler as a grid service and deploy it in grid platform. This paper uses crawler of Igloo as single crawler to construct IglooG. First we introduce the architecture of single crawler (Fig. 1). DNS Resolver

I n t e r n e t

Robots.txt Resolver

HTTP module

URL Extractor Web Page database URL Filter

queue of URL

... . . .

. . .

URL Dispatcher

URL database URL Manager

URL database Policy of Crawling

Fig. 1. The structure of single crawler

Each crawler can get IP of host with URL by DNS. Then it downloads the web page through HTTP module if Robot allows access to the URL. URL extractor extracts the URL from the downloaded web page and URL filter check whether the URL accord with the restrictions. Then the crawler uses hash function to compute the

IglooG: A Distributed Web Crawler Based on Grid Service

211

hash ID of URL. The crawler inserts the URL into its URL database. Policy of crawling is used to sort the rank of pages to make higher important resource to be crawled more prior. We adopt the PageRank [16] method to evaluate the importance of web page. HTTP module consists of several download threads and each thread has a queue of URL. 4.2 The Information Service IglooG is designed to use in Grid environment. Information service is in charge of collecting the information about resource and the distribution of URL. Also it adjusts the distribution of URL to make crawlers have good load balance. The system we design is used to deal with large-scale web page download so the number of information service is much. How to organize these information services is challengeable. These information services are regarded as index service in GT3 and is defined as a service that speaks two basic protocols. GRIP [17] is used to access information about resource providers, while the GRRP [17] is used to notify register nodes services of the availability of this information. Each resource has two attributes. One is resource type and the other is the value of the resource. Crawler being a special resource is recorded in information service. The number of URLs in crawling queue and IP of the node the crawler being in are the value of the crawler. Fig. 2 is an example of GRIP data model: Type: value:

computer

location:192.168.1.199 child: cpu memory storage files

Type: cpu value: frequency: 1.4G loadaverage: 30%

Type: file value: file1: filename1 file2: filename2

Type: memory value: Total: 256MB loadaverage: 30%

Type: storage value: Total: 160GB free: 120GB

Fig. 2. GRIP data model

We organize these information services in CAN. Our design centers around a virtual 4-dimensional Cartesian coordinate space. At any point in time, the entire coordinate space is dynamically partitioned among all the information services in the system such that every service owns it individual, distinct zone within the overall space. In our system one node can start at most one information service. We assume that IP of node is the identifier of information service in it. We can regard IP as a point in a virtual 4-dimensional Cartesian coordinate space which is defined as Sa={(0,0,0,0),(255,255,255,255)}. We assume the 4 axes of Sa are x, y, z, w. The first information service R1 holds space Sa. When the second information service R2 joins, Sa is divided into two parts averagely. One parts is controlled by R1 and the other is held by R2. The central point of the space controlled by R1 is closer to R1 than the other space. R1 records the IP of R2 and the space controlled by R2 and R2 records IP

212

F. Liu et al.

of R1 and the space controlled by R2. In this way the neighbor relationship between R1 and R2 sets up. After the information service overlay network contains m service [R1, R2, …, Rm], the (m+1)th service joins which split the space controlled by node Rn {1

Suggest Documents