DSphere: A Source-Centric Approach to Crawling ... - CiteSeerX

3 downloads 1620 Views 93KB Size Report
search engines to date can only reach a relatively small portion. (10% − 20%) of the .... those URLs whose domain name hashes onto it. Fig. 2(a) il- lustrates the ...
DSphere: A Source-Centric Approach to Crawling, Indexing and Searching the World Wide Web Bhuvan Bamba, Ling Liu, James Caverlee, Vaibhav Padliya, Mudhakar Srivatsa, Mahesh Palekar, Joseph Patrao, Suiyang Li and Aameek Singh College of Computing, Georgia Institute of Technology, Atlanta, Georgia - 30332 {bhuvan, lingliu, caverlee, vaibhav, mudhakar, mpalekar, jpatrao, suiyang, aameek}@cc.gatech.edu

Abstract We describe DS PHERE1 − a decentralized source-centric approach to crawling, indexing, searching and ranking of documents in the World Wide Web. Unlike most of the existing search technologies that depend heavily on a page-centric view of the Web, we advocate a source-centric view of the Web and propose a decentralized architecture for crawling, indexing and searching the Web in a distributed source-specific fashion. We demonstrate a DSphere prototype system for source-based crawling, indexing and searching of the Web. A fully decentralized crawler is developed to crawl the World Wide Web where each peer is assigned the responsibility of crawling a specific set of documents referred to as a source collection. The peer also maintains an index over its crawled collections to facilitate a full-text keyword search over these collections. Link analysis techniques are used for ranking documents. Traditional link analysis techniques suffer from problems like slow refresh rate and vulnerabilities to Web Spam. We propose a source-based link analysis algorithm which computes fast and accurate ranking scores for all crawled documents.

1. Introduction The amount of data available on the World Wide Web continues to grow at an astonishing speed. Engineering Web data and making this massively distributed data repository searchable with respectable latency and service coverage continues to be one of the big challenges in the 21st century. On one hand, it is widely recognized that search engines are extremely useful for searching the vast and rapidly growing amount of Web data. On the other hand, it is also well known that conventional search engines to date can only reach a relatively small portion (10% − 20%) of the Web. Most of the current search systems manage Web crawlers with a centralized client-server model in which the assignment of crawling jobs is managed by a centralized system using centralized repositories. Such systems suffer from a number of problems, including link congestion, low fault tolerance, low scalability and expensive administration. We develop DS PHERE − a system with a network of peers. DS PHERE performs crawling, indexing, searching and ranking using a fully decentralized computing architecture. Each peer in the DSphere network is responsible for crawling a spe1 DS PHERE

stands for decentralized information sphere

cific set of documents, referred to as the source collection, and maintaining an index over its crawled collections to facilitate a full-text keyword search over these collections. The DS PHERE crawler uses geographical proximity of peers to Web resources for faster crawling and distributes the crawling task among a network of peers which exchange information amongst themselves to avoid URL duplication and content duplication. A source collection may be defined as a set of documents belonging to a particular domain. For example, a peer may be assigned the responsibility of crawling all or a subset of the documents in the www.cc.gatech.edu domain. We have developed two different versions of the crawler, the first [6] uses the structured P2P approach based on a Distributed Hash Table protocol and the other [4] uses the unstructured P2P approach based on the Gnutella protocol. The design of our decentralized crawlers aims at providing extremely fast crawling functionality with built-in scalability and fault tolerance. For ranking purposes, the traditional TF-IDF based measures are supplemented by link analysis based techniques. Several search engines have successfully deployed such techniques, as exemplified by the popularity of Google’s PageRank algorithm [5]. However, PageRank and other existing link-based approaches suffer from a number of known problems, such as slow update time, vulnerability to Web Spam, and lack of support for higher levels of abstraction of the Web beyond the flat pagebased view. Our DSphere approach enhances the link-based ranking mechanisms through collection-based link analysis [2]. Motivated by the strong locality-based nature of Web links, we view the Web graph as a set of interlinked collections on top of the interlinked Web pages. The source-based approach significantly speeds up the update time of link analysis based scores and provides higher resilience to a number of known Web Spam attacks [3].

2. System Overview The DS PHERE project aims at building a fast, scalable peer to peer (P2P) search system. Like any other search system, crawling, indexing, and search are the three main functional components of the DS PHERE system. In contrast to most of the existing search systems, DS PHERE implements these functional components using a distributed and decentralized computing architecture with a network of peers. The design of the DS PHERE system offers three unique features:

INDEXER

Downloaded Downloaded Collections Collections

SEARCH

Results

SOURCE RANK/ PAGE RANK

Links DB

World WorldWide Wide Web Web

Results from other peers

CRAWLER

Crawl Jobs Processing Jobs Downloader Seen URLs Extractor Seen Content Batch

Routing Table/ Distribution Function

Interface Page Duplication (IN)

URL (IN) Receive URL

Receive Query

Send Reply

Page Duplication (OUT) URL (OUT) Send URL

Send Query

Receive Reply

Fig. 1. Architecture of a DS PHERE Peer • Scalable, fault tolerant and extremely fast fully decentralized P2P crawler for crawling the World Wide Web. • Distributed indexing and full-text keyword search functionality built on top of the Apache Lucene API [1]. • Efficient and Spam resilient ranking techniques with source-based link analysis. The system architecture for a single peer in DS PHERE is shown in Fig. 1. In the rest of this section we briefly describe the design and the working model for each of these components. 2.1. Decentralized Crawler A decentralized crawler consists of a network of peers located in geographically disparate regions across the World. Each peer is responsible for crawling a certain portion of the Web based on the geographical proximity between peers and the collections to be crawled. We have developed two different versions of the crawler, one using a structured P2P topology and routing using a Distributed Hash Table based protocol (called Apoidea) and another using an unstructured P2P system based on Gnutella-like topology and routing protocol (called PeerCrawl). The general architecture of the crawler component of a DS PHERE peer comprises of three units as shown in Fig. 1: Storage Manager, Job Scheduler and the Interface Manager unit. The Storage manager manages the various data structures maintained within each peer, including the to-be-crawled list, the set of URLs crawled and so forth. The job scheduler controls the Crawl-Jobs, Processing-Jobs, Seen-URLs, SeenContent, the Routing Table, and the workers, all implemented at each peer. The choice of data structures is extremely crucial for efficient functioning of the crawler, as the crawler is both network and CPU/Disk intensive. The Worker unit contains the Downloader and Extractor threads. The Downloader thread picks up domains from the Crawl-Jobs list and queries

its neighbor list to determine which peer can crawl a particular domain fastest. The URLs are passed onto a neighboring peer, if the neighboring peer is faster at accessing the content of this domain. The Extractor thread does the job of picking up pages from the Processing-Jobs list and extracting URLs from the pages. These URLs are normalized after extraction to ensure all logical URLs are mapped to the correct physical URLs. The Interface manager handles two types of communications among peers in the network: (1) crawling specific communications such as page duplication, crawl job distribution, and exchange of URLs; and (2) topology and routing protocol communication to control entry and exit of peers from the network, maintaining the desired network topology. Due to space constraints, we describe only the most important feature of the DS PHERE decentralized crawler − the division of labor among the peers. We describe below how the division of labor is carried out in both the DHT-based P2P crawler − Apoidea and the unstructured P2P crawler − PeerCrawl. Interested readers may refer to [6, 4] for more detailed descriptions. The structured decentralized crawler [6] uses the DHT protocol for distributing the World Wide Web space among all peers in the network. Peers and domains to be crawled are hashed to a common m bit space. A peer is responsible for those URLs whose domain name hashes onto it. Fig. 2(a) illustrates the logic through an example. Assuming that there are three peers in the system - Peer A, B and C; their IPs are hashed onto an m bit space occupying three different points in the modulo m ring. Various domain names are also hashed onto this space and occupy such slots. Peer A is made responsible for the space between Peer C and itself. Peer B is responsible for space between Peer A and itself and Peer C is responsible for all other remaining domains. In this case, Peer A is responsible for the domain www.cc.gatech.edu. If any other peer gets URLs belonging to that domain, it batches them and periodically sends them to Peer A. However, such a hash-table based random assignment of URLs to peers fails to exploit the ge-

X

www.cc.gatech.edu

C

A

www.unsw.edu.au

www.cc.gatech.edu

C

A

www.unsw.edu.au

www.iitm.ac.in

www.iitm.ac.in

Z

Y

Batch A B

www.cc.gatech.edu/research www.cc.gatech.edu/admissions www.cc.gatech.edu/people

Batch A B

www.cc.gatech.edu/research www.cc.gatech.edu/admissions www.cc.gatech.edu/people

Fig. 2. Division of Labor in (a) Structured P2P System (b) Unstructured P2P System ographical proximity between peers and the domains they are assigned to crawl. The design of Apoidea improves the standard DHT based crawling assignment to ensure that a domain is crawled by a peer close to it. Similarly, the unstructured decentralized crawler (PeerCrawl) performs the division of labor by introducing a hashbased URL Distribution Function that determines the domains to be crawled by a particular peer. The IP address of peers and domains are hashed to the same m bit space. A URL U is crawled by peer P if its domain lies within the range of peer P. The range of Peer P, denoted by Range(P ), is defined by h(P ) − 2n to h(P ) + 2n , where h is a hash function (like MD5) and n is a system parameter dependent on the number of peers in the system. In our first prototype of DSphere, we use the number of neighbor peers of P as the value of n. Thus whenever a peer joins or leaves the network, the range of all peers will change dynamically. Consider the example shown in Fig. 2(b). Assuming that the points X, Y , and Z in Fig. 2(b) are computed based on the neighbor count of peers A, B, and C respectively. The Peer A in this case is responsible for all domains mapped between points X and Y, Peer B is responsible for all domains mapped between points Y and Z and Peer C is responsible for the rest of the domains. A peer broadcasts all URLs for which it is not responsible to other peers in the system. All peers not responsible for the domain of this URL will drop the URL. The distribution function can be dynamically adjusted as peers enter and exit the system with a subsequent redistribution of the load among remaining peers. Such redistribution may lead to overlapping domain ranges for neighboring peers, which can cause the same domain to be crawled by two different peers. Due to space constraints, we refer readers to [4] for further details. 2.2. Indexer and Search The Indexer and Search components in DS PHERE have been implemented using the Apache Lucene API. Each peer maintains an index over the collections (set of domains) crawled by it. Apache Lucene supports a full-text inverted index, which enables us to perform a keyword search over the downloaded

content. The search function implemented can utilize flooding mechanisms (as in Gnutella) or use inverted lists and Bloom filters (as in a DHT-based system). The system can provide incremental results to the user, assuming that the user may be satisfied with a partial set of matching results. We are working on several enhancements to the search mechanism to lower the communication costs incurred while retrieving relevant results. We are also investigating source-specific search mechanisms to support source-centric crawling, indexing and search algorithms. 2.3. Source-Based Link Analysis To provide augmented relevance ranking and search capabilities, DS PHERE relies on source-based link analysis. Rather than providing a single global authority score to each Web page – much like Google’s PageRank does – DS PHERE assigns two scores: (1) First, each collection of pages (or source) is assigned an importance score based on an analysis of the intersource link structure; (2) Second, each page within a source is assigned an importance score based on an analysis of intrasource links. This approach is motivated by the strong localitybased nature of Web links, which is often centered around organizational responsibility (e.g., pages within one domain tend to link to pages within the same domain) [2]. SourceRank calculations are significantly faster than flat page-level approaches like PageRank, which is especially important considering the size and growth rate of the Web. We have additionally incorporated a suite of spam-resilient countermeasures into the source-based ranking model to support more robust rankings that are more difficult to manipulate than traditional page-based ranking approaches [3]. One such countermeasure is hijack-resistant influence flow for determining the strength of the “vote” from one source to another. This first countermeasure is designed to limit the impact of legitimate pages that have been hijacked by Web spammers for linking to spammer-controlled pages. A second countermeasure is spam-proximity influence throttling for mitigating the impact of link-based attacks by tuning the influence of malicious Web spammers, even when they behave collectively.

Fig. 3. Crawler GUI In practice, we treat each domain (like www.cc.gatech.edu) as a source. So, in Fig. 2(a), Peer A will calculate a local PageRank-based score for each page in the www.cc.gatech.edu domain. Using source-based link analysis, we will then calculate a source score for the www.cc.gatech.edu domain through the accumulation of information from other peers regarding all sources that link to this domain. This can be done by providing all the source-based links to a single peer. This peer will be responsible for maintaining the source-based Web graph and calculating the SourceRank for all the sources from this graph. Since peers on the Web have different capabilities, we select more powerful peers for maintaining this source-based Web graph. We additionally replicate the source-based Web graph across multiple peers for handling peer exits from the network.

3. Demonstration Overview We demonstrate the functionality and unique features of DS PHERE design and walk through each component of the system. During the demonstration, a network of peers will be set up remotely (if the network connection is provided) to illustrate the features of the system. • We display the features of the DS PHERE decentralized P2P crawlers, including the methodology employed to resolve URL duplicates, distribution of the crawl jobs among a network of peers, crawling speed and quality of the crawled pages. We will also show the Graphical User Interface developed to monitor the performance of the system. An example panel of such a tool is shown in Fig. 3. • We display the indexing and search capabilities integrated into the P2P system. Comprehensive performance data related to these system components will be provided. The indexer and search functions can be easily executed using the Graphical User Interface. • We will provide a detailed analysis of our source-based link analysis approach and demonstrate how it outperforms existing algorithms in terms of refresh rates and robustness to Web Spam. The performance of this approach on documents crawled from several Georgia Tech associated domains, such as www.cc.gatech.edu, will be displayed.

4. Conclusions and Future Work Engineering Web data presents a number of unique challenges to the data management in terms of scale, speed of evolution, growth rate, as well as inherent heterogeneity, and high-degree of linkage among data. The DS PHERE project is geared towards developing a fully decentralized P2P system for crawling, indexing and searching Web data. By distributing the crawling, indexing and link analysis jobs among a network of peers we can provide much faster performance than conventional search systems. With decentralized crawling we are able to navigate a much larger portion of the Web. By distributing the indexing functionality among a large number of peers we are able to greatly reduce the maintenance cost for such a large scale search system as well as improve the efficiency of the system. Furthermore, by exploring source-based link analysis techniques, we are able to provide faster and more robust ranking methods. Ongoing research and development efforts focus on efficient search algorithms for full text keyword search and user-friendly GUI tools for visualizing the source-based link analysis results. Acknowledgement. This work is partially supported by NSF CISE ITR, CSR, and CyberTrust program, an AFOSR grant, an IBM SUR grant and an IBM Faculty Partnership award.

References [1] Apache lucene. http://lucene.apache.org/java. [2] J. Caverlee, L. Liu, and W. B. Rouse. Source-based link analysis of the web: A parameterized approach. Technical report, Georgia Institute of Technology, 2006. [3] J. Caverlee, S. Webb, and L. Liu. Countering web spam attacks using link-based analysis. Technical report, Georgia Institute of Technology, 2006. [4] V. J. Padliya and L. Liu. Peercrawl: A decentralized peer-to-peer architecture for crawling the world wide web. Technical report, Georgia Institute of Technology, May 2006. [5] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Computer Science Department, Stanford University, 1998. [6] A. Singh, M. Srivatsa, L. Liu, and T. Miller. Apoidea: A decentralized peer-to-peer architecture for crawling the world wide web. In SIGIR 2003 Workshop on Distributed Information Retrieval. ACM, August 2003.

Suggest Documents