Exploiting Replication in Peer-to-Peer Search over ...

3 downloads 0 Views 218KB Size Report
and Robert Morris. On the Feasibility of Peer-to-Peer Web Indexing and Search. In IPTPS,. 2003. 9. Qin Lv, Pei Cao, Edith Cohen, Kai Li, and Scott Shenker.
Exploiting Replication in Peer-to-Peer Search over Distributed Digital Libraries Christian Zimmer, Srikanta Bedathur, Christos Tryfonopoulos, and Gerhard Weikum {czimmer, bedathur, trifon, weikum}@mpi-inf.mpg.de Department for Databases and Information Systems Max-Planck-Institute for Informatics, 66123 Saarbr¨ucken, Germany

Abstract. Existing peer-to-peer (P2P) networks suffer from dynamics: Both high churn with joining and leaving of peers with unknown rates and without notification, as well as high data dynamics with adding of new and disappearing of older data. We develop replication strategies from the existing P2P search engine Minerva where each participating peer or digital library manages its own local document collection. A distributed directory on top of a Chord-DHT stores per-term summaries of all peers. Given this scenario of a conceptually global but physically distributed directory, we design different replication strategies that can be integrated into the query execution process. In this paper, we explore algorithms: Successor-Directory Replication (SDR) and Top-Result Replication (TRR). Detailed experiments with varying parameters and under different conditions show that our proposed approaches can sufficiently handle dynamics in a P2P search engine.

1

Introduction

The peer-to-peer (P2P) paradigm has received increasing attention in the last few years and has become popular mainly in the context of file sharing systems such as KaZaA or Gnutella. Lately, P2P researchers have become particularly interested in distributed information retrieval. Since a P2P approach allows handling huge amounts of data in a distributed and self-organizing manner. In such a setting, all peers are equal and all the functionality is distributed among the participating peers to avoid any single point of failure. These properties offer enormous potential benefit for search capabilities powerful in terms of scalability, efficiency, and resilience to failures and dynamics. Beyond that, a peer-to-peer search engine can benefit from intellectual input of a large user community e.g., user-specific bookmarks, query logs, click streams, and user profiles. Unstructured search architectures often use flooding-based query routing mechanisms and suffer from scalability issues and missing query results. On the other hand, structured P2P networks are limited to exact-match queries on single keywords. Recent scientific work [7, 16] has focused on extending such simple architectures in order to order to support more complex user queries. However, all these approaches are not dealing with text queries that consist of a variable number of keywords, and they are inappropriate for full-fledged web search where keyword queries should return a ranked result list of the most relevant query matches.

A peer-to-peer search engine across a network of digital libraries has to address the following two conflicting challenges: (i) providing high quality results in terms of precision and recall; (ii) enabling high scalability in a way that the number of participating peers is unlimited, and the system supports handling huge amounts of data. In an effort to satisfy these conflicting goals, we have proposed our search engine prototype Minerva [1]: architecture, system design, and implementation of Minerva aim for high quality results by ensuring high scalability. Our search system uses a DHT-based overlay network to efficiently manage compact meta data that digital libraries (or peers) publish about their local index. This meta data is exploited to select promising peers for a certain query so as to limit the number of involved digital libraries. This way, network traffic is reduced and the system scales gracefully. As an underlying overlay network, we have implemented a Chord-style [13] DHT allowing efficient lookup processes. One fundamental issue of all distributed P2P networks is handling the two basic facets of dynamics: churn and data dynamics. Churn denotes joining and leaving of peers with unknown rate and at least the leaving process may often be without notification. As a solution to this, the P2P system has to handle churn by quickly updating the network structure. Data dynamics is also an important issue. New data arrivals and modifications and disappearing of other data may affect data availability, and have a negative impact on user perceived recall. All these issues are connected to each other because each leaving or joining peer causes a decrease or increase of the overall data collection. To handle aspects of dynamics (especially churn), replication strategies can be of help. In this paper, we present some new ideas with replication in a P2P search engine. We used our Minerva system to develop different replication approaches. The Minerva search engine uses two kinds of information: data information, i.e., the documents from the local data collections of each peer, and directory information, i.e., the statistics about the local data collections included in the distributed directory. Thus, our replication strategies include both replication of data and replication of directory information. The Successor-Directory Replication (SDR) approach disseminates the statistics to more than one directory peer such that the leaving of a peer does not cause an inevitable loss of directory information. In combination with our Top-Result Replication (TRR) approach where a querying peer includes the top results into its local data collection, these replication strategies increase the system resistance against dynamics and churn. The rest of the paper is structured as follows: Section 2 mentions related work from structured P2P overlay networks, collection selection in distributed IR, and replication in P2P systems. The Minerva system architecture is explained in Section 3. Our replication approaches are shown in Section 4, and the experimental evaluation is presented in Section 5. Finally, Section 6 concludes the paper and shows some future work.

2

Related Work

Recent research on structured P2P systems, such as Chord [13], is typically based on various forms of distributed hash tables (DHTs) and supports mappings from keys to locations in a decentralized manner such that routing scales well with the number of peers in the system.

Galanx [18] is a P2P search engine using the Apache HTTP server and BerkeleyDB. The Web site servers are the peers of this architecture; pages are stored only where they originate from. In contrast, our approach leaves it to the peers to what extent they want to crawl interesting fractions of the Web and build their own local indexes. Another system is PlanetP [6] which is a publish-subscribe service for P2P communities, supporting content ranking search. PlanetP distinguishes local indexes and a global index to describe all peers and their shared information. The global index is replicated using a gossiping algorithm. The system appears to be limited to a few thousand peers although there are attempts to improve scalability by using hierarchical network construction. Odissea [15] assumes a two-layered search engine architecture with a global index structure distributed over the nodes in the system. A single node holds the complete, Web-scale, index for a given text term (i.e., keyword or word stem). The system outlined in [11] uses a fully distributed inverted text index in which every participant is responsible for a specific subset of terms and manages the respective index structures. Particular emphasis is put on minimizing the bandwidth used during multi-keyword searches. Later, OverCite [14] was proposed as a distributed alternative for the scientific literature digital library CiteSeer. This functionality was made possible by utilizing a DHT infrastructure to harness distributed resources (storage, computational power, etc.). In [9], different replication strategies in unstructured P2P networks are suggested: owner replication, path replication, and random replication. [4] compares the two replication strategies uniform and proportional replication in an unstructured scenario and observes that the optimal strategy lies between both. Beyond that, a uniform index caching mechanism (UIC) as suggested in [12] caches query results in all peers along the inverse query path. The DiCAS (Distributed Caching and Adaptive Search) approach [17] for unstructured networks distributes the cache results among neighboring peers and queries are forwarded to peers with a high probability of providing the desired cache results such that the network search traffic is reduced significantly at the same query success rate. [5] uses replication to increase the availability of shared data in weak structured P2P systems.

3

System Architecture

Our Minerva system [2] is a fully operational distributed search engine consisting of autonomous peers, where each peer has a local document collection from its own (thematically focused) Web crawls or imported from external sources that fall into the users thematic interest profile. The local data collection is indexed by inverted lists, one for each keyword or term (e.g., word stems) and contains document identifiers like URLs and relevance scores based on term frequency statistics. A conceptually global but physically distributed directory, which is layered on top of a Chord-style [13] distributed hash table (DHT), manages aggregated information about the peers local knowledge in compact form. Unlike [8], we use the Chord DHT to partition the term space, such that every peer is responsible for the statistics and meta data of a randomized subset of terms within the directory. We do not distribute the actual index lists or even documents across the directory. The Chord DHT offers a lookup method to determine a peer that is

responsible for a particular term. This way, the DHT allows very efficient and scalable access to the global statistics for each term. 3.1

Directory Maintenance

In the publishing process, each peer distributes per-term summaries (posts) of its local index to the global directory. The DHT determines a peer currently responsible for this term and this peer maintains a peerlist of all posts for this term. Each post includes the peer’s address together with statistics to calculate IR-style measures for a term (e.g., the size of the inverted list for the term). These statistics are used to identify the most promising peers for a particular query. We use a time-to-live (TTL) technique that invalidates posts that have not been updated (or reconfirmed) for a tunable period of time. 3.2

Query Execution

A multi-term query is processed as follows. In the first step, the query is executed locally using the peers’ local index. If the user considers this local result as unsatisfactory, the peer issues a peerlist request to the directory for looking up potentially promising remote peers, for each query term separately. From the retrieved lists, a certain number of most promising peers for the complete query is computed (e.g., by simple intersection of the lists), and the query is forwarded to these peers. This step is referred to as query routing. For efficiency reasons, the query initiator can decide not to retrieve the complete peerlists, but only a subset.

Metadata Dissermination

Metadata Retrieval

dir(a)

Query Execution

dir(a) dir(b)

dir(b) query: a,b

meta(b)

query: a,b peerlist(a)

meta(c)

Chord-DHT Distributed Directory meta(e)

query: a,b

Distributed Directory query: a,b

meta(d) dir(c)

dir(e)

peerlist(b)

Chord-DHT

query initiator requesting query a,b

dir(d)

Fig. 1. System Architecture with Directory Maintenance and Query Execution.

For scalability, the query originator typically decides to issue the query to a small number of peers based on a calculated benefit/cost ratio. Once this decision is made, the query is executed on each of the remote peers, using that peer’s local scoring/ranking method and top-k search algorithm. Each peer returns its top-k results (typically with k equal to 10 or 100), and the results are merged into a global ranking, using either the query originator’s statistical model or global statistics derived from directory information. Figure 1 shows the three steps of query execution including the dissemination of per-term summaries to the global directory.

3.3

Query Routing

The prior literature on query routing (or resource selection) has mostly focused on IRstyle statistics about the document corpora (most notably, CORI [3], or the decisiontheoretic framework by [10]). These techniques have been shown to work well on disjoint data collections, but are insufficient to cope with a large number of autonomous peers that crawl the Web independently of each other, resulting in a certain degree of overlap as popular information may be indexed by many peers. We have developed an overlap-aware query routing strategy based on compact statistical synopses to overcome this problem [1].

4

The Proposed Replication Strategies

Replication usually refers to the provision of redundant resources, e.g. documents, in the network to improve reliability and fault-tolerance. In this section, we will introduce two different replication approaches in the system architecture of Minerva, and focus on failure resilience and high availability to avoid data (documents and directory entries) loss caused by leaving or failing peers. Besides this, other goals including load balancing, performance or response time improvements, and query result improvement, have to be investigated in the future. We can distinguish between two types of replication: (i) data replication, and (ii) directory replication. Data replication denotes that documents are replicated on more than one peer by adding documents to different local datasets. Documents can be downloaded from other local data collections or can be proactive disseminated to other peers. Directory replication aims at the replication of peerlists, such that more than one peer is responsible for a certain term. Whenever a peer disappears without notifying the directory, the directory entries of this peer, and, of course, its local documents are lost. The replication strategies proposed below try to avoid the loss of both kinds of data. So, we need to replicate directory information and we need to replicate documents. 4.1

Successor-Directory Replication (SDR)

The directory replication approach focuses on storing the statistical entries in peerlists on more than one directory peer. Possible approaches for achieving this property are: – Building-up two or more DHT-networks and posting the per-term statistics to all networks. If a lookup in the first DHT has no result, the initiating peer can use the other networks until a lookup is performed successfully. The main disadvantage is the required overhead to manage all the data structures; – Using more than one lookup-methods in the same distributed network. Here, the overhead to manage more than one DHT does not exist, but each lookup needs a different routing involving more peers; – All directory entries are also stored at the successors. This approach uses the advantages of consistent hashing. Each directory peer getting a statistical per-term entry forwards the entry to its direct successors. In this scenario, the failure of a directory peer yields that its successor is now responsible for this term and the querying lookup does not need to reroute the request messages.

Our Successor-Directory Replication (SDR) approach uses the latter strategy to replicate the directory. The replication degree determines the number of successors and thus the number of replicated directory entries. This approach limits the replication message overhead in comparison to the other strategies. 4.2

Top-Result Replication (TRR)

To perform data replication, we follow the Top-Result Replication (TRR) strategy. The obvious solution for perfect replication would be each peer storing every document in the network. In this case there is no data loss but the communication and storage costs will explode. For this reason it is clear that we need a strategy to provide a balance between replication cost and gain in availability. According to this strategy, we choose to replicate only the important documents, i.e., the results of queries. We argue that peers are only interested in result documents which means the data loss of unimportant documents is not seriously effecting the retrieval performance of the system. Our approach works as follows: A querying peer merges the local result lists for a query and computes a final list of top documents concerning a query request. These are the interesting documents and the querying peer downloads these documents, and includes them into its local data collection. This way, if a peer leaves the network suddenly, the query processing algorithm is still able to deliver the top document results. However this result list replication needs a controlling mechanism that will prevent excessive and unnecessary replication. One solution to this is to count the document occurrences in the system and to limit them to a predefined value. This behavior needs additional messaging effort and additional statistics. Our approach approximates the number of occurrences by analyzing the local result lists. If one document is included in the local results of a certain small number of peers, this document is not replicated by the querying peers although it is a top result for the request.

5

Experimental Evaluation

For the experimental evaluation of our replication approaches, we used a Web data collection of a focused crawl containing documents from different categories. Our experiments investigated Successor-Directory Replication (SDR) and Top-Result Replication (TRR) and showed that these approaches can deal with dynamics in P2P systems. 5.1

Experimental Setup

The Web data collection contains 253, 875 documents from a focused web crawl. All documents are categorized to one of 10 categories, e.g. Travel, Finance, or Sports. The smallest category has about 18, 000 documents, the largest about 35, 000 documents. There are more than 700, 000 different terms included in all the documents. In all experimental series, we used 50 digital library peers (5 peers per category) containing 1, 000 random documents from one category each. To avoid too specialized peers and unrealistic results, we added another 1, 000 random documents from all categories to the peers’ document set. We extracted 20 two-term queries out of the

documents where the query terms are strong representatives of the categories (the used terms appear very frequently in documents of one category and infrequently in documents of the other categories). All queries contain two, three, or four query terms; example queries are biology institute, quantum einstein atom model, or opera concert voice. For all 20 queries we computed the ideal results as the result of a local query execution of a centralized search engine where the local data collection includes all documents from the 50 peers in the network. The size of the ideal results is limited to an experiment specific parameter I. Thus, as result quality measure we use the relative recall as the fraction of ideal result documents included in the results of the P2P query processing. Besides the size I of the ideal results, the P2P result and the relative recall is among others influenced by two parameters: (i) the maximum number L of local results a peer returns to the querying peer, and (ii) the maximum number G of global results the querying peers return as query results. 5.2

Experimental Results

For this experiment, we fixed the parameters for the maximum number of ideal, local, and global results (I = 50, L = 50, and G = 100) as a result of previous series. We investigated the relative recall depending on the number of inactive peers in the system. In all query executions, we have a maximum number of 10 peers returning their local results to the query initiator. All values shown in the figures represent the average of five experimental runs with random inactive peers. Figure 2(a) shows the scenario if we use only Successor-Directory Replication without Top-Result Replication. Here, the peer statistics are replicated by using the successor strategy depending on the replication degree. The chart for SDR-0 means that only the peer responsible for a term stores the per-term statistics, and SDR-2 denotes that, in addition, the two successors store the directory entry. Using the resulting replicated directory, we executed the queries and obtained the relative recall for the complete network. The following rounds consider the sudden disappearance of some peers in the network such that the directory entries and the documents of these peers are lost. The Chord overlay network ensures a proper routing behavior. In the experiment shown in Figure 2(b), we consider only Top-Result Replication, ignore any kind of directory replication, and proceed as follows: all peers publish their statistics into the directory, and subsequently a certain number of random peers execute the queries and add the top-result documents to their local document set such that the most important documents are replicated at more than one peers. The result for TRR-0 means that there is no replication, and TRR-3 denotes that three peers have executed the queries and replicated the top documents. Before the queries are executed, we assume that the directory statistics have already been refreshed. In Figure 2(c), we have combined the two replication approaches using the same replication degree. So, the chart TRR-1 / SDR-1 denotes that we have replicated the topresult documents on one peer and each directory entry is stored on the next successor. Again, our experiments demonstrate the relative recall with different number of inactive peers in the system up to 60%.

1

TRR-0 TRR-1 TRR-2 TRR-3 TRR-4 TRR-5

0.8 average relative recall

0.8 average relative recall

1

SDR-0 SDR-1 SDR-2 SDR-3 SDR-4 SDR-5

0.6

0.4

0.2

0.6

0.4

0.2

0

0 0

10

20

30

40

50

60

0

10

percentage inactive peers

(a) SDR 1

1

40

50

60

SDR-0 / TRR-0 + SDR-1 / TRR-1 + SDR-2 / TRR-2 + SDR-3 / TRR-3 + SDR-4 / TRR-4 + SDR-5 / TRR-5 +

0.8 average relative recall

average relative recall

30

(b) TRR SDR-0 / TRR-0 SDR-1 / TRR-1 SDR-2 / TRR-2 SDR-3 / TRR-3 SDR-4 / TRR-4 SDR-5 / TRR-5

0.8

20

percentage inactive peers

0.6

0.4

0.2

0.6

0.4

0.2

0

0 0

10

20

30

40

percentage inactive peers

(c) SDR & TRR.

50

60

0

10

20

30

40

50

60

percentage inactive peers

(d) Enhanced SDR & TRR.

Fig. 2. Relative Recall using different Replication Strategies.

The experimental series concerning our replication approaches showed interesting outcomes: – If we only consider the Successor-Directory Replication approach, we see that a higher replication degree avoids a strong decrease in relative recall. But there are limits by the loss of documents. Even if we have fully replicated the directory such that no directory information is lost, we miss suitable documents because the peers holding these documents have left the system. – On the other hand, Top-Result Replication shows promising results that however cannot prevent recall loss. By replicating the important documents we could still reach them because they are still available on active peers, but the directory can not tell us where to find these documents. So, just replicating documents can not avoid a decrease in relative recall. – The combination of the two replication approaches tries to eliminate both directory loss and document loss. The chart for TRR-5 / SDR-5 shows, that even if 60% of the peers are inactive, we reach a relative recall of over 40%. In our experimental setting, but also in a real world setting, the number of inactive peers would cause a corresponding decrease in the number of resource providers that respond to a query. For example, if we look at the case where 60% of the peers are

inactive, we can expect that asking a maximum of ten peers for their local query results, a maximum of only four peers are still active and only these peers return their local results. There are two solutions but both of them contradict our experimental settings: (i) we only ask active peers, but this would require asking peers until ten of them have responded to our request. However, under this solution, a parallel query execution would not be possible any more; (ii) the directory information about alive peers is very accurate, even if peers suddenly leave the DHT, but this would require a high frequency of updates that would lead to higher network costs. To disregard the above assumptions, we conducted an experiment where we assume that the network learns the expected number of inactive peers by looking at previous query executions. This way, the query initiating peer can choose the expected number of peers to receive local results of 10 peers. For example, if there are 20 random peers inactive out of the 50 participating peers, the querying peer needs to send the query requests to approximate 17 peers to expect a number of 10 peers answering with their local results. In Figure 2(d), the relative recall is shown for the same setup as in Figure 2(c), but now, we increase the number of requested peers depending on the number of inactive peers of previous query executions such that we expect that 10 peers will deliver their local results. Again, all recall values are averaged over 5 runs with random peers leaving the network. As expected, our strategy to replicate directory entries and documents at the same time results in an almost constant recall level for SDR-5 / TRR-5. To sum up our experimental study, we have observed that only the combination of replicating directory entries and documents avoids a strong decrease in relative recall. The Top-Result Replication approach is suitable to avoid replicating unimportant documents, whereas accurate directory entries assist us in locating the peers holding the requested documents.

6

Conclusion and Future Work

In this paper, we have presented a study about replication strategies for a P2P search engine over distributed digital libraries. Our replication approaches try to avoid the loss of information caused by failing peers. We distinguished the loss of routing knowledge due to directory entries being lost, and the loss of the actual documents due to failing of the providers. The combination of our two replication approaches (Successor-Directory Replication (SDR) and Top-Result Replication (TRR)) managed to avoid a dramatical decrease in relative recall even if 60% of the participating peers become inactive. Future work needs to adapt the system behavior, i.e., replication degree, to the network dynamics. If we have a low failure rate of peers, the system can decrease the replication degree whereas a network with higher dynamics needs a higher replication degree. To enhance this approach, we can also consider other replication strategies based on document frequencies in the whole network. In all cases, the user behavior has high influence on the selected strategy. Finally utilizing replication to achieve load balancing is an interesting direction that was not addressed here. This goal can be achieved by applying different replication strategies, e.g. replicate frequently requested directory entries on more directory peers than non frequent ones.

References 1. Matthias Bender, Sebastian Michel, Peter Triantafillou, Gerhard Weikum, and Christian Zimmer. Improving Collection Selection with Overlap-Awareness. In SIGIR, 2005. 2. Matthias Bender, Sebastian Michel, Peter Triantafillou, Gerhard Weikum, and Christian Zimmer. MINERVA: Collaborative P2P Search. In VLDB, 2005. 3. James P. Callan, Zhihong Lu, and W. Bruce Croft. Searching Distributed Collections with Inference Networks. In SIGIR, 1995. 4. Edith Cohen and Scott Shenker. Replication Strategies in Unstructured Peer-to-Peer Networks. In SIGCOMM, 2002. 5. Francisco Matias Cuenca-Acuna, Richard P. Martin, and Thu D. Nguyen. Autonomous Replication for High Availability in Unstructured P2P Systems. In SRDS, 2003. 6. Francisco Matias Cuenca-Acuna, Christopher Peery, Richard P. Martin, and Thu D. Nguyen. PlanetP: Using Gossiping to Build Content Addressable Peer-to-Peer Information Sharing Communities. In HPDC, 2003. 7. Ryan Huebsch, Joseph M. Hellerstein, Nick Lanham, Boon Thau Loo, Scott Shenker, and Ion Stoica. Querying the Internet with PIER. In VLDB, 2003. 8. Jinyang Li, Boon Thau Loo, Joseph M. Hellerstein, M. Frans Kaashoek, David R. Karger, and Robert Morris. On the Feasibility of Peer-to-Peer Web Indexing and Search. In IPTPS, 2003. 9. Qin Lv, Pei Cao, Edith Cohen, Kai Li, and Scott Shenker. Search and Replication in Unstructured Peer-to-Peer Networks. In ICS, 2002. 10. Henrik Nottelmann and Norbert Fuhr. Evaluating Different Methods of Estimating Retrieval Quality for Resource Selection. In SIGIR, 2003. 11. Patrick Reynolds and Amin Vahdat. Efficient Peer-to-Peer Keyword Searching. In Middleware, 2003. 12. Kunwadee Sripanidkulchai. The Popularity of Gnutella Queries and its Implications on Scalability. 13. Ion Stoica, Robert Morris, David R. Karger, M. Frans Kaashoek, and Hari Balakrishnan. Chord: A scalable Peer-to-Peer Lookup Service for Internet Applications. In SIGCOMM, 2001. 14. Jeremy Stribling, Isaac G. Councill, Jinyang Li, M. Frans Kaashoek, David R. Karger, Robert Morris, and Scott Shenker. OverCite: A Cooperative Digital Research Library. In IPTPS, 2005. 15. Torsten Suel, Chandan Mathur, Jo wen Wu, Jiangong Zhang, Alex Delis, Mehdi Kharrazi, Xiaohui Long, and Kulesh Shanmugasundaram. ODISSEA: A Peer-to-Peer Architecture for Scalable Web Search and Information Retrieval. In WebDB, 2003. 16. Peter Triantafillou and Theoni Pitoura. Towards a Unifying Framework for Complex Query Processing over Structured Peer-to-Peer Data Networks. In DBISP2P, 2003. 17. Chen Wang, Li Xiao, Yunhao Liu, and Pei Zheng. Distributed Caching and Adaptive Search in Multilayer P2P Networks. In ICDCS, 2004. 18. Yuan Wang, Leonidas Galanis, and David J. de Witt. GALANX: An Efficient Peer-to-Peer Search Engine System.

Suggest Documents