terms used in the queries and shared object names in the Gnutella file-sharing system. We analyzed the object names of over 20 million objects collected from ...
Understanding the Practical Limits of the Gnutella P2P System: An Analysis of Query Terms and Object Name Distributions William Acosta Surendar Chandra University of Notre Dame {wacosta, surendar}@nd.edu ABSTRACT A number of prior efforts analyzed the behavior of popular peer-to-peer (P2P) systems and proposed ways for maintaining the overlays as well as methods for searching for contents using these overlays. However, little was known about how successful users could be in locating the shared objects in these system. There might be a mismatch between the way content creators named objects and the way such objects were queried by the consumers. Our aim was to examine the terms used in the queries and shared object names in the Gnutella file-sharing system. We analyzed the object names of over 20 million objects collected from 40,000 peers as well as terms from over 230,000 queries. We observed that almost half (44.4%) of the queries had no matching objects in the system regardless of the overlay or search mechanism used to locate the objects. We also evaluated the query success rates against random peer groups of various sizes (200, 1K, 2K, 3K, 4K, 5K, 10K and 20K peers sampled from the full 40,000 peers). We showed that the success rates increased rapidly from 200 to 5,000 peers, but only exhibited modest improvements when increasing the number of peers beyond 5,000. Finally, we observed Zipf-like distribution for query terms and the object names. However, the relative popularity of a term in the object names did not correlate with the terms popularity in the query workload. This observation affected the ability of hybrid P2P systems to guide searches by creating a synopsis of the peer object names. A synopsis created by using the distribution of terms in the object names need not represent relevant terms for the query. Our results can be used to guide the design of future P2P systems that are optimized for the observed object names and user query behavior. Keywords: Peer-to-Peer, P2P, Query Workload, Shared File Distribution, Query Success Rate
1. INTRODUCTION P2P systems are popular mechanisms for distributing content among a large number of users. The effectiveness of these systems depends on several factors: a) the annotations provided by the content providers to describe the objects, b) the queries issued by the users against these annotations, c) an overlay created by the P2P system that keeps track of the various content providers and users, and d) the search mechanism used to locate the shared objects. For example, structured P2P systems such as Chord,1 Pastry,2 and CAN3 used a distributed hash table mechanism to maintain the overlay (factor c) and route the queries (factor d). On the other hand, unstructured P2P systems4, 5 built the overlay among peers (factor c) and use mechanisms such as flooding, random walks,6 or hybrid-approaches7 to locate the contents (factor d). The effectiveness of P2P search mechanisms is limited by the ability to match the annotations created by content providers with the [queries from consumers who are searching for these contents. Searches in P2P systems can fail when the system did not have any contents that matched users interests (e.g., searching for a top 10 song in Gnutella is likely more fruitful than searching for a song by a little known artist) or when a mismatch occurs between the contents’ annotations and the queries issued by users (e.g., the content provider named the song using only the last name of the artist while the query searched by the first name of the artist). The former situation can potentially be solved when content providers share more objects that are of interest to users. The latter condition can be addressed with better annotations. Peers are independent and do not adhere to a single naming convention. Systems such as iTunes8 provide a richer variety of annotations (e.g., name, kind, album, genre, etc.)9 while systems such as Gnutella and Chord use a single name that encodes the description or identifier of the object. In this work, we analyzed the query and object distributions of data captured from the Gnutella network. The Gnutella network remains popular; from 2004 to 2006, Gnutella quadrupled in size.10 For our analysis, we collected about 20 million object names from over 40,000 peers. We also collected over 230,000 queries that were routed through our data collection peer. The data was collected in October, 2006 as well as in April, 2007. For our analysis, we considered the efficacy of queries against all the objects. The behavior of actual P2P systems will be limited by their ability to query all
the objects using a specific overlay and search mechanism. For example, a TTL bounded flood only accesses a fraction of the peers based on the TTL value used. Hence, our analysis gives an upper bound of the system behavior. Our analysis showed that 44.4% of the queries could not be resolved even if we forwarded the queries to a large number of peers (40K peers sharing over 20M objects in our study). We sampled the full 40,000 peer data set to generate peer groups of various sizes (200, 1K, 2K, 3K, 4K, 5K, 10K and 20K peers) and evaluated the queries against these peer groups. Our results showed that the success rate increased consistently as the peer group size grew from 200 to 5,000 peers, but exhibited only a modest increase beyond 5,000 peers. Query success rates exhibited a modest drop of 15% (from 56% to 40%) in success rate for using just 12% of the peers (5,000 of the 40,000). Next we analyzed the number of matching objects for a particular query as well as the distribution of these objects. Our analysis showed that over 70% of the queries resulted in less than ten matching objects (among the 40,000 nodes queried). Less than 3.5% of the queries had matching objects on more than 1% of the nodes. These observations have important implications on the choice of structured, unstructured or hybrid systems for a given application scenario. Hybrid P2P systems7, 11, 12 use a synopsis of shared objects at each peer in order to choose structured or unstructured query mechanisms. The relationship between object annotations and query terms had an impact on efficacy of the synopsis. We analyzed the annotations of the shared objects and show that there was a large discrepancy between the terms used in queries and the distribution of keywords in the object annotations. The popularity of file terms and query terms followed a Zipf-like distribution. However, the relative popularity of a term in the object names did not correlate well to its popularity in the query workload. Our results showed that for the top 10,000 terms in the objects annotations, the mean rank for the term in the query workload was 1,876 positions lower than in the object annotations. Terms that occurred frequently in the object annotations were, on average, less popular in the query workload. This discrepancy affects the ability to create a synopsis used to guide searches. A synopsis created by examining only the object annotations may not represent terms that are relevant to queries. These observations can be incorporated into future enhancements of P2P systems. For example, to improve search response time, a synopsis can be created to provide one-hop routing and thus bypass both the structured and unstructured searches in hybrid P2P systems. Such a synopsis should be created by analyzing the relative popularity of the query terms and the the object names. This would allow the synopsis to minimize the number of terms that need to be maintained and thus increase the effectiveness of the synopsis since terms that appear in the enhanced synopsis would be representative of the terms expected in the query workload. In our recent work,13 we employ this approach to create a query-adaptive synopsis that adapts the contents of the synopsis to track changes in relative popularity of query and file terms. The rest of this paper is organized as follows. In Section 2 we place our contributions in the context of related works. Section 3 then describes the questions our analysis intends to answer and our experimental methodology. We then present our results in Section 4 and provide concluding remarks in Section 5.
2. RELATED WORK 7
Loo et. al. used a query trace based on captured Gnutella traffic to motivate the use of hybrid approaches for search in P2P networks. Their trace was generated by capturing query traffic at 30 Gnutella ultrapeers and contained over 670K objects and 230K queries. The set of objects in the system were extracted from the query results. They selected 700 queries and replayed them through Gnutella several times from different ultrapeers to analyze Gnutella’s search performance. They showed that 6% of queries received no results. Because only queries with responses were selected to be replayed, the 6% of queries with no results represent a failure of Gnutella to locate the desired objects. Their analysis does not take into account the fact that some queries may have no matching objects in the system at all. By contrast, our approach to evaluating query performance is independent of the P2P overlay or search mechanism used and thus provides a more complete representation of the systems’ query behavior. Prior studies examined search performance for rare objects. They considered objects with few or no replicas to be rare objects. Qiao et. al.14 created objects with 50-byte long random string filenames and inserted them into the Gnutella network thus guaranteeing that each object contained no other replicas in the system. The authors found that Gnutella achieved only a 1.5% success rate for finding these objects. However, an object may have few replicas and yet not be considered rare if it matched many queries. For example, an object named ”britney-spears-live-concert-for-my-cousinsbirthday.mp3” is likely a unique object in the system. However, any search that contains the following terms ”britney
spears concert” will match this object. Rather than focus on whether the system can locate an individual object efficiently, we examined actual queries issued by users to determine if the query targeted a rare or popular object. Determining the popularity distribution of objects is difficult. In Gia,15 the authors use the number of download requests for an objects to measure its popularity to claim that ”most queries are for hay, not needles”. The authors captured query traffic in Gnutella and offered for download the objects that matched the top 50 queries seen. They showed that more than 50% of the requests were for objects with more than 100 replicas. This approach can evaluate the popularity of requested objects. However, it does not give insight as to the popularity of objects with respect to the query workload. Our analysis examined the distribution of matching objects for each query and showed that only 17% of the queries had more than 100 matching objects. Gummadi et. al.16 note that file popularity distributions in the Kazaa P2P file-sharing systems do not follow a Zipf distribution. However, the authors’ analysis was based on the number of downloads for a given file. They note that these systems exhibit fetch-at-most-once behavior for downloading files which results in non-Zipfian popularity distributions. Our analysis was focused on the distribution of files across peers rather than on the number of downloads as this provided a means to determine how many files were available in the system to match each query. Saroiu et. al.17 presented an early study of user behavior in Gnutella. They found that most users were connected on low capacity network connection and that the structure of the original Gnutella overlay followed a power law degree distribution. Following this study, the Gnutella developers created a new version of the protocol (v0.64 ). The new protocol defined an explicit hierarchy whereby high capacity peers served as ultrapeers and low capacity peers acted as leaf peers. In the new architecture, the ultrapeers were responsible for most of the query routing burden while leaf peers, due to their low capacity, performed little or no query routing. Stutzbach et. al.18 examined the modern Gnutella overlay and found that although it was growing, only a small set of peers (ultrapeers) were long lived and thus formed a stable core of peers for the overlay. Our prior research19 showed that over a four year period the Gnutella network evolved to adopt the new v0.6 protocol, however, the new architecture had little net benefit on bandwidth utilization or query success rate. Because the new architecture still suffered from poor query performance and high query-routing bandwidth utilization, we were interested in determining the practical limits of query performance. Zhao et. al.20 analyzed the Gnutella network and found that file popularity followed a Zipf distribution. Our study differs in that we were interested in the properties of the shared objects and query workload from Gnutella and identified how the relationship between them affects overall query performance. We use our findings to propose a new informed search mechanism that can exploit the shared object and query workload properties to improve search performance. Pucha et. al.21 examined the byte-level similarity of files shared by clients in BitTorrent.22 They found that byte-level similarity existed for the popular file types even though the file names do not indicate that the objects are identical. The authors exploit this similarity to improve download performance for BitTorrent. Our work differs from Pucha’s work in that we sought to examine how the contents were located and how many queries had contents with matching contents. Because search mechanisms could only rely on the terms used in the file annotations, our primary concern was to identify the relationship between the terms used to locate the contents and the terms used to describe the contents. Before a mechanism can exploiting similarity of shared file data for improved download (Pucha’s work), the query must locate matching contents (our work).
3. EXPERIMENTAL METHODOLOGY AND SETUP In the following sections we describe the questions that our analysis answers, our experimental setup and methodology for evaluating queries.
3.1 Questions Answered Question 1: Are frequently occurring terms in the object filenames also popular search terms? We were interested in determining whether the popularity rank of a term in one distribution correlated to its rank in the other. Question 2: What is the average practical limit for query success rates? We were interested in determining the percentage of queries that were answerable. Question 3: Does the query success rate scale with the size of the network? Querying more peers should reach more unique objects and in turn increase the overall query success rate.
Question 4: Do users typically issue queries for popular objects? Prior studies assumed that many queries were for popular objects. We were specifically interested in quantifying the number of queries for popular objects. Question 5: When the query reached more peers, what proportion of the relevant objects available in the system were returned for each query (i.e, recall)? We were interested in determining whether reaching a larger subset of nodes increased the number of matching objects returned. Question 6: How do queries and objects with non-ASCII characters affect the system’s behavior? We were interested in determining the prevalence of queries and objects with non-ASCII characters and whether they affected the overall performance of searches.
3.2 Data Capturing Methodology In this section we describe our data capturing methodology. First we describe the process for discovering shared objects in the system. We then provide details on our query capturing client. 3.2.1 Discovering Shared Objects We developed a file crawler that crawled the Gnutella network and queried each peer for a list of shared objects. The crawler first performed a topology crawl to discover peers connected to the Gnutella network. It then performed a file crawl by connecting to each discovered peer and requesting a list of its shared files. Each responding peer returned a list of shared objects that contained the filename and file size for each object. Our crawl was performed in October 2006 and discovered 18.6M objects shared by 41,910 peers. We performed an additional crawl in April 2007 and discovered over 21M objects shared by 37,572 peers. 3.2.2 Capturing Query Traffic We modified the source code of Phex,23 an open source Gnutella client written in Java to log traffic on the network. The client connected to the network and logged every incoming and outgoing query that passed through it. It should be noted that the query-capturing client does not actively probe the system; rather, only queries and query responses that were routed through it were logged. We simultaneously ran the file discovery crawl and the query traffic capturing to ensure an accurate representation of the system’s state. We captured queries for 30 days in from mid September to mid October 2006 and showed that on average, the mean number of queries per day was approximately 200K queries. In April 2007, we captured 3 days worth of queries and found similar results. In this paper, we present results based on a 24 hour trace from October 1, 2006 and April 26, 2007. 3.2.3 Handling Multi-Lingual Queries and File Names In Gnutella, all object names and all transmitted queries were represented using UTF-8. UTF-8 is a multi-byte encoding of Unicode.24 UTF-8 encodes ASCII characters using a single byte.
3.3 Query Evaluation Methodology A key aspect of our approach was to remain agnostic to both overlay topology and search mechanism in evaluating the system’s ability to resolve queries. Rather than using an overlay with a search algorithm to resolve queries, queries were resolved using an oracle with perfect knowledge of all objects in the system. For any given query, the oracle returned all objects in the dataset that matched all the terms in the query. This approach allowed our evaluation to remain agnostic to both overlay topology and P2P search mechanism.
4. EXPERIMENTAL RESULTS First we present an analysis of the terms that occurred in queries and object annotations and compared the relative popularity distributions of these terms. We then show our analysis of query evaluation and then conclude with an analysis of the multi-lingual characteristics of both queries and shared objects.
Figure 1. Relationship between the query terms and the file terms. Only queries whose terms were in M = F ∩ Q were matched.
(a) M = ∅
(b) M = Q
Figure 2. Boundary conditions for M . If M = ∅, no queries were matched. If M = Q, then there is a potential to match all queries. Because Q is not static, M also varies with time.
4.1 Question 1: File Annotations and Query Term Analysis We were interested in understanding the annotations of the shared objects as they relate to the terms used in queries. The nature of this relationship affects the ability of queries to be resolved against matching objects effectively. For a given set of files shared by peers and a query trace for a specified time interval, we denote the set of terms that appear in annotations of the shared files as F , and the set of terms that appear in the query trace as Q. F was expected to be relatively static while Q was expected to vary over time and depends on the specific sampling interval. The shared matching terms M = F ∩ Q is the intersection of file and query terms at a given time (illustrated in Figure 1). If M = ∅, then no queries could be matched because no files exist with terms that matched the terms used in the queries (illustrated in Figure 2(a)). If Q ⊆ F then M = Q thus all queries could potentially be matched (illustrated in Figure 2(b)).
Figure 3. Relationship between the popular query terms Q0 and the popular file terms F 0 . Only queries whose terms were in M 0 = F 0 ∩ Q0 were matched.
Unlike uninformed search mechanisms such as flooding and random walks, newer search mechanisms make informed query routing decisions.11, 12, 25, 26 For example, an informed search mechanism can consult a summary (synopsis) of the content for each neighbor and only forward the query to those neighbors that have terms in their synopsis that match the query. To remain scalable and distribute the synopsis effectively to many peers, the synopsis size must be bounded. Because the size of the synopsis is bounded, its ability to resolve queries effectively depends on the terms it contains. Typical systems build the synopsis using only the popular file terms at each peer. However, creating a suitable synopsis requires examining both the file terms and query term distributions. Consider a synopsis created with the set of popular
file terms F 0 . Queries routed using this synopsis would only be resolved if the query terms appear in the synopsis and thus intersect with F 0 (F 0 ∩ Q). Popular query terms Q0 , by definition, appear in many queries. Therefore, a synopsis consisting of the terms in F ∩ Q0 has the ability to resolve many queries because it contains the terms that occur in the popular queries. However, current models only consider the popular file terms F 0 when creating the synopsis. Because only queries whose terms are in the intersection M 0 = F 0 ∩ Q0 can be resolved (illustrated in Figure 3), it is important to understand the relationship between F 0 and Q0 . 1 0.9
Cumulative distribution
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
10
20 30 40 Number of terms in filename
50
60
Figure 4. Cumulative distribution of number of terms in shared filenames. 18.6M total objects discovered in October, 2006.
Our aim was to examine the relationship between the file and query term distributions in order to understand how search mechanisms could be modified to more effectively resolve match queries to the shared files. First, we examined the filenames of shared objects. For our analysis, the terms used in the shared filenames were the only annotations available for queries. We tokenized the filenames as specified in the Gnutella protocol4 and counted the number of terms for each filename. Figure 4 plots the cumulative distribution of the number of terms in the shared filenames discovered in October, 2006. From Figure 4, we observe that over 90% of the files had fewer than 10 terms. An analysis of filenames from the April 2007 data showed similar results (not illustrated in this paper). Next, we tracked the number of times that each term occurred across all discovered objects. We also investigated the number of times that each term occurred in the query trace. We ranked each term based on file and query occurrences and measured the difference in rank. This analysis gives us insights into whether the most popular terms in the distribution of objects were the ones that were used most in the queries. Among the 233,000 queries, we discovered 83,000 unique query keywords. Out of 18.6 M files, there were 1.2M unique tokens occurring in the filenames. We then cross-referenced the list of query keywords and filename tokens. About 61,000 terms appeared in both queries and filenames. This represented about 75% of the query keywords, but only 5% of the filename tokens. The majority of filename tokens did not appear in any of the queries we captured in the 24 hours both in October, 2006 and April, 2007. We plot the distribution of the number of occurrences for each term in both queries and filenames in Figures 5(a) and 5(b) respectively. The plots are in a log-log scale. We see a linear trend for the number of occurrences of each term in both filenames and query traces indicating a Zipf-like distribution. Next, we examined these distributions more closely to determine whether the popularity of a term in the set of shared objects correlates to its popularity in the query workload. We assigned each term a rank for both queries and filenames based on the number of occurrences of that term in the query trace and shared object dataset respectively. We plotted the difference in rank between the query and filename popularity rank of the top 100, 1,000, and 10,000 terms in Figures 6(a), 6(b), and 6(c) respectively. Perfect correlation between these ranks would result in a graph that is a straight line where y = 0. The distance that each point deviated from 0 indicates the difference in rank and thus a decrease in correlation between the two ranks. From Figure 6(a), we note that the top 25 terms show little difference in rank. However, beyond the top 25 terms, the difference in rank begins to diverge rapidly. When we expand the analysis to include the top 1000 terms of the object annotations, Figure 6(b), the rank for a term showed significant variance. Further expanding the analysis to include the top 10,000, Figure 6(c), terms revealed a difference of -1,876 between the popularity rank of a term from the object annotations and its popularity rank in the query workload. Although the term popularity distributions followed
1e+06
1e+07 1e+06
100000
Number of terms
Number of terms
100000 10000
1000
100
10000 1000 100
10
1
10
1
10
100 1000 Number of occurences
10000
100000
1 1
10
(a) Query Terms
100 1000 10000 Number of occurences
100000
1e+06
(b) File Terms
Figure 5. Distribution of terms occurrence counts in the query trace (24 hours in October, 2006) and the term occurrence counts across all objects (18M Objects from October, 2006). Plots are in log-log scale.
500
0
-500
-1000
1000
Difference in rank against query trace
1000
Difference in rank against query trace
Difference in rank against query trace
1000
500
0
-500
-1000 0
10
20
30 40 50 60 70 Term rank in object annotations
(a) Top 100 Terms
80
90
100
500
0
-500
-1000 0
100
200
300 400 500 600 700 Term rank in object annotations
(b) Top 1,000 Terms
800
900
1000
0
1000
2000
3000 4000 5000 6000 7000 Term rank in object annotations
8000
9000 10000
(c) Top 10,000 Terms
Figure 6. Difference between term rank in the shared object dataset and the query workload for top the 100, 1,000 and 10,000 terms in the shared object annotations(October, 2006 data)
Zipf-like trends, the relative popularity of a term in the set of shared objects does not correlate well to the popularity of the term in the query workload. This observation indicates that to improve search performance using a synopsis of the objects at each peer requires an examination of the query workload.
4.2 Questions 2 & 3: Analysis of Query Success Rate Limits Our goal was to determine the limits for query performance. We first analyzed the global success rate of queries. We then investigated the change in global success rate as we increased the number of nodes, and thus, objects in the network. We then examined the number of matching objects returned per query followed by an examination of the recall for queries as a function of the network size. We evaluated the success rate for queries against networks of varying size. Our dataset consisted of traces and discovered objects captured in October, 2006 and April, 2007. The October, 2006 data consisted of 40K peers with 18.6M shared objects. The April, 2007 data consisted of 37K peers with 21M shared objects. We selected random samples of each data set where each sample had an increasing number of peers: 200, 1K, 2K, 3K, 4K, 5K, 10K, 20K, and then the full number of peers. We then evaluated a two hour trace of queries against each sample of peers. The results of the success rate trends for the various network sizes for the 2006 and 2007 data is shown in Figures 7(a) and 7(b) respectively. From Figures 7(a) and 7(b), we observe that at most, 56% of the queries in the 2006 data and 44.4% of the queries in the 2007 were resolved when using the shared objects from all the nodes. Beyond 10,000 nodes, the success rate did not improve significantly as the network size increased for both 2006 and 2007 queries. Note that the success rates for 2007 were lower than in 2006. We speculate that this drop in success rate can be partially attributed to an increase in the number of queries with non-ASCII characters. The percentage of queries with non-ASCII characters increased from 8% to 17% between the October, 2006 and the April, 2007 data yet the number
1
0.8
0.8 Query success rate
Query success rate
1
0.6
0.4
0.2
0
0.6
0.4
0.2
0
5000
0 10000 15000 20000 25000 30000 35000 40000 45000 Number of nodes in the system
0
5000
(a) October, 2006 Data
10000 15000 20000 25000 30000 Number of nodes in the system
35000
40000
(b) April, 2007 Data
Figure 7. Success rate vs. nodes reached by the queries (2 hour query trace). Beyond 10,000 nodes there was little improvement in query success rates.
1
1
0.8
0.8 Query success rate
Query success rate
of shared objects with non-ASCII characters in their filenames remained less than 1%. We further discuss the effect of non-ASCII characters on the query behavior in Section 4.5.
0.6
0.4
0.2
0
0.6
0.4
0.2
0
500
1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 Number of nodes in the system
(a) October, 2006 Data
0
0
500
1000
1500 2000 2500 3000 3500 Number of nodes in the system
4000
4500
5000
(b) April, 2007 Data
Figure 8. Success rate vs. nodes reached by the queries (24 hour query trace).
We then examined whether this trend was consistent for longer traces and performed the same evaluation for 24 hour query traces from 2006 and 2007. Note that due to computation requirements, we were unable to evaluate the 24 hour traces for all network sizes. We ran our query evaluator on a quad-processor Itanium-2 Linux system with 8GB of memory. Even with aggressive pre-processing, and caching to reduce the memory footprint of the term index, our query evaluator required several days to process a 2 hour trace for networks larger than 10K nodes. Figures 8(a) and 8(b) show the success rate trends of a 24 hour query trace from 2006 and 2007 respectively for the network sizes that we were able to evaluate. The figures show a similar trend as in the 2 hour query trace analysis for network sizes up to 5,000 nods. Note that the success rates for the queries from the 24 hour trace were slightly lower than for a 2 hour trace. This is due to small variance in query performance over the course of a day. Examining a full 24 hour trace are a more accurate representation of the overall query success rate. In the next section, we discuss our examination of the distribution of matching objects for each query.
4.3 Question 4: Matching Objects per Query and Object Replication Ratios In addition to the overall query success rate, we were interested in examining the distribution of matching objects per query. Understanding this distribution gives insight into the relative popularity and replication of objects in the system. We plotted the cumulative distribution of the number of matching objects for a 2 hour query trace evaluated against the full set of 18.6M objects (October, 2006 data) in Figure 9(a) . The analysis of data from April, 2007 yielded similar results
1
0.8
0.8 Cumulative distribution
Cumulative distribution
1
0.6
0.4
0.2
0
0.6
0.4
0.2
0
100
200 300 Number of matching objects
400
(a) All Queries
500
0 0
50
100
150 200 250 300 350 Number of matching objects
400
450
500
(b) Successfully Matched Queries
Figure 9. Cumulative distribution of the number of matching files returned per query using October, 2006 data: 2 hour query trace, 18.6M objects, and 40K peers.
and thus we omit graphs showing the 2007 results. From Figure 9(a) we observe that 88.3% of all queries had fewer than 500 matching files with only 17% having more than 100 matching files. Most queries (> 70%) were for objects with low replication rates (< 10 matching objects). 1
Cumulative distribution
0.8
0.6
0.4
0.2
0 0
50
100
150 200 250 300 350 400 Number of nodes with matching objects
450
500
Figure 10. Cumulative distribution of the number of nodes with matching files per query. (October, 2006 data : 2 hour query trace, 18.6M objects, and 40K peers).
Our earlier analysis in Section 4.2 showed that 44.4% of queries against the full set of 18.6M objects had no matches. The large number of queries with no matches skewed the distribution downward; the mean and median number of matching objects were closer to 0. We then focused on analyzing the distribution of matching objects for just the queries with matching objects in the system. Figure 9(b) plots the cumulative distribution of the number of matching objects per successfully matched query. We note that the majority of successfully matched queries had only one matching object. Over 70% of all queries had fewer than 10 matching files. The mean number of matching files per successful query was over 5,600. However, the mean was skewed high due to several queries with unusually large numbers of matching files∗ . The median number of matching files per successful query was much lower: 18 matching files. Less than 0.5% of all queries and less than 1.4% of the successful queries had over 5,000 matching files. Next we examined the number of nodes with matching objects for each query. We plot the cumulative distribution in Figure 10. We observe that approximately 70% of the queries were resolved by fewer than ten nodes. Over 90% of the queries were resolved by fewer than 100 nodes. Our results showed that few objects had high replication ratios. We define replication ratio as the proportion of nodes in the system with matching objects. Our results showed that less than 3.5% of queries had replication ratios higher than 1%. These results imply a system where most queries were for rare objects with ∗
These queries contained a single search keyword. The top such queries were for ”mp3” (over 4,000,000 matches), ”zip” (over 800,000 matches), and ”jpg” (over 700,000 matches).
most queries resolved by a small number of nodes. Next, we examined how the network size affected the quality of the results for a given query by measuring recall.
4.4 Question 5: Query Recall Analysis 1
1
Mean recall Median recall
0.9 0.8
0.7 0.6
Query recall
Cumulative distribution
0.8
0.5 0.4
0.6
0.4
0.3 200 nodes 1,000 nodes 2,000 nodes 4,000 nodes 10,000 nodes
0.2 0.1 0
0
0.1
0.2
0.3
0.4 0.5 0.6 Query recall
(a) Recall CDF
0.7
0.8
0.2
0.9
1
0
0
1000
2000
3000 4000 5000 6000 7000 Number of nodes in the system
8000
9000 10000
(b) Mean & Median Recall
Figure 11. CDF and average recall for queries performed on networks of increasing size.
Our previous results from Section 4.2 showed that, beyond a certain threshold, global success rate for queries were not significantly improved by visiting more nodes. We then investigated whether visiting more nodes improved the number of relevant objects returned for each query (recall). Recall is a measure of how many relevant objects are retrieved as a proportion to the total number of relevant objects in the system. As an example, consider a query for ”spiderman movie”. If the system contained 100 objects with ”spiderman” and ”movie” as part of their filename (annotations) and the query 60 = 0.6. returned 60 of them, then the recall for the query would be: 100 We evaluated the queries on a subset of the 40,000 nodes that we crawled in October, 2006. We identified all objects that matched the query and then re-issued the query against the full 40,000 nodes in the system. We tracked the matching objects from the full 40,000 nodes and compared that set to the matching objects from the original subset of nodes. The analysis of data from April, 2007 yielded similar results and thus we omit graphs showing the 2007 results. We plot the cumulative distribution of recall for various network sizes in Figure 11(a). We note that recall improves with increasing values (perfect recall where all relevant objects are returned has a value of 1). We prefer curves where most queries have a higher recall as this indicates that queries are locating more of the objects relevant to the query. As the network size increased, the recall distribution showed a trend toward higher recall. Using a network of 10,000 nodes we had over 50% of the queries having recall of 0.5 or greater; most queries located the majority of the matching objects in the system. We then examined the trend of the average recall as the size of the network increased. We plot the mean and median recall for a 2 hour query trace as a function of network size in Figure 11(b). We observe that unlike success rate, recall did improve linearly with increasing network size. Although global search performance did not improve significantly with more nodes, individual queries with positive matches in the system benefit with more nodes by having access to more relevant matching objects.
4.5 Question 6: Analysis of Non-ASCII Characters in Queries and Object Annotations With the world-wide rise in Internet usage, the future trend is towards multi-lingual toward multi-lingual environments.27 We were interested in determining how much multi-lingual queries and files affected overall query performance. We showed that queries with non-ASCII characters represented a significant portion (up to 17%) of the queries in our study. However, our study showed that less than 1% (0.13%) of the files had any non-ASCII characters. We then evaluated only those queries with non-ASCII characters against the full 20M object dataset. We showed that the success rate of these queries were only 0.08%. Even though the query distribution contained many queries with non-ASCII characters, the object distribution contained few objects with such characters thus resulting in a low success rate for queries with non-ASCII characters.
4.6 Summary of Results Question 1: Are frequently occurring terms in the object filenames also popular search terms? We showed that the popularity distribution of terms in both the object annotations and query traces followed a Zipf-like distribution. However, we showed that the relative popularity in the object annotations does not correlate well with its popularity in the query workload. Additionally, we showed that fewer than 5% of the terms found in the object annotations occurred in the query trace. Question 2: What is the average practical limit for query success rates? Our analysis showed that analyzing queries against 40,000 crawled peers from 2006 yielded a success rate of 56%. Question 3: Does the query success rate scale with the size of the network? Our experiments showed that increasing the network size from 200 to 10,000 peers yielded significant increases in success rate. However, beyond 10,000 nodes, the increase in success rate was less significant. Question 4: Do users typically issue queries for popular objects? We showed that only 17% of queries had more than 100 matching objects. Additionally, we showed that 70% of the queries had matching objects on ten or fewer nodes. Our analysis also showed that only 3.5% of the queries had matching objects on more than 1% of the peers in the system. Question 5: What proportion of the relevant objects available in the system were returned for each query (recall) as the query reached more peers? Our analysis showed that, on average, a query that reached more peers was able to retrieve a higher proportion of relevant objects (higher recall). Moreover, the median recall increased linearly with the size of the network in contrast to the query success rate. Question 6: How do queries and objects with non-ASCII characters affect the system’s behavior? We showed that queries with non-ASCII characters made up a significant percentage of the total queries (17%) in the April, 2007 dataset. This represented a doubling in the percentage of these queries (8% to 17%) from the October, 2006 dataset. Despite the significant number of queries with non-ASCII characters, we showed that less than 1% of the objects discovered in our crawls for both 2006 and 2007 contained any non-ASCII characters.
5. CONCLUSION In general, P2P systems cannot realistically control the amount or the quality of annotations and queries. We showed that search in P2P systems have a practical limit on success rate. Almost half (44.4%) of the queries had no matching objects in the system regardless of the overlay or search mechanism used to locate the objects. Additionally, the query success rate only dropped modestly (15%) when the number of nodes used to evaluate the queries was small (12% of the total). We also showed that the distribution of terms in the query workload and the shared objects’ annotations followed a Zipflike distribution. However, the relative popularity of a term in the shared object names did not correlate with the term’s popularity in the query workload. Although we do not address the question of how best to name objects or issue queries, our results could be used to design more effective overlays and search mechanisms that are optimized for the observed workload. Our future work is focused on designing such overlays and search mechanisms.
ACKNOWLEDGMENTS This work was supported in part by the U.S. National Science Foundation (CNS-0447671).
REFERENCES 1. I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan, “Chord: A scalable peer-to-peer lookup service for internet applications,” in Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications, pp. 149–160, ACM Press, 2001. 2. A. I. T. Rowstron and P. Druschel, “Pastry: Scalable, decentralized object location, and routing for large-scale peerto-peer systems,” in Proceedings of the IFIP/ACM International Conference on Distributed Systems Platforms Heidelberg, pp. 329–350, Springer-Verlag, 2001.
3. S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Schenker, “A scalable content-addressable network,” in Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications, pp. 161–172, ACM Press, 2001. 4. “Gnutella protocol v0.6.” http://rfc-gnutella.sourceforge.net/src/rfc-0 6-draft.html. 5. W. Acosta and S. Chandra, “Improving search using a fault-tolerant overlay in unstructured p2p systems,” in Proceedings of the IEEE International Conference on Parallel Processing (ICPP’07), 2007. 6. Q. Lv, P. Cao, E. Cohen, K. Li, and S. Shenker, “Search and replication in unstructured peer-to-peer networks,” in Proceedings of the 16th international conference on Supercomputing, pp. 84–95, ACM Press, 2002. 7. B. T. Loo, R. Huebsch, I. Stoica, and J. M. Hellerstein, “The case for a hybrid p2p search infrastructure,” in Proceedings of the 3rd International Workshop on Peer-to-Peer Systems (IPTPS’04), 2004. 8. “itunes.” http://www.apple.com/itunes. 9. S. Chandra and X. Yu, “Share with thy neighbors,” in Proceedings of the SPIE/ACM Multimedia Computing and Networking Conference (MMCN’07), 2007. 10. A. H. Rasti, D. Stutzbach, and R. Rejaie, “On the long-term evolution of the two-tier gnutella overlay,” in IEEE Golbal Internet, 2006. 11. M. Zaharia and S. Keshav, “Gossip-based search selection in hybrid peer-to-peer networks,” in Proceedings of the 5th International Workshop on Peer-to-Peer Systems (IPTPS’06), 2006. 12. C. Tang and S. Dwarkadas, “Hybrid global-local indexing peer-to-peer information retrieval,” in Proceedings of the 1st Symposium on Networked Systems Desing and Implementation (NSDI’04), 2004. 13. W. Acosta and S. Chandra, “Exploiting the properties of query workload and file name distributions to improve p2p synopsis-based searches,” in Proceedings of the IEEE Conference on Computer Communications (INFOCOM ’08), April 2008. 14. Y. Qiao and F. E. Bustamante, “Structured and unstructured overlays under the microscope - a measurement-based view of two p2p systems that people use,” in Proceedings of the USENIX Annual Technical Conference, 2006. 15. Y. Chawathe, S. Ratnasamy, L. Breslau, N. Lanham, and S. Shenker, “Making gnutella-like p2p systems scalable,” in Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications, pp. 407–418, ACM Press, 2003. 16. K. P. Gummadi, R. J. Dunn, S. Saroiu, S. D. Gribble, H. M. Levy, and J. Zahorjan, “Measurement, modeling, and analysis of a peer-to-peer file-sharing workload,” in Proceedings of the nineteenth ACM symposium on Operating systems principles, pp. 314–329, ACM Press, 2003. 17. S. Saroiu, P. K. Gummadi, and S. D. Gribble, “A measurement study of peer-to-peer file sharing systems,” in Proceedings of Multimedia Computing and Networking 2002 (MMCN ’02), (San Jose, CA, USA), January 2002. 18. D. Stutzbach, R. Rejaie, and S. Sen, “Characterizing unstructured overlay topologies in modern p2p file-sharing systems,” IEEE/ACM Transactions on Networking , 2007. 19. W. Acosta and S. Chandra, “Trace driven analysis of the long term evolution of gnutella peer-to-peer traffic,” in Proceedings of the Passive and Active Measurment Conference (PAM’07), 2007. 20. S. Zhao, D. Stutzbach, and R. Rejaie, “Characterizing files in the modern gnutella network: A measurement study,” in Proceedings of the SPIE/ACM Multimedia Computing and Networking (MMCN ’06), (San Jose, CA), 2006. 21. H. Pucha, D. G. Andersen, and M. Kaminsky, “Exploiting similarity for multi-source downloads using file handprints,” in Proceedings 4th Symposium on Networked System Design and Implementation (NSDI ’07), 2007. 22. “Bittorrent protocol specification.” http://bitconjurer.org/BitTorrent/protocol.html. 23. “The phex gnutella client.” http://phex.kouk.de. 24. “Unicode.” http://unicode.org/. 25. B. T. Loo, J. M. Hellerstein, R. Huebsch, S. Shenker, and I. Stoica, “Enhancing p2p file-sharing with an internet-scale query processor,” in Proceedings of the 30th Very Large Data Bases Conference (VLDB’04), 2004. 26. P. Reynolds and A. Vahdat, “Efficient peer-to-peer keyword searching,” in Proceedings of International Middleware Conference (Middleware’03), 2003. 27. “http://www.internetworldstats.com/stats3.htm.”