Public Domain P2P File-sharing Networks Measurements and Modeling Jaime Lloret1, Juan R. Diaz2, Jose M. Jimenez3 and Fernando Boronat4 Department of Communications, Polytechnic University of Valencia (Spain) 1
[email protected];
[email protected];
[email protected];
[email protected] Abstract Since P2P file-sharing networks became extremely popular between Internet users, many researchers have tried to model those P2P networks. This article deals with the modeling of public domain P2P filesharing networks in terms of some parameters such as their number of users and the number of files inside them along the time. To do so, we have been measuring five public P2P file-sharing networks (Gnutella, FastTrack, Opennap, Edonkey and MP2P) tracking their evolution for three years. The results will be discussed and compared with measurements taken from other authors. Results obtained could be used to design new P2P networks, to test their performance or to optimize P2P network’s parameters.
1. Introduction and motivation P2P file-sharing is one of the most popular P2P variants and with largest number of users. Many of those users merely interact with their P2P network to download files but without sharing nothing, but there are many others with the intention to share their files with the whole community without bothering who is downloading them. It is also interesting to distinguish between a P2P file-sharing network and desktop P2P file-sharing application, the development of which must not be necessarily parallel. P2P file-sharing networks are a set of rules and interactions that allow Desktop P2P file-sharing application to communicate. A desktop P2P file-sharing application is a computer application that allows a user interacts with other users in the same P2P file-sharing network. Desktop filesharing application simplicity, its possible languages and/or the type of download of the P2P file-sharing network are often the responsible factors for a P2P filesharing network becoming very popular or, on the other hand, disappearing. Sometimes the reason for changing to another newly offered desktop P2P filesharing application is its ability to simultaneously join
several networks, such as Shareaza [1] and MLDonkey [2], Morpheus [3], giFT [4] and cP2Pc [5][6]. Some ISPs observed in 2002 that their networks became rapidly congested and sometimes P2P traffic reached around 60% of the total traffic [7]. Although not so striking, Internet2 administrators also computed impressive results on 16 February 2004 where 10.46% of the total traffic was originated by P2P file-sharing networks [8]. CAIDA also shows that Internet traffic is mainly dominated by P2P file-sharing protocols and HTTP [9]. Because of the social impact of P2P file-sharing networks, both industry and academia are spending time and money analyzing several aspects of these networks. There are several public domain P2P filesharing networks measurements published, some of them have been taken in a deceptive manner (e.g., the number of connected users is calculated solely based on the amount of users that download a certain desktop P2P file-sharing application [10, 11]), others just give the average of users for a certain period of time [12]. Even some studies have analyzed which type of P2P file-sharing network is used by Internet users from different regions of the world [13]. But none of those papers and measurements aforementioned have studied the number of users and the number of files from inside the network, using their protocol. There are many public domain P2P file-sharing networks, but we have chosen one totally decentralized architecture (Gnutella [14]), one partially decentralized architecture with superpeers (FastTrack [15]) and three partially decentralized architectures with servers (OpenNap [16], eDonkey [17] and MP2P [18]), to measure their number of users and, the number of files inside them because their protocol allow us to have the information without broadcasting or flooding the network. How chosen P2P file-sharing networks work is described in reference [19]. On the other hand, those parameters vary over the time, so to measure their evolution is needed in order to design new P2P file-sharing networks or to test new P2P networks performance.
This paper is structured as follows. Section 2 explains how we have taken our measurements, a summary of them and discusses other authors’ measurements used for modeling. Mathematical expressions obtained from measurements taken are shown in section 3. Section 4 has our conclusions.
2. Measurements We have chosen the most adequate clients to take accurate measurements from the selected P2P filesharing networks. This election has been taken bearing in mind that the client would provide the most information on the architecture and the highest update frequency to measure the parameters. Once a desktop file-sharing application joins its network, it receives a message containing the number of users in the network, the number of files shared and, in some cases, the total amount of data shared inside the network. This information is periodically refreshed. Using the protocol signature, we have captured the messages containing those parameters every hour of the day. The Gnutella network has been analyzed using the Limewire client [20]. In the FastTrack network, the measurements have been taken with the KaZaA Lite client [21]. In order to analyze OpenNap network, the Xnap client [22] has been used. The eMule client has [23] been utilized to analyze the eDonkey network. Finally, the MP2P architecture has been analyzed by means of the Piolet client [24].
2.1. Evolution We have taken measurements from March 2003 to March 2006 for the Gnutella, FastTrack, OpenNap and eDonkey networks and from March 2004 for MP2P because of its number of users were increasing significantly. Figure 1 shows their evolution in terms of the number of users joined to the analyzed networks, and figure 2 shows the number of files shared inside the networks along the years. We can observe that there not a strictly relationship between the number of online users in a P2P file-sharing network and the number of shared files in the network, because OpenNap and MP2P networks had the same number of users in March 2004; however, OpenNap network had three times more shared files than MP2P network. On the other hand, eDonkey network has had always more users than OpenNap network; however, OpenNap has had more shared files than eDonkey between March 2003 and January 2004. Detailed comments about P2P file-sharing networks evolution and their content can be read in reference [25].
We have also measured the total amount of data shared inside P2P networks. We have observed that it is not dependent with the number of files shared inside the network. FastTrack is the one with most shared files, however, eDonkey is the network with most total amount of shared data inside because many of them are big size files like videos, DVD images, and so on.
2.2. Number of users and files inside the P2P networks Figures 3a, 3b, 3c, 3d and 3e show an example of the measurements, of the number of users inside the analyzed P2P networks, taken during a week. All figures correspond to the most significant ones amongst all obtained data. Time values are GMT+01:00 timezone. We observe that the hours where all the architectures measured have more users are between 18:00 and 5:00. We can also observe that all graphs have a regular wave form that is repeated all days. It can be seen for all P2P networks we have measured. It can be appreciated better when there are many users inside the P2P network and when the network is stable (i.e. the users leave the network voluntarily, not because of the network has failed). Measurements taken from other P2P networks with lower number of users show irregular graphs. On the other hand, the higher is the number of users, the lower is the precision of the parameters obtained from the P2P network. Measurements taken in 2006, show that eDonkey network can have variations of ± 9 million of users in one hour. Gnutella can have variations of ± 350,000 users in one hour, FastTrack and OpenNap can have variations of ± 150,000 users in one hour and MP2P can have variations of ± 10,000 users in one hour. Figures 4a, 4b, 4c, 4d and 4e show an example of measurements, of the number of files shared inside the analyzed P2P networks, taken during a week. All of them correspond to the most significant ones amongst all obtained data. We can appreciate the same as for the figures of the number of users. For a detailed analysis of public domain P2P file-sharing networks during a week, see reference [26]. Table 1 shows their type of files shared, the correlation between the number of users and their files and the hours when there are more users connected.
2.3. Number of files per user File replication in many P2P file-sharing networks is ordered by power-law, more concretely by the Zipflaw, as it can be read in references [27] and [28].
Number of File s
Num berof us ers Gnutella
25.000.000
FastTrack
OpenNap
eDonkey
MP2P
Gnutella
1.200.000.000
FastTrack
OpenNap
eDonkey
MP2P
1.000.000.000
20.000.000
800.000.000
15.000.000
600.000.000 10.000.000
400.000.000
5.000.000
200.000.000
0
ar -0 3 ju n03 se p03 di c03 m ar -0 4 ju n04 se p04 di c04 m ar -0 5 ju n05 se p05 di c05 m ar -0 6
Time
Figure 1. Number of users in analyzed networks Users
Tim e
m
-0 5
ar -0 6
m
di c
ju n05 se p05
ju n04 se p04 di c04 m ar -0 5
m
ar -0 3 ju n03 se p03 di c03 m ar -0 4
0
Figure 2. Number of files in analyzed networks. Users
Users
2.500.000
4.300.000
2.450.000 2.400.000
4.100.000
350.000 330.000 310.000
3.900.000
290.000
2.350.000 2.300.000
3.700.000
2.250.000 2.200.000
3.500.000
250.000
3.300.000
230.000
2.150.000 2.100.000
3.100.000
210.000
2.900.000
190.000
2.050.000 2.000.000 0:00
2.700.000
270.000
170.000 150.000 0:00
2.500.000
0:00
0:00
0:00
0:00
0:00
0:00
0:00
Hours
Figure 3a. Number of users in Gnutella.
0:00
0:00
0:00
0:00
0:00
0:00
Figure 3b. Number of users in FastTrack.
Users
0:00
0:00
0:00
0:00
Hours
Figure 3c. Number of users in OpenNap. 1.000.000.000
260.000
1.700.000
0:00
Files
Users
1.800.000
0:00
Hours
900.000.000
255.000
800.000.000
1.600.000
250.000
700.000.000
1.500.000
245.000
600.000.000
1.400.000
500.000.000
240.000
400.000.000
235.000
300.000.000
1.300.000
200.000.000
230.000
1.200.000 0:00
0:00
0:00
0:00
0:00
0:00
0:00
225.000
Hours
0:00
Figure 3d. Number of users in eDonkey. Files
0:00
0:00
0:00
0:00
0:00
0:00
Hours
Figure 3e. Number of users in MP2P.
100.000.000 0:00
Files
Files
210.000.000
125.000.000
750.000.000
200.000.000
120.000.000
190.000.000 170.000.000
600.000.000
160.000.000
130.000.000
450.000.000
120.000.000 0:00
0:00
0:00
0:00
0:00
0:00
0:00
0:00
90.000.000 85.000.000
0:00
0:00
0:00
59.000.000 58.000.000 57.000.000 56.000.000 55.000.000 0:00
0:00
0:00
0:00
0:00
0:00
0:00
Hours
0:00
80.000.000
0:00
Figure 4c Number of files in OpenNap.
63.000.000
60.000.000
0:00
Hours
64.000.000
61.000.000
Hours
95.000.000
Files
62.000.000
0:00
100.000.000
Hours
Figure 4b. Number of files in FastTrack.
0:00
105.000.000
140.000.000
500.000.000
0:00
110.000.000
150.000.000
550.000.000
0:00
115.000.000
180.000.000
650.000.000
0:00
Figure 4a. Number of files in Gnutella.
800.000.000
700.000.000
0:00
Gnutella FastTrack OpenNap eDonkey MP2P
Figure 4e. Number of files in MP2P.
Zipf-law explains that few users have most of the files in the network. Measurements taken show us that all analyzed P2P networks have their number of shared files by a user greater than years before. This makes sense, the amount of shared files in a P2P file-sharing network trend to grow because the most popular files trend to replicate in the network. However, we have to take into account that hard disks capacity are limited, so many users use to record many downloaded files in
0:00
0:00
0:00
0:00
0:00
0:00
0:00
Hours
Figure 4d. Number of files in eDonkey.
Type of Crosscorrelation Peak hours for files between users and files connected users All High 12:00-17:00 All Medium 20:00-23:00 All Medium 18:00-22:00 All High 20:00-1:00 Audio High 3:00-5:00 Table 1. Analyzed networks summary
optical disks, then, those files are deleted from users hard disks. Even so, all of them have grown. Figure 5 shows the average files per user through the years for all analyzed P2P networks. We can see that OpenNap network is the one with most number of files per users. The most stable network is MP2P. As measurements vary over the time, researchers have to model their P2P networks taking into account these types of parameters.
Files per User 1000
Gnutella
FastTrack
OpenNap
eDonkey
MP2P
March 2006), and there are other with very big values (872 files per user for OpenNap in March 2006).
900 800
3. Modeling
700 600 500
In order to model public domain P2P file-sharing networks, we have observed measurements taken in section 2 and we state the following assumptions:
400 300 200 100
di c05 m ar -0 6
ju n05 se p05
ar -0 5
di c04
m
ju n04 se p04
di c03 m ar -0 4
ar -0 3 m
ju n03 se p03
0 Time
Figure 5. Files per user in all analyzed P2P networks.
Subhabrata Sen and Jia Wang, in their paper [29], took some measurements in Gnutella and FastTrack networks, from September 2001 to December 2001. Between this dates, both networks were growing as they say in their paper. There were an average of 197.445 users in Gnutella network and an average of 4.450.149 users in FastTrack network. If we compare those measures with our measurements, we can observe that between those dates and the first time we took measurements, the number of users began to decrease. Currently, the number of users from Gnutella and FastTrack are growing again. On the other hand, they observed some traffic pattern for P2P networks between evening and midnight. This paper states that peak hours differ for different P2P networks Beverly Yang and Hector Garcia-Molina, in their paper [30], used experimental results taken from OpenNap network, at the end of 2000. Users in their research had a fixed average of 168 files per user. As we can observe in our measurements, this file average has varied along the years. They used this number to model the expected number of servers in all architectures, as it can vary very much along the years; it is not a good parameter to model new P2P networks. S. Saroiu, P. K. Gummadi and S. D. Gribble show in reference [12] the number of shared files versus the total amount of data shared inside the P2P network. This numbers have varied significantly because actually users in P2P file-sharing networks trend to share and download big size files as we have observed in reference [31]. On the other hand, the number of shared files per host varies not only between P2P filesharing networks, but along the years as we can see in figure 5. P. K. Gummadi et al, in reference [32], in order to model P2P file-sharing workloads, used a value of 1000 users and a value of 40000 objects shared inside the network. This numbers are very far from real values and they also means that there are an average of 40 files per user. As we have seen in our measurements, there is not any P2P network since 2003 with less than 52 files per user (eDonkey in
- The maximum and the minimum number of users inside the network during a day vary along the time, so we can state that max_users(D) and min_users(D) vary along the days, so we assume it is right when it is applied for less than 24 hours. - The graph of the number of users and the number of files behavior for every network seems to be a cosine waveform. - The waveform oscillates between max_users(D) and min_users(D), so this variation is given by U(D) (see equation 1). U ( D) =
max_ users ( D ) − min_ users ( D ) 2
(1)
- The centre of the oscillation is given by the V position (see equation 2). V ( D) =
max_ users ( D ) + min_ users ( D ) 2
(2)
Equation 3 is the best one to model the number of users inside a P2P network for a certain period of time. users (t ) = V ( D ) + U ( D ) cos((( 360 * t ) /( 24 * α )) + β )
(3)
We can also apply this reasoning to the number of files inside the network, having equation 4. files (t ) = W ( D ) + Z ( D ) cos((( 360 * t ) /( 24 * δ )) + λ )
(4)
Where W(D) and Z(D) are given by equations 5 and 6 respectively. W (D) =
max_ files ( D) + min_ files ( D) 2
(5)
Z (D) =
max_ files ( D ) − min_ files ( D ) 2
(6)
If we apply α, β, δ and λ values, given in table 2, for analyzed P2P networks, we can see in figures 6a, 6b, 6c, 6d and 6e for the number of users and 7a, 7b, 7c, 7d and 7e that the model approaches very much with real values.
Users
Gnutella 55 2,5 55 2,5
α β δ λ
FastTrack 55 0,34 55 0,5
OpenNap 55 1,5 55 1,5
eDonkey 55 1,25 55 1,25
MP2P 55 -1,12 55 -1,12
Real Values
2.400.000 2.350.000 2.300.000 2.250.000 2.200.000 2.150.000 2.100.000 2.050.000 2.000.000 1.950.000 1.900.000 6:00
Table 2. α, β, δ and λ values for analyzed networks
10:00
14:00
Model
18:00
22:00
2:00 Hours
Figure 6a. Model vs Real users Gnutella
Users
Users
Model
4.500.000
Real Values
4.000.000 3.500.000 3.000.000 2.500.000 2.000.000 0:00
4:00
8:00
12:00
16:00
Real Values
350.000 330.000 310.000 290.000 270.000 250.000 230.000 210.000 190.000 170.000 150.000
20:00
Use rs
Real Values
1.750.000 1.700.000 1.650.000 1.600.000 1.550.000
0:00
0:00
Model
4:00
8:00
12:00
16:00
20:00
Hours
0:00 Hours
Model
1.500.000 1.450.000 1.400.000 1.350.000 1.300.000 21:00
1:00
5:00
9:00
13:00
17:00
21:00 Hours
Figure 6b. Model vs Real users FastTrack. Figure 6c. Model vs Real users OpenNap. Figure 6d. Model vs Real users eDonkey. Use rs
Real Values
260.000 250.000 245.000 240.000 235.000 230.000 225.000 220.000 4:00
8:00
12:00
16:00
20:00
Real Values
210.000.000 200.000.000 190.000.000 180.000.000 170.000.000 160.000.000 150.000.000 140.000.000 130.000.000 120.000.000
Model
650.000.000 600.000.000 550.000.000 500.000.000 450.000.000 400.000.000
10:00
14:00
18:00
22:00
2:00
6:00 Hours
Model
100.000.000 95.000.000 12:00
16:00
90.000.000
20:00 Hours
8:00
12:00
21:00
1:00
5:00
9:00
13:00
17:00
21:00
Hours
Figure 7c. Model vs Real files OpenNap. Figure 7d. Model vs Real files eDonkey.
Gnutella and eDonkey networks are the ones that worst fixes our model because their values have great variation. We can see that α=δ=55 for all networks, this means that there is the same frequency for all networks. It is because all networks use to have the maximum and minimum peaks around the same time all days. β≠λ because P2P networks have their maximum and minimum peaks in different hours of the day, so we suppose that users join these P2P networks from different countries. β and λ are the delay of the cosine. β and λ are same for users and files in all P2P networks except FastTrack. We suppose it is because the protocol has a delay between the information obtained about the number of users in their network and their files.
4. Conclusions P2P file-sharing networks can be characterized by some parameters, such as the number of users, shared files and total size of shared information.
Real Values
64.000.000 63.000.000 62.000.000 61.000.000 60.000.000 59.000.000 58.000.000 57.000.000 56.000.000 55.000.000 54.000.000
105.000.000
8:00
4:00
Files Real Values
120.000.000
110.000.000
4:00
0:00
16:00
20:00
0:00 Hours
Figure 7a. Model vs Real files Gnutella. Figure 7b. Model vs Real files FastTrack. Files
115.000.000
0:00
Real Values
700.000.000
Hours
Model
Model
800.000.000 750.000.000
6:00
Figure 6e. Model vs Real users MP2P. Files
Real Values
1.000.000.000 900.000.000 800.000.000 700.000.000 600.000.000 500.000.000 400.000.000 300.000.000 200.000.000 100.000.000 0
255.000
0:00
Files
Files
Model
0:00
4:00
8:00
12:00
Model
16:00
20:00
Hours
Figure 7e. Model vs Real files MP2P.
Measurements have varied along the years, so they can’t be used as fixed parameters to design new P2P networks as it has been done in [30]. On the other hand, the number of users used to model in [27] is now obsolete because nowadays eDonkey network has more users than values used there, so values to take into account for this kind of models have to be higher. P2P networks, whose number of users is stable or is decreasing, grows its number of files per user in the network because the users replicate the shared files in the network. We have seen that some P2P networks are able to support variations till 9 million of users in one hour, so when a new P2P network is designed, this number has to be taken into account. Our results have been compared and discussed with previous measurements taken by other authors in order to show P2P networks evolution. This paper demonstrates that other author’s measurements for P2P networks (such as average of files per user, most popular files and the type of files shared by Internet
users) have varied along the years so they can’t be used as fixed parameters. The grade of correlation between the number of users and their files is different for each P2P filesharing network. In order to have this conclusion we have calculated the number of files per user along a week, having more correlation in MP2P, eDonkey and Gnutella than FastTrack and OpenNap. The graphs showing number of users, files or size of total files shared do not depend on the decentralization degree of the architecture. As shown, there are regular graphs both in decentralized and in partially centralized architectures. The number of users of some of the older architectures is decreasing because the appearance of new P2P networks that attract users from older ones. The total number of users connecting to the P2P filesharing networks is growing. Therefore, the number of users increasing Internet traffic, due to the use of these networks, is growing. Obtained graphs allow us establish a certain timetable for each architecture (considering the maximum values for connected users and shared files), where it is more probable to obtain the desired content. This model could be applied for other systems that depend on users from different countries.
5. References [1] Shareaza http://www.shareaza.com [2] MLDonkey http://mldonkey.berlios.de/ [3] Morpheus http://www.morpheus.com [4] giFT: Internet File Transfer, http://gift.sourceforge.net/ [5] Benno J. Overeinder, Etienne Posthumus, Frances M.T. Brazier. Integrating Peer-to-Peer Networking and Computing in the AgentScape Framework. 2nd International Conference on Peer-to-Peer Computing (P2P'02). September, 2002 [6] Ihor Kuz, Maarten van Steen. cP2Pc: Integrating P2P networks. Linux Journal, Issue 110, June 2003. Available at www.nlnet.nl/project/cp2pc/20030620-cp2pc.pdf [7] Peer-to-Peer File Sharing: The impact of filesharing on service provider networks, Sandvine Incorporated, 2002 [8] Internet2 NetFlow, Weekly reports, info at http://netflow.internet2.edu/weekly/20040216/ [9] The CAIDA website at http://www.caida.org [10] United States House of Representatives Committee on Government Reform, Staff Repport. File-sharing programs and peer-to-peer networks privacy and security risks. United States House of Representatives Committee on Government. May 2003. [11] AssetMetrix Research Labs, Corporate. P2P (Peer-ToPeer) Usage and Risk Analysis. Technical Report. 2003. [12] Stefan Saroiu, P. Krishna Gummadi and Steven D. Gribble. A Measurement Study of Peer-to-Peer File Sharing Systems, Department of Computer Science & Engineering, University of Washington, Technical Report. UW-CSE-0106-02. 2002.
[13] Sanvine Incorporated. Regional Characteristics of P2P: File sharing as a multi-application, multi-national phenomenon. White Paper. October 2003. [14] Eytan Adar and Bernardo Huberman. Free riding on gnutella. First Monday, 5(10), October 2000. [15] Deconstructing the Kazaa Network, Nathaniel Leibowitz, Matei Ripeanu, and Adam Wierzbicki, 3rd IEEE Workshop on Internet Applications (WIAPP'03), San Jose, CA. June 2003. [16] Open Source Napster Server (OpenNap), at http://opennap.sourceforge.net/ [17] Oliver Heckmann and Axel Bock. The eDonkey 2000 Protocol. Technical Report KOM-TR-08-2002, Multimedia Communications Lab, Darmstadt University of Technology, December 2002. [18] MP2P, http://www.blubster.com/protocol1.html [19] J. Lloret Mauri, G. Fuster, J. R. Diaz Santos, M. Esteve Domingo, Analysis and Characterization of Peer-To-Peer Filesharing Networks, WSEAS Transactions on Systems. Issue 7, Volume 3, Pp 2574-2579, September 2004 [20] Limewire, http://www.limewire.com/developer/ [21] KaZaA Lite, info at http://www.k-lite.tk/ [22] Xnap, info at http://xnap.sourceforge.net/ [23] Emule, info at http://www.emule-project.net/ [24] Piolet, info at http://www.piolet.com/ [25] J. Lloret, J. R. Diaz, J. M. Jimenez, M. Esteve, Public Domain P2P File Sharing Networks Content and their Evolution, The IASTED International Conference on Communication Systems and Networks 2005, Benidorm, Valencia, September 2005. [26] J. Lloret, B. Molina, C. Palau, M. Esteve, Public PeerTo-Peer Filesharing Networks’ Evaluation, The 2nd IASTED International Conference on Communication and Computer Networks 2004, Cambridge, MA (USA), November 2004. [27] Z. Ge, D. R. Figueiredo, S. Jaiswal, J. Kurose, D. Towsley. Modeling Peer-Peer File Sharing Systems, Proceedings of IEEE Infocom 2003, April 2003. [28] Qin Lv, Pei Cao, Edith Cohen, Kai Li, and Scott Shenker. Search and replication in unstructured peer-to- peer networks. Proceedings of the 6th International Conference on Supercomputing, ACM Press. Pp. 84–95. 2002. [29] Subhabrata Sen an Jia Wang. Analyzing peer-to-peer traffic across large networks. IEEE/ACM Transactions on Networking (TON) archive. Volume 12, Issue 2. Pag. 219232. 2004. [30] Yang, B., Garcia-Molina, H. Comparing Hybrid Peer-toPeer Systems. Proceedings of the 27th International Conf. on Very Large Data Bases. Pag. 561–570. October 2001. [31] J. Lloret, G. Fuster, J. R. Diaz, M. Esteve, Analysis and Characterization of Peer-To-Peer Filesharing Networks, WSEAS Transactions on Systems. Issue 7, Volume 3, Pp 2574-2579, September 2004 [32] Krishna P. Gummadi, Richard J. Dunn, Stefan Saroiu, Steven D. Gribble, Henry M. Levy, John Zahorjan, Measurement, Modeling, and Analysis of a Peer-to-Peer FileSharing Workload. Proceedings of the 19th ACM symposium on Operating Systems principles. Pp. 314-329. 2003.