Peer-to-Peer Architecture Case Study: GnutellaNetw
ork∗
MateiRipeanu
[email protected] Departmentof Computer Science The University oChicago f th 1100E58 . Street, ChicagoIL 60637 Tel:(773) 955-4040 (ext. 57395) Fax: (773) 702-8487
Abstract Despiterecentexcitementgeneratedbythepeer-to-peer(P2P) rapiddeploymentofsomeP2Papplications, thereare fewquantitat behavior. Theopenarchitecture,achievedscale,andself-organiz networkmakeitaninterestingP2Parchitecturetostudy. Gnutellabuilds,attheapplicationlevel,vairtualnetworkw topologyothis f virtualnetworkandtheroutingmechanismsusedhave applicationpropertiessuchapserformance,reliability,andscal toextractthetopologyofGnutella’sapplicationlevelnetwork topologygraphandevaluategeneratednetworktraffic. Thetwoma (1)althoughGnutellains otapurepower-lawnetwork,itscurrent andthedrawbacksoapower-law f structure,and(2)theGnutellav notmatchwelltheunderlyingInternettopology,henceleadingtoin networkinginfrastructure.Thesefindingsguideustoproposechange andimplementationsthatmaybringsignificantperformancean AlthoughGnutellanetworkmightfade,webelievetheP2Pparadigm ourfindingsaw s ellasourmeasurementandanalysistechniques system designtradeoffs.
paradigmandthesurprisingly iveevaluationsoP2P f systems ingstructureotfheGnutella LikemostotherP2Papplications, ithitsownroutingmechanisms. The saignificantinfluenceon ability. Wehavebuilt “a crawler” In.thispaperweanalyzethe jorfindingswfeocusonare: configurationhasthebenefits irtualnetworktopologydoes effectiveuseotfhephysical stotheGnutellaprotocol dscalabilityimprovements. isheretostay.Inthislight, bringpreciousinsightintoP2P
Keywords: peer-to-peersystemevaluation,self-organizednetworks, topology analysis.
power-lawnetwork,
1.Introduction Peer-to-peersystems(P2P)haveemergedasasignificantsoci alandtechnicalphenomenonoverthe lastyear. Theyprovide infrastructureforcommunitiesthatshareCPUcycles(e.g.,SE TI@Home, Entropia)and/orstoragespace(e.g.,Napster,FreeNet,Gnutella), orthatsupportcollaborative environments(Groove). Twofactorshavefosteredtherecentexplosi vegrowthosuch f systems:first, thelowcostandhighavailabilityolfargenumbersocf omputingandstor ageresources,andsecond, increasednetworkconnectivity. Asthesetrendscontinue,theP2Pparadi gm is boundtobecomemore popular. Unliketraditionaldistributedsystems,P2Pnetworksaimtoaggrega telargenumbersocomputers f that joinandleavethenetworkfrequentlyandthatmightnothavepermanent network(IP)addresses. In
∗ An extendedversion othis f paper was publishedas
University oChicago f TechnicalReport
TR-2001-26.
1
pureP2Psystems,individualcomputerscommunicatedirectlywitheac hotherandshareinformation andresourceswithoutusingdedicatedservers. Acommoncharacteris ticothis f newbreedosystems f isthattheybuild,attheapplicationlevel,avirtualnetworkwit hitsownroutingmechanisms. The topologyofthevirtualnetworkand theroutingmechanismsusedhaveasignificantimpacton applicationpropertiessuchasperformance,reliability,and,ins omecases,anonymity. Thevirtual topology alsodeterminesthecommunicationcostsassociatedwithrunni ngtheP2Papplication,bothat individualhostsandintheaggregate.Notethatthedecentralizednatur eopf ureP2Psystemsmeans thatthesepropertiesareemergentproperties,determinedby ent irely localdecisionsmadebiyndividual resources,basedonlyonlocalinformation:we aredealingwithaself-organizednetworkof independententities. Theseconsiderationshavemotivatedutsoconductda etailedstudyof thetopologyandprotocolsoaf popularP2Psystem:Gnutella. Inthisstudy,webenefitedfromGnute lla’slargeexistinguserbaseand openarchitecture,and,ineffect,usethepublicGnutellanetworkasa large-scale,ifuncontrolled, testbed. Weproceededafsollows. First,wecapturedthenetworktopol ogy,itsgeneratedtraffic,and dynamicbehavior. Then,weusedthisrawdatatoperformamacros copicanalysisotfhenetwork,to evaluatecostsandbenefitsothe f P2Papproach,andtoinvestigatepossi bleimprovementsthatwould allow better scalingandincreasedreliability. Our measurementsandanalysisothe f Gnutellanetworkaredrive nbtywoprimary questions. Thefirst concernsitsconnectivitystructure. Recentresearch[1,8,7]showstha networks t ads iverseans atural networksformedbymoleculesincell, a networksopeople f insocia a group, l ortheInternet,organize themselvessothatmostnodeshavefewlinkswhiletiny a numberof nodes,calledhubs,havealarge numberoflinks. [14]findsthatnetworksfollowingthisorganizationalpat tern(power-lawnetworks) displayanunexpecteddegreeorfobustness:theabilityoftheirnode stocommunicateisunaffected evenbyextremelyhighfailurerates.However,errortolerance comesaaht ighprice:thesenetworks arevulnerabletoattacks,i.e.,totheselectionandremovaloaf fewnodesthatprovidemostotfhe network’sconnectivity. Weshowherethat,althoughGnutellaisnotapur epower-lawnetwork,it preservesgoodfaulttolerancecharacteristicswhilebeing lessdependentthanapurepower-law network on highly connected nodesthatare easy tosingle out(andattac k). Thesecondquestionconcernshowwell(ifaat ll)Gnutellavirtualne tworktopologymapstothe physicalInternetinfrastructure. Therearemultiplereasonst oanalyzethisissue.First,itisquestion a ofcrucialimportanceforInternetServiceProviders(ISP):if thevirtualtopologydoesnotfollowthe physicalinfrastructure,thentheadditionalstressontheinfras tructureand,consequently,thecostsfor ISPs,areimmense. Thispointhasbeenraisedonvariousoccasi ons[9,12]but,asfaraswke now,we arethefirsttoprovideaquantitativeevaluationonP2Papplicationa ndInternettopology (mis)matching.Second,thescalabilityoafnyP2Papplicationius l timatelydeterminedbyitsefficient use ounderlying f resources. WearenotthefirsttoanalyzetheGnutellanetwork. Inparticul ar,theDistributedSearchSolutions (DSS)group[15]haspublishedresultsotheir f Gnutellasurveys[4,5],a ndothershaveusedtheirdata toanalyzeGnutellausers’behavior[2]andtoanalyzesearchproto colsforpower-lawnetworks[6]. However,ournetworkcrawlingandanalysistechnology(developedindependentl yotfhiswork)goes significantlyfurtherintermsofscale(bothspatialandtempor al)andsophistication. WhileDSS presentsonlyrawfactsaboutthenetwork,weanalyzethegenera tednetworktraffic,findpatternsin network organization,andinvestigate itsefficiency inusing the underlying network infrastructure. Therestofthepaperistructuredasfollows:thenextsecti onsuccinctlydescribesGnutellaprotocol andapplication. Section3introducesthe crawlerwdeevelopedtodiscoverGnutella’svirtualnetwork
2
topology. InSection4weanalyzethenetworkandanswerthequestions paragraphs. We conclude iS n ection 5.
introducedintheprevious
2.GnutellaProtocol:DesignGoalsandDescription TheGnutellaprotocol[3]isanopen,decentralizedgroupmembershipandse archprotocol,mainly usedforfilesharing. ThetermGnutellaalsodesignatesthevi rtualnetworkoInternet f accessiblehosts runningGnutella-speakingapplications(thisitshe“Gnutellanetwork” wemeasure)andanumberof smaller,and often private,disconnected networks. AsmostP2Pfile sharingapplications,Gnutella protocolwasdesigne tdm o eetthe followinggoals: o Abilitytooperateinadynamicenvironment P2P . applicationsoperateindynamicenvironments, wherehostsmayjoinorleavethenetworkfrequently. Theymust achieveflexibilityinorderto keep operating transparently despite constantly a changing setof resources. o PerformanceandScalability . P2Pparadigmshowsitsfullpotentialonlyonlarge-scale deploymentswherethelimitsothe f traditionalclient/server paradigmbecomeobvious.Moreover, scalabilityiismportantasP2Papplicationsexhibitwhateconomi stscallthe“networkeffect”[10]: thevalueonafetworktoanindividualuser increaseswiththet otalnumberofusersparticipatingin thenetwork.Ideally,whenincreasingthenumberofnodes,aggregates toragespaceandfile availabilityshouldgrowlinearly,responsetimeshouldremainconstant while , searchthroughput should remain highogrow. r o Reliability. Externalattacksshouldnotcause significantdata operformance r loss. o Anonymity.Anonymityisvaluedasameanstoprotectprivacyofpeopleseekin gorproviding information thatmay notbe popular. Gnutellanodes,called servents bydevelopers , performtasksnormallyassociatedwithboth SERVers andcli ENTS. Theyprovideclient-sideinterfacesthroughwhichuserscanissue queriesandviewsearch results,acceptqueriesfromotherservents,checkfor matchesagainsttheirlocaldataset,andrespond withcorrespondingresults.Thesenodesarealsoresponsible formanagingthebackgroundtrafficthat spreadsthe informationusedtomaintainnetworkintegrity. Inordertojointhesystemnaewnode/serventinitiallyconnects tooneoseveral f knownhoststhatare almostalwaysavailable(e.g.,gnutellahosts.com). Onceattac hedtothenetwork(havingoneomore r openconnectionswithnodesalreadyinthenetwork),nodessendmessagesto interactwitheachother. Messagescanbe broadcasted(i.e.,senttoallnodeswithwhichthesenderhasopen TCP connections) orsimply back-propagated (i.e.,sentonaspecificconnectiononthereverseotfhepathtakenb yan initial, broadcasted, message). Several features of the protocol facilitate this broadcast/back-propagationmechanism. First,eachmessagehasa randomlygeneratedidentifier. Second,eachnodekeepsashortmemoryoftherecentlyroutedmessage s,usedtoprevent re-broadcastingandimplementback-propagation. Third,messagesar felaggedwithtime-to-live(TTL) and “hopspassed” fields. The messagesallowedinthe networkare: GroupMembership (PINGand PONGMessages ) Anode . joiningthenetworkinitiatesbroadcasted a PING messagetoannounceitspresence. Whenanodereceivesa PING messageiforwards t ito tits neighborsandinitiatesaback-propagated PONG message.The PONG messagecontainsinformation aboutthe node suchaits sIPaddressand the number and size oshared f files. Search (QUERYand QUERY RESPONSEMessages ) . QUERYmessagescontainuser a specifiedsearch string,eachreceivingnodematchesagainstlocallystoredfile names. QUERYmessagesare
3
broadcasted. QUERYRESPONSES areback-propagatedrepliesto QUERYmessagesandinclude information necessary todownload file. a FileTransfer (GETand PUSHMessages ) File . downloadsaredonedirectlybetweentwopeersusing GET/PUSH messages. Tosummarize:tobecomeamemberofthenetwork,a servent(node)hastoopenoneormany connectionswithnodesthatarealreadyinthenetwork. Inthedynam icenvironmentwhereGnutella operates,nodesoftenjoinandleaveandnetworkconnectionsareunreliable To . copewiththis environment,afterjoiningthenetwork,anodeperiodically PINGsitsneighborstodiscoverother participatingnodes. Usingthisinformation, daisconnectednodecanal waysreconnecttothenetwork. Nodesdecidewheretoconnectinthenetworkbasedonlyonlocalinforma tion,andthusforminga dynamic,self-organizingnetworkoifndependententities. Thisvirtual, applicationlevelnetworkhas GnutellaserventsaittsnodesandopenTCPconnectionsasitsli nks. Inthefollowingsectionswe presentour solutiontodiscover thisnetworktopology andanalyze itschar acteristics. 3.DataCollection:TheCrawler We have developeda crawlerthatjoinsthe network asaserventand usesthe membership protocol (the PING-PONG mechanism)tocollecttopologyinformation. Inthissectionwberie flydescribethecrawler and discussother issuesrelatedtodata collection. Thecrawlerstartswithalistofnodes,initiatesa TCPconnectiontoeachnodeinthelist,sendsa genericjoin-inmessage (PING),anddiscoverstheneighborsofthenodeitcontactedbasedonthe repliesigt etsback (PONG messages).Newlydiscoveredneighborsareaddedtothelist. For each discoverednodethecrawlerstoresitsIPaddress,port,thenumberof filesandthetotalspaceshared. Westartedwithshort, a publiclyavailablelistofinitialnodes but , intimewhe aveincrementallybuilt our own listwith more than400,000nodesthathave beenactive aone t time or another. Wefirstdevelopedasequentialversionotfhecrawler. Using empiricallydeterminedoptimalvalues forconnectionestablishmenttimeoutaswellasforconnectionlis teningtimeout(thetimeintervalthe crawlerwaitstoreceive PONGasfter ithassenta PING), saequentialcrawlofthenetworkprovedslow: about50hoursevenfor samallnetwork(4000nodes). Thisslowsearchspeed hastwodisadvantages: notonly it is notscalable,butbecauseothe f dynamicnetworkbehavior,theresul of tourcrawlisfar from naetwork topology snapshot. Inorder toreducethecrawlingtime,wenextdeveloped distribut a edcrawlingstrategy.Our distributed crawlerhasclient/server a architecture: theserveris responsiblewithmanagingthe listofnodestobe contacted,assemblingthefinalgraph,andassigningworktoclients.C lientsreceiveasmalllistof initialpointsanddiscoverthenetworktopologyaroundthesepoints. Althoug hwecouldusealarge numberofclients(easilyinthe orderofhundreds),wedecidedtouseonlyupto50clientsinorderto reducetheinvasivenessof our search. These techniqueshaveallowed usto reduce the crawling time to caouple ohours f evenfor laarge listof starting pointsand a discovered topology graph with morethan 30,000 active nodes. Notethatinthefollowingweuseaconservativedefinitionofnetwor kmembership:we xcludethe nodesthat,althoughwerereportedaspartotfhenetwork,ourcrawl ercouldnotconnectto. This situationmightoccurwhenthelocalserventisconfiguredtoallow onlyalimitednumberofTCP connectionsor whenthenode leavesthe networkbefore the crawler c ontactsit.
4
4.GnutellaNetworkAnalysis WestartbypresentingbrieflyGnutellanetworkgrowthtr endsanddynamicbehavior.Ourdatashows that(Section4.1),althoughoverthepast6monthsGnutellaoverheadt raffichasbeendecreasing, currentlythe generatedtraffic volume representssignificant a percentageototal f Internettraff icandis m a ajorobstacletofurthergrowth. Wecontinuewith macrosc a opicanalysisothe f network:westudy firstconnectivitypatterns(Section4.2)andthenthemappingofthe Gnutellatopologytothe underlying networking infrastructure (Section 4.3). Figure1presentsthegrowthothe f Gnutellanetworkinthepast6 months. Weranourcrawlerduring November2000,February/March2001,andMay2001.WhileinNovember2000thelar gest connectedcomponent ofthenetworkwfeoundhad2,063hosts,thisgrewto14,949hostsinMarchand 48,195hostsinMay2001. AlthoughGnutella’sfailuretoscalehasbeen predictedtimeandagain,the numberofnodesinthelargestnetworkcomponentgrewabout25times(admi ttedly fromlaowbase) inthe past6 months. Weidentify threefactorsthatallowedthenetworkthisexceptiona growth l inresponsetouserpressure. First,asweargueinSection 4.1,carefulengineeringledtosignificantoverheadtrafficdec reasesover thelastsixmonths.Second,thenetworkconnectivityofGnutellapart icipatingmachinesimproved significantly. Ourroughestimate(basedontracingDNShostnam es)itshatthenumberoD f SLor cable-connectedmachinesgrewtwiceasfastastheoverallne tworksize. WhileinNovember2000 about24% of the nodeswereDSL ocable r modemenabled,thisnumbergrew toabout41% sixmonths later. Finally,theeffortsmadetobetteruseavailablenetw orkingresourcesbsyendingnodeswithlow available bandwidthathe t edgesof the networkeventually paid off . Itisworthmentioningthatthenumberofconnectedcomponentsisrelat ivelysmall:thelargest connectedcomponentalwaysincludesmorethan95%oftheactivenodesdisc overed,whilethesecond biggestconnectedcomponentusually haslessthan1n0odes. .
messages per secod
50 Gnutella NetworkGrowth
Number ofnodesinthe largest networkcomponent('000)
40 30
25
MessageFrequency
.
20
Ping Push Query Other
15
20 10
10 5
Figure 1:Gnutellanetworkgrowth. Theplotpresentsthe numberofnodesinthelargestconnectedcomponent network.DatacollectedduringNov.2000,Feb./Marc 2001andMay2001. Wefoundsignificantly network around Memorial Day (24-28 May) Thanksgivings,when apparently morepeopleareonli
inthe h a larger and ne.
Using recordsof successive crawls,we investigatethe dynam thatabout40%ofthenodesleavethenetworkinlessthan4hours,while aliveformorethan24hours. Giventhisdynamicbehavior,iitsimpo
378
349
320
291
262
233
204
175
146
117
88
59
30
1
05/12/01 05/16/01 05/22/01 05/24/01 05/29/01
02/27/01 03/01/01 03/05/01 03/09/01 03/13/01 03/16/01 03/19/01 03/22/01 03/24/01
11/20/00 11/21/00 11/25/00 11/28/00
-
minute
Figure 2:Generatedtraffic(messages/sec)inNov.2000 classifiedbymessagetypeovera376minuteperiod . Notethatoverheadtraffic(PINGmessages,thatser ve onlytomaintainnetworkconnectivity)formedmore than 50%ofthetraffic.Theonly‘true’usertrafficis QUERY messages.TrafficbecomemoreefficientbyMay2001 .
icgraph structure over time. We discover only25%ofthenodesare rtanttofindtheappropriate 5
tradeoffbetweendiscoverytimeandinvasivenessoof urcrawler. crawlingtasksreducesdiscoverytimebutincreasestheburdenon Gnutellamapourcrawlerproducesins otanexact‘snapshot’ofthe thenetworkgraphwoe btainicslosetoasnapshotinstatistica a size,diameter,averageconnectivity,andconnectivity distribution are
Increasingthenumberopf arallel theapplication. Obviously,the network. However,wearguethat sense: l allpropertiesotfhenetwork: preserved.
4.1 Estimate oGnutella f GeneratedTraffic Weused modified a versionothe f crawler toeavesdropthetraffi gceneratedbtyhenetwork. In Figure 2weclassify,accordingtomessagetype,thetrafficthatg oesacrossonerandomlychosenlinkin November2000. Afteradjustingformessagesize,wefindthat,onaver age,only36%ofthetotal traffic(inbytes)isuser-generatedtraffic( QUERY messages). Therest isoverheadtraffic: 55% usedto maintaingroupmembership( PINGand PONGmessages)while9%containseithernon-standard messages(1%)or PUSH messagesbroadcastby serventsthatarenotcompliantwiththe latestversionof theprotocol. Apparently,byJune2001,these engineeringproblemsweresolved with thearrivalof newer Gnutellaimplementations:generatedtraffic contains92% QUERYmessages,8% PING messages and insignificantlevelsof other message types. Giventhesmalldiameterofthenetwork(anytwonodesaregenera llylessthan7hopsaway,see Figure3),themessagetime-to-live(TTL=7)preponderantlyus ed,andtheflooding-basedrouting algorithmemployed,mostlinkssupportsimilartraffic. Wev erifiedthistheoreticalconclusionby measuringthetrafficam t ultiple,randomlychosen,nodes. Asaresult,thetotal Gnutellagenerated traffic isproportionaltothenumberocf onnectionsinthenetwork. Basedonour measurementswe estimatethe totaltraffic(excludingfiletransfers)foralargeGnutell anetworkas1Gbps(170,000 connectionsfora50,000nodeslargeGnutellanetwork times6Kbpsperconnection)orabout 330TB/month. To putthistrafficvolumeintoperspectivewenotethatitamountsto about1.7%of totaltrafficin USInternetbackbonesinDecember 2000 (asreporte in d [16]). We infer thatthevolume ofgeneratedtrafficisanimportantobstacleforfurther growthandthatefficientuseouf nderlying network infrastructure icsrucialfor better scalingand wider deployment. 200 Graph connectivity
Searchcharacteristichs Number oflinks('000)
Percentofnode pairs(%)
50% 40% 30% 20%
.
150
100
50 10%
0
0% 1
2
3
4
5
6
7
8
9
10 11 12
Node-to-node shortest path(hops)
Figure 3: Distributionofnode-to-nodeshortestpaths. Eachlinerepresents onenetworkmeasurement. Note that, although the largest network diameter(thelongest node-to-nodepath)is 12,morethan95%ofnodepairs are atmost 7hops away
0
10000
20000
30000
40000
50000
Number of nodes
Figure 4: Average nodeconnectivity. Eachpoint represents oneGnutellanetwork. Notethat,as the network grows,theaveragenumberofconnectionspernode remainsconstant(average nodeconnectivityis3.4 connections per node).
Oneinterestingfeatureothe f networkitshat,oversaeven -monthperiod,withthenetworkscalingup almosttwoordersofmagnitude,theaveragenumberofconnectionspernode remainedconstant
6
(Figure 4). Assumingthisinvariantholds,iitspossibletoestimatethe networksand find scalability limitsbased onavailable bandwidth.
generatedtrafficforlarger
4.2. Connectivity and Reliability inGnutella Network. Power-law Distributions. Whenanalyzingglobalconnectivityandreliabilitypatternsinthe Gnutellanetwork,itisimportantto keepinmindtheself-organizednetworkbehavior:usersdecideonlythema ximumnumberof connectionsanodeshouldsupport, while nodesdecidewhomtoconnect to orwhentodrop/adda connection basedonly onlocalinformation. Recentresearch[1,7,8,13]showsthatmanynaturalnetworkssucham s ol eculesincell, a speciesinan ecosystem,andpeopleisocial na grouporganizethemselvesasocal led power-lawnetworks In .these networksmostnodeshavefewlinksandatinynumberohf ubshavealarg enumberolfinks. More L− k where specifically,inpower-law a networkthefractiononodes f withL linksips roportionalto , k is naetwork dependentconstant. Thisstructurehelpsexplaining whynetworksrangingfrommetabolismstoecosystemstotheInt ernet aregenerallyhighlystableandresilient,yetproneto ccas ionalcatastrophiccollapse[14].Sincemost nodes(molecules,Internetrouters,Gnutellaservents)aresparsel yconnected,littledependsonthem:a largefractioncanbetakenawayandthenetwork staysconnected.But,ifjustafewhighlyconnected nodesareeliminated,thewholesystemcouldcrash. Oneimplicatio nisthatthesenetworksare extremely robustwhenfacing randomnode failures, butvulnerable to wel l-planned attacks. Giventhediversity onetworks f thatexhibitpower-lawstructurea ndtheirpropertiesww e ereinterested todeterminewhetherGnutellafallsintothesamecategory. Figure 5presentstheconnectivity distributioninNov.2000.Althoughdataarenoisy(duetothesmallsize ofthenetworks),wecan easilyrecognizethesignatureoapower-law f distribution: t heconnectivitydistributionappearsaas lineonalog-logplot.[6,4]confirmthatearlyGnutellanetworksw erepower-law.Later measurements(Figure 6)however,showthatmorerecentnetworksmoveaway fromthisorganization: therearetoofewnodeswithlowconnectivitytoformapurepower-l awnetwork. Inthesenetworks thepower-lawdistributionips reservedfornodeswithmorethan10links whilenodeswithfewerlink follow an almostconstantdistribution. 10000 Nodeconnectivitydistribution
10000
1000
.
1000
Number ofnodes (log scale)
Num.ofnodes(logscale)
Nodeconnectivitydistribution
.
100
100
10
1
10
1 1
10
100
Number of links(log scale)
Figure 5: ConnectivitydistributionduringNovember 2000.EachseriesofpointsrepresentsoneGnutella networktopology we discoveredadifferent t timesduring thatmonth.Notethelogscaleonbothaxes. Gnutella nodes organizedthemselves into power-law a network .
1
10 Numberoflinks(logscale)
100
Figure 6: Connectivitydistributions duringMarch2001. EachseriesofpointsrepresentsoneGnutella network topologydiscoveredduringMarch2001.Notethelog scaleonbothaxes . NetworkscrawledduringMay/June 2001show saimilar pattern.
7
Aninterestingissueistheimpactotfhisnew,multi-modaldistr believethatthemoreuniformconnectivitydistributionpreservesthe randomnodefailureswhilereducingthenetworkdependenceonhighlyconne (and attack) nodes.
ibutiononnetworkreliability. We networkcapabilitytodealwith cted,easytosingleout
4.3.InternetInfrastructure andGnutellaNetwork Peer-to-peercomputingbringsanimportantchangetothewayweusethe Internet:itenables computerssittingattheedgesofthenetworktoactasbothcl ientsandservers. Asaresult,P2P applicationsradicallychangetheamountofbandwidththeaverageInter netuserconsumes. Most InternetServiceProviders(ISP)useflatratestobilltheir clients.IfP2Papplicationsbecome ubiquitous,they couldbreaktheexistingbusinessmodelsomany f ISPsa ndforcethemtochangetheir pricing scheme [9]. GiventheconsiderabletrafficvolumeaP2Papplicationgenerates (seeourGnutellaestimatesin previoussection)itiscrucialthatitemployswellavailable networkingresources.Thescalabilityof a P2Papplicationius ltimatelydeterminedbyhow efficientlyiuses t theunderlyingnetwork. Gnutella’s store-and-forwardarchitecturemakesthevirtualnetworktopology extremely important. Thelargerthe mismatchbetweenthenetworkinfrastructureandtheP2Papplication’s virtualtopology,thebiggerthe “stress”ontheinfrastructure. Inthefollowingweinvestiga tewhethertheself-organizingGnutella network shapesitstopology tomapwellonthe physicalinfrastructur e. Letusfirstpresentanexampletohighlighttheimportanceoaf “fitting” virtualtopology.In Figure 7, eighthostsparticipateinaGnutellalikenetwork. Weuseblac k,solidlinestopresenttheunderlying networkinfrastructureandblue,dottedlinesforapplicationvirtualtopol ogy. Intheleftpicture,the virtualtopology closely matches theinfrastructure. Intherightpicture,thevirtualtopology,alth ough functionallysimilar,doesnotmatchtheinfrastructure. Inthisca sethetrafficthelinkD-Ehasto supportissixtimeshigher. F
A
F
A
E B
D
G
E
B
G D
C
H
C
H
Figure 7: Gnutella’svirtualnetworktopology(blue,dotteda rrows)mappingontheunderlyingnetwork infrastructure(black). Leftpicture:perfectmapp ing.Rightpicture:inefficientmapping;linkD-En tosupportsix timeshigher traffic.
Unfortunately,itisprohibitivelyexpensivetomapGnutellaontheI duetotheinherentdifficultyoef xtractingInternettopologyandse scale othe f problem). Instead,we proceed with two high level between the topologiesof the twonetworks. TheInternetis collection a oAutonomous f Systems(AS)whichare arecollectionsolfocalareanetworksunderasingletechnicala viewtrafficcrossingASbordersismoreexpensivethanlocaltr Gnutellaconnectionslinknodeslocatedwithinthesame
eeds
nternetdetailedtopology(firstly, condly,duetothecomputational experimentsthathighlightthe mismatch connectedby routers.ASs,inturn, dministration. FromanISPpointof affic. Wefoundthatonly2-5%of AS,althoughmorethan40%ofthesenodes 8
arelocatedinthetoptenASs. Thisindicatesthat,althoughunforced bynodedistribution,most Gnutella generatedtraffic crossesASborder being thusmoreexpensive thoandle. Inthesecondexperimentweassumethatthehierarchicalorganizat ionodomain f namesmirrorsthatof theInternetinfrastructure. Forexample,itislikelythatcom municationcostsbetweentwohostsinthe “uchicago.edu”domainaresignificantlysmallerthanbetween“uchic ago.edu”and“sdsc.edu.” The underlyingassumptionhereisthatdomainnamesexpresssomesortof organizationalhierarchyand thatorganizationstendtobuild networksthatexploitlocality within t hathierarchy. InordertostudyhowwellGnutellavirtualtopologymapsontotheI nternetpartitioningads efinedby domainnames,wedividetheGnutellavirtualtopologygraphinto clusters,i.e.,subgraphswithhigh interiorconnectivity. Giventheflooding-likeroutingalgorithmuse dbyGnutella,itiswithinthese clustersthatmostloadigs enerated. Wearethereforeinteres tedtoseehowwelltheseclustersmapon the partitioning definedbtyhe domainnaming scheme. We use simple a clusteringalgorithmbasedotnhe connectivity di stribution describedearlier:we define asclusterssubgraphsformedbyonehubwithitsadjacentnodes. If twoclustershavemorethan25% nodesincommon,wemergethem. Aftertheclusteringids one,we(1) assignnodesthatareincluded inmorethanoneclusteronlytothelargestclusterand(2)form alastclusterwithnodesthatarenot included in any other cluster. Wedefinetheentropy[24]oafsetCcontaining , |C|hosts,eachlabeledwithoneotfhe ndistinct domain names,as: n
E (C ) = ∑ (− pi log( p i ) − (1 − p i ) log(1 − pi ) ) , i =1
where pis probability orandomly f picking host a with domain name ithe Wethendefinetheentropyocaflusteringogafraphosize f
i. |C|,clusteredin
cklusters
C1 , C 2 ,..., C k of
C = C1 + C 2 + ... + C k as: ,
sizes C1 , C 2 ,..., C k with , k
Ci
i =1
C1 + C 2 + ... + C k
E (C1 , C 2 ,...C k ) = ∑
* E (C i )
Webaseourreasoningonthepropertythat E (C ) ≥ E (C1 , C 2 ,..., C k )nomatterhowtheclusters C1 , C 2 ,..., C karechosen. Iftheclusteringmatchesthedomain partitioning,thenweshouldfindthat E (C ) >> E (C1 , C 2 ,..., C k ) .Conversely,iftheclustering C1 , C 2 ,..., C k hasthesamelevelof randomnessaisntheinitialset C,thentheentropyshouldremainlargelyunchanged. Essentially,the entropyfunctionius sedheretomeasurehowwellt hetwopartitionsappliedonsetnodesmatch:the firstpartitionusestheinformationcontainedind omainnames,whilethesecondusestheclustering heuristic. Notethat laargeclassodata f mining andmachinelearningalgorithmsbasedoninformatio n gains(ID3, C4.5, etc. [25]) use similar a argument to build their decision trees. Weperformedthisanalysison10topologygraphsco llectedduringFebruary/March2001.We detectednosignificantdecreaseinentropyafterp erformingtheclustering(alldecreaseswerewithin lessthan8%fromtheinitialentropyvalue). Cons equently,weconcludethatGnutellanodesclusteri n awaythatiscompletelyindependentfromtheInter netstructure. AssumingthattheInternet domain namestructureroughlymatchestheunderlyingtopol ogy(thecostofsendingdatawithinadomainis smallerthanthatofsendingdataacrossdomains), weconcludethattheself-organizingGnutella network doesnotefficiently use the underlying phy sicalinfrastructure.
9
5.SummaryandPotentialImprovements SociologicalcircumstancesthathavefosteredthesuccessoG f nute llanetworkmightchangeandthe networkmightfade. P2P,however,“isoneotfhoserareideasthatis simplytoogoodtogoaway” [18].Despiterecentexcitementgeneratedbythisparadigmandthe surprisinglyrapiddeploymentof someP2Papplications,therearefewquantitativeevaluationsoP f 2Psystemsbehavior. Theopen architecture,achievedscale,andself-organizingstructureof theGnutellanetworkmakeitan interestingP2Parchitecturetostudy. Ourmeasurementandanalysistechniques canbue sedformost P2P systemsto enhance general understanding odesign f tradeoffs. OuranalysisshowsthatGnutellanodeconnectivityfollowsamultimodaldistribution:combininga powerlawandaquasi-constantdistribution. Thisproperty keeps thenetwork asreliableasapure power-lawnetwork whenassumingrandomnodefailures, andmakes it hardertoattackbmalicious ya adversary. Gnutellatakesfewprecautionstowardoffpotentialat tacks.Forexample,thenetwork topologyinformationthatwe obtainhereiesasyto btainandwouldpermithighlyefficientdenia l-ofserviceattacks. Someformofsecuritymechanismsthatwouldpre vent an intruder to gathertopology information appearsessentialfor the long-termsurvivalof the network. We have estimated that,asoJfune2001,thenetworkgenerates about330TB/month onlyto remain connectedandbroadcastuserqueries. This trafficvolume representssignificant a fractionof the total Internettraffic andmakes thefuturegrowthof Gnutellanetworkparticularlydependentonefficient network usage.Wehavealsodocumentedthetopologymismatchbetweentheself -organized, applicationlevelGnutellanetworkandtheunderlyingphysicalnetw orkinginfrastructure. Webelieve thishasmajorimplicationsforthescalabilityothe f Interne (or, t equivalently,forthebusinessmodels ofISPs). ThisproblemmustbesolvedifGnutellaorsimilarlybui ltsystems areto reachlarger deployment. Weseetwodirectionsforimprovement.First,weobservethatthe application-leveltopology determinesthevolumeogf eneratedtraffic,thesearchsucces rsate,and the applicationreliability.We imagineanagentthatconstantly monitorsthenetworkandintervene bs ayskingserventstodropoadd r linksans ecessarytokeepthenetworktopologyefficient. Additionally,agents(ornodes)couldlearn abouttheunderlyingphysicalnetworkandbuildthevirtualapplicationt opologyaccordingly. Note thatimplementing thisidea requiressome minimalprotocolmodifica tions. A second, orthogonal,directionistoreplacefloodingwithasmarter(less expensiveintermsof communicationcosts)routingandgroupcommunicationmechanism. Recentresearchprojects:Chord [19],CAN[21],SDS[23]oO r ceanStore[22]focusonbuildingIntenetsca leoverlaynetworksand offer vaastarray ochoices f futureGnutella implementationscould build on . 6.Acknowledgements am I gratefultoIanFoster,AdrianaIamnitchi,LarryLidz,Conor McGrath,DustinMitchell,andAlain Royfortheirinsightfulcommentsandgeneroussupport.Thisworkst artedajoint as classprojectwith YugoNakaiand XuehaiZhang. Rongguan Jin, KnoxMcMurry, andYanWang participatedinrefining thiswork. Thisresearch wassupportedbtyheNationalScience Foundation under contractITR-0086044. 7.References [1] M.Faloutsos, P. Faloutsos, C. Faloutsos, On Power-Law Relationships of the InternetTopology 1999. [2] E. Adar,B.Huberman, FreeridingoG n nutella First , Monday Vol 5-10 O – ct. 2, 2000.
SIGCOMM ,
10
[3] TheGnutellaprotocolspecificationv4.0 -http://dss.clip2.com/GnutellaProtocol04.pdf Gnutella: TotheBandwidth Barrierand Beyond http://dss.clip2.com, , Nov. 6, 2000. [4]DSS Group, BarrierstoGnutellaNetwork Scalability , http://dss.clip2.com, Sept. 8, 2000. [5] DSS Group, Bandwidth [6]LadaAAdamic, . Rajan MLukose, . AmitR. Puniyani, B. Huberman, Search iP n owe-Law Networks . [7] A.Broder,R. Kumar,F.Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkinsand JWiener, . Graph th structure in theweb , 8International WWWConference, May 15-19 Amsterdam. [8]A.Barabasi andRAlbert. . Emergenceoscaling f irnandom networks, Science, 286(509), 1999. [9] Todd Spangle , TheHiddenCostOfP2P Interactive , Week, February 26, 2001. [10] M.Katz, C. Shapiro, SystemsCompetitionand NetworkEffects, JournalofEconomicPerspectives,vol.8, no.2,pp. 93-115, 1994. [11] T. Cover, J. Thomas, ElementsofInformation Theory Wiley, , 1991. [12] B. St.Arnaud, ScalingIssuesonInternetNetworks , TechnicalReport, CANARIEInc. Power-law distribution othe f World WideWeb Science , [13] A,Barabási , R.Albert,H. Jeong,G. Bianconi, 287,(2000). [14] R.Albert,H. Jeong, A. Barabási, Attakand toleranceicnomplexnetworks, Nature 406 378 (2000). [15] http://dss.clip2.com [16] K. Coffman, A. Odlyzko, Internet growth: Is there"Moore's a Law" fordata traffic? , Handbook oMassive f Data Sets J.Abello , & all editors., Kluwer, 2001. Mining Concepts : and Techniques Morgan , Kaufmann, August2000. [17] J.Han, M. Kamber, Data [18] TheEconomist, Inventionithe s EasyBit The , EconomistTechnology Quarterly 6/23/01. [19] IonStoica, RobertMorris, DavidKarger, M.FransKaashoek,and HariBalakrishnan, Chord:AScalable Peer-to-peerLookupServicefor InternetApplications, SIGCOMM2001. [20]SR . atnasamy,PF. rancis,MH . andley,RK . arp,SS. henker AScalableContent-AddressableNetwork . Submitted for publication, 2000 [21]BenZhao,JohnKubiatowicz,AnthonyJoseph. Tapestry:Aninfrastructureforwide-areafaulttolerant location and routing [22]Zachary Ives,AlonLevy,Jayant Madhavan,RachelPottinger, StefanSaroiu, Igor Tatarinov,ShioriBetzler, Self-Organizing DataSharing Communities Qiong Chen,Ewa Jaslikowska, Jing Su, W.T. TheodoraYeung. with SAGRES SIGMOD , 2000,Dallas, TX. [23] Todd DHodes, . StevenECzerwinski, . Ben YZhao, . Anthony DJoseph, . Randy H.Katz, An Architecture forSecureWide-AreaServiceDiscovery, ACMBaltzer WirelessNetworks:selectedpapers from MobiC om 1999. [24] T. Cover, J. Thomas, ElementsofInformation Theory Wiley, , 1991. [25] J.Han, M. Kamber, Data Mining Concepts : and Techniques Morgan , Kaufmann, August2000.
11