Peer-to-Peer Architecture Case Study: Gnutella Netw ork - CiteSeerX

4 downloads 19395 Views 151KB Size Report
the low cost and high availability of large numbers of computing and stor ..... domain names, we divide the Gnutella virtual topology graph into clusters, i.e. ...
Peer-to-Peer Architecture Case Study: GnutellaNetw

ork∗

MateiRipeanu [email protected] Departmentof Computer Science The University oChicago f th 1100E58 . Street, ChicagoIL 60637 Tel:(773) 955-4040 (ext. 57395) Fax: (773) 702-8487

Abstract Despiterecentexcitementgeneratedbythepeer-to-peer(P2P) rapiddeploymentofsomeP2Papplications, thereare fewquantitat behavior. Theopenarchitecture,achievedscale,andself-organiz networkmakeitaninterestingP2Parchitecturetostudy. Gnutellabuilds,attheapplicationlevel,vairtualnetworkw topologyothis f virtualnetworkandtheroutingmechanismsusedhave applicationpropertiessuchapserformance,reliability,andscal toextractthetopologyofGnutella’sapplicationlevelnetwork topologygraphandevaluategeneratednetworktraffic. Thetwoma (1)althoughGnutellains otapurepower-lawnetwork,itscurrent andthedrawbacksoapower-law f structure,and(2)theGnutellav notmatchwelltheunderlyingInternettopology,henceleadingtoin networkinginfrastructure.Thesefindingsguideustoproposechange andimplementationsthatmaybringsignificantperformancean AlthoughGnutellanetworkmightfade,webelievetheP2Pparadigm ourfindingsaw s ellasourmeasurementandanalysistechniques system designtradeoffs.

paradigmandthesurprisingly iveevaluationsoP2P f systems ingstructureotfheGnutella LikemostotherP2Papplications, ithitsownroutingmechanisms. The saignificantinfluenceon ability. Wehavebuilt “a crawler” In.thispaperweanalyzethe jorfindingswfeocusonare: configurationhasthebenefits irtualnetworktopologydoes effectiveuseotfhephysical stotheGnutellaprotocol dscalabilityimprovements. isheretostay.Inthislight, bringpreciousinsightintoP2P

Keywords: peer-to-peersystemevaluation,self-organizednetworks, topology analysis.

power-lawnetwork,

1.Introduction Peer-to-peersystems(P2P)haveemergedasasignificantsoci alandtechnicalphenomenonoverthe lastyear. Theyprovide infrastructureforcommunitiesthatshareCPUcycles(e.g.,SE TI@Home, Entropia)and/orstoragespace(e.g.,Napster,FreeNet,Gnutella), orthatsupportcollaborative environments(Groove). Twofactorshavefosteredtherecentexplosi vegrowthosuch f systems:first, thelowcostandhighavailabilityolfargenumbersocf omputingandstor ageresources,andsecond, increasednetworkconnectivity. Asthesetrendscontinue,theP2Pparadi gm is boundtobecomemore popular. Unliketraditionaldistributedsystems,P2Pnetworksaimtoaggrega telargenumbersocomputers f that joinandleavethenetworkfrequentlyandthatmightnothavepermanent network(IP)addresses. In

∗ An extendedversion othis f paper was publishedas

University oChicago f TechnicalReport

TR-2001-26.

1

pureP2Psystems,individualcomputerscommunicatedirectlywitheac hotherandshareinformation andresourceswithoutusingdedicatedservers. Acommoncharacteris ticothis f newbreedosystems f isthattheybuild,attheapplicationlevel,avirtualnetworkwit hitsownroutingmechanisms. The topologyofthevirtualnetworkand theroutingmechanismsusedhaveasignificantimpacton applicationpropertiessuchasperformance,reliability,and,ins omecases,anonymity. Thevirtual topology alsodeterminesthecommunicationcostsassociatedwithrunni ngtheP2Papplication,bothat individualhostsandintheaggregate.Notethatthedecentralizednatur eopf ureP2Psystemsmeans thatthesepropertiesareemergentproperties,determinedby ent irely localdecisionsmadebiyndividual resources,basedonlyonlocalinformation:we aredealingwithaself-organizednetworkof independententities. Theseconsiderationshavemotivatedutsoconductda etailedstudyof thetopologyandprotocolsoaf popularP2Psystem:Gnutella. Inthisstudy,webenefitedfromGnute lla’slargeexistinguserbaseand openarchitecture,and,ineffect,usethepublicGnutellanetworkasa large-scale,ifuncontrolled, testbed. Weproceededafsollows. First,wecapturedthenetworktopol ogy,itsgeneratedtraffic,and dynamicbehavior. Then,weusedthisrawdatatoperformamacros copicanalysisotfhenetwork,to evaluatecostsandbenefitsothe f P2Papproach,andtoinvestigatepossi bleimprovementsthatwould allow better scalingandincreasedreliability. Our measurementsandanalysisothe f Gnutellanetworkaredrive nbtywoprimary questions. Thefirst concernsitsconnectivitystructure. Recentresearch[1,8,7]showstha networks t ads iverseans atural networksformedbymoleculesincell, a networksopeople f insocia a group, l ortheInternet,organize themselvessothatmostnodeshavefewlinkswhiletiny a numberof nodes,calledhubs,havealarge numberoflinks. [14]findsthatnetworksfollowingthisorganizationalpat tern(power-lawnetworks) displayanunexpecteddegreeorfobustness:theabilityoftheirnode stocommunicateisunaffected evenbyextremelyhighfailurerates.However,errortolerance comesaaht ighprice:thesenetworks arevulnerabletoattacks,i.e.,totheselectionandremovaloaf fewnodesthatprovidemostotfhe network’sconnectivity. Weshowherethat,althoughGnutellaisnotapur epower-lawnetwork,it preservesgoodfaulttolerancecharacteristicswhilebeing lessdependentthanapurepower-law network on highly connected nodesthatare easy tosingle out(andattac k). Thesecondquestionconcernshowwell(ifaat ll)Gnutellavirtualne tworktopologymapstothe physicalInternetinfrastructure. Therearemultiplereasonst oanalyzethisissue.First,itisquestion a ofcrucialimportanceforInternetServiceProviders(ISP):if thevirtualtopologydoesnotfollowthe physicalinfrastructure,thentheadditionalstressontheinfras tructureand,consequently,thecostsfor ISPs,areimmense. Thispointhasbeenraisedonvariousoccasi ons[9,12]but,asfaraswke now,we arethefirsttoprovideaquantitativeevaluationonP2Papplicationa ndInternettopology (mis)matching.Second,thescalabilityoafnyP2Papplicationius l timatelydeterminedbyitsefficient use ounderlying f resources. WearenotthefirsttoanalyzetheGnutellanetwork. Inparticul ar,theDistributedSearchSolutions (DSS)group[15]haspublishedresultsotheir f Gnutellasurveys[4,5],a ndothershaveusedtheirdata toanalyzeGnutellausers’behavior[2]andtoanalyzesearchproto colsforpower-lawnetworks[6]. However,ournetworkcrawlingandanalysistechnology(developedindependentl yotfhiswork)goes significantlyfurtherintermsofscale(bothspatialandtempor al)andsophistication. WhileDSS presentsonlyrawfactsaboutthenetwork,weanalyzethegenera tednetworktraffic,findpatternsin network organization,andinvestigate itsefficiency inusing the underlying network infrastructure. Therestofthepaperistructuredasfollows:thenextsecti onsuccinctlydescribesGnutellaprotocol andapplication. Section3introducesthe crawlerwdeevelopedtodiscoverGnutella’svirtualnetwork

2

topology. InSection4weanalyzethenetworkandanswerthequestions paragraphs. We conclude iS n ection 5.

introducedintheprevious

2.GnutellaProtocol:DesignGoalsandDescription TheGnutellaprotocol[3]isanopen,decentralizedgroupmembershipandse archprotocol,mainly usedforfilesharing. ThetermGnutellaalsodesignatesthevi rtualnetworkoInternet f accessiblehosts runningGnutella-speakingapplications(thisitshe“Gnutellanetwork” wemeasure)andanumberof smaller,and often private,disconnected networks. AsmostP2Pfile sharingapplications,Gnutella protocolwasdesigne tdm o eetthe followinggoals: o Abilitytooperateinadynamicenvironment P2P . applicationsoperateindynamicenvironments, wherehostsmayjoinorleavethenetworkfrequently. Theymust achieveflexibilityinorderto keep operating transparently despite constantly a changing setof resources. o PerformanceandScalability . P2Pparadigmshowsitsfullpotentialonlyonlarge-scale deploymentswherethelimitsothe f traditionalclient/server paradigmbecomeobvious.Moreover, scalabilityiismportantasP2Papplicationsexhibitwhateconomi stscallthe“networkeffect”[10]: thevalueonafetworktoanindividualuser increaseswiththet otalnumberofusersparticipatingin thenetwork.Ideally,whenincreasingthenumberofnodes,aggregates toragespaceandfile availabilityshouldgrowlinearly,responsetimeshouldremainconstant while , searchthroughput should remain highogrow. r o Reliability. Externalattacksshouldnotcause significantdata operformance r loss. o Anonymity.Anonymityisvaluedasameanstoprotectprivacyofpeopleseekin gorproviding information thatmay notbe popular. Gnutellanodes,called servents bydevelopers , performtasksnormallyassociatedwithboth SERVers andcli ENTS. Theyprovideclient-sideinterfacesthroughwhichuserscanissue queriesandviewsearch results,acceptqueriesfromotherservents,checkfor matchesagainsttheirlocaldataset,andrespond withcorrespondingresults.Thesenodesarealsoresponsible formanagingthebackgroundtrafficthat spreadsthe informationusedtomaintainnetworkintegrity. Inordertojointhesystemnaewnode/serventinitiallyconnects tooneoseveral f knownhoststhatare almostalwaysavailable(e.g.,gnutellahosts.com). Onceattac hedtothenetwork(havingoneomore r openconnectionswithnodesalreadyinthenetwork),nodessendmessagesto interactwitheachother. Messagescanbe broadcasted(i.e.,senttoallnodeswithwhichthesenderhasopen TCP connections) orsimply back-propagated (i.e.,sentonaspecificconnectiononthereverseotfhepathtakenb yan initial, broadcasted, message). Several features of the protocol facilitate this broadcast/back-propagationmechanism. First,eachmessagehasa randomlygeneratedidentifier. Second,eachnodekeepsashortmemoryoftherecentlyroutedmessage s,usedtoprevent re-broadcastingandimplementback-propagation. Third,messagesar felaggedwithtime-to-live(TTL) and “hopspassed” fields. The messagesallowedinthe networkare:  GroupMembership (PINGand PONGMessages ) Anode . joiningthenetworkinitiatesbroadcasted a PING messagetoannounceitspresence. Whenanodereceivesa PING messageiforwards t ito tits neighborsandinitiatesaback-propagated PONG message.The PONG messagecontainsinformation aboutthe node suchaits sIPaddressand the number and size oshared f files.  Search (QUERYand QUERY RESPONSEMessages ) . QUERYmessagescontainuser a specifiedsearch string,eachreceivingnodematchesagainstlocallystoredfile names. QUERYmessagesare

3

broadcasted. QUERYRESPONSES areback-propagatedrepliesto QUERYmessagesandinclude information necessary todownload file. a  FileTransfer (GETand PUSHMessages ) File . downloadsaredonedirectlybetweentwopeersusing GET/PUSH messages. Tosummarize:tobecomeamemberofthenetwork,a servent(node)hastoopenoneormany connectionswithnodesthatarealreadyinthenetwork. Inthedynam icenvironmentwhereGnutella operates,nodesoftenjoinandleaveandnetworkconnectionsareunreliable To . copewiththis environment,afterjoiningthenetwork,anodeperiodically PINGsitsneighborstodiscoverother participatingnodes. Usingthisinformation, daisconnectednodecanal waysreconnecttothenetwork. Nodesdecidewheretoconnectinthenetworkbasedonlyonlocalinforma tion,andthusforminga dynamic,self-organizingnetworkoifndependententities. Thisvirtual, applicationlevelnetworkhas GnutellaserventsaittsnodesandopenTCPconnectionsasitsli nks. Inthefollowingsectionswe presentour solutiontodiscover thisnetworktopology andanalyze itschar acteristics. 3.DataCollection:TheCrawler We have developeda crawlerthatjoinsthe network asaserventand usesthe membership protocol (the PING-PONG mechanism)tocollecttopologyinformation. Inthissectionwberie flydescribethecrawler and discussother issuesrelatedtodata collection. Thecrawlerstartswithalistofnodes,initiatesa TCPconnectiontoeachnodeinthelist,sendsa genericjoin-inmessage (PING),anddiscoverstheneighborsofthenodeitcontactedbasedonthe repliesigt etsback (PONG messages).Newlydiscoveredneighborsareaddedtothelist. For each discoverednodethecrawlerstoresitsIPaddress,port,thenumberof filesandthetotalspaceshared. Westartedwithshort, a publiclyavailablelistofinitialnodes but , intimewhe aveincrementallybuilt our own listwith more than400,000nodesthathave beenactive aone t time or another. Wefirstdevelopedasequentialversionotfhecrawler. Using empiricallydeterminedoptimalvalues forconnectionestablishmenttimeoutaswellasforconnectionlis teningtimeout(thetimeintervalthe crawlerwaitstoreceive PONGasfter ithassenta PING), saequentialcrawlofthenetworkprovedslow: about50hoursevenfor samallnetwork(4000nodes). Thisslowsearchspeed hastwodisadvantages: notonly it is notscalable,butbecauseothe f dynamicnetworkbehavior,theresul of tourcrawlisfar from naetwork topology snapshot. Inorder toreducethecrawlingtime,wenextdeveloped distribut a edcrawlingstrategy.Our distributed crawlerhasclient/server a architecture: theserveris responsiblewithmanagingthe listofnodestobe contacted,assemblingthefinalgraph,andassigningworktoclients.C lientsreceiveasmalllistof initialpointsanddiscoverthenetworktopologyaroundthesepoints. Althoug hwecouldusealarge numberofclients(easilyinthe orderofhundreds),wedecidedtouseonlyupto50clientsinorderto reducetheinvasivenessof our search. These techniqueshaveallowed usto reduce the crawling time to caouple ohours f evenfor laarge listof starting pointsand a discovered topology graph with morethan 30,000 active nodes. Notethatinthefollowingweuseaconservativedefinitionofnetwor kmembership:we xcludethe nodesthat,althoughwerereportedaspartotfhenetwork,ourcrawl ercouldnotconnectto. This situationmightoccurwhenthelocalserventisconfiguredtoallow onlyalimitednumberofTCP connectionsor whenthenode leavesthe networkbefore the crawler c ontactsit.

4

4.GnutellaNetworkAnalysis WestartbypresentingbrieflyGnutellanetworkgrowthtr endsanddynamicbehavior.Ourdatashows that(Section4.1),althoughoverthepast6monthsGnutellaoverheadt raffichasbeendecreasing, currentlythe generatedtraffic volume representssignificant a percentageototal f Internettraff icandis m a ajorobstacletofurthergrowth. Wecontinuewith macrosc a opicanalysisothe f network:westudy firstconnectivitypatterns(Section4.2)andthenthemappingofthe Gnutellatopologytothe underlying networking infrastructure (Section 4.3). Figure1presentsthegrowthothe f Gnutellanetworkinthepast6 months. Weranourcrawlerduring November2000,February/March2001,andMay2001.WhileinNovember2000thelar gest connectedcomponent ofthenetworkwfeoundhad2,063hosts,thisgrewto14,949hostsinMarchand 48,195hostsinMay2001. AlthoughGnutella’sfailuretoscalehasbeen predictedtimeandagain,the numberofnodesinthelargestnetworkcomponentgrewabout25times(admi ttedly fromlaowbase) inthe past6 months. Weidentify threefactorsthatallowedthenetworkthisexceptiona growth l inresponsetouserpressure. First,asweargueinSection 4.1,carefulengineeringledtosignificantoverheadtrafficdec reasesover thelastsixmonths.Second,thenetworkconnectivityofGnutellapart icipatingmachinesimproved significantly. Ourroughestimate(basedontracingDNShostnam es)itshatthenumberoD f SLor cable-connectedmachinesgrewtwiceasfastastheoverallne tworksize. WhileinNovember2000 about24% of the nodeswereDSL ocable r modemenabled,thisnumbergrew toabout41% sixmonths later. Finally,theeffortsmadetobetteruseavailablenetw orkingresourcesbsyendingnodeswithlow available bandwidthathe t edgesof the networkeventually paid off . Itisworthmentioningthatthenumberofconnectedcomponentsisrelat ivelysmall:thelargest connectedcomponentalwaysincludesmorethan95%oftheactivenodesdisc overed,whilethesecond biggestconnectedcomponentusually haslessthan1n0odes. .

messages per secod

50 Gnutella NetworkGrowth

Number ofnodesinthe largest networkcomponent('000)

40 30

25

MessageFrequency

.

20

Ping Push Query Other

15

20 10

10 5

Figure 1:Gnutellanetworkgrowth. Theplotpresentsthe numberofnodesinthelargestconnectedcomponent network.DatacollectedduringNov.2000,Feb./Marc 2001andMay2001. Wefoundsignificantly network around Memorial Day (24-28 May) Thanksgivings,when apparently morepeopleareonli

inthe h a larger and ne.

Using recordsof successive crawls,we investigatethe dynam thatabout40%ofthenodesleavethenetworkinlessthan4hours,while aliveformorethan24hours. Giventhisdynamicbehavior,iitsimpo

378

349

320

291

262

233

204

175

146

117

88

59

30

1

05/12/01 05/16/01 05/22/01 05/24/01 05/29/01

02/27/01 03/01/01 03/05/01 03/09/01 03/13/01 03/16/01 03/19/01 03/22/01 03/24/01

11/20/00 11/21/00 11/25/00 11/28/00

-

minute

Figure 2:Generatedtraffic(messages/sec)inNov.2000 classifiedbymessagetypeovera376minuteperiod . Notethatoverheadtraffic(PINGmessages,thatser ve onlytomaintainnetworkconnectivity)formedmore than 50%ofthetraffic.Theonly‘true’usertrafficis QUERY messages.TrafficbecomemoreefficientbyMay2001 .

icgraph structure over time. We discover only25%ofthenodesare rtanttofindtheappropriate 5

tradeoffbetweendiscoverytimeandinvasivenessoof urcrawler. crawlingtasksreducesdiscoverytimebutincreasestheburdenon Gnutellamapourcrawlerproducesins otanexact‘snapshot’ofthe thenetworkgraphwoe btainicslosetoasnapshotinstatistica a size,diameter,averageconnectivity,andconnectivity distribution are

Increasingthenumberopf arallel theapplication. Obviously,the network. However,wearguethat sense: l allpropertiesotfhenetwork: preserved.

4.1 Estimate oGnutella f GeneratedTraffic Weused modified a versionothe f crawler toeavesdropthetraffi gceneratedbtyhenetwork. In Figure 2weclassify,accordingtomessagetype,thetrafficthatg oesacrossonerandomlychosenlinkin November2000. Afteradjustingformessagesize,wefindthat,onaver age,only36%ofthetotal traffic(inbytes)isuser-generatedtraffic( QUERY messages). Therest isoverheadtraffic: 55% usedto maintaingroupmembership( PINGand PONGmessages)while9%containseithernon-standard messages(1%)or PUSH messagesbroadcastby serventsthatarenotcompliantwiththe latestversionof theprotocol. Apparently,byJune2001,these engineeringproblemsweresolved with thearrivalof newer Gnutellaimplementations:generatedtraffic contains92% QUERYmessages,8% PING messages and insignificantlevelsof other message types. Giventhesmalldiameterofthenetwork(anytwonodesaregenera llylessthan7hopsaway,see Figure3),themessagetime-to-live(TTL=7)preponderantlyus ed,andtheflooding-basedrouting algorithmemployed,mostlinkssupportsimilartraffic. Wev erifiedthistheoreticalconclusionby measuringthetrafficam t ultiple,randomlychosen,nodes. Asaresult,thetotal Gnutellagenerated traffic isproportionaltothenumberocf onnectionsinthenetwork. Basedonour measurementswe estimatethe totaltraffic(excludingfiletransfers)foralargeGnutell anetworkas1Gbps(170,000 connectionsfora50,000nodeslargeGnutellanetwork times6Kbpsperconnection)orabout 330TB/month. To putthistrafficvolumeintoperspectivewenotethatitamountsto about1.7%of totaltrafficin USInternetbackbonesinDecember 2000 (asreporte in d [16]). We infer thatthevolume ofgeneratedtrafficisanimportantobstacleforfurther growthandthatefficientuseouf nderlying network infrastructure icsrucialfor better scalingand wider deployment. 200 Graph connectivity

Searchcharacteristichs Number oflinks('000)

Percentofnode pairs(%)

50% 40% 30% 20%

.

150

100

50 10%

0

0% 1

2

3

4

5

6

7

8

9

10 11 12

Node-to-node shortest path(hops)

Figure 3: Distributionofnode-to-nodeshortestpaths. Eachlinerepresents onenetworkmeasurement. Note that, although the largest network diameter(thelongest node-to-nodepath)is 12,morethan95%ofnodepairs are atmost 7hops away

0

10000

20000

30000

40000

50000

Number of nodes

Figure 4: Average nodeconnectivity. Eachpoint represents oneGnutellanetwork. Notethat,as the network grows,theaveragenumberofconnectionspernode remainsconstant(average nodeconnectivityis3.4 connections per node).

Oneinterestingfeatureothe f networkitshat,oversaeven -monthperiod,withthenetworkscalingup almosttwoordersofmagnitude,theaveragenumberofconnectionspernode remainedconstant

6

(Figure 4). Assumingthisinvariantholds,iitspossibletoestimatethe networksand find scalability limitsbased onavailable bandwidth.

generatedtrafficforlarger

4.2. Connectivity and Reliability inGnutella Network. Power-law Distributions. Whenanalyzingglobalconnectivityandreliabilitypatternsinthe Gnutellanetwork,itisimportantto keepinmindtheself-organizednetworkbehavior:usersdecideonlythema ximumnumberof connectionsanodeshouldsupport, while nodesdecidewhomtoconnect to orwhentodrop/adda connection basedonly onlocalinformation. Recentresearch[1,7,8,13]showsthatmanynaturalnetworkssucham s ol eculesincell, a speciesinan ecosystem,andpeopleisocial na grouporganizethemselvesasocal led power-lawnetworks In .these networksmostnodeshavefewlinksandatinynumberohf ubshavealarg enumberolfinks. More L− k where specifically,inpower-law a networkthefractiononodes f withL linksips roportionalto , k is naetwork dependentconstant. Thisstructurehelpsexplaining whynetworksrangingfrommetabolismstoecosystemstotheInt ernet aregenerallyhighlystableandresilient,yetproneto ccas ionalcatastrophiccollapse[14].Sincemost nodes(molecules,Internetrouters,Gnutellaservents)aresparsel yconnected,littledependsonthem:a largefractioncanbetakenawayandthenetwork staysconnected.But,ifjustafewhighlyconnected nodesareeliminated,thewholesystemcouldcrash. Oneimplicatio nisthatthesenetworksare extremely robustwhenfacing randomnode failures, butvulnerable to wel l-planned attacks. Giventhediversity onetworks f thatexhibitpower-lawstructurea ndtheirpropertiesww e ereinterested todeterminewhetherGnutellafallsintothesamecategory. Figure 5presentstheconnectivity distributioninNov.2000.Althoughdataarenoisy(duetothesmallsize ofthenetworks),wecan easilyrecognizethesignatureoapower-law f distribution: t heconnectivitydistributionappearsaas lineonalog-logplot.[6,4]confirmthatearlyGnutellanetworksw erepower-law.Later measurements(Figure 6)however,showthatmorerecentnetworksmoveaway fromthisorganization: therearetoofewnodeswithlowconnectivitytoformapurepower-l awnetwork. Inthesenetworks thepower-lawdistributionips reservedfornodeswithmorethan10links whilenodeswithfewerlink follow an almostconstantdistribution. 10000 Nodeconnectivitydistribution

10000

1000

.

1000

Number ofnodes (log scale)

Num.ofnodes(logscale)

Nodeconnectivitydistribution

.

100

100

10

1

10

1 1

10

100

Number of links(log scale)

Figure 5: ConnectivitydistributionduringNovember 2000.EachseriesofpointsrepresentsoneGnutella networktopology we discoveredadifferent t timesduring thatmonth.Notethelogscaleonbothaxes. Gnutella nodes organizedthemselves into power-law a network .

1

10 Numberoflinks(logscale)

100

Figure 6: Connectivitydistributions duringMarch2001. EachseriesofpointsrepresentsoneGnutella network topologydiscoveredduringMarch2001.Notethelog scaleonbothaxes . NetworkscrawledduringMay/June 2001show saimilar pattern.

7

Aninterestingissueistheimpactotfhisnew,multi-modaldistr believethatthemoreuniformconnectivitydistributionpreservesthe randomnodefailureswhilereducingthenetworkdependenceonhighlyconne (and attack) nodes.

ibutiononnetworkreliability. We networkcapabilitytodealwith cted,easytosingleout

4.3.InternetInfrastructure andGnutellaNetwork Peer-to-peercomputingbringsanimportantchangetothewayweusethe Internet:itenables computerssittingattheedgesofthenetworktoactasbothcl ientsandservers. Asaresult,P2P applicationsradicallychangetheamountofbandwidththeaverageInter netuserconsumes. Most InternetServiceProviders(ISP)useflatratestobilltheir clients.IfP2Papplicationsbecome ubiquitous,they couldbreaktheexistingbusinessmodelsomany f ISPsa ndforcethemtochangetheir pricing scheme [9]. GiventheconsiderabletrafficvolumeaP2Papplicationgenerates (seeourGnutellaestimatesin previoussection)itiscrucialthatitemployswellavailable networkingresources.Thescalabilityof a P2Papplicationius ltimatelydeterminedbyhow efficientlyiuses t theunderlyingnetwork. Gnutella’s store-and-forwardarchitecturemakesthevirtualnetworktopology extremely important. Thelargerthe mismatchbetweenthenetworkinfrastructureandtheP2Papplication’s virtualtopology,thebiggerthe “stress”ontheinfrastructure. Inthefollowingweinvestiga tewhethertheself-organizingGnutella network shapesitstopology tomapwellonthe physicalinfrastructur e. Letusfirstpresentanexampletohighlighttheimportanceoaf “fitting” virtualtopology.In Figure 7, eighthostsparticipateinaGnutellalikenetwork. Weuseblac k,solidlinestopresenttheunderlying networkinfrastructureandblue,dottedlinesforapplicationvirtualtopol ogy. Intheleftpicture,the virtualtopology closely matches theinfrastructure. Intherightpicture,thevirtualtopology,alth ough functionallysimilar,doesnotmatchtheinfrastructure. Inthisca sethetrafficthelinkD-Ehasto supportissixtimeshigher. F

A

F

A

E B

D

G

E

B

G D

C

H

C

H

Figure 7: Gnutella’svirtualnetworktopology(blue,dotteda rrows)mappingontheunderlyingnetwork infrastructure(black). Leftpicture:perfectmapp ing.Rightpicture:inefficientmapping;linkD-En tosupportsix timeshigher traffic.

Unfortunately,itisprohibitivelyexpensivetomapGnutellaontheI duetotheinherentdifficultyoef xtractingInternettopologyandse scale othe f problem). Instead,we proceed with two high level between the topologiesof the twonetworks. TheInternetis collection a oAutonomous f Systems(AS)whichare arecollectionsolfocalareanetworksunderasingletechnicala viewtrafficcrossingASbordersismoreexpensivethanlocaltr Gnutellaconnectionslinknodeslocatedwithinthesame

eeds

nternetdetailedtopology(firstly, condly,duetothecomputational experimentsthathighlightthe mismatch connectedby routers.ASs,inturn, dministration. FromanISPpointof affic. Wefoundthatonly2-5%of AS,althoughmorethan40%ofthesenodes 8

arelocatedinthetoptenASs. Thisindicatesthat,althoughunforced bynodedistribution,most Gnutella generatedtraffic crossesASborder being thusmoreexpensive thoandle. Inthesecondexperimentweassumethatthehierarchicalorganizat ionodomain f namesmirrorsthatof theInternetinfrastructure. Forexample,itislikelythatcom municationcostsbetweentwohostsinthe “uchicago.edu”domainaresignificantlysmallerthanbetween“uchic ago.edu”and“sdsc.edu.” The underlyingassumptionhereisthatdomainnamesexpresssomesortof organizationalhierarchyand thatorganizationstendtobuild networksthatexploitlocality within t hathierarchy. InordertostudyhowwellGnutellavirtualtopologymapsontotheI nternetpartitioningads efinedby domainnames,wedividetheGnutellavirtualtopologygraphinto clusters,i.e.,subgraphswithhigh interiorconnectivity. Giventheflooding-likeroutingalgorithmuse dbyGnutella,itiswithinthese clustersthatmostloadigs enerated. Wearethereforeinteres tedtoseehowwelltheseclustersmapon the partitioning definedbtyhe domainnaming scheme. We use simple a clusteringalgorithmbasedotnhe connectivity di stribution describedearlier:we define asclusterssubgraphsformedbyonehubwithitsadjacentnodes. If twoclustershavemorethan25% nodesincommon,wemergethem. Aftertheclusteringids one,we(1) assignnodesthatareincluded inmorethanoneclusteronlytothelargestclusterand(2)form alastclusterwithnodesthatarenot included in any other cluster. Wedefinetheentropy[24]oafsetCcontaining , |C|hosts,eachlabeledwithoneotfhe ndistinct domain names,as: n

E (C ) = ∑ (− pi log( p i ) − (1 − p i ) log(1 − pi ) ) , i =1

where pis probability orandomly f picking host a with domain name ithe Wethendefinetheentropyocaflusteringogafraphosize f

i. |C|,clusteredin

cklusters

C1 , C 2 ,..., C k of

C = C1 + C 2 + ... + C k as: ,

sizes C1 , C 2 ,..., C k with , k

Ci

i =1

C1 + C 2 + ... + C k

E (C1 , C 2 ,...C k ) = ∑

* E (C i )

Webaseourreasoningonthepropertythat E (C ) ≥ E (C1 , C 2 ,..., C k )nomatterhowtheclusters C1 , C 2 ,..., C karechosen. Iftheclusteringmatchesthedomain partitioning,thenweshouldfindthat E (C ) >> E (C1 , C 2 ,..., C k ) .Conversely,iftheclustering C1 , C 2 ,..., C k hasthesamelevelof randomnessaisntheinitialset C,thentheentropyshouldremainlargelyunchanged. Essentially,the entropyfunctionius sedheretomeasurehowwellt hetwopartitionsappliedonsetnodesmatch:the firstpartitionusestheinformationcontainedind omainnames,whilethesecondusestheclustering heuristic. Notethat laargeclassodata f mining andmachinelearningalgorithmsbasedoninformatio n gains(ID3, C4.5, etc. [25]) use similar a argument to build their decision trees. Weperformedthisanalysison10topologygraphsco llectedduringFebruary/March2001.We detectednosignificantdecreaseinentropyafterp erformingtheclustering(alldecreaseswerewithin lessthan8%fromtheinitialentropyvalue). Cons equently,weconcludethatGnutellanodesclusteri n awaythatiscompletelyindependentfromtheInter netstructure. AssumingthattheInternet domain namestructureroughlymatchestheunderlyingtopol ogy(thecostofsendingdatawithinadomainis smallerthanthatofsendingdataacrossdomains), weconcludethattheself-organizingGnutella network doesnotefficiently use the underlying phy sicalinfrastructure.

9

5.SummaryandPotentialImprovements SociologicalcircumstancesthathavefosteredthesuccessoG f nute llanetworkmightchangeandthe networkmightfade. P2P,however,“isoneotfhoserareideasthatis simplytoogoodtogoaway” [18].Despiterecentexcitementgeneratedbythisparadigmandthe surprisinglyrapiddeploymentof someP2Papplications,therearefewquantitativeevaluationsoP f 2Psystemsbehavior. Theopen architecture,achievedscale,andself-organizingstructureof theGnutellanetworkmakeitan interestingP2Parchitecturetostudy. Ourmeasurementandanalysistechniques canbue sedformost P2P systemsto enhance general understanding odesign f tradeoffs. OuranalysisshowsthatGnutellanodeconnectivityfollowsamultimodaldistribution:combininga powerlawandaquasi-constantdistribution. Thisproperty keeps thenetwork asreliableasapure power-lawnetwork whenassumingrandomnodefailures, andmakes it hardertoattackbmalicious ya adversary. Gnutellatakesfewprecautionstowardoffpotentialat tacks.Forexample,thenetwork topologyinformationthatwe obtainhereiesasyto btainandwouldpermithighlyefficientdenia l-ofserviceattacks. Someformofsecuritymechanismsthatwouldpre vent an intruder to gathertopology information appearsessentialfor the long-termsurvivalof the network. We have estimated that,asoJfune2001,thenetworkgenerates about330TB/month onlyto remain connectedandbroadcastuserqueries. This trafficvolume representssignificant a fractionof the total Internettraffic andmakes thefuturegrowthof Gnutellanetworkparticularlydependentonefficient network usage.Wehavealsodocumentedthetopologymismatchbetweentheself -organized, applicationlevelGnutellanetworkandtheunderlyingphysicalnetw orkinginfrastructure. Webelieve thishasmajorimplicationsforthescalabilityothe f Interne (or, t equivalently,forthebusinessmodels ofISPs). ThisproblemmustbesolvedifGnutellaorsimilarlybui ltsystems areto reachlarger deployment. Weseetwodirectionsforimprovement.First,weobservethatthe application-leveltopology determinesthevolumeogf eneratedtraffic,thesearchsucces rsate,and the applicationreliability.We imagineanagentthatconstantly monitorsthenetworkandintervene bs ayskingserventstodropoadd r linksans ecessarytokeepthenetworktopologyefficient. Additionally,agents(ornodes)couldlearn abouttheunderlyingphysicalnetworkandbuildthevirtualapplicationt opologyaccordingly. Note thatimplementing thisidea requiressome minimalprotocolmodifica tions. A second, orthogonal,directionistoreplacefloodingwithasmarter(less expensiveintermsof communicationcosts)routingandgroupcommunicationmechanism. Recentresearchprojects:Chord [19],CAN[21],SDS[23]oO r ceanStore[22]focusonbuildingIntenetsca leoverlaynetworksand offer vaastarray ochoices f futureGnutella implementationscould build on . 6.Acknowledgements am I gratefultoIanFoster,AdrianaIamnitchi,LarryLidz,Conor McGrath,DustinMitchell,andAlain Royfortheirinsightfulcommentsandgeneroussupport.Thisworkst artedajoint as classprojectwith YugoNakaiand XuehaiZhang. Rongguan Jin, KnoxMcMurry, andYanWang participatedinrefining thiswork. Thisresearch wassupportedbtyheNationalScience Foundation under contractITR-0086044. 7.References [1] M.Faloutsos, P. Faloutsos, C. Faloutsos, On Power-Law Relationships of the InternetTopology 1999. [2] E. Adar,B.Huberman, FreeridingoG n nutella First , Monday Vol 5-10 O – ct. 2, 2000.

SIGCOMM ,

10

[3] TheGnutellaprotocolspecificationv4.0 -http://dss.clip2.com/GnutellaProtocol04.pdf Gnutella: TotheBandwidth Barrierand Beyond http://dss.clip2.com, , Nov. 6, 2000. [4]DSS Group, BarrierstoGnutellaNetwork Scalability , http://dss.clip2.com, Sept. 8, 2000. [5] DSS Group, Bandwidth [6]LadaAAdamic, . Rajan MLukose, . AmitR. Puniyani, B. Huberman, Search iP n owe-Law Networks . [7] A.Broder,R. Kumar,F.Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkinsand JWiener, . Graph th structure in theweb , 8International WWWConference, May 15-19 Amsterdam. [8]A.Barabasi andRAlbert. . Emergenceoscaling f irnandom networks, Science, 286(509), 1999. [9] Todd Spangle , TheHiddenCostOfP2P Interactive , Week, February 26, 2001. [10] M.Katz, C. Shapiro, SystemsCompetitionand NetworkEffects, JournalofEconomicPerspectives,vol.8, no.2,pp. 93-115, 1994. [11] T. Cover, J. Thomas, ElementsofInformation Theory Wiley, , 1991. [12] B. St.Arnaud, ScalingIssuesonInternetNetworks , TechnicalReport, CANARIEInc. Power-law distribution othe f World WideWeb Science , [13] A,Barabási , R.Albert,H. Jeong,G. Bianconi, 287,(2000). [14] R.Albert,H. Jeong, A. Barabási, Attakand toleranceicnomplexnetworks, Nature 406 378 (2000). [15] http://dss.clip2.com [16] K. Coffman, A. Odlyzko, Internet growth: Is there"Moore's a Law" fordata traffic? , Handbook oMassive f Data Sets J.Abello , & all editors., Kluwer, 2001. Mining Concepts : and Techniques Morgan , Kaufmann, August2000. [17] J.Han, M. Kamber, Data [18] TheEconomist, Inventionithe s EasyBit The , EconomistTechnology Quarterly 6/23/01. [19] IonStoica, RobertMorris, DavidKarger, M.FransKaashoek,and HariBalakrishnan, Chord:AScalable Peer-to-peerLookupServicefor InternetApplications, SIGCOMM2001. [20]SR . atnasamy,PF. rancis,MH . andley,RK . arp,SS. henker AScalableContent-AddressableNetwork . Submitted for publication, 2000 [21]BenZhao,JohnKubiatowicz,AnthonyJoseph. Tapestry:Aninfrastructureforwide-areafaulttolerant location and routing [22]Zachary Ives,AlonLevy,Jayant Madhavan,RachelPottinger, StefanSaroiu, Igor Tatarinov,ShioriBetzler, Self-Organizing DataSharing Communities Qiong Chen,Ewa Jaslikowska, Jing Su, W.T. TheodoraYeung. with SAGRES SIGMOD , 2000,Dallas, TX. [23] Todd DHodes, . StevenECzerwinski, . Ben YZhao, . Anthony DJoseph, . Randy H.Katz, An Architecture forSecureWide-AreaServiceDiscovery, ACMBaltzer WirelessNetworks:selectedpapers from MobiC om 1999. [24] T. Cover, J. Thomas, ElementsofInformation Theory Wiley, , 1991. [25] J.Han, M. Kamber, Data Mining Concepts : and Techniques Morgan , Kaufmann, August2000.

11