COMPARINGTHEPERFORMANCEOFMPI ONTHECRAYRESEARCHT3EANDIBMSP-2
1
by GlennR.LueckeandJamesJCoyle .
[email protected]@iastate.edu IowaStateUniversity Ames,Iowa50011-2251,USA WaqarulHaque
[email protected] UniversityoNorthern f BritishColumbia PrinceGeorge,BritishColumbia,CanadaV2N4Z9 February81997 , Abstract:.Thispaperreportsthe performanceothe f CrayResearchT3EandIBMSP-2ona collectionof communicationteststhatuseMPIforthemessagepassi ng. Thesetestshavebeendesignedtoevaluate theperformanceofcommunicationpatternsthatwefeel arelikelytooccurinscientificprograms. Communicationtestswereperformedformessagesosf izes8bytes,1KB,100KB,and10MBwith2,4, 8,16,32and64processors. Bothmachinesprovided averyhighlevelofconcurrencyforthenearest neighborcommunicationtestsandmoderateconcurrencyon thebroadcastoperations. Onthetests used,theT3EsignificantlyoutperformedtheSP-2wit hmostperformancetestsbeingatleastthreetimes fasterthantheSP-2.
INTRODUCTION
MessagePassingInterface,MPI,[5.8]israpidlybecoming thestandardforwritingscientific programswithexplicitmessagepassingratherthanPVM[ 3]ortheBLACS[2].The communicationnetworkofaparallelcomputerplaysavi tallyimportantroleinitsoverall performance,see[4,6]. Therearesomanydifferent waysthatcommunicationmayoccurwhen runningscientificprogramsthatitisnotpossibletotest allofthem.However,theMPI communication tests that have been developed have been designed to test those communicationpatternsthatwefeelarelikelytooccuri nscientificprograms. Thepurposeof thisstudyistoevaluatecommunicationperformanceoft heCrayResearchT3EandIBMSP-2 onacollectionofcommunicationteststhatuseMPIfo the r messagepassing. DESCRIPTIONOFTHEPERFORMANCETESTSANDRESULTS AllourcommunicationtestshavebeenwritteninFortra messagepassing. Thesetestswererunwithmessagesizesra withthenumberofprocessorsrangingfrom2to64. S 1
Afollow-onstudyisplannedthatwillevaluatethe follow-ontotheSP-2,andtheCrayResearchT3E-900.
nwithcallstoMPIroutinesforthe ngingfrom8bytesto10MBand omeofthesecommunication
performanceofMPIontheSGIORIGIN2000,IBM’s
2
patternstakeaveryshortamountotfimetoexecutesot timeofatleastonesecondinordertoobtainaccurat patternisthenobtainedbydividingthetotaltimeb doneusingawall-clocktimer. Testswererunatleastt numbersarereported. TestsfortheCrayT3Ewererunona64processorT3Elo headquartersinEagan,Minnesota. Thepeaktheoretical mflops. Thecommunicationnetworkhasabandwidthof35 microsecond.TheT3Ecommunicationnetworkisabi-dire informationontheT3Eseehttp://www.cray.com. Theo version1.3.1andtheFortrancompilerusedwascf90ver
heyareloopedtoobtainawall-clock etimings. Thetimeofthecommunication ythenumberoifterations. Alltimingswere entimesandthebestperformance
catedatCrayResearch’scorporate performanceofeachprocessoris600 0MB/secondandlatencyof1.5 ctional3-Dtorus.Formore peratingsystemusedwasUNICOS/mk sion3.0withtheO2optimizationlevel.
TestsfortheIBMSP-2wererunattheMauiHighPerf ormanceComputingCenter. Thepeak theoreticalperformanceofeachoftheseprocessorsis 267mflops.Thecommunicationnetwork hasapeakbi-directionalbandwidthof40MB/secondwith alatencyof40.0microsecondsfor thinnodesand39.2microsecondsforwidenodes. Wide (thin)nodeshavea256(64)KBdata cache,256(64)bitpathfrommemorytothedatacache, and256(128)bitpathfromthedata cachetotheprocessorbus. TheIBMSP-2usesabi-direct ionalmultistage interconnection networkandmaybeconfiguredwiththinand/orwideno des. Performancetestsweredone separatelyforthinnodesandforwidenodes.Formor einformationabouttheSP-2,see http://www.mhpcc.edu/training/workshop/html/ibmhwsw/ibmhwsw.html. TheIBMSP-2didnot have64widenodesavailableforourtestssonoperfor manceresultscouldbeobtainedwith thismanywidenodes. AIXversion4.1wasusedandthe Fortrancompilerusedwasxlfversion 3.2.4withtheO2optimizationlevel. CommunicationTest1 (Table1): Thefirstcommunicationtestsendsamessage fromone processorsendsasingle-integermessagebacktothesender received. ThistestisdifferentfromtheCOMMS1testd theCOMMS1testsendsamessagetoaprocessorandthe thefirstprocessor. Table1gives performanceratesi consistentlyoutperformsthewidenodesbymorethanafact rangedfrom5%to14%lessthanwidenodes.
processortoanotherthenthesecond indicatingthatthemessagewas escribedinsection3.3.1of[4]inthat nthatsamemessageissentbackto nKBpersecond. NoticethattheT3E orofthree. Thinnodeperformance
IBMSP-2 SP-2wide CrayT3E Message thinnode node Sizein Bytes 8 106 114 374 1,000 5,625 6,519 20,374 100,000 27,619 31,349 111,742 10,000,000 32,304 33,878 111,551 Table Processor 1: toprocessorcommunicationratein
Ratio T3E/wide 3.3 3.1 3.6 3.3 KB/second.
3
CommunicationTest2
(Table2):
Wenextmeasurecommunicationratesforsendingamessag efromoneprocessortoallofthe otherprocessors.ThistestistheCOMMS3testdescribed in[4].Tobetterevaluatethe performanceofthisbroadcastoperation,definea NormalizedBroadcastRate as (totaldatarate)/(N-1) whereNisthetotalnumberopf rocessorsinvolvedinth ecommunicationandwherethe total datarate isthetotalamountofdatasentonthecommunicatio nnetworkperunittimeand measuredinKBpersecond.LetRbethe dataratewhensendingamessagefromone processortoanotherandletDbethetotaldatarate forbroadcastingthesamemessagetothe N-1otherprocessors. Ifthebroadcastoperationandcom municationnetworkwereableto concurrentlytransmitthemessages,thenD=R*(N-1)and thusthe NormalizedBroadcastRate wouldremainconstantasNvariedforagivenmessagesize. Thus,therateatwhichthe NormalizedBroadcastRate decreasesasNincreasesindicateshowfarthebroadcast operationisfrombeing ideal. Table2givesthe NormalizedBroadcastRates fortheT3Eand forbothwideandthinnodesontheSP-2. Noticeth atinallcasesinsteadofbeingconstantfor agivenmessagesize,the NormalizedBroadcastRate decreasessignificantlyasthenumberof processorsincrease. Alsonoticethatforallmessagesizest heT3Eisroughly3to5times faster thanwidenodesbutthisfactordecreasesasthenu mberofprocessorsusedincreases.
4
Message Sizein bytes 8 8 8 8 8 8 1,000 1,000 1,000 1,000 1,000 1,000 100,000 100,000 100,000 100,000 100,000 100,000 10,000,000 10,000,000 10,000,000 10,000,000 10,000,000 10,000,000
Numberof Processors
IBMSP-2 thinnodes
IBMSP-2 widenodes
CrayT3E
2 75 78 345 4 52 55 195 8 31 38 101 16 20 24 47 32 12 14 16 64 5 na 4 2 4,587 5,184 23,609 4 2,711 3,374 13,485 8 1,837 2,433 7,380 16 1,366 1,717 3,759 32 833 1,073 1,398 64 378 na 412 2 27,181 30,685 98,729 4 9,755 10,826 51,874 8 5,057 5,516 35,357 16 4,650 8,268 26,632 32 3,190 4,049 21,545 64 1,943 na 17,833 2 32,298 33,614 100,336 4 10,970 11,332 53,220 8 5,507 5,680 36,821 16 10,638 16,472 27,321 32 9,662 14,523 22,371 64 8,147 na 18,466 Table Normalized 2: BroadcastRatesinKB/second.
Toseethatthereactuallyisconcurrencyoccurringinth the LogNormalizedBroadcastRate as
Ratio T3E/wide 4.4 3.5 2.7 2.0 1.1 4.6 4.0 3.0 2.2 1.3 3.2 4.8 6.4 3.2 5.3 3.0 4.7 6.5 1.7 1.5
ebroadcastoperation,define
(totaldatarate)/Log(N), whereNisthenumberoprocessors f involvedinthecommu 2ofN. Thus,ifbinarytreeparallelismwerebeingu beconstantforagivenmessagesizeasNvaries. Table2-b Ratesanddoesinfactshowthatconcurrencyisbeingutilize bothmachines.
tilized,the
nicationandLog(N)isthelogbase LogNormalizedDataRate would givesthe LogNormalizedData dinthebroadcastoperationfor
5
Message Sizein bytes 8 8 8 8 8 8 1,000 1,000 1,000 1,000 1,000 1,000 100,000 100,000 100,000 100,000 100,000 100,000 10,000,000 10,000,000 10,000,000 10,000,000 10,000,000 10,000,000
Numberof Processors
IBMSP-2 thinnodes
IBMSP-2 widenodes
CrayT3E
2 75 78 345 4 78 83 293 8 72 89 236 16 75 90 176 32 74 87 99 64 53 na 42 2 4,587 5,184 23,609 4 4,067 5,061 20,228 8 4,286 5,677 17,220 16 5,123 6,439 14,096 32 5,165 6,653 8,668 64 3,969 na 4,326 2 27,181 30,685 98,729 4 14,633 16,239 77,811 8 11,800 12,871 82,500 16 17,438 31,005 99,870 32 19,778 25,104 133,579 64 20,402 na 187,247 2 32,298 33,614 100,336 4 16,455 16,998 79,830 8 12,850 13,253 85,916 16 39,893 61,770 102,454 32 59,904 90,043 138,700 64 85,544 na 193,893 Table2-b: LogNormalizedBroadcastRatesinKB/second
Ratio T3E/wide 4.4 3.5 2.7 2.0 1.1 4.6 4.0 3.0 2.2 1.3 3.2 4.8 6.4 3.2 5.3 3.0 4.7 6.5 1.7 1.5 .
CommunicationTest3 (Table3): CommunicationTest3measurestheratesforbroadcasting amessagefromaprocessortoall otherprocessorsandthenhavingthereceivingprocessorsr eturnthissamemessagebackto theoriginatingprocessor. Toeliminatethepossibili tyoan f optimizing compilerrecognizingthat itisthesamemessagebeingsentbackandhenceneednot besentback,oneelementofthe messageisalteredbythereceivingprocessorpriortosend ingthemessageback. Noticethat thiscommunicationpatterncausessignificantlymoredata trafficonthecommunicationnetwork thanthesimplebroadcastoperationdescribedinCommun icationTest2Table . 3givesthe NormalizedBroadcastRates forthisoperation. NoticethatthetrendsinTable 3aresimilarto thoseofTable2butthatthedataratesfor10MB messageswith16,32and64processorsdrop offsignificantlyontheSP-2.
Message Sizein bytes
Numberof Processors
IBMSP-2 thinnodes
IBMSP-2 widenodes
CrayT3E
Ratio T3E/wide
6
8 2 103 124 426 8 4 64 74 202 8 8 40 45 96 8 16 23 26 37 8 32 12 13 12 8 64 5 na 3 1,000 2 5,925 7,676 28,811 3.8 1,000 4 3,665 4,735 11,482 2.4 1,000 8 2,243 2,841 5,009 1.8 1,000 16 1,229 1,578 1,584 1.0 1,000 32 638 835 418 0.5 1,000 64 294 na 106 100,000 2 27,226 31,479 108,573 100,000 4 11,292 12,778 68,032 100,000 8 5,629 6,420 42,286 100,000 16 2,768 3,419 25,039 100,000 32 1,466 1,731 14,037 100,000 64 758 na 7,568 10,000,000 2 32,109 33,766 108,512 3.2 10,000,000 4 13,117 13,542 70,423 5.2 10,000,000 8 6,518 6,764 47,711 7.1 10,000,000 16 3,333 3,964 29,399 7.4 10,000,000 32 1,899 2,016 16,962 8.4 10,000,000 64 946 na 9,310 Table3: NormalizedBroadcastRates inKB/secondforabroadcastandreturn.
3.4 2.7 2.1 1.4 0.9
3.4 5.3 6.6 7.3 8.1
CommunicationTest4 (Table4): Therestofthecommunicationtestsaredesignedtom easurecommunicationbetween “neighboring”processors. Asabove,letNbethetotaln umberopf rocessorsandassumethat theyhavebeennumberedfrom1toN.Thiscommunicati ontestsendsamessagefrom processor to iprocessor(i+1)modN,for 1, =2, i…, N. Observethatthedataratesforthistest willincreaseproportionallywithNsincecommunication can(hopefully)bedoneinparallel. Thus,inamannersimilar tothe NormalizedBroadcastRate w , edefinethe NormalizedData Ratetobe (totaldatarate)/N. Inanidealparallelcomputer,the NormalizedDataRate fortheabovecommunicationwouldbe constantsinceallcommunicationwouldbedoneconcurrent ly. Thus,thedegreetowhichthe NormalizedDataRate isnotconstantindicateshowfarfromidealthistype ofcommunication canbeperformedbytheparallelcomputer. Table4gi vestheNormalizedDataRatesforthe abovecommunicationinKB/second. Noticethatforall messagesizesthedataratescaleswell forbothmachinesandthattheT3Esignificantlyoutpe rformstheSP-2.
Message Sizein bytes
Numberof Processors
IBMSP-2 thinnodes
IBMSP-2 widenodes
CrayT3E
Ratio T3E/wide
7
8 8 8 8 8 8 1,000 1,000 1,000 1,000 1,000 1,000 100,000 100,000 100,000 100,000 100,000 100,000 10,000,000 10,000,000 10,000,000 10,000,000 10,000,000 10,000,000 Table4:
2 67 75 259 4 67 72 231 8 65 70 228 16 61 67 226 32 55 66 223 64 43 na 222 2 3,994 4,936 15,845 4 3,642 4,671 14,382 8 3,644 4,507 14,247 16 3,475 4,303 14,142 32 3,444 3,964 14,093 64 3,294 na 13,864 2 13,612 15,832 57,801 4 13,554 15,694 57,126 8 13,364 15,645 57,166 16 13,290 15,389 57,079 32 13,197 14,899 56,869 64 12,630 na 55,955 2 16,091 16,893 56,001 4 16,056 16,939 47,475 8 15,722 16,808 40,355 16 15,791 16,767 40,336 32 15,497 16,739 40,307 64 13,245 na 40,126 NormalizedDataRates forprocessor to iprocessori+1modN.
3.5 3.2 3.2 3.4 3.7 3.2 3.1 3.2 3.3 3.6 3.7 3.6 3.7 3.7 3.8 3.3 2.8 2.4 2.4 2.4
CommunicationTest5 (Table5): Thisnextcommunicationpatternsendsamessagefromapro cessor to i eachofitsneighbors (i-1)modNand(i+1)modN,for =1, i 2,… N ,andwhereNisthenumberofprocessorsusedin thetest. Thus,theamountofdatabeingmovedont henetworkwillbetwicethatoftheprevious test. Table5showstheperformanceresults. Noticethat theT3Eisabletohandletheextra datatrafficbetterthantheSP-2. Alsonoticethatt hedataratesscalewellasthenumberof processorsusedincreases.
8
Numberof Message Processors Sizein bytes 8 2 8 4 8 8 8 16 8 32 8 64 1,000 2 1,000 4 1,000 8 1,000 16 1,000 32 1,000 64 100,000 2 100,000 4 100,000 8 100,000 16 100,000 32 100,000 64 10,000,000 2 10,000,000 4 10,000,000 8 10,000,000 16 10,000,000 32 10,000,000 64 Table5: NormalizedDataRates
IBMSP-2 thinnodes
IBMSP-2 widenodes
CrayT3E
90 100 487 85 95 478 78 92 475 69 89 470 65 84 414 54 na 386 6,177 7,374 27,599 5,420 6,965 26,804 5,166 6,576 26,810 4,809 6,616 26,294 4,775 6,088 21,783 3,845 na 20,773 14,793 17,715 100,498 15,849 20,493 94,391 15,969 20,445 100,140 15,706 20,389 100,059 15,233 19,828 99,950 14,914 na 99,212 17,106 18,638 104,615 18,821 22,556 81,668 18,334 22,396 72,293 18,122 22,437 72,372 17,799 22,111 72,483 16,518 na 72,512 forprocessor to iprocessors(i+1)modN&(i-1)modN
CommunicationTest6 (Table6): Todescribethislastcommunicationpattern,letNbethe andletPbeapermutationofthefirstNpositiveint bedescribedassendingamessagefromprocessorP(i)top thatthisisexactlythesameasthecommunicationpattern oneweretoreordertheprocessornumberingfrom1, purposeofthistestistodeterminetheimpactonperf theprocessors.Ofcourse,itislikelythattheperforma permutationselected. ComparingTable6withTable RatesforboththeT3EandSP-2caninfactchangesignific selected.
Ratio T3E/wide 4.9 5.1 5.2 5.3 4.9 3.7 3.8 4.1 4.0 3.6 5.7 4.6 4.9 4.9 5.0 5.6 3.6 3.2 3.2 3.3 .
numberofprocessorsusedforthetest egers. Thiscommunicationpatterncanthen rocessorP((i+1)modN). Notice describedinCommunicationTest4if 2,…,NtoP(1),P(2),…,P(N). Thus,the ormanceofreorderingthenumberingof ncewilldependontheparticular 4,oneobservesthatthe NormalizedData antlydependingonthepermutation
9
Numberof IBMSP-2 IBMSP-2 CrayT3E Ratio Message Processors thinnodes widenodes T3E/wide Sizein bytes 8 2 88 99 483 4.9 8 4 86 96 472 4.9 8 8 79 94 461 4.9 8 16 75 91 440 4.8 8 32 69 85 408 4.8 8 64 55 na 398 1,000 2 5,957 7,318 28,127 3.8 1,000 4 5,697 6,851 27,067 4.0 1,000 8 5,353 6,770 26,578 3.9 1,000 16 5,256 6,593 21,808 3.3 1,000 32 5,046 6,339 20,775 3.3 1,000 64 4,972 na 20,150 100,000 2 14,599 17,868 100,425 5.6 100,000 4 16,015 20,453 90,487 4.4 100,000 8 15,793 20,474 97,556 4.8 100,000 16 14,866 20,417 96,906 4.7 100,000 32 15,554 20,328 89,936 4.4 100,000 64 14,176 na 80,868 10,000,000 2 17,171 19,413 99,158 5.1 10,000,000 4 18,275 22,565 69,569 3.1 10,000,000 8 17,311 22,520 76,064 3.8 10,000,000 16 17,111 22,378 67,416 3.0 10,000,000 32 17,010 22,179 59,208 2.7 10,000,000 64 14,667 na 54,516 Table6: NormalizedDataRates forprocessorP(i)toprocessorP((i+1)modN).
CONCLUSIONS Thepurposeofthisstudywastoevaluatecommunication performanceoftheCrayResearch T3EandIBMSP-2onacollectionofcommunicationte stswritteninMPI. Thesetestshave beendesignedtoevaluatetheperformanceofcommunicat ionpatternsthatwefeelarelikelyto occurinscientificprograms. Bothmachinesshowedaveryhi ghlevelofconcurrencyonthe nearest neighbor communication tests and moderate concu rrency on the broadcast communicationtests. OntheSP-2,thinnodesperformed roughly10%lessthanwidenodeson ourtests. TheT3Esignificantlyoutperformed theSP2withmostperformancetestsbeingat leastthreetimesfasterthanwidenodesontheSP-2. ACKNOWLEDGMENTS ComputertimeontheMauiHighPerformanceComputer PhillipsLaboratory,AirForceMaterialCommand,USAF, F29601-93-2-0001. Theviewsandconclusionscontainedin
Center’sSP-2wassponsoredbythe undercooperativeagreementnumber thisdocumentarethoseofthe
10
authorsandshouldnotbeinterpretedasnecessarilyre endorsements,eitherexpressedorimplied,ofPhillips WewouldliketothankCrayResearchInc.forallowingu headquartersinEagan,Minnesota, USA.
presentingtheofficialpoliciesor Laboratoryothe r U.S.Government. stousetheirT3Eattheircorporate
REFERENCES 1. CrayMPPFortranReferenceManual
SR , 25046.2.2,CrayResearch,Inc.,June1995.
2. J.Dongarra,R.Whaley,AUser’sGuidetotheBLACSv1 .0,ComputerScienceDepartmen TechnicalReportCS-95-281,UniversityofTennessee,1995. (AvailableasLAPACK WorkingNote94at: http://www.netlib.org/lapack/lawn s/lawn94.ps) 3. A.Geist,AB . eguelin,JD . ongarra,WJ. iang,RM . anch VirtualMachineAUsers’GuideandTutorialforNetwo Press,1994.
ek,VS . underam, PVM: Parallel rkedParallelComputing , TheMIT
4. R. Hockney, M. Berry, Public International Benchmarks for PARKBENCHCommittee,Report-1,February71994. , 5. W.Gropp,E.Lusk,A.Skjellum,
Parallel Computers:
USINGMPI The , MITPress1994.
6. G.Luecke,JC . oyle,WH . aque,J.Hoekstra,HJ. espersen P , erformanceComparisonof WorkstationClustersforScientificComputing S , UPERCOMPUTER,volXII,no.2pp , 4-20, March1996. 7. OptimizationandTuningGuideforFortran,Ca , ndC IBM,June1996. 8. M.Snir,S.Otto,S.Huss-Lederman,D.Walker,J.Dong Reference,TheMITPress,1996.
++forAIXversion4
s, econdedition,
arra, MPI:TheComplete