comparing the performance of mpi on the cray research ... - CiteSeerX

COMPARINGTHEPERFORMANCEOFMPI ONTHECRAYRESEARCHT3EANDIBMSP-2

1

by GlennR.LueckeandJamesJCoyle . [email protected]@iastate.edu IowaStateUniversity Ames,Iowa50011-2251,USA WaqarulHaque [email protected] UniversityoNorthern f BritishColumbia PrinceGeorge,BritishColumbia,CanadaV2N4Z9 February81997 , Abstract:.Thispaperreportsthe performanceothe f CrayResearchT3EandIBMSP-2ona collectionof communicationteststhatuseMPIforthemessagepassi ng. Thesetestshavebeendesignedtoevaluate theperformanceofcommunicationpatternsthatwefeel arelikelytooccurinscientificprograms. Communicationtestswereperformedformessagesosf izes8bytes,1KB,100KB,and10MBwith2,4, 8,16,32and64processors. Bothmachinesprovided averyhighlevelofconcurrencyforthenearest neighborcommunicationtestsandmoderateconcurrencyon thebroadcastoperations. Onthetests used,theT3EsignificantlyoutperformedtheSP-2wit hmostperformancetestsbeingatleastthreetimes fasterthantheSP-2.

INTRODUCTION

MessagePassingInterface,MPI,[5.8]israpidlybecoming thestandardforwritingscientific programswithexplicitmessagepassingratherthanPVM[ 3]ortheBLACS[2].The communicationnetworkofaparallelcomputerplaysavi tallyimportantroleinitsoverall performance,see[4,6]. Therearesomanydifferent waysthatcommunicationmayoccurwhen runningscientificprogramsthatitisnotpossibletotest allofthem.However,theMPI communication tests that have been developed have been designed to test those communicationpatternsthatwefeelarelikelytooccuri nscientificprograms. Thepurposeof thisstudyistoevaluatecommunicationperformanceoft heCrayResearchT3EandIBMSP-2 onacollectionofcommunicationteststhatuseMPIfo the r messagepassing. DESCRIPTIONOFTHEPERFORMANCETESTSANDRESULTS AllourcommunicationtestshavebeenwritteninFortra messagepassing. Thesetestswererunwithmessagesizesra withthenumberofprocessorsrangingfrom2to64. S 1

Afollow-onstudyisplannedthatwillevaluatethe follow-ontotheSP-2,andtheCrayResearchT3E-900.

nwithcallstoMPIroutinesforthe ngingfrom8bytesto10MBand omeofthesecommunication

performanceofMPIontheSGIORIGIN2000,IBM’s

2

patternstakeaveryshortamountotfimetoexecutesot timeofatleastonesecondinordertoobtainaccurat patternisthenobtainedbydividingthetotaltimeb doneusingawall-clocktimer. Testswererunatleastt numbersarereported. TestsfortheCrayT3Ewererunona64processorT3Elo headquartersinEagan,Minnesota. Thepeaktheoretical mflops. Thecommunicationnetworkhasabandwidthof35 microsecond.TheT3Ecommunicationnetworkisabi-dire informationontheT3Eseehttp://www.cray.com. Theo version1.3.1andtheFortrancompilerusedwascf90ver

heyareloopedtoobtainawall-clock etimings. Thetimeofthecommunication ythenumberoifterations. Alltimingswere entimesandthebestperformance

catedatCrayResearch’scorporate performanceofeachprocessoris600 0MB/secondandlatencyof1.5 ctional3-Dtorus.Formore peratingsystemusedwasUNICOS/mk sion3.0withtheO2optimizationlevel.

TestsfortheIBMSP-2wererunattheMauiHighPerf ormanceComputingCenter. Thepeak theoreticalperformanceofeachoftheseprocessorsis 267mflops.Thecommunicationnetwork hasapeakbi-directionalbandwidthof40MB/secondwith alatencyof40.0microsecondsfor thinnodesand39.2microsecondsforwidenodes. Wide (thin)nodeshavea256(64)KBdata cache,256(64)bitpathfrommemorytothedatacache, and256(128)bitpathfromthedata cachetotheprocessorbus. TheIBMSP-2usesabi-direct ionalmultistage interconnection networkandmaybeconfiguredwiththinand/orwideno des. Performancetestsweredone separatelyforthinnodesandforwidenodes.Formor einformationabouttheSP-2,see http://www.mhpcc.edu/training/workshop/html/ibmhwsw/ibmhwsw.html. TheIBMSP-2didnot have64widenodesavailableforourtestssonoperfor manceresultscouldbeobtainedwith thismanywidenodes. AIXversion4.1wasusedandthe Fortrancompilerusedwasxlfversion 3.2.4withtheO2optimizationlevel. CommunicationTest1 (Table1): Thefirstcommunicationtestsendsamessage fromone processorsendsasingle-integermessagebacktothesender received. ThistestisdifferentfromtheCOMMS1testd theCOMMS1testsendsamessagetoaprocessorandthe thefirstprocessor. Table1gives performanceratesi consistentlyoutperformsthewidenodesbymorethanafact rangedfrom5%to14%lessthanwidenodes.

processortoanotherthenthesecond indicatingthatthemessagewas escribedinsection3.3.1of[4]inthat nthatsamemessageissentbackto nKBpersecond. NoticethattheT3E orofthree. Thinnodeperformance

IBMSP-2 SP-2wide CrayT3E Message thinnode node Sizein Bytes 8 106 114 374 1,000 5,625 6,519 20,374 100,000 27,619 31,349 111,742 10,000,000 32,304 33,878 111,551 Table Processor 1: toprocessorcommunicationratein

Ratio T3E/wide 3.3 3.1 3.6 3.3 KB/second.

3

CommunicationTest2

(Table2):

Wenextmeasurecommunicationratesforsendingamessag efromoneprocessortoallofthe otherprocessors.ThistestistheCOMMS3testdescribed in[4].Tobetterevaluatethe performanceofthisbroadcastoperation,definea NormalizedBroadcastRate as (totaldatarate)/(N-1) whereNisthetotalnumberopf rocessorsinvolvedinth ecommunicationandwherethe total datarate isthetotalamountofdatasentonthecommunicatio nnetworkperunittimeand measuredinKBpersecond.LetRbethe dataratewhensendingamessagefromone processortoanotherandletDbethetotaldatarate forbroadcastingthesamemessagetothe N-1otherprocessors. Ifthebroadcastoperationandcom municationnetworkwereableto concurrentlytransmitthemessages,thenD=R*(N-1)and thusthe NormalizedBroadcastRate wouldremainconstantasNvariedforagivenmessagesize. Thus,therateatwhichthe NormalizedBroadcastRate decreasesasNincreasesindicateshowfarthebroadcast operationisfrombeing ideal. Table2givesthe NormalizedBroadcastRates fortheT3Eand forbothwideandthinnodesontheSP-2. Noticeth atinallcasesinsteadofbeingconstantfor agivenmessagesize,the NormalizedBroadcastRate decreasessignificantlyasthenumberof processorsincrease. Alsonoticethatforallmessagesizest heT3Eisroughly3to5times faster thanwidenodesbutthisfactordecreasesasthenu mberofprocessorsusedincreases.

4

Message Sizein bytes 8 8 8 8 8 8 1,000 1,000 1,000 1,000 1,000 1,000 100,000 100,000 100,000 100,000 100,000 100,000 10,000,000 10,000,000 10,000,000 10,000,000 10,000,000 10,000,000

Numberof Processors

IBMSP-2 thinnodes

IBMSP-2 widenodes

CrayT3E

2 75 78 345 4 52 55 195 8 31 38 101 16 20 24 47 32 12 14 16 64 5 na 4 2 4,587 5,184 23,609 4 2,711 3,374 13,485 8 1,837 2,433 7,380 16 1,366 1,717 3,759 32 833 1,073 1,398 64 378 na 412 2 27,181 30,685 98,729 4 9,755 10,826 51,874 8 5,057 5,516 35,357 16 4,650 8,268 26,632 32 3,190 4,049 21,545 64 1,943 na 17,833 2 32,298 33,614 100,336 4 10,970 11,332 53,220 8 5,507 5,680 36,821 16 10,638 16,472 27,321 32 9,662 14,523 22,371 64 8,147 na 18,466 Table Normalized 2: BroadcastRatesinKB/second.

Toseethatthereactuallyisconcurrencyoccurringinth the LogNormalizedBroadcastRate as

Ratio T3E/wide 4.4 3.5 2.7 2.0 1.1 4.6 4.0 3.0 2.2 1.3 3.2 4.8 6.4 3.2 5.3 3.0 4.7 6.5 1.7 1.5

ebroadcastoperation,define

(totaldatarate)/Log(N), whereNisthenumberoprocessors f involvedinthecommu 2ofN. Thus,ifbinarytreeparallelismwerebeingu beconstantforagivenmessagesizeasNvaries. Table2-b Ratesanddoesinfactshowthatconcurrencyisbeingutilize bothmachines.

tilized,the

nicationandLog(N)isthelogbase LogNormalizedDataRate would givesthe LogNormalizedData dinthebroadcastoperationfor

5

Message Sizein bytes 8 8 8 8 8 8 1,000 1,000 1,000 1,000 1,000 1,000 100,000 100,000 100,000 100,000 100,000 100,000 10,000,000 10,000,000 10,000,000 10,000,000 10,000,000 10,000,000

Numberof Processors

IBMSP-2 thinnodes

IBMSP-2 widenodes

CrayT3E

2 75 78 345 4 78 83 293 8 72 89 236 16 75 90 176 32 74 87 99 64 53 na 42 2 4,587 5,184 23,609 4 4,067 5,061 20,228 8 4,286 5,677 17,220 16 5,123 6,439 14,096 32 5,165 6,653 8,668 64 3,969 na 4,326 2 27,181 30,685 98,729 4 14,633 16,239 77,811 8 11,800 12,871 82,500 16 17,438 31,005 99,870 32 19,778 25,104 133,579 64 20,402 na 187,247 2 32,298 33,614 100,336 4 16,455 16,998 79,830 8 12,850 13,253 85,916 16 39,893 61,770 102,454 32 59,904 90,043 138,700 64 85,544 na 193,893 Table2-b: LogNormalizedBroadcastRatesinKB/second

Ratio T3E/wide 4.4 3.5 2.7 2.0 1.1 4.6 4.0 3.0 2.2 1.3 3.2 4.8 6.4 3.2 5.3 3.0 4.7 6.5 1.7 1.5 .

CommunicationTest3 (Table3): CommunicationTest3measurestheratesforbroadcasting amessagefromaprocessortoall otherprocessorsandthenhavingthereceivingprocessorsr eturnthissamemessagebackto theoriginatingprocessor. Toeliminatethepossibili tyoan f optimizing compilerrecognizingthat itisthesamemessagebeingsentbackandhenceneednot besentback,oneelementofthe messageisalteredbythereceivingprocessorpriortosend ingthemessageback. Noticethat thiscommunicationpatterncausessignificantlymoredata trafficonthecommunicationnetwork thanthesimplebroadcastoperationdescribedinCommun icationTest2Table . 3givesthe NormalizedBroadcastRates forthisoperation. NoticethatthetrendsinTable 3aresimilarto thoseofTable2butthatthedataratesfor10MB messageswith16,32and64processorsdrop offsignificantlyontheSP-2.

Message Sizein bytes

Numberof Processors

IBMSP-2 thinnodes

IBMSP-2 widenodes

CrayT3E

Ratio T3E/wide

6

8 2 103 124 426 8 4 64 74 202 8 8 40 45 96 8 16 23 26 37 8 32 12 13 12 8 64 5 na 3 1,000 2 5,925 7,676 28,811 3.8 1,000 4 3,665 4,735 11,482 2.4 1,000 8 2,243 2,841 5,009 1.8 1,000 16 1,229 1,578 1,584 1.0 1,000 32 638 835 418 0.5 1,000 64 294 na 106 100,000 2 27,226 31,479 108,573 100,000 4 11,292 12,778 68,032 100,000 8 5,629 6,420 42,286 100,000 16 2,768 3,419 25,039 100,000 32 1,466 1,731 14,037 100,000 64 758 na 7,568 10,000,000 2 32,109 33,766 108,512 3.2 10,000,000 4 13,117 13,542 70,423 5.2 10,000,000 8 6,518 6,764 47,711 7.1 10,000,000 16 3,333 3,964 29,399 7.4 10,000,000 32 1,899 2,016 16,962 8.4 10,000,000 64 946 na 9,310 Table3: NormalizedBroadcastRates inKB/secondforabroadcastandreturn.

3.4 2.7 2.1 1.4 0.9

3.4 5.3 6.6 7.3 8.1

CommunicationTest4 (Table4): Therestofthecommunicationtestsaredesignedtom easurecommunicationbetween “neighboring”processors. Asabove,letNbethetotaln umberopf rocessorsandassumethat theyhavebeennumberedfrom1toN.Thiscommunicati ontestsendsamessagefrom processor to iprocessor(i+1)modN,for 1, =2, i…, N. Observethatthedataratesforthistest willincreaseproportionallywithNsincecommunication can(hopefully)bedoneinparallel. Thus,inamannersimilar tothe NormalizedBroadcastRate w , edefinethe NormalizedData Ratetobe (totaldatarate)/N. Inanidealparallelcomputer,the NormalizedDataRate fortheabovecommunicationwouldbe constantsinceallcommunicationwouldbedoneconcurrent ly. Thus,thedegreetowhichthe NormalizedDataRate isnotconstantindicateshowfarfromidealthistype ofcommunication canbeperformedbytheparallelcomputer. Table4gi vestheNormalizedDataRatesforthe abovecommunicationinKB/second. Noticethatforall messagesizesthedataratescaleswell forbothmachinesandthattheT3Esignificantlyoutpe rformstheSP-2.

Message Sizein bytes

Numberof Processors

IBMSP-2 thinnodes

IBMSP-2 widenodes

CrayT3E

Ratio T3E/wide

7

8 8 8 8 8 8 1,000 1,000 1,000 1,000 1,000 1,000 100,000 100,000 100,000 100,000 100,000 100,000 10,000,000 10,000,000 10,000,000 10,000,000 10,000,000 10,000,000 Table4:

2 67 75 259 4 67 72 231 8 65 70 228 16 61 67 226 32 55 66 223 64 43 na 222 2 3,994 4,936 15,845 4 3,642 4,671 14,382 8 3,644 4,507 14,247 16 3,475 4,303 14,142 32 3,444 3,964 14,093 64 3,294 na 13,864 2 13,612 15,832 57,801 4 13,554 15,694 57,126 8 13,364 15,645 57,166 16 13,290 15,389 57,079 32 13,197 14,899 56,869 64 12,630 na 55,955 2 16,091 16,893 56,001 4 16,056 16,939 47,475 8 15,722 16,808 40,355 16 15,791 16,767 40,336 32 15,497 16,739 40,307 64 13,245 na 40,126 NormalizedDataRates forprocessor to iprocessori+1modN.

3.5 3.2 3.2 3.4 3.7 3.2 3.1 3.2 3.3 3.6 3.7 3.6 3.7 3.7 3.8 3.3 2.8 2.4 2.4 2.4

CommunicationTest5 (Table5): Thisnextcommunicationpatternsendsamessagefromapro cessor to i eachofitsneighbors (i-1)modNand(i+1)modN,for =1, i 2,… N ,andwhereNisthenumberofprocessorsusedin thetest. Thus,theamountofdatabeingmovedont henetworkwillbetwicethatoftheprevious test. Table5showstheperformanceresults. Noticethat theT3Eisabletohandletheextra datatrafficbetterthantheSP-2. Alsonoticethatt hedataratesscalewellasthenumberof processorsusedincreases.

8

Numberof Message Processors Sizein bytes 8 2 8 4 8 8 8 16 8 32 8 64 1,000 2 1,000 4 1,000 8 1,000 16 1,000 32 1,000 64 100,000 2 100,000 4 100,000 8 100,000 16 100,000 32 100,000 64 10,000,000 2 10,000,000 4 10,000,000 8 10,000,000 16 10,000,000 32 10,000,000 64 Table5: NormalizedDataRates

IBMSP-2 thinnodes

IBMSP-2 widenodes

CrayT3E

90 100 487 85 95 478 78 92 475 69 89 470 65 84 414 54 na 386 6,177 7,374 27,599 5,420 6,965 26,804 5,166 6,576 26,810 4,809 6,616 26,294 4,775 6,088 21,783 3,845 na 20,773 14,793 17,715 100,498 15,849 20,493 94,391 15,969 20,445 100,140 15,706 20,389 100,059 15,233 19,828 99,950 14,914 na 99,212 17,106 18,638 104,615 18,821 22,556 81,668 18,334 22,396 72,293 18,122 22,437 72,372 17,799 22,111 72,483 16,518 na 72,512 forprocessor to iprocessors(i+1)modN&(i-1)modN

CommunicationTest6 (Table6): Todescribethislastcommunicationpattern,letNbethe andletPbeapermutationofthefirstNpositiveint bedescribedassendingamessagefromprocessorP(i)top thatthisisexactlythesameasthecommunicationpattern oneweretoreordertheprocessornumberingfrom1, purposeofthistestistodeterminetheimpactonperf theprocessors.Ofcourse,itislikelythattheperforma permutationselected. ComparingTable6withTable RatesforboththeT3EandSP-2caninfactchangesignific selected.

Ratio T3E/wide 4.9 5.1 5.2 5.3 4.9 3.7 3.8 4.1 4.0 3.6 5.7 4.6 4.9 4.9 5.0 5.6 3.6 3.2 3.2 3.3 .

numberofprocessorsusedforthetest egers. Thiscommunicationpatterncanthen rocessorP((i+1)modN). Notice describedinCommunicationTest4if 2,…,NtoP(1),P(2),…,P(N). Thus,the ormanceofreorderingthenumberingof ncewilldependontheparticular 4,oneobservesthatthe NormalizedData antlydependingonthepermutation

9

Numberof IBMSP-2 IBMSP-2 CrayT3E Ratio Message Processors thinnodes widenodes T3E/wide Sizein bytes 8 2 88 99 483 4.9 8 4 86 96 472 4.9 8 8 79 94 461 4.9 8 16 75 91 440 4.8 8 32 69 85 408 4.8 8 64 55 na 398 1,000 2 5,957 7,318 28,127 3.8 1,000 4 5,697 6,851 27,067 4.0 1,000 8 5,353 6,770 26,578 3.9 1,000 16 5,256 6,593 21,808 3.3 1,000 32 5,046 6,339 20,775 3.3 1,000 64 4,972 na 20,150 100,000 2 14,599 17,868 100,425 5.6 100,000 4 16,015 20,453 90,487 4.4 100,000 8 15,793 20,474 97,556 4.8 100,000 16 14,866 20,417 96,906 4.7 100,000 32 15,554 20,328 89,936 4.4 100,000 64 14,176 na 80,868 10,000,000 2 17,171 19,413 99,158 5.1 10,000,000 4 18,275 22,565 69,569 3.1 10,000,000 8 17,311 22,520 76,064 3.8 10,000,000 16 17,111 22,378 67,416 3.0 10,000,000 32 17,010 22,179 59,208 2.7 10,000,000 64 14,667 na 54,516 Table6: NormalizedDataRates forprocessorP(i)toprocessorP((i+1)modN).

CONCLUSIONS Thepurposeofthisstudywastoevaluatecommunication performanceoftheCrayResearch T3EandIBMSP-2onacollectionofcommunicationte stswritteninMPI. Thesetestshave beendesignedtoevaluatetheperformanceofcommunicat ionpatternsthatwefeelarelikelyto occurinscientificprograms. Bothmachinesshowedaveryhi ghlevelofconcurrencyonthe nearest neighbor communication tests and moderate concu rrency on the broadcast communicationtests. OntheSP-2,thinnodesperformed roughly10%lessthanwidenodeson ourtests. TheT3Esignificantlyoutperformed theSP2withmostperformancetestsbeingat leastthreetimesfasterthanwidenodesontheSP-2. ACKNOWLEDGMENTS ComputertimeontheMauiHighPerformanceComputer PhillipsLaboratory,AirForceMaterialCommand,USAF, F29601-93-2-0001. Theviewsandconclusionscontainedin

Center’sSP-2wassponsoredbythe undercooperativeagreementnumber thisdocumentarethoseofthe

10

authorsandshouldnotbeinterpretedasnecessarilyre endorsements,eitherexpressedorimplied,ofPhillips WewouldliketothankCrayResearchInc.forallowingu headquartersinEagan,Minnesota, USA.

presentingtheofficialpoliciesor Laboratoryothe r U.S.Government. stousetheirT3Eattheircorporate

REFERENCES 1. CrayMPPFortranReferenceManual

SR , 25046.2.2,CrayResearch,Inc.,June1995.

2. J.Dongarra,R.Whaley,AUser’sGuidetotheBLACSv1 .0,ComputerScienceDepartmen TechnicalReportCS-95-281,UniversityofTennessee,1995. (AvailableasLAPACK WorkingNote94at: http://www.netlib.org/lapack/lawn s/lawn94.ps) 3. A.Geist,AB . eguelin,JD . ongarra,WJ. iang,RM . anch VirtualMachineAUsers’GuideandTutorialforNetwo Press,1994.

ek,VS . underam, PVM: Parallel rkedParallelComputing , TheMIT

4. R. Hockney, M. Berry, Public International Benchmarks for PARKBENCHCommittee,Report-1,February71994. , 5. W.Gropp,E.Lusk,A.Skjellum,

Parallel Computers:

USINGMPI The , MITPress1994.

6. G.Luecke,JC . oyle,WH . aque,J.Hoekstra,HJ. espersen P , erformanceComparisonof WorkstationClustersforScientificComputing S , UPERCOMPUTER,volXII,no.2pp , 4-20, March1996. 7. OptimizationandTuningGuideforFortran,Ca , ndC IBM,June1996. 8. M.Snir,S.Otto,S.Huss-Lederman,D.Walker,J.Dong Reference,TheMITPress,1996.

++forAIXversion4

s, econdedition,

arra, MPI:TheComplete

comparing the performance of mpi on the cray research ... - CiteSeerX

comparing the performance of mpi on the cray research ... - CiteSeerX

Suggest Documents

Performance of MPI on the CRAY T3E-512

A Comparison of Application Performance Using Open MPI and Cray ...

The Performance of MPI Derived Types on a SGI Origin 2000, a Cray

Performance Evaluation of Apache Spark on Cray ... - Cray User Group

Performance Evaluation of the Cray X1 Distributed Shared ... - CiteSeerX

Performance Evaluation of the Cray X1 Distributed Shared ... - CiteSeerX

performance evaluation of the cray x1 distributed shared ... - CiteSeerX

performance evaluation of the cray x1 distributed shared ... - CiteSeerX

COMPARING THE RECOGNITION PERFORMANCE OF ... - CiteSeerX

Comparing the Stock Recommendation Performance of ... - CiteSeerX

Implementation and Performance of Portals 3.3 on the Cray XT3

Chemical Plant Simulation on Cray Research ... - Cray User Group

O on the Cray XT3 ... - Cray User Group

A comparison of MPI performance on di erent MPPs - CiteSeerX

Early Evaluation of the Cray XD1 - CiteSeerX

Performance Evaluation of MPI, UPC and OpenMP on ... - CiteSeerX

Early Evaluation of the Cray XT3 - CiteSeerX

Performance Comparison of Cray X1 and Cray Opteron Cluster with ...

On the Performance of Transparent MPI Piggyback ... - Semantic Scholar

Evaluating the Performance Impact of Xen on MPI and Process ...

MPI performance engineering with the MPI tool interface

ON COMPARING THE PREDICTION VARIANCES OF ... - CiteSeerX

ON COMPARING THE PREDICTION VARIANCES OF ... - CiteSeerX

High Performance MPI: Extending the Message Passing ... - CiteSeerX