A comparison of MPI performance on di erent MPPs - CiteSeerX

2 downloads 0 Views 142KB Size Report
Michael Resch, Holger Berger and Thomas Boenisch .... model requires many measurements since both n and M vary in a wide range. It has turned out that this ...
A comparison of MPI performance on di erent MPPs Michael Resch, Holger Berger and Thomas Boenisch High Performance Computing Center Stuttgart, Parallel Computing Department, D-70550 Stuttgart, Germany

Abstract. Since MPI [1] has become a standard for message-passing on

distributed memory machines a number of implementations have evolved. Today there is an MPI implementation available for all relevant MPP systems, a number of which is based on MPICH [2]. In this paper we are going to present performance comparison for several implementations of MPI on di erent MPPs. Results for the Cray T3E, the IBM RS/6000 SP, the Hitachi SR2201 and the Intel Paragon are presented. In addition we compare those results to the NEC SX-4, a shared memory PVP. Results presented will show latency and bandwidth for point-to-point communication. In addition results for global communications and synchronization will be given. This covers a wide range of MPI features used by typical numerical simulation codes. Finally we investigate a core conjugate gradient solver operation to show the behaviour of latency-hiding techniques on di erent platforms.

1 Introduction Since MPI has become the standard for message-passing on MPPs its performance is a critical measure for program development and buying decisions. A number of studies has been published recently[3, 4, 5, 6, 7, 8] . Most of them cover the performance of MPI on one single system or give a comparison of two systems. The aim of this study is to present some results of a performance study that aimed to evaluate as many systems as possible. It will cover results for the Intel Paragon, the IBM RS/6000 SP, the Cray T3E, the Hitachi SR2201 and the NEC SX-4. The aim is to provide the user with some information about the costs of di erent MPI calls. The most important aspects are surely latency and bandwidth. For so-called "latency bound" algorithms reduction of latency is critical to achieve acceptable eciency on parallel systems. And it is latency that has dramatically been reduced during the last years from some hundred microseconds to some ten microseconds. Bandwidth on the other hand has become important to have balanced systems and to make sure that remote memory access does not become too expensive. Although peak bandwidth and sustained bandwidth of MPPs have made good progress during the last years they have only increased by about a factor of three.

2 Method of measurement For the benchmark evaluation a package of test programs was implemented that allows to investigate di erent message-passing operations using di erent methods of measurement. It is written in FORTRAN77 to allow portability as much as possible. Time measurement is based on MPI WTIME. Although we are aware of the relative inaccuracy of this method compared to clock cycle counting it was found that accuracy was good enough for the measurements done. Average results for 50 measurements were taken for each function tested. For global communication, processes were synchronized initially by an MPI Barrier.

3 Tested Hardware Four MPP systems and one shared memory PVP were investigated. The NEC SX-4 is a 32 processor machine with 8 GB of main memory. Access to the shared memory is realized via a crossbar that couples the processors to 1024 memory banks. The CPU is rated at 125 MHz and o ers a peak performance of 2 GFLOPS based on a vector unit and a scalar unit. Assuming that the transfer of data from one process space to another takes at least two copy operations peak performance for data transfer is 8 GB/sec. The Cray T3E at Stuttgart is a 512 processor distributed memory machine. It is based on the DEC Alpha EV5 - 21164 rated at 300 MHz with a peak performance of 600 MFLOPS. The interprocessor network is a 3D-torus with a theoretical bandwidth of 450 MB/s. The MPI version used was Cray MPI 1.1.0.2. The Hitachi SR2201 is a 32 processor distributed memory machine. The machine is based on a proprietary chip that provides a peak performance of 300 MFLOPS. The interprocesor network is a three-dimensional crossbar network with a peak bandwidth of 300 MB/s. The MPI version used was Hitachi MPI 02.01 which is based on MPICH 1.0.12. The IBM RS/6000 SP has 84 wide nodes rated at 77 MHz and 16 thin nodes rated at 66 MHz. The interpocessor network is the IBM high-performance switch that provides a theoretical peak bandwidth of 150 MB/S. For our tests only wide nodes were used. The MPI version used was IBM MPI 1.1.6 using ppe.poe 2.1.0.11. The Intel Paragon XP/S is a 113 processor distributed memory machine. It is based on the Intel i860 XP rated at 50 MHz with a peak performance of 75 MFLOPS. The two-dimensional mesh network provides a peak bandwidth of 200 MB/s. The MPI version used was that of Intel based on MPICH 1.0.12.

4 Results As a basic model to evaluate the results Hockney's model[9] was chosen: t = t0 +

M r1

(1)

where t is overall time in sec, t0 is time measured for a zero size message, M is message size in bytes and r1 is some asymptotic bandwidth measured in MB/s. Measurements on all machines showed that this is in general a good model to describe the behaviour of blocking communication as measured with the ping-pong method. However, it needs to be slightly modi ed to re ect usage of di erent protocols for di erent message lengths. The model was modi ed as worked out by Xu et al. [3] to describe global communication behaviour. M t = t0  n + (2) r n

1

where n stands for the number of parallel processes used. Veri cation of this model requires many measurements since both n and M vary in a wide range. It has turned out that this model is too simple and we have modi ed it as described below.

4.1 Point-to-point communication

Latency as presented here is time for a zero-byte message. Asymptotic bandwidth was measured for a message size of 1 MB. All measurements were done using standard MPI Send and MPI Recv. The results for the latency show that the Intel Paragon can still keep up with modern architectures. Only the Cray T3E and the process based MPI on the NEC SX-4 show signi cantly better results. Platform Latency (t0 sec) Bandwidth (r1 MB/s) Peak Bandwidth (MB/s) Paragon 42 60 200 SP 38 85 150 SR2201 30 216 300 T3E 16 308 450 SX-4 proc 9 1100 8000 SX-4 thread 40 6053 8000

Table 1. Latency and asymptotic bandwidth for di erent hardware platforms. Taking into consideration that for PVM or MPI latencies were in the range of 100 sec, programmers face a factor of about 8 during the last 4 years. Bandwidth has grown only by about a factor of ve. However, one has to take into account that the high bandwidth achieved on the T3E is only possible by setting bu er size correctly [10] while all other values were measured without any changes to parameters. The values given in table 1 are average values. A closer investigation of results reveals that latency and bandwidth highly depend on the type of protocol used for a speci c size of message. We have investigated these values in detail for the Cray T3E and the Hitachi SR2201 as described in table 2.

SR2201 Size Latency (t0 sec) Bandwidth (r1 MB/s) 0-128 30 144 129-16K 48 152 16K142 216 T3E Size Latency (t0 sec) Bandwidth (r1 MB/s) 0-4K 16 50 4K 56 308

Table 2. Detailed analysis of t0 and r1 for di erent hardware platforms.

4.2 Synchronization MPI as a model supports asynchronous behaviour of codes. However, some algorithms may require synchronization which may put a cost overhead on the parallel execution. In the following we try to gure out the costs for an MPI Barrier in microseconds and in terms of oating point operations. Time in microseconds for an MPI Barrier is given in table 3. Platform 8 16 32 64 128 256 Paragon 28 40 55 66 { { SP 176 270 399 { { { SR2201 210 330 490 { { { T3E 3.5 2.7 3.3 3.5 6.9 7.7 SX-4 thread 264 371 3288 { { {

Table 3. Synchronization time in sec on di erent hardware platforms for increasing number of processors.

It is obvious that only the T3E yields acceptable results for an MPI Barrier. Taking a realistic value for the performance of the CPU of 200-250 MFLOPS a synchronization does not cost more than 600-750 Operations. The Intel Paragon - although a 4 year old architecture - still yields acceptable results. The costs in terms of operations are in the range of 600 - 2000 operations. For the SP this value is about 40000-60000. For the Hitachi values are about 42000-98000. For the SX-4 synchronization seems to be a serious problem for the programmer. Even if we consider that on 32 processors time is excessively high, because it's dicult to synchronize the whole machine, costs in terms of operations go up to about 750K operations for 16 processors.

4.3 Broadcast Operation Aggregated bandwidth for a broadcast communication is shown in table 4. Platform 8 16 32 64 128 256 Paragon 38 47.6 325.5 551.5 { { SP 166.3 167.3 210 { { { SR2201 512.2 839.4 1353 { { { T3E 565.4 874.8 1444 2571 4188 7400 SX-4 thread 9464 14535 13602 { { {

Table 4. Aggregated bandwidth (MB/s) for a broadcast on di erent hardware platforms for increasing number of processors. Obviously the SX-4 makes perfect use of the shared memory architecture. The distributed memory machines behave di erently. The Paragon and the SP do not show acceptable bandwidth while the T3E and the SR2201 seem to scale well. For those two machines further investigations are interesting. An evaluation of the model provided by Xu et al. [3] turns out that the model is not able to take into account sophisticated implementations of global operations as provided by the T3E and the SR2201.

’BROADCAST’ ’XU0-4K’ ’XU4K-1M’

TIME 100

10

10

1

1

0.1

0.1

0.01

’BROADCAST’ ’TREE’

TIME 100

0.01 50

50 100

PROCESSORS 150

10000

200 250

1

10

100000

1e+06

100 PROCESSORS 150

1000 100 MESSAGE SIZE

200 250

1

10

100

10000 1000 MESSAGE SIZE

100000

Fig. 1. Broadcast versus the Xu model (left) and a tree algorithm (right) on the T3E. One would expect that a broadcast is implemented using a tree like algorithm. Consequently we propose for the broadcast to use the following model: t = (t0 + M=r1 )  log2n

(3)

1e+06

Figures 1 and 2 indicate that the model of XU does not take into consideration the possibility of optimized communication of modern network architectures and thus is far from predicting the behaviour on the T3E and the SR2201 correctly. The modi ed model yields acceptable results.

’BROADCAST’ ’XU0-128’ ’XU128-16K’ ’XU16K-2M’

TIME 100

10

10

1

1

0.1

0.1

0.01

’BROADCAST’ ’TREE’

TIME 100

0.01 5 10

15 PROCESSORS

20 25 30

1

10

100

10000 1000 MESSAGE SIZE

100000

1e+06

5 10 15 PROCESSORS

20 25 30

1

10

100

10000 1000 MESSAGE SIZE

100000

Fig. 2. Broadcast versus the Xu model (left) and a tree algorithm (right) on the SR2201.

4.4 Gather/Scatter Operations Standard gather and scatter operations are widely used in numerical codes. The most trivial implementation for such operations would be a simple loop of sends and recvs. The results in table 5 indicate that gather and scatter are a problem for all machines. And with growing number of processors involved the bandwidth that can be achieved gets even worse. Obviously both operations are not tuned by MPI developers on all platforms. For an application it may well be an option to implement such operations on ones own, trying to take advantage of knowledge about the code. This may help to overlap communication and may be better than using MPI Gather or MPI Scatter.

4.5 A CG Kernel A typical operation in a CG kernel is the following:

s = Mr x = x + u = ts

(4) (5) (6)

1e+06

Gather Platform 8 16 32 64 128 256 Paragon 15 16 16 15 { { SP 87 74 60 { { { SR2201 147 138 113 { { { T3E 318 332 187 199 241 244 SX-4 thread 2348 865 389 { { { Scatter Platform 8 16 32 64 128 256 Paragon 14 15 15 14 { { SP 87 35 40 { { { SR2201 168 165 138 { { { T3E 303 324 335 339 338 200 SX-4 thread 1027 479 203 { { {

Table 5. Aggregated bandwidth (MB/s) for gather and scatter on di erent hardware platforms for increasing number of processors. A matrix-vector multiply is followed by a vector-update that does not involve the vectors calculated previously. In the next step a vector-product has to be calculated from the initially calculated vector s and some other vector t. Assuming that t is stored completely on all nodes one could calculate local parts of and combine them by an MPI Allreduce. This would cost one global synchronization for all nodes. Or one could switch the last two steps and simulate the MPI Allreduce for by a loop of MPI Isends and MPI Recvs. This could be overlapped by the vector update. First investigations on the T3E, the Paragon and the SR2201 show that overlapping communication and the vector-update do not lead to a signi cant reduction of computing time. Only for very small dimensions in the range of 100 or 250 using MPI Isend is sometimes faster. However, typical dimensions for such a problem are in the range of 3000-5000. This behaviour may be due to the fact that the vector update can be done very fast since it only needs a very small number of oating point operations and can be highly optimized by modern compilers.

5 Summary The results presented show that the reduction of latency is going on. Latency bound algorithm will pro t from this behaviour in the future. Advances in bandwidth are not of the same quality. Theoretical peak bandwidth has only grown by a factor of three and sustained bandwidth sometimes can only keep pace if setting parameters appropriately. However, parameter setting may vary from code to code which makes it dicult to provide the user with a common rule.

Furthermore, the code may not be able to cope easily with the restricitons that bu er limits might impose. The user may have to nd a balance between communication performance and synchronization requirements due to small or no bu ering. For global communication results are only partly good. Synchronization and broadcast operations are well implemented on modern hardware architectures. Users can bene t from sophisticated implementations. However, although MPI provides portability across di erent hardware architectures, the users have to be aware that a synchronization on a PVP may be costly. For the gather and scatter operations results are bad on all machines. There seems to be room for improvement on all platforms.

References 1. Message Passing Interface Forum: MPI: A Message Passing Interface Standard. University of Tennesee, Knoxville, USA, 1995. 2. William Gropp, Ewing Lusk, Nathan Doss, Anthony Skjellum, "A highperformance, portable implementation of the MPI message passing interface standard", Parallel Computing 22 (1996), 789{828. 3. Zhiwei Xu, Kai Hwang, "Modeling Communication Overhead: MPI and MPL Performance on the IBM SP2", IEEE Parallel & Distributed Technology, Spring 1996, 9{23. 4. Kai Hwang, Zhiwei Xu and Masahiro Arakawa, "Benchmark Evaluation of the IBM SP2 for Parallel Signal Processing", IEEE Transactions on Parallel and Distributed Systems 7 (1996), 522{535. 5. Shahid H. Bokhari, "Multiphase Complete Exchange on Paragon, SP2, and CS-2", IEEE Parallel & Distributed Technology, Fall 1996, 45{59. 6. Zhiwei Xu, Kai Hwang, "Early predicition of MPP performance: The SP2, T3D, and Paragon experiences", Parallel Computing 22 (1996), 917{942. 7. Jose Miguel, Augustin Arruabarrena, Ramon Beivide and Jose Angel Gregorio, "Assessing the Performance of the New IBM SP2 Communication Subsystem", IEEE Parallel & Distributed Technology, Winter 1996, 12{22. 8. C. Calvin, L. Colombet, "Performance evaluation and modeling of collective communications on Cray T3D", Parallel Computing 22 (1996), 1413{1427. 9. R.W. Hockney, "The Communication Challenge for MPP: Intel Paragon and Meiko CS-2", Parallel Computing 20 (1994), 389{398. 10. Resch, M., Berger, H., Rabenseifner, R., Boenisch, T.: MPI Performance on the Cray T3E, BI, RUS, 1997.

This article was processed using the LATEX macro package with LLNCS style

Suggest Documents