A Microbenchmark Suite for Mixed-mode OpenMP/MPI Mark Bull, Jim Enright and Nadia Ameer EPCC, University of Edinburgh
[email protected]
Overview
• Motivation • Benchmark design – Point-to-point communication – Collective communication
• Hardware platforms • Results • Conclusions and future work
IWOMP 09
2
Motivation
• In the multicore era, almost every HPC system is a cluster of shared memory nodes.
• Mixed-mode OpenMP/MPI programming is seen as an important technique for achieving scalability on these systems
• Not many existing applications benchmarks or microbenchmarks for this programming paradigm – NAS multizone – ASC Sequoia – LLNL Sphinx
• A microbenchmark suite is a useful tool to assess the quality of implementation of mixed-mode OpenMP/MPI IWOMP 09
3
Benchmark design Basic idea is to implement mixed-mode analogues of the typical operations found in MPI microbenchmarks: e.g. PingPong, HaloExchange, Barrier, Reduce, AlltoAll
Two main considerations: 1. We wish to capture the effects of intra-thread communication and synchronisation. We therefore include the (possibly multi-threaded) reading and writing of buffers in the timing.
2. We wish to be able to compare the effects of changing the OpenMP thread to MPI process ratio while using the same total number of cores. This is done by appropriate choice of buffer sizes/message lengths.
IWOMP 09
4
Point-to-Point benchmarks
• We implement PingPong, PingPing and HaloExchange, each in three different ways:
1. Master-only: MPI communication takes place in the master thread, outside of parallel regions. 2. Funnelled: MPI communication takes place in the master thread, inside parallel regions. 3. Multiple: MPI communication takes place concurrently in all threads inside parallel regions, using tags to identify messages. IWOMP 09
5
Master-only PingPong Process 0
Process 1
Begin loop over repeats
Begin loop over repeats
Begin OMP Parallel region Each thread writes to its part of pingBuf End OMP Parallel region MPI_Send( pingBuf )
dataSize * numThreads
MPI_Recv( pingBuf ) Begin OMP Parallel region Each thread reads its part of pingBuf Each thread writes its part of pongBuf End OMP Parallel region dataSize * numThreads
MPI_Send( pongBuf )
MPI_Recv( pongBuf ) Begin OMP Parallel region Each thread reads its part of pongBuf End OMP Parallel region End loop over repeats
End loop over repeats
IWOMP 09
6
Funnelled PingPong Process 0
Process 1
Begin OMP Parallel region
Begin OMP Parallel region
Begin loop over repeats
Begin loop over repeats
Each thread writes to its part of pingBuf OMP Barrier OMP Master
OMP Master
MPI_Send (pingBuf) dataSize * numThreads
MPI_Recv (pingBuf)
OMP End Master OMP Barrier Each thread reads its part of pingBuf Each thread writes its part of pongBuf OMP Barrier OMP Master dataSize * numThreads
MPI_Send (pongBuf)
MPI_Recv(pongBuf) OMP End Master
OMP End Master
OMP Barrier Each thread reads its part of pongBuf End loop over repeats
End loop over repeats
End OMP Parallel region
End OMP Parallel region
IWOMP 09
7
Multiple PingPong Process 0
Process 1
Begin OMP Parallel region
Begin OMP Parallel region
Begin loop over repeats
Begin loop over repeats
Each thread writes to its part of pingBuf MPI_Send( pingBuf ) numThreads messages of size dataSize
MPI_Recv( pingBuf ) Each thread reads its part of pingBuf Each thread writes its part of pongBuf
numThreads messages
of size dataSize
MPI_Send( pongBuf )
MPI_Recv( pongBuf ) Each thread reads its part of pongBuf End loop over repeats
End loop over repeats
End OMP Parallel region
End OMP Parallel region
IWOMP 09
8
Collective communication
• We implement Barrier, Reduce, AllReduce, Gather, Scatter and AlltoAll
• The total amount of data transferred is proportional to no. of threads × no. of processes.
IWOMP 09
9
Reduce
Process 0
Process 1
Process p
Begin loop over repeats
Begin loop over repeats
Begin loop over repeats
Begin OMP Parallel region with REDUCTION clause
Begin OMP Parallel region with REDUCTION clause
Begin OMP Parallel region with REDUCTION clause
Each thread updates localReduceBuf
Each thread updates localReduceBuf
Each thread updates localReduceBuf
End OMP Parallel region
End OMP Parallel region
End OMP Parallel region
MPI_Reduce (sbuf = localReduceBuf, rbuf = globalReduceBuf)
dataSize
MPI_Reduce (sbuf = localReduceBuf, rbuf = globalReduceBuf)
dataSize
MPI_Reduce (sbuf = localReduceBuf, rbuf = globalReduceBuf)
Read globalReduceBuf End loop over repeats
End loop over repeats
IWOMP 09
End loop over repeats
10
Scatter
Process 0
Process 1
Process p
Begin loop over repeats
Begin loop over repeats
Begin loop over repeats
Write to scatterSend dataSize * numThreads
MPI_Scatter (sbuf = scatterSend, rbuf = scatterRecv, root = 0)
MPI_Scatter (sbuf = scatterSend, rbuf = scatterRecv, root = 0)
MPI_Scatter (sbuf = scatterSend, rbuf = scatterRecv, root = 0)
Begin OMP Parallel region
Begin OMP Parallel region
Begin OMP Parallel region
Each thread reads its part of scatterRecv
Each thread reads its part of scatterRecv
Each thread reads its part of scatterRecv
End OMP Parallel region
End OMP Parallel region
End OMP Parallel region
End loop over repeats
End loop over repeats
End loop over repeats
dataSize * numThreads
IWOMP 09
11
Hardware platforms • IBM Power 5
16 cores/node, 4 nodes
– IBM xl compiler for AIX, IBM MPI
• IBM Power 6
32 cores/node, 4 nodes
– IBM xl compiler for Linux, IBM MPI
• IBM BlueGene/P
4 cores/ node, 16 nodes
– IBM xl compiler, IBM BlueGene MPI
• Cray XT5
8 cores/nodes, 8 nodes
– PGI compiler, Cray MPI
• Bull Novascale
8 cores/node, 8 nodes
– dual socket, quad-core, Xeon cluster, InfinibandX interconnect – Intel compiler – two MPI libraries: Bull and Intel IWOMP 09
12
IBM Power5 - MasterOnly PingPong 2.2 16 Threads, 2 Processes 8 Threads, 2 Processes 4 Threads, 2 Processes
2 1.8
2 Threads, 2 Processes
1.6
Ratio
1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 10
10
1
10
2
3
4
10 10 Message Size (bytes)
IWOMP 09
10
5
10
6
10
7
13
Novascale/Bull - MasterOnly PingPong 2.6 8 Threads, 2 Processes 4 Threads, 2 Processes 2 Threads, 2 Processes
2.4 2.2 2
Ratio
1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0 10
1
10
2
10
3
4
10 10 Message Size (bytes)
IWOMP 09
5
10
6
10
7
10
14
Novascale/Intel – MasterOnly PingPong 103 8 Threads, 2 Processes 4 Threads, 2 Processes 2 Threads, 2 Processes
Ratio
102
101
100 0 10
1
10
2
10
3
4
10 10 Message Size (bytes)
IWOMP 09
5
10
10
6
7
10
15
IBM Power 5 – Multiple PingPong 104 16 Threads, 2 Processes 8 Threads, 2 Processes 10
4 Threads, 2 Processes
3
2 Threads, 2 Processes 1 Threads, 2 Processes
Ratio
102
10
1
10
0
10
-1
10
0
10
1
10
2
10
3
4
5
10 10 Message Size (bytes)
IWOMP 09
10
6
10
7
10
8
10
9
16
IBM BlueGene/P – Multiple PingPong 2.4 4 Threads, 2 Processes 2 Threads, 2 Processes 1 Threads, 2 Processes
2.2
2
Ratio
1.8
1.6
1.4
1.2
1
0.8 0 10
10
1
10
2
10
3
4
5
10 10 Message Size (bytes)
IWOMP 09
10
6
7
10
8
10
10
9
17
Cray XT5 - Multiple PingPong 11 8 Threads, 2 Processes 4 Threads, 2 Processes 2 Threads, 2 Processes 1 Threads, 2 Processes
10 9 8
Ratio
7 6 5 4 3 2 1 0 100
101
102
103
104 105 Message Size (bytes)
IWOMP 09
106
107
108
109
18
IBM Power 5 - Reduce 2.5
2
Ratio
1.5
1
16 Threads, 4 Processes 8 Threads, 8 Processes 4 Threads, 16 Processes 2 Threads, 32 Processes
0.5
0 0 10
10
1
10
2
10
3
4
5
10 10 Message Size (bytes)
IWOMP 09
10
6
10
7
10
8
10
9
19
IBM Power 6 - Reduce 5 32 Threads, 4 Processes 16 Threads, 8 Processes 8 Threads, 16 Processes 4 Threads, 32 Processes 2 Threads, 64 Processes
4.5
4
3.5
Ratio
3
2.5
2
1.5
1
0.5 100
101
102
103 104 Message Size (bytes)
IWOMP 09
105
106
107
20
IBM BlueGene/P - Reduce 1.6
1.5
1.4
Ratio
1.3
1.2
1.1
1
0.9
4 Threads, 16 Processes 2 Threads, 32 Processes
0.8 0 10
10
1
2
10
3
4
10 10 Message Size (bytes)
IWOMP 09
5
10
6
10
7
10
21
Cray XT5 - Reduce 3 8 Threads, 8 Processes 4 Threads, 16 Processes 2 Threads, 32 Processes
2.5
Ratio
2
1.5
1
0.5
0 0 10
1
10
10
2
3
4
10 10 Message Size (bytes)
IWOMP 09
5
10
10
6
7
10
22
Novascale/Bull - Reduce 2.5
2
Ratio
1.5
1
0.5 8 Threads, 8 Processes 4 Threads, 16 Processes 2 Threads, 32 Processes 0 100
101
102
103 104 Message Size (bytes)
IWOMP 09
105
106
107
23
Novascale/Intel - Reduce 10
2
8 Threads, 8 Processes 4 Threads, 16 Processes 2 Threads, 32 Processes
Ratio
101
100
-1
10
100
101
102
103 104 Message Size (bytes)
IWOMP 09
105
106
107
24
Conclusions & Future Work
• Benchmark suite shows interesting results – highlights problems in some implementations – may help application writers decide whether mixed-mode approach is useful on a given platform
• Add other benchmarks – MultiPingPong to compare e.g. on 4-core nodes, 4 pairs of MPI processes with 4 pairs of threads, etc.
• Run on more platforms – NEC SX-9, Cray X2 (when supported), other clusters,…
• Release public version
IWOMP 09
25