A Microbenchmark Suite for Mixed-mode OpenMP ... - Semantic Scholar

A Microbenchmark Suite for Mixed-mode OpenMP/MPI Mark Bull, Jim Enright and Nadia Ameer EPCC, University of Edinburgh

[email protected]

Overview

• Motivation • Benchmark design – Point-to-point communication – Collective communication

• Hardware platforms • Results • Conclusions and future work

IWOMP 09

2

Motivation

• In the multicore era, almost every HPC system is a cluster of shared memory nodes.

• Mixed-mode OpenMP/MPI programming is seen as an important technique for achieving scalability on these systems

• Not many existing applications benchmarks or microbenchmarks for this programming paradigm – NAS multizone – ASC Sequoia – LLNL Sphinx

• A microbenchmark suite is a useful tool to assess the quality of implementation of mixed-mode OpenMP/MPI IWOMP 09

3

Benchmark design Basic idea is to implement mixed-mode analogues of the typical operations found in MPI microbenchmarks: e.g. PingPong, HaloExchange, Barrier, Reduce, AlltoAll

Two main considerations: 1. We wish to capture the effects of intra-thread communication and synchronisation. We therefore include the (possibly multi-threaded) reading and writing of buffers in the timing.

2. We wish to be able to compare the effects of changing the OpenMP thread to MPI process ratio while using the same total number of cores. This is done by appropriate choice of buffer sizes/message lengths.

IWOMP 09

4

Point-to-Point benchmarks

• We implement PingPong, PingPing and HaloExchange, each in three different ways:

1. Master-only: MPI communication takes place in the master thread, outside of parallel regions. 2. Funnelled: MPI communication takes place in the master thread, inside parallel regions. 3. Multiple: MPI communication takes place concurrently in all threads inside parallel regions, using tags to identify messages. IWOMP 09

5

Master-only PingPong Process 0

Process 1

Begin loop over repeats


Begin OMP Parallel region Each thread writes to its part of pingBuf End OMP Parallel region MPI_Send( pingBuf )

dataSize * numThreads

MPI_Recv( pingBuf ) Begin OMP Parallel region Each thread reads its part of pingBuf Each thread writes its part of pongBuf End OMP Parallel region dataSize * numThreads

MPI_Send( pongBuf )

MPI_Recv( pongBuf ) Begin OMP Parallel region Each thread reads its part of pongBuf End OMP Parallel region End loop over repeats

End loop over repeats

IWOMP 09

6

Funnelled PingPong Process 0

Process 1

Begin OMP Parallel region




Each thread writes to its part of pingBuf OMP Barrier OMP Master

OMP Master

MPI_Send (pingBuf) dataSize * numThreads

MPI_Recv (pingBuf)

OMP End Master OMP Barrier Each thread reads its part of pingBuf Each thread writes its part of pongBuf OMP Barrier OMP Master dataSize * numThreads

MPI_Send (pongBuf)

MPI_Recv(pongBuf) OMP End Master

OMP End Master

OMP Barrier Each thread reads its part of pongBuf End loop over repeats


End OMP Parallel region


IWOMP 09

7

Multiple PingPong Process 0

Process 1





Each thread writes to its part of pingBuf MPI_Send( pingBuf ) numThreads messages of size dataSize

MPI_Recv( pingBuf ) Each thread reads its part of pingBuf Each thread writes its part of pongBuf

numThreads messages

of size dataSize

MPI_Send( pongBuf )

MPI_Recv( pongBuf ) Each thread reads its part of pongBuf End loop over repeats




IWOMP 09

8

Collective communication

• We implement Barrier, Reduce, AllReduce, Gather, Scatter and AlltoAll

• The total amount of data transferred is proportional to no. of threads × no. of processes.

IWOMP 09

9

Reduce

Process 0

Process 1

Process p




Begin OMP Parallel region with REDUCTION clause



Each thread updates localReduceBuf






MPI_Reduce (sbuf = localReduceBuf, rbuf = globalReduceBuf)

dataSize


dataSize


Read globalReduceBuf End loop over repeats


IWOMP 09


10

Scatter

Process 0

Process 1

Process p




Write to scatterSend dataSize * numThreads

MPI_Scatter (sbuf = scatterSend, rbuf = scatterRecv, root = 0)






Each thread reads its part of scatterRecv









dataSize * numThreads

IWOMP 09

11

Hardware platforms • IBM Power 5

16 cores/node, 4 nodes

– IBM xl compiler for AIX, IBM MPI

• IBM Power 6


– IBM xl compiler for Linux, IBM MPI

• IBM BlueGene/P

4 cores/ node, 16 nodes

– IBM xl compiler, IBM BlueGene MPI

• Cray XT5

8 cores/nodes, 8 nodes

– PGI compiler, Cray MPI

• Bull Novascale


– dual socket, quad-core, Xeon cluster, InfinibandX interconnect – Intel compiler – two MPI libraries: Bull and Intel IWOMP 09

12

IBM Power5 - MasterOnly PingPong 2.2 16 Threads, 2 Processes 8 Threads, 2 Processes 4 Threads, 2 Processes

2 1.8

2 Threads, 2 Processes

1.6

Ratio

1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 10

10

1

10

2

3

4

10 10 Message Size (bytes)

IWOMP 09

10

5

10

6

10

7

13

Novascale/Bull - MasterOnly PingPong 2.6 8 Threads, 2 Processes 4 Threads, 2 Processes 2 Threads, 2 Processes

2.4 2.2 2

Ratio

1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0 10

1

10

2

10

3

4


IWOMP 09

5

10

6

10

7

10

14

Novascale/Intel – MasterOnly PingPong 103 8 Threads, 2 Processes 4 Threads, 2 Processes 2 Threads, 2 Processes

Ratio

102

101

100 0 10

1

10

2

10

3

4


IWOMP 09

5

10

10

6

7

10

15

IBM Power 5 – Multiple PingPong 104 16 Threads, 2 Processes 8 Threads, 2 Processes 10

4 Threads, 2 Processes

3

2 Threads, 2 Processes 1 Threads, 2 Processes

Ratio

102

10

1

10

0

10

-1

10

0

10

1

10

2

10

3

4

5


IWOMP 09

10

6

10

7

10

8

10

9

16

IBM BlueGene/P – Multiple PingPong 2.4 4 Threads, 2 Processes 2 Threads, 2 Processes 1 Threads, 2 Processes

2.2

2

Ratio

1.8

1.6

1.4

1.2

1

0.8 0 10

10

1

10

2

10

3

4

5


IWOMP 09

10

6

7

10

8

10

10

9

17

Cray XT5 - Multiple PingPong 11 8 Threads, 2 Processes 4 Threads, 2 Processes 2 Threads, 2 Processes 1 Threads, 2 Processes

10 9 8

Ratio

7 6 5 4 3 2 1 0 100

101

102

103


IWOMP 09

106

107

108

109

18

IBM Power 5 - Reduce 2.5

2

Ratio

1.5

1

16 Threads, 4 Processes 8 Threads, 8 Processes 4 Threads, 16 Processes 2 Threads, 32 Processes

0.5

0 0 10

10

1

10

2

10

3

4

5


IWOMP 09

10

6

10

7

10

8

10

9

19

IBM Power 6 - Reduce 5 32 Threads, 4 Processes 16 Threads, 8 Processes 8 Threads, 16 Processes 4 Threads, 32 Processes 2 Threads, 64 Processes

4.5

4

3.5

Ratio

3

2.5

2

1.5

1

0.5 100

101

102


IWOMP 09

105

106

107

20

IBM BlueGene/P - Reduce 1.6

1.5

1.4

Ratio

1.3

1.2

1.1

1

0.9

4 Threads, 16 Processes 2 Threads, 32 Processes

0.8 0 10

10

1

2

10

3

4


IWOMP 09

5

10

6

10

7

10

21

Cray XT5 - Reduce 3 8 Threads, 8 Processes 4 Threads, 16 Processes 2 Threads, 32 Processes

2.5

Ratio

2

1.5

1

0.5

0 0 10

1

10

10

2

3

4


IWOMP 09

5

10

10

6

7

10

22

Novascale/Bull - Reduce 2.5

2

Ratio

1.5

1

0.5 8 Threads, 8 Processes 4 Threads, 16 Processes 2 Threads, 32 Processes 0 100

101

102


IWOMP 09

105

106

107

23

Novascale/Intel - Reduce 10

2

8 Threads, 8 Processes 4 Threads, 16 Processes 2 Threads, 32 Processes

Ratio

101

100

-1

10

100

101

102


IWOMP 09

105

106

107

24

Conclusions & Future Work

• Benchmark suite shows interesting results – highlights problems in some implementations – may help application writers decide whether mixed-mode approach is useful on a given platform

• Add other benchmarks – MultiPingPong to compare e.g. on 4-core nodes, 4 pairs of MPI processes with 4 pairs of threads, etc.

• Run on more platforms – NEC SX-9, Cray X2 (when supported), other clusters,…

• Release public version

IWOMP 09

25

A Microbenchmark Suite for Mixed-mode OpenMP ... - Semantic Scholar

A Microbenchmark Suite for Mixed-mode OpenMP ... - Semantic Scholar

Suggest Documents

Towards OpenMP for Java - Semantic Scholar

ompP: A Profiling Tool for OpenMP - Semantic Scholar

Automatic Hybrid OpenMP + MPI Program ... - Semantic Scholar

The OpenMP Source Code Repository - Semantic Scholar

Finding Inefficiencies in OpenMP Applications ... - Semantic Scholar

Ptidej: A Tool Suite - Semantic Scholar

Performance Evaluation of OpenMP Applications ... - Semantic Scholar

Implementing an OpenMP Execution Environment ... - Semantic Scholar

Binding Nested OpenMP Programs on ... - Semantic Scholar

OMPT: An OpenMP Tools Application ... - Semantic Scholar

Evaluating OpenMP Performance Analysis Tools ... - Semantic Scholar

OpenMP Task Scheduling Strategies for Multicore ... - Semantic Scholar

Towards optimisation of openMP codes for ... - Semantic Scholar

A Metrics Suite for Evaluating Flexibility and ... - Semantic Scholar

GINsim: A software suite for the qualitative ... - Semantic Scholar

A concurrency analysis tool suite for Ada programs - Semantic Scholar

A Suite of Tools for Debugging Distributed ... - Semantic Scholar

FedBench: A Benchmark Suite for Federated ... - Semantic Scholar

Towards optimisation of openMP codes for ... - Semantic Scholar

A Scalability Benchmark Suite for Erlang/OTP - Semantic Scholar

ADVISOR SUITE: A Tool for Rapid Development of ... - Semantic Scholar

Towards a metric suite for OCL Expressions ... - Semantic Scholar

MatchBase: A Development Suite for Efficient ... - Semantic Scholar

Designing a Test Suite for Empirically-based ... - Semantic Scholar