A Microbenchmark Suite for Mixed-mode OpenMP ... - Semantic Scholar

2 downloads 0 Views 340KB Size Report
[email protected]. A Microbenchmark. Suite for Mixed-mode. OpenMP/MPI. Mark Bull, Jim Enright and Nadia Ameer. EPCC, University of Edinburgh ...
A Microbenchmark Suite for Mixed-mode OpenMP/MPI Mark Bull, Jim Enright and Nadia Ameer EPCC, University of Edinburgh

[email protected]

Overview

• Motivation • Benchmark design – Point-to-point communication – Collective communication

• Hardware platforms • Results • Conclusions and future work

IWOMP 09

2

Motivation

• In the multicore era, almost every HPC system is a cluster of shared memory nodes.

• Mixed-mode OpenMP/MPI programming is seen as an important technique for achieving scalability on these systems

• Not many existing applications benchmarks or microbenchmarks for this programming paradigm – NAS multizone – ASC Sequoia – LLNL Sphinx

• A microbenchmark suite is a useful tool to assess the quality of implementation of mixed-mode OpenMP/MPI IWOMP 09

3

Benchmark design Basic idea is to implement mixed-mode analogues of the typical operations found in MPI microbenchmarks: e.g. PingPong, HaloExchange, Barrier, Reduce, AlltoAll

Two main considerations: 1. We wish to capture the effects of intra-thread communication and synchronisation. We therefore include the (possibly multi-threaded) reading and writing of buffers in the timing.

2. We wish to be able to compare the effects of changing the OpenMP thread to MPI process ratio while using the same total number of cores. This is done by appropriate choice of buffer sizes/message lengths.

IWOMP 09

4

Point-to-Point benchmarks

• We implement PingPong, PingPing and HaloExchange, each in three different ways:

1. Master-only: MPI communication takes place in the master thread, outside of parallel regions. 2. Funnelled: MPI communication takes place in the master thread, inside parallel regions. 3. Multiple: MPI communication takes place concurrently in all threads inside parallel regions, using tags to identify messages. IWOMP 09

5

Master-only PingPong Process 0

Process 1

Begin loop over repeats

Begin loop over repeats

Begin OMP Parallel region Each thread writes to its part of pingBuf End OMP Parallel region MPI_Send( pingBuf )

dataSize * numThreads

MPI_Recv( pingBuf ) Begin OMP Parallel region Each thread reads its part of pingBuf Each thread writes its part of pongBuf End OMP Parallel region dataSize * numThreads

MPI_Send( pongBuf )

MPI_Recv( pongBuf ) Begin OMP Parallel region Each thread reads its part of pongBuf End OMP Parallel region End loop over repeats

End loop over repeats

IWOMP 09

6

Funnelled PingPong Process 0

Process 1

Begin OMP Parallel region

Begin OMP Parallel region

Begin loop over repeats

Begin loop over repeats

Each thread writes to its part of pingBuf OMP Barrier OMP Master

OMP Master

MPI_Send (pingBuf) dataSize * numThreads

MPI_Recv (pingBuf)

OMP End Master OMP Barrier Each thread reads its part of pingBuf Each thread writes its part of pongBuf OMP Barrier OMP Master dataSize * numThreads

MPI_Send (pongBuf)

MPI_Recv(pongBuf) OMP End Master

OMP End Master

OMP Barrier Each thread reads its part of pongBuf End loop over repeats

End loop over repeats

End OMP Parallel region

End OMP Parallel region

IWOMP 09

7

Multiple PingPong Process 0

Process 1

Begin OMP Parallel region

Begin OMP Parallel region

Begin loop over repeats

Begin loop over repeats

Each thread writes to its part of pingBuf MPI_Send( pingBuf ) numThreads messages of size dataSize

MPI_Recv( pingBuf ) Each thread reads its part of pingBuf Each thread writes its part of pongBuf

numThreads messages

of size dataSize

MPI_Send( pongBuf )

MPI_Recv( pongBuf ) Each thread reads its part of pongBuf End loop over repeats

End loop over repeats

End OMP Parallel region

End OMP Parallel region

IWOMP 09

8

Collective communication

• We implement Barrier, Reduce, AllReduce, Gather, Scatter and AlltoAll

• The total amount of data transferred is proportional to no. of threads × no. of processes.

IWOMP 09

9

Reduce

Process 0

Process 1

Process p

Begin loop over repeats

Begin loop over repeats

Begin loop over repeats

Begin OMP Parallel region with REDUCTION clause

Begin OMP Parallel region with REDUCTION clause

Begin OMP Parallel region with REDUCTION clause

Each thread updates localReduceBuf

Each thread updates localReduceBuf

Each thread updates localReduceBuf

End OMP Parallel region

End OMP Parallel region

End OMP Parallel region

MPI_Reduce (sbuf = localReduceBuf, rbuf = globalReduceBuf)

dataSize

MPI_Reduce (sbuf = localReduceBuf, rbuf = globalReduceBuf)

dataSize

MPI_Reduce (sbuf = localReduceBuf, rbuf = globalReduceBuf)

Read globalReduceBuf End loop over repeats

End loop over repeats

IWOMP 09

End loop over repeats

10

Scatter

Process 0

Process 1

Process p

Begin loop over repeats

Begin loop over repeats

Begin loop over repeats

Write to scatterSend dataSize * numThreads

MPI_Scatter (sbuf = scatterSend, rbuf = scatterRecv, root = 0)

MPI_Scatter (sbuf = scatterSend, rbuf = scatterRecv, root = 0)

MPI_Scatter (sbuf = scatterSend, rbuf = scatterRecv, root = 0)

Begin OMP Parallel region

Begin OMP Parallel region

Begin OMP Parallel region

Each thread reads its part of scatterRecv

Each thread reads its part of scatterRecv

Each thread reads its part of scatterRecv

End OMP Parallel region

End OMP Parallel region

End OMP Parallel region

End loop over repeats

End loop over repeats

End loop over repeats

dataSize * numThreads

IWOMP 09

11

Hardware platforms • IBM Power 5

16 cores/node, 4 nodes

– IBM xl compiler for AIX, IBM MPI

• IBM Power 6

32 cores/node, 4 nodes

– IBM xl compiler for Linux, IBM MPI

• IBM BlueGene/P

4 cores/ node, 16 nodes

– IBM xl compiler, IBM BlueGene MPI

• Cray XT5

8 cores/nodes, 8 nodes

– PGI compiler, Cray MPI

• Bull Novascale

8 cores/node, 8 nodes

– dual socket, quad-core, Xeon cluster, InfinibandX interconnect – Intel compiler – two MPI libraries: Bull and Intel IWOMP 09

12

IBM Power5 - MasterOnly PingPong 2.2 16 Threads, 2 Processes 8 Threads, 2 Processes 4 Threads, 2 Processes

2 1.8

2 Threads, 2 Processes

1.6

Ratio

1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 10

10

1

10

2

3

4

10 10 Message Size (bytes)

IWOMP 09

10

5

10

6

10

7

13

Novascale/Bull - MasterOnly PingPong 2.6 8 Threads, 2 Processes 4 Threads, 2 Processes 2 Threads, 2 Processes

2.4 2.2 2

Ratio

1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0 10

1

10

2

10

3

4

10 10 Message Size (bytes)

IWOMP 09

5

10

6

10

7

10

14

Novascale/Intel – MasterOnly PingPong 103 8 Threads, 2 Processes 4 Threads, 2 Processes 2 Threads, 2 Processes

Ratio

102

101

100 0 10

1

10

2

10

3

4

10 10 Message Size (bytes)

IWOMP 09

5

10

10

6

7

10

15

IBM Power 5 – Multiple PingPong 104 16 Threads, 2 Processes 8 Threads, 2 Processes 10

4 Threads, 2 Processes

3

2 Threads, 2 Processes 1 Threads, 2 Processes

Ratio

102

10

1

10

0

10

-1

10

0

10

1

10

2

10

3

4

5

10 10 Message Size (bytes)

IWOMP 09

10

6

10

7

10

8

10

9

16

IBM BlueGene/P – Multiple PingPong 2.4 4 Threads, 2 Processes 2 Threads, 2 Processes 1 Threads, 2 Processes

2.2

2

Ratio

1.8

1.6

1.4

1.2

1

0.8 0 10

10

1

10

2

10

3

4

5

10 10 Message Size (bytes)

IWOMP 09

10

6

7

10

8

10

10

9

17

Cray XT5 - Multiple PingPong 11 8 Threads, 2 Processes 4 Threads, 2 Processes 2 Threads, 2 Processes 1 Threads, 2 Processes

10 9 8

Ratio

7 6 5 4 3 2 1 0 100

101

102

103

104 105 Message Size (bytes)

IWOMP 09

106

107

108

109

18

IBM Power 5 - Reduce 2.5

2

Ratio

1.5

1

16 Threads, 4 Processes 8 Threads, 8 Processes 4 Threads, 16 Processes 2 Threads, 32 Processes

0.5

0 0 10

10

1

10

2

10

3

4

5

10 10 Message Size (bytes)

IWOMP 09

10

6

10

7

10

8

10

9

19

IBM Power 6 - Reduce 5 32 Threads, 4 Processes 16 Threads, 8 Processes 8 Threads, 16 Processes 4 Threads, 32 Processes 2 Threads, 64 Processes

4.5

4

3.5

Ratio

3

2.5

2

1.5

1

0.5 100

101

102

103 104 Message Size (bytes)

IWOMP 09

105

106

107

20

IBM BlueGene/P - Reduce 1.6

1.5

1.4

Ratio

1.3

1.2

1.1

1

0.9

4 Threads, 16 Processes 2 Threads, 32 Processes

0.8 0 10

10

1

2

10

3

4

10 10 Message Size (bytes)

IWOMP 09

5

10

6

10

7

10

21

Cray XT5 - Reduce 3 8 Threads, 8 Processes 4 Threads, 16 Processes 2 Threads, 32 Processes

2.5

Ratio

2

1.5

1

0.5

0 0 10

1

10

10

2

3

4

10 10 Message Size (bytes)

IWOMP 09

5

10

10

6

7

10

22

Novascale/Bull - Reduce 2.5

2

Ratio

1.5

1

0.5 8 Threads, 8 Processes 4 Threads, 16 Processes 2 Threads, 32 Processes 0 100

101

102

103 104 Message Size (bytes)

IWOMP 09

105

106

107

23

Novascale/Intel - Reduce 10

2

8 Threads, 8 Processes 4 Threads, 16 Processes 2 Threads, 32 Processes

Ratio

101

100

-1

10

100

101

102

103 104 Message Size (bytes)

IWOMP 09

105

106

107

24

Conclusions & Future Work

• Benchmark suite shows interesting results – highlights problems in some implementations – may help application writers decide whether mixed-mode approach is useful on a given platform

• Add other benchmarks – MultiPingPong to compare e.g. on 4-core nodes, 4 pairs of MPI processes with 4 pairs of threads, etc.

• Run on more platforms – NEC SX-9, Cray X2 (when supported), other clusters,…

• Release public version

IWOMP 09

25

Suggest Documents