Overlapping Communication and Computation with High ... - SPCL, ETH

2 downloads 0 Views 1MB Size Report
May 21, 2008 - Torsten Hoefler and Andrew Lumsdaine. Open Systems Lab ... are an optimization tool. CO performance influences application performance.
Overlapping Communication and Computation with High Level Communication Routines - On Optimizing Parallel Applications -

Torsten Hoefler and Andrew Lumsdaine Open Systems Lab Indiana University Bloomington, IN 47405, USA

Conference on Cluster Computing and the Grid (CCGrid’08) Lyon, France

21th May 2008

Torsten Hoefler and Andrew Lumsdaine

Communication/Computation Overlap

Introduction Solving Grand Challenge Problems not a Grid talk HPC-centric view highly-scalable tightly coupled machines Thanks for the Introduction Manish! All processors will be multi-core All computers will be massively parallel All programmers will be parallel programmers All programs will be parallel programs ⇒ All (massively) parallel programs need optimized communication (patterns) Torsten Hoefler and Andrew Lumsdaine

Communication/Computation Overlap

Fundamental Assumptions (I) We need more powerful machines! Solutions for real-world scientific problems need huge processing power (Grand Challenges) Capabilities of single PEs have fundamental limits The scaling/frequency race is currently stagnating Moore’s law is still valid (number of transistors/chip) Instruction level parallelism is limited (pipelining, VLIW, multi-scalar) Explicit parallelism seems to be the only solution Single chips and transistors get cheaper Implicit transistor use (ILP, branch prediction) have their limits Torsten Hoefler and Andrew Lumsdaine

Communication/Computation Overlap

Fundamental Assumptions (II) Parallelism requires communication Local or even global data-dependencies exist Off-chip communication becomes necessary Bridges a physical distance (many PEs) Communication latency is limited It’s widely accepted that the speed of light limits data-transmission Example: minimal 0-byte latency for 1m ≈ 3.3ns ≈ 13 cycles on a 4GHz PE Bandwidth can hide latency only partially Bandwidth is limited (physical constraints) The problem of “scaling out” (especially iterative solvers) Torsten Hoefler and Andrew Lumsdaine

Communication/Computation Overlap

Assumptions about Parallel Program Optimization Collective Operations Collective Operations (COs) are an optimization tool CO performance influences application performance optimized implementation and analysis of CO is non-trivial Hardware Parallelism More PEs handle more tasks in parallel Transistors/PEs take over communication processing Communication and computation could run simultaneously Overlap of Communication and Computation Overlap can hide latency Improves application performance Torsten Hoefler and Andrew Lumsdaine

Communication/Computation Overlap

Overview (I) Theoretical Considerations a model for parallel architectures parametrize model derive model for BC and NBC prove optimality of collops in the model (?) show processor idle time during BC show limits of the model (IB,BG/L) Implementation of NBC how to assess performance? highly portable low-performance IB optimized, high performance, threaded

Torsten Hoefler and Andrew Lumsdaine

Communication/Computation Overlap

Overview (II) Application Kernels FFT (strong data dependency) compression (parallel data analysis) poisson solver (2d-decomposition) Applications show how performance benefits for microbenchmarks can benefit real-world applications ABINIT Octopus OSEM medical image reconstruction

Torsten Hoefler and Andrew Lumsdaine

Communication/Computation Overlap

The LogGP model Modelling Network Communication LogP model family has best tradeoff between ease of use and accuracy LogGP is most accurate for different message sizes Methodology assess LogGP parameters for modern interconnects model collective communication level Sender

Receiver

CPU

Network

os

o

L g, G

r

g, G time

Torsten Hoefler and Andrew Lumsdaine

Communication/Computation Overlap

TCP/IP - GigE/SMP

MPICH2 - G*s+g MPICH2 - o TCP - G*s+g TCP o

Time in microseconds

600 500 400 300 200 100 0 0

10000 20000 30000 40000 50000 60000 Datasize in bytes (s)

Torsten Hoefler and Andrew Lumsdaine

Communication/Computation Overlap

Myrinet/GM (preregistered/cached) 350

Open MPI - G*s+g Open MPI - o Myrinet/GM - G*s+g Myrinet/GM - o

Time in microseconds

300 250 200 150 100 50 0 0

10000 20000 30000 40000 50000 60000 Datasize in bytes (s)

Torsten Hoefler and Andrew Lumsdaine

Communication/Computation Overlap

InfiniBand (preregistered/cached) 90

Open MPI - G*s+g Open MPI - o OpenIB - G*s+g OpenIB - o

Time in microseconds

80 70 60 50 40 30 20 10 0 0

10000 20000 30000 40000 50000 60000 Datasize in bytes (s)

Torsten Hoefler and Andrew Lumsdaine

Communication/Computation Overlap

Modelling Collectives LogGP Models - general tbarr

= (2o + L) · ⌈log2 P⌉

tallred

= 2 · (2o + L + m · G) · ⌈log2 P⌉ + m · γ · ⌈log2 P⌉

tbcast

= (2o + L + m · G) · ⌈log2 P⌉

CPU and Network LogGP parts CPU tbarr = 2o · ⌈log2 P⌉ CPU tallred = (4o + m · γ) · ⌈log2 P⌉ CPU tbcast = 2o · ⌈log2 P⌉

Torsten Hoefler and Andrew Lumsdaine

NET = L · ⌈log2 P⌉ tbarr NET = 2 · (L + m · G) · ⌈log2 P⌉ tallred NET = (L + m · G) · ⌈log2 P⌉ tbcast

Communication/Computation Overlap

CPU Overhead - MPI_Allreduce LAM/MPI 7.1.2

CPU Usage (share) 0.03 0.025 0.02 0.015 0.01 0.005 0

10

100000 10000 1000 100 Data Size

20

30 40 Communicator Size Torsten Hoefler and Andrew Lumsdaine

50

10 60

1

Communication/Computation Overlap

CPU Overhead - MPI_Allreduce MPICH2 1.0.3

CPU Usage (share) 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0

10

100000 10000 1000 100 Data Size

20

30 40 Communicator Size Torsten Hoefler and Andrew Lumsdaine

50

10 60

1

Communication/Computation Overlap

Implementation of Non-blocking Collectives LibNBC for MPI single-threaded highly portable schedule-based design LibNBC for InfiniBand single-threaded (first version) receiver-driven message passing very low overhead Threaded LibNBC thread support requires MPI_THREAD_MULTIPLE completely asynchronous progress complicated due to scheduling issues Torsten Hoefler and Andrew Lumsdaine

Communication/Computation Overlap

LibNBC - Alltoall overhead, 64 nodes 60000

Open MPI/blocking LibNBC/Open MPI, 1024 LibNBC/OF, waitonsend

Overhead (usec)

50000 40000 30000 20000 10000 0 0

50

100

150

200

250

Message Size (kilobytes) Torsten Hoefler and Andrew Lumsdaine

Communication/Computation Overlap

300

First Example Derivation from “normal” implementation distribution identical to “normal” 3D-FFT first FFT in z direction and index-swap identical Design Goals to Minimize Communication Overhead start communication as early as possible achieve maximum overlap time Solution start MPI_Ialltoall as soon as first xz-plane is ready calculate next xz-plane start next communication accordingly ... collect multiple xz-planes (tile factor) Torsten Hoefler and Andrew Lumsdaine

Communication/Computation Overlap

Transformation in z Direction

Data already transformed in y direction

z

y

x

1 block = 1 double value (3x3x3 grid)

Torsten Hoefler and Andrew Lumsdaine

Communication/Computation Overlap

Transformation in z Direction

Transform first xz plane in z direction

z

y

1111 0000 0000 1111 0000 1111 0 1 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1

x

pattern means that data was transformed in y and z direction

Torsten Hoefler and Andrew Lumsdaine

Communication/Computation Overlap

Transformation in z Direction

start MPI_Ialltoall of first xz plane and transform second plane

z

y

0000 1111 0000 1111 0000 1111 0 1 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 1111 0000 0000 1111 0000 1111 0 1 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 000 111 0 1 000 111 000 111 000 111 0 1 000 111 0 1 000 111 000 111 000 111 0 1 000 111 0 1 000 111 000 111 000 111 0 1 000 111 0 1 000 111 000 111 000 111 0 1 000 111 0 1 000 111 000 111 000 111 0 1 000 111 0 1 000 111 000 111 000 111 0 1 000 111 0 1 000 111 000 111 000 111 0 1 000 111 0 1 000 111 000 111 000 111 0 1 000 111 0 1 000 111 000 111 000 111 0 1 000 111 0 1 000 111 000 111 000 111 0 1 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1

x

cyan color means that data is communicated in the background

Torsten Hoefler and Andrew Lumsdaine

Communication/Computation Overlap

Transformation in z Direction

start MPI_Ialltoall of second1111 xz plane and transform third plane 0000 0000 1111 0000 1111 0 1 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 000 111 000 111 000 111 000 111 01 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 z 111 000 111 0 1 0 1 000 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 0 1 000 111 0 1 000 111 000 111 000 111 0 1 000 111 0 1 000 111 000 111 000 111 0 1 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1

y

x

data of two planes is not accessible due to communication

Torsten Hoefler and Andrew Lumsdaine

Communication/Computation Overlap

Transformation in x Direction

start communication of the third plane and ... 0000 1111 0000 1111 0000 1111 0 1 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 000 111 000 111 000 111 000 111 01 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 z 111 000 111 0 1 0 1 000 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 0 1 000 111 0 1 000 111 000 111 000 111 0 1 000 111 0 1 000 111 000 111 000 111 0 1 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1

y

x

we need the first xz plane to go on ...

Torsten Hoefler and Andrew Lumsdaine

Communication/Computation Overlap

Transformation in x Direction

... so MPI_Wait for the1111 first MPI_Ialltoall! 0000 1111 0000 0000 1111 0 1 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 z 111 000 111 0 1 0 1 000 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 0 1 000 111 0 1 000 111 000 111 000 111 0 1 000 111 0 1 000 111 000 111 000 111 0 1 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1

y

x

and transform first plane (new pattern means xyz transformed)

Torsten Hoefler and Andrew Lumsdaine

Communication/Computation Overlap

Transformation in x Direction

Wait and transform second xz plane 0000 1111 0000 1111 0000 1111 0 1 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 000 111 000 111 000 111 000 111 01 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 z 111 000 111 0 1 0 1 000 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 0 1 000 111 0 1 000 111 000 111 000 111 0 1 000 111 0 1 000 111 000 111 000 111 0 1 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1

y

x

first plane’s data could be accessed for next operation

Torsten Hoefler and Andrew Lumsdaine

Communication/Computation Overlap

Transformation in x Direction

wait and transform last xz plane 0000 1111 0000 1111 0000 1111 0 1 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 000 111 000 111 000 111 000 111 01 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 z 111 000 111 0 1 0 1 000 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 0 1 000 111 0 1 000 111 000 111 000 111 0 1 000 111 0 1 000 111 000 111 000 111 0 1 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1

y

x

done! → 1 complete 1D-FFT overlaps a communication

Torsten Hoefler and Andrew Lumsdaine

Communication/Computation Overlap

FFT Communication Overhead (s)

3d-FFT performance

MPI/BL MPI/NBC OF/NBC

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

64

32

Torsten Hoefler and Andrew Lumsdaine

16

8

4

Communication/Computation Overlap

2

Parallel Data Compression

Second Example Data Parallel Loops - Parallel Compression

for (i=0; i < N/P; i++) { compute(i); } comm(N/P);

Torsten Hoefler and Andrew Lumsdaine

Communication/Computation Overlap

Parallel Compression Performance

Communication Overhead (s)

0.5

0.4

MPI/BL MPI/NBC OF/NBC

0.3

0.2

0.1

0

64

32

Torsten Hoefler and Andrew Lumsdaine

16 Communication/Computation Overlap

8

Domain Decomposition nearest neighbor communication can be implemented with MPI_Alltoallv we propose new collective MPI_Neighbor_xchg[v] 1111111 0000000 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 P1 P2 P3 P0 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 P4 P5 P6 P7 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 P8 P9 P10 P11 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 Process−local data 1 0 0 1 0 Halo−data 1

2D Domain

Torsten Hoefler and Andrew Lumsdaine

Communication/Computation Overlap

Parallel 3d-Poisson solver - Speedup

100

Eth blocking Eth non-blocking

100 80 Speedup

Speedup

80

IB blocking IB non-blocking

60 40 20

60 40 20

0

0 8 16 24 32 40 48 56 64 72 80 88 96 Number of CPUs

8 16 24 32 40 48 56 64 72 80 88 96 Number of CPUs

Cluster: 128 2 GHz Opteron 246 nodes Interconnect: Gigabit Ethernet, InfiniBandTM System size 800x800x800 (1 node ≈ 5300s)

Torsten Hoefler and Andrew Lumsdaine

Communication/Computation Overlap

Medical Image Reconstruction OSEM algorithm Allreduction of full image

Communication Overhead (s)

7

MPI_Allreduce() NBC_Iallreduce() NBC_Iallreduce() (thread)

6 5 4 3 2 1 0

8 nodes

Torsten Hoefler and Andrew Lumsdaine

16 nodes

32 nodes

Communication/Computation Overlap

Thank you for your Attention

Torsten Hoefler and Andrew Lumsdaine

Communication/Computation Overlap

Suggest Documents