Overlapping Communication and Computation with High ... - SPCL, ETH

Overlapping Communication and Computation with High Level Communication Routines - On Optimizing Parallel Applications -

Torsten Hoefler and Andrew Lumsdaine Open Systems Lab Indiana University Bloomington, IN 47405, USA

Conference on Cluster Computing and the Grid (CCGrid’08) Lyon, France

21th May 2008

Torsten Hoefler and Andrew Lumsdaine

Communication/Computation Overlap

Introduction Solving Grand Challenge Problems not a Grid talk HPC-centric view highly-scalable tightly coupled machines Thanks for the Introduction Manish! All processors will be multi-core All computers will be massively parallel All programmers will be parallel programmers All programs will be parallel programs ⇒ All (massively) parallel programs need optimized communication (patterns) Torsten Hoefler and Andrew Lumsdaine


Fundamental Assumptions (I) We need more powerful machines! Solutions for real-world scientific problems need huge processing power (Grand Challenges) Capabilities of single PEs have fundamental limits The scaling/frequency race is currently stagnating Moore’s law is still valid (number of transistors/chip) Instruction level parallelism is limited (pipelining, VLIW, multi-scalar) Explicit parallelism seems to be the only solution Single chips and transistors get cheaper Implicit transistor use (ILP, branch prediction) have their limits Torsten Hoefler and Andrew Lumsdaine


Fundamental Assumptions (II) Parallelism requires communication Local or even global data-dependencies exist Off-chip communication becomes necessary Bridges a physical distance (many PEs) Communication latency is limited It’s widely accepted that the speed of light limits data-transmission Example: minimal 0-byte latency for 1m ≈ 3.3ns ≈ 13 cycles on a 4GHz PE Bandwidth can hide latency only partially Bandwidth is limited (physical constraints) The problem of “scaling out” (especially iterative solvers) Torsten Hoefler and Andrew Lumsdaine


Assumptions about Parallel Program Optimization Collective Operations Collective Operations (COs) are an optimization tool CO performance influences application performance optimized implementation and analysis of CO is non-trivial Hardware Parallelism More PEs handle more tasks in parallel Transistors/PEs take over communication processing Communication and computation could run simultaneously Overlap of Communication and Computation Overlap can hide latency Improves application performance Torsten Hoefler and Andrew Lumsdaine


Overview (I) Theoretical Considerations a model for parallel architectures parametrize model derive model for BC and NBC prove optimality of collops in the model (?) show processor idle time during BC show limits of the model (IB,BG/L) Implementation of NBC how to assess performance? highly portable low-performance IB optimized, high performance, threaded



Overview (II) Application Kernels FFT (strong data dependency) compression (parallel data analysis) poisson solver (2d-decomposition) Applications show how performance benefits for microbenchmarks can benefit real-world applications ABINIT Octopus OSEM medical image reconstruction



The LogGP model Modelling Network Communication LogP model family has best tradeoff between ease of use and accuracy LogGP is most accurate for different message sizes Methodology assess LogGP parameters for modern interconnects model collective communication level Sender

Receiver

CPU

Network

os

o

L g, G

r

g, G time



TCP/IP - GigE/SMP

MPICH2 - G*s+g MPICH2 - o TCP - G*s+g TCP o

Time in microseconds

600 500 400 300 200 100 0 0

10000 20000 30000 40000 50000 60000 Datasize in bytes (s)



Myrinet/GM (preregistered/cached) 350

Open MPI - G*s+g Open MPI - o Myrinet/GM - G*s+g Myrinet/GM - o


300 250 200 150 100 50 0 0




InfiniBand (preregistered/cached) 90

Open MPI - G*s+g Open MPI - o OpenIB - G*s+g OpenIB - o


80 70 60 50 40 30 20 10 0 0




Modelling Collectives LogGP Models - general tbarr

= (2o + L) · ⌈log2 P⌉

tallred

= 2 · (2o + L + m · G) · ⌈log2 P⌉ + m · γ · ⌈log2 P⌉

tbcast

= (2o + L + m · G) · ⌈log2 P⌉

CPU and Network LogGP parts CPU tbarr = 2o · ⌈log2 P⌉ CPU tallred = (4o + m · γ) · ⌈log2 P⌉ CPU tbcast = 2o · ⌈log2 P⌉


NET = L · ⌈log2 P⌉ tbarr NET = 2 · (L + m · G) · ⌈log2 P⌉ tallred NET = (L + m · G) · ⌈log2 P⌉ tbcast


CPU Overhead - MPI_Allreduce LAM/MPI 7.1.2

CPU Usage (share) 0.03 0.025 0.02 0.015 0.01 0.005 0

10

100000 10000 1000 100 Data Size

20

30 40 Communicator Size Torsten Hoefler and Andrew Lumsdaine

50

10 60

1


CPU Overhead - MPI_Allreduce MPICH2 1.0.3

CPU Usage (share) 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0

10

100000 10000 1000 100 Data Size

20

30 40 Communicator Size Torsten Hoefler and Andrew Lumsdaine

50

10 60

1


Implementation of Non-blocking Collectives LibNBC for MPI single-threaded highly portable schedule-based design LibNBC for InfiniBand single-threaded (first version) receiver-driven message passing very low overhead Threaded LibNBC thread support requires MPI_THREAD_MULTIPLE completely asynchronous progress complicated due to scheduling issues Torsten Hoefler and Andrew Lumsdaine


LibNBC - Alltoall overhead, 64 nodes 60000

Open MPI/blocking LibNBC/Open MPI, 1024 LibNBC/OF, waitonsend

Overhead (usec)

50000 40000 30000 20000 10000 0 0

50

100

150

200

250

Message Size (kilobytes) Torsten Hoefler and Andrew Lumsdaine


300

First Example Derivation from “normal” implementation distribution identical to “normal” 3D-FFT first FFT in z direction and index-swap identical Design Goals to Minimize Communication Overhead start communication as early as possible achieve maximum overlap time Solution start MPI_Ialltoall as soon as first xz-plane is ready calculate next xz-plane start next communication accordingly ... collect multiple xz-planes (tile factor) Torsten Hoefler and Andrew Lumsdaine


Transformation in z Direction

Data already transformed in y direction

z

y

x

1 block = 1 double value (3x3x3 grid)




Transform first xz plane in z direction

z

y

1111 0000 0000 1111 0000 1111 0 1 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1

x

pattern means that data was transformed in y and z direction




start MPI_Ialltoall of first xz plane and transform second plane

z

y

0000 1111 0000 1111 0000 1111 0 1 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 1111 0000 0000 1111 0000 1111 0 1 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 000 111 0 1 000 111 000 111 000 111 0 1 000 111 0 1 000 111 000 111 000 111 0 1 000 111 0 1 000 111 000 111 000 111 0 1 000 111 0 1 000 111 000 111 000 111 0 1 000 111 0 1 000 111 000 111 000 111 0 1 000 111 0 1 000 111 000 111 000 111 0 1 000 111 0 1 000 111 000 111 000 111 0 1 000 111 0 1 000 111 000 111 000 111 0 1 000 111 0 1 000 111 000 111 000 111 0 1 000 111 0 1 000 111 000 111 000 111 0 1 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1

x

cyan color means that data is communicated in the background




start MPI_Ialltoall of second1111 xz plane and transform third plane 0000 0000 1111 0000 1111 0 1 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 000 111 000 111 000 111 000 111 01 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 z 111 000 111 0 1 0 1 000 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 0 1 000 111 0 1 000 111 000 111 000 111 0 1 000 111 0 1 000 111 000 111 000 111 0 1 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1

y

x

data of two planes is not accessible due to communication



Transformation in x Direction

start communication of the third plane and ... 0000 1111 0000 1111 0000 1111 0 1 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 000 111 000 111 000 111 000 111 01 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 z 111 000 111 0 1 0 1 000 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 0 1 000 111 0 1 000 111 000 111 000 111 0 1 000 111 0 1 000 111 000 111 000 111 0 1 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1

y

x

we need the first xz plane to go on ...




... so MPI_Wait for the1111 first MPI_Ialltoall! 0000 1111 0000 0000 1111 0 1 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 z 111 000 111 0 1 0 1 000 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 0 1 000 111 0 1 000 111 000 111 000 111 0 1 000 111 0 1 000 111 000 111 000 111 0 1 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1

y

x

and transform first plane (new pattern means xyz transformed)




Wait and transform second xz plane 0000 1111 0000 1111 0000 1111 0 1 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 000 111 000 111 000 111 000 111 01 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 z 111 000 111 0 1 0 1 000 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 0 1 000 111 0 1 000 111 000 111 000 111 0 1 000 111 0 1 000 111 000 111 000 111 0 1 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1

y

x

first plane’s data could be accessed for next operation




wait and transform last xz plane 0000 1111 0000 1111 0000 1111 0 1 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0000 1111 0000 1111 0000 1111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 000 111 000 111 000 111 000 111 01 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 000 111 0 1 z 111 000 111 0 1 0 1 000 000 111 000 111 000 111 0 1 000 111 0 1 0 1 000 111 000 111 000 111 0 1 000 111 0 1 000 111 000 111 000 111 0 1 000 111 0 1 000 111 000 111 000 111 0 1 0 1 000 111 000 111 000 111 0 1 000 111 000 111 000 111 0 1

y

x

done! → 1 complete 1D-FFT overlaps a communication



FFT Communication Overhead (s)

3d-FFT performance

MPI/BL MPI/NBC OF/NBC

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

64

32


16

8

4


2

Parallel Data Compression

Second Example Data Parallel Loops - Parallel Compression

for (i=0; i < N/P; i++) { compute(i); } comm(N/P);



Parallel Compression Performance

Communication Overhead (s)

0.5

0.4

MPI/BL MPI/NBC OF/NBC

0.3

0.2

0.1

0

64

32


16 Communication/Computation Overlap

8

Domain Decomposition nearest neighbor communication can be implemented with MPI_Alltoallv we propose new collective MPI_Neighbor_xchg[v] 1111111 0000000 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 P1 P2 P3 P0 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 P4 P5 P6 P7 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 P8 P9 P10 P11 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 Process−local data 1 0 0 1 0 Halo−data 1

2D Domain



Parallel 3d-Poisson solver - Speedup

100

Eth blocking Eth non-blocking

100 80 Speedup

Speedup

80

IB blocking IB non-blocking

60 40 20

60 40 20

0

0 8 16 24 32 40 48 56 64 72 80 88 96 Number of CPUs

8 16 24 32 40 48 56 64 72 80 88 96 Number of CPUs

Cluster: 128 2 GHz Opteron 246 nodes Interconnect: Gigabit Ethernet, InfiniBandTM System size 800x800x800 (1 node ≈ 5300s)



Medical Image Reconstruction OSEM algorithm Allreduction of full image

Communication Overhead (s)

7

MPI_Allreduce() NBC_Iallreduce() NBC_Iallreduce() (thread)

6 5 4 3 2 1 0

8 nodes


16 nodes

32 nodes


Thank you for your Attention



Overlapping Communication and Computation with High ... - SPCL, ETH

Overlapping Communication and Computation with High ... - SPCL, ETH

Suggest Documents

Overlapping Communication with Computation in ... - CiteSeerX

Overlapping Communication with Computation in MPI ... - OpenCoarrays

Overlapping Computation and Communication: Barrier Algorithms and ...

Overlapping Computation and Communication - Oak Ridge National ...

Overlapping Clustering and Distributed Computation

Isoefficiency in Practice: Configuring and Understanding ... - SPCL, ETH

Analysis of High Performance Communication and Computation

Optimized Routing for Large-Scale InfiniBand Networks - SPCL, ETH

A practically constant-time MPI Broadcast Algorithm for ... - SPCL, ETH

Active Pebbles: Parallel Programming for Data-Driven ... - SPCL, ETH

LogGOPSim â Simulating Large-Scale Applications in ... - SPCL, ETH

1ère SPCL

Rapid communication High-dynamic-range ... - ETH ZÃ¼rich

Overlapping Clusters for Distributed Computation - Semantic Scholar

Overlapping Clusters for Distributed Computation - Semantic Scholar

Robust Multiparty Computation with Linear Communication ... - IACR

Overlapping Clusters for Distributed Computation - Semantic Scholar

On Computation and Communication with Small Bias - CWI Amsterdam

On Computation and Communication with Small Bias - CWI Amsterdam

Randomized online computation with high probability guarantees

High-throughput thermodynamic computation and

Overlapping Coalition Formation Games for Emerging Communication ...

On Communication and Computation - Springer Link

Integrated Computation, Communication and Control - Semantic Scholar

Overlapping Communication and Computation with High ... - SPCL, ETH