Overlapping Communication and Computation with High ... - SPCL, ETH
Recommend Documents
types); (2) between each level of the object graph (if the marshalling algorithm runs through this ... more lightweight parallel and distributed applications in C++.
is quite difficult to enable on the receive side; in fact, RDMA operations must contain .... target address depends on the user application making an MPI library call, then ...... In Proc. of 2014 21st International Conference on High Performance.
Email: 1rlgraham,[email protected]. Pavel Shamis ... Email: 1pasha,gil,noam,hillel,michael,ariels,[email protected] ..... benchmark results for MPI Barrier().
the progression and completion of data-dependent communi- cations sequences to ..... 2.6.18-53.el5, a dual port quad data rate ConnectX-2 HCA, and a switch ...
Overlapping Clustering: Approx. Results. [Khandekar, Kortsarz, M.] Overlap vs.
no-‐overlap: – Min-‐Sum: Within a factor 2 using Uncrossing. – Min-‐Max: Might
...
performance modeling; performance analysis; co-design. 1. ... Although isoefficiency analysis is useful in understand- ...... org/mp-documents/OpenMP4.0.0.pdf.
Gabriele D'Angelo, and Lorenzo Donatiello .... the high level portion of the architecture state on each logical processor, while ..... ure 9 confirms our expectations.
Bell System. Technology Journal ... [8] William J. Dally and Charles L. Seitz. ... Roger M. Needham, Thomas L. Rodeheffer, Edwin H. Satterthwaite, and. Charles ...
cesses in almost constant time, i.e., independent of the num- ... constant-time property on such systems. ..... of all transmitted datagrams on our test systems.
May 31, 2011 - Torsten Hoefler. University of ...... Co-array Fortran for parallel programming. ... [30] J. Willcock, T. Hoefler, N. Edmonds, and A. Lumsdaine.
ABSTRACT. We introduce LogGOPSimâa fast simulation framework for par- allel algorithms at large-scale. LogGOPSim utilizes a slightly ex- tended version of ...
Première Sciences Physiques et Chimie en Laboratoire. Français Hachette
Education :Textes, Langue, Méthodes - Edition 2007 Isbn : 978-2-01-180422-8.
with pulse widths between 13 and 25 fs by measuring the high-dynamic-range ... proven to be an excellent modelocking mechanism [4â7], which is based on ... band semiconductor saturable absorber mirror (SESAM) [8,. 9]. So far, we have ..... 17, 748
of clustering in distributed computation, we describe a graph .... https://dgleich.com/projects/overlapping-clusters. 2. NOTATION ...... We also list the performance.
use deg(v) to denote the degree of a vertex. Sets are denoted ... of the set measures the degree of all vertices in the
Furthermore, we use segmentation and player elimination [HMP00]: The cir- ... proach of [Bea91b]: First, the parties generate one multiplication triple (A,B,C) for each ... However we briefly sketch the used communication model. .... E(m mod N; 1), a
These processes are ap- plied throughout scientific computation because they model .... of the set measures the degree o
Ronald de Wolfâ¡. CWI Amsterdam [email protected] ..... Alice gets input x, Bob gets input y, and .... Alice picks a number i â [n] with probability pi and sends over i, ...
On Computation and Communication with Small Bias. Harry Buhrmanâ. CWI Amsterdam, and. University of Amsterdam [email protected]. Nikolay Vereshchagin ...
Feb 12, 2013 - ... by the SNF grant 200021â141089 and Deutsche Forschungsgemeinschaft grant. BL511/10-1. 1. arXiv:1302.2805v1 [cs.DS] 12 Feb 2013 ...
Sep 4, 2018 - plored via CALPHAD phase diagram calculations and experimental methodologies. ..... In this work, Gibbs phase rule is used as criteria to.
Aug 31, 2015 - Emerging Communication Networks ... and resource demanding wireless services such as video streaming, the demand for wireless capacity ...
In Representation and Reality (Putman, 1998), Hilary Putman develops a set ... one reading of this argument, Putnam would be showing that not only is any ...
With advances in the information technologies of communication, computation and control, a successful implementation of real4time control systems requires a ...
Overlapping Communication and Computation with High ... - SPCL, ETH
May 21, 2008 - Torsten Hoefler and Andrew Lumsdaine. Open Systems Lab ... are an optimization tool. CO performance influences application performance.
Overlapping Communication and Computation with High Level Communication Routines - On Optimizing Parallel Applications -
Torsten Hoefler and Andrew Lumsdaine Open Systems Lab Indiana University Bloomington, IN 47405, USA
Conference on Cluster Computing and the Grid (CCGrid’08) Lyon, France
21th May 2008
Torsten Hoefler and Andrew Lumsdaine
Communication/Computation Overlap
Introduction Solving Grand Challenge Problems not a Grid talk HPC-centric view highly-scalable tightly coupled machines Thanks for the Introduction Manish! All processors will be multi-core All computers will be massively parallel All programmers will be parallel programmers All programs will be parallel programs ⇒ All (massively) parallel programs need optimized communication (patterns) Torsten Hoefler and Andrew Lumsdaine
Communication/Computation Overlap
Fundamental Assumptions (I) We need more powerful machines! Solutions for real-world scientific problems need huge processing power (Grand Challenges) Capabilities of single PEs have fundamental limits The scaling/frequency race is currently stagnating Moore’s law is still valid (number of transistors/chip) Instruction level parallelism is limited (pipelining, VLIW, multi-scalar) Explicit parallelism seems to be the only solution Single chips and transistors get cheaper Implicit transistor use (ILP, branch prediction) have their limits Torsten Hoefler and Andrew Lumsdaine
Communication/Computation Overlap
Fundamental Assumptions (II) Parallelism requires communication Local or even global data-dependencies exist Off-chip communication becomes necessary Bridges a physical distance (many PEs) Communication latency is limited It’s widely accepted that the speed of light limits data-transmission Example: minimal 0-byte latency for 1m ≈ 3.3ns ≈ 13 cycles on a 4GHz PE Bandwidth can hide latency only partially Bandwidth is limited (physical constraints) The problem of “scaling out” (especially iterative solvers) Torsten Hoefler and Andrew Lumsdaine
Communication/Computation Overlap
Assumptions about Parallel Program Optimization Collective Operations Collective Operations (COs) are an optimization tool CO performance influences application performance optimized implementation and analysis of CO is non-trivial Hardware Parallelism More PEs handle more tasks in parallel Transistors/PEs take over communication processing Communication and computation could run simultaneously Overlap of Communication and Computation Overlap can hide latency Improves application performance Torsten Hoefler and Andrew Lumsdaine
Communication/Computation Overlap
Overview (I) Theoretical Considerations a model for parallel architectures parametrize model derive model for BC and NBC prove optimality of collops in the model (?) show processor idle time during BC show limits of the model (IB,BG/L) Implementation of NBC how to assess performance? highly portable low-performance IB optimized, high performance, threaded
Torsten Hoefler and Andrew Lumsdaine
Communication/Computation Overlap
Overview (II) Application Kernels FFT (strong data dependency) compression (parallel data analysis) poisson solver (2d-decomposition) Applications show how performance benefits for microbenchmarks can benefit real-world applications ABINIT Octopus OSEM medical image reconstruction
Torsten Hoefler and Andrew Lumsdaine
Communication/Computation Overlap
The LogGP model Modelling Network Communication LogP model family has best tradeoff between ease of use and accuracy LogGP is most accurate for different message sizes Methodology assess LogGP parameters for modern interconnects model collective communication level Sender
Receiver
CPU
Network
os
o
L g, G
r
g, G time
Torsten Hoefler and Andrew Lumsdaine
Communication/Computation Overlap
TCP/IP - GigE/SMP
MPICH2 - G*s+g MPICH2 - o TCP - G*s+g TCP o
Time in microseconds
600 500 400 300 200 100 0 0
10000 20000 30000 40000 50000 60000 Datasize in bytes (s)
Torsten Hoefler and Andrew Lumsdaine
Communication/Computation Overlap
Myrinet/GM (preregistered/cached) 350
Open MPI - G*s+g Open MPI - o Myrinet/GM - G*s+g Myrinet/GM - o
Time in microseconds
300 250 200 150 100 50 0 0
10000 20000 30000 40000 50000 60000 Datasize in bytes (s)
Torsten Hoefler and Andrew Lumsdaine
Communication/Computation Overlap
InfiniBand (preregistered/cached) 90
Open MPI - G*s+g Open MPI - o OpenIB - G*s+g OpenIB - o
Time in microseconds
80 70 60 50 40 30 20 10 0 0
10000 20000 30000 40000 50000 60000 Datasize in bytes (s)
Torsten Hoefler and Andrew Lumsdaine
Communication/Computation Overlap
Modelling Collectives LogGP Models - general tbarr
= (2o + L) · ⌈log2 P⌉
tallred
= 2 · (2o + L + m · G) · ⌈log2 P⌉ + m · γ · ⌈log2 P⌉
tbcast
= (2o + L + m · G) · ⌈log2 P⌉
CPU and Network LogGP parts CPU tbarr = 2o · ⌈log2 P⌉ CPU tallred = (4o + m · γ) · ⌈log2 P⌉ CPU tbcast = 2o · ⌈log2 P⌉
Torsten Hoefler and Andrew Lumsdaine
NET = L · ⌈log2 P⌉ tbarr NET = 2 · (L + m · G) · ⌈log2 P⌉ tallred NET = (L + m · G) · ⌈log2 P⌉ tbcast
Communication/Computation Overlap
CPU Overhead - MPI_Allreduce LAM/MPI 7.1.2
CPU Usage (share) 0.03 0.025 0.02 0.015 0.01 0.005 0
10
100000 10000 1000 100 Data Size
20
30 40 Communicator Size Torsten Hoefler and Andrew Lumsdaine
30 40 Communicator Size Torsten Hoefler and Andrew Lumsdaine
50
10 60
1
Communication/Computation Overlap
Implementation of Non-blocking Collectives LibNBC for MPI single-threaded highly portable schedule-based design LibNBC for InfiniBand single-threaded (first version) receiver-driven message passing very low overhead Threaded LibNBC thread support requires MPI_THREAD_MULTIPLE completely asynchronous progress complicated due to scheduling issues Torsten Hoefler and Andrew Lumsdaine
Communication/Computation Overlap
LibNBC - Alltoall overhead, 64 nodes 60000
Open MPI/blocking LibNBC/Open MPI, 1024 LibNBC/OF, waitonsend
Overhead (usec)
50000 40000 30000 20000 10000 0 0
50
100
150
200
250
Message Size (kilobytes) Torsten Hoefler and Andrew Lumsdaine
Communication/Computation Overlap
300
First Example Derivation from “normal” implementation distribution identical to “normal” 3D-FFT first FFT in z direction and index-swap identical Design Goals to Minimize Communication Overhead start communication as early as possible achieve maximum overlap time Solution start MPI_Ialltoall as soon as first xz-plane is ready calculate next xz-plane start next communication accordingly ... collect multiple xz-planes (tile factor) Torsten Hoefler and Andrew Lumsdaine