CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
Videos
Results
Analysis
Smoothed Particle Hydrodynamics simulations on multi-GPU systems E. Rustico, G. Bilotta, A. H´erault, C. Del Negro and G. Gallo Universit`a di Catania, INGV Osservatorio Etneo, CNAM (Paris) February 16, 2012
Eugenio Rustico PDP 2012 - Garching
Conclusions
CUDA
SPH
Eugenio Rustico PDP 2012 - Garching
Single-GPU SPH
Multi-GPU SPH
Load balancing
Videos
Results
Analysis
Conclusions
CUDA
SPH
Single-GPU SPH
Outline 1 CUDA 2 SPH 3 Single-GPU SPH 4 Multi-GPU SPH 5 Load balancing 6 Videos 7 Results 8 Analysis 9 Conclusions
Eugenio Rustico PDP 2012 - Garching
Multi-GPU SPH
Load balancing
Videos
Results
Analysis
Conclusions
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
Videos
Results
Analysis
Conclusions
Evolutions of GPUs
1981
1992
2001
2011
”Same” game, slightly different amount of computational requirements... Eugenio Rustico PDP 2012 - Garching
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
Videos
Results
Analysis
Conclusions
Evolutions of GPUs
Fermi
1500
peak GFLOP/s
1000
CUDA
NVIDIA GPU Intel CPU
GT200
750 G80
G80 Ultra
G92
500 G71
250
0
G70 NV35 NV30
NV40
Jan Jun 2003
Apr 2004
3.0 Ghz Core 2 Duo
Jun 2005
Mar Nov 2006
May 2007
3.2 Ghz Harpertown
Core i7-920 Arrandale
Jun Jun 2008 2009
Apr 2010
Gaming has been the main pulling factor for GPU computational power. GPUs are nowadays faster than CPUs for specific kinds of computations Eugenio Rustico PDP 2012 - Garching
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
Videos
Results
Analysis
GPU programming Timeline
GPU programming Timeline
1980s: first Graphic Processing Units 1990s: almost every computer has its own graphic accelerator 2001: release of the first programmable shader; researchers used to map a non-graphic problem to a graphic-one, in order to exploit the computational power of GPUs for general-purpose problems 2007: release of CUDA, software and hardware platform with explicit support for general-purpose programmability
Eugenio Rustico PDP 2012 - Garching
Conclusions
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
Videos
Results
Analysis
Conclusions
GPU programming Timeline
Multiprocessor
Multiprocessor
Multiprocessor
Instruction unit
Instruction unit
Instruction unit
Multiprocessor Instruction unit
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
ALU ALU ALU ALU
Shared memory
Shared memory
Shared memory
Shared memory
Registers
Registers
Registers
Registers
Global memory Constant/texture memory
A multi-core GPU can be seen as a parallel computer with shared memory
Eugenio Rustico PDP 2012 - Garching
PciE
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
Videos
Results
Analysis
Conclusions
CUDA
Why CUDA? Simple programming language; use an extension of C++ Simple programming model; it is easy to quickly gain a consistent speedup Cheap hardware (CUDA-capable hardware since 2007; 3 TFLOPS with less than 1, 000 e) Various optimized libraries available for common tasks (sorting, selection, linear algebra, FFT, image processing, computer vision...) ...aggressive marketing! In november 2011, three CUDA-enabled clusters are among the top 5 in TOP500 Eugenio Rustico PDP 2012 - Garching
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
Videos
Results
Analysis
CUDA
Why not CUDA? Bound to a specific manufacturer No complete abstraction from hardware (like OpenCL) Theoretical peak power often smaller than the main competitor (ATI)
Eugenio Rustico PDP 2012 - Garching
Conclusions
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
CUDA
Example: array initialization, serial CPU code
1
for ( i =0; i < 4 0 9 6 ; i ++)
2
Eugenio Rustico PDP 2012 - Garching
cpu_array [i] = i;
Videos
Results
Analysis
Conclusions
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
Videos
Results
Analysis
CUDA
Example: array initialization, parallel GPU code
1
# d e f i n e B L K S I Z E 64
2
...
3
_ _ g l o b a l _ _ voi d i n i t _ k e r n e l ( int * g p u P o i n t e r ) { index = blockDim .x* blockIdx .x + threadIdx .x;
4
gpuPointer [ index ] = index ;
5 6
}
7
...
8
i n i t _ k e r n e l < < < 4 0 9 6 / BLKSIZE , B L K S I Z E > > >( g p u _ a r r a y );
Eugenio Rustico PDP 2012 - Garching
Conclusions
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
Videos
Results
Analysis
Model
SPH Smoothed Particle Hydrodynamics is a mesh-free, Lagrangian numerical method for fluid modeling, originally developed by Gingold, Monaghan and Lucy for astrophysical simulations.
The fluid is discretized in a set of arbitrarily distributed particles each carrying scalar and vector properties (mass, density, position, velocity, etc.). Eugenio Rustico PDP 2012 - Garching
Conclusions
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
Videos
Results
Analysis
Model
It is possible to interpolate the value of a field A at any point r of the domain by convolving the field values with a smoothing kernel W :
A(r) '
X j
mj
Aj W (|r − rj |, h), ρj
where mj is the mass of particle j Aj is the value of the field A for particle j ρj is the density associated with particle j r denotes position W is the smoothing kernel function, which is chosen to have compact support with radius proportional to a given smoothing length h
Eugenio Rustico PDP 2012 - Garching
Conclusions
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
Videos
Results
Analysis
Conclusions
Model
As the smoothing kernel W has compact support, each particle only interacts with a small set of neighbors:
Because each particle accesses its neighbors multiple times during an iteration, it is convenient to compute the neighbors list once and store it
Eugenio Rustico PDP 2012 - Garching
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
Videos
Results
Analysis
Model
Each iteration of the simulation consists of the following steps: 1
For each particle, compute and store the list of neighbors
2
For each particle, compute the interaction with all neighbors and compute the resulting force and the maximum allowable δt
3
Find the minimum allowable δt (to be used in next iteration)
4
For each particle, integrate the force over the δt and compute the resulting position and velocity
The current implementation actually relies on a predictor-corrector integration scheme with adaptive δt, where steps 2-4 are performed twice each iteration
Eugenio Rustico PDP 2012 - Garching
Conclusions
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
Videos
Results
Analysis
Conclusions
Kernels
Single-GPU SPH
The SPH model exposes a high degree of parallelism and it steps can be straightforwardly implemented as GPU functions (i.e. CUDA kernels): 1
BuildNeibs - Computed every k th iteration
2
Forces - The most expensive step
3
MinimumScan - Computed with CUDPP minimum scan
4
Euler - The fastest step
Eugenio Rustico PDP 2012 - Garching
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
Videos
Results
Analysis
Conclusions
Fast Neighbor Search
The na¨ıf algorithm to gather the neighbor list for each particle is O(n2 ) (quadratic with the number of particles). While the SPH model is gridless, organizing the particles in a virtual grid with cells as wide as the influence radius boosts the search:
Cells are completely transparent to the SPH model and have side equal to the influence radius Eugenio Rustico PDP 2012 - Garching
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
Videos
Results
Analysis
Conclusions
CPU vs. GPU
Differences with respect to the CPU implementation Data are stored in a structure of arrays instead of array of structures fashion (to improve coalescence; cfr. CUDA Best Practices) Each interaction is computed twice, as on the GPU it is fastest not to deal with concurrent writes Negligible numerical differences: GPU uses by default single-precision float and order of additions may differ
Eugenio Rustico PDP 2012 - Garching
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
Videos
Results
Analysis
Results
The GPU-based implementation reached a 100× speedup over the reference CPU code (serial, single thread). Details and results in A. Herault, G. Bilotta, and R. A. Dalrymple, SPH on GPU with CUDA, Journal of Hydraulic Research, 48(Extra Issue):74–79, 2010
Open source implementation: GPU-SPH www.ce.jhu.edu/dalrymple/GPUSPH
Eugenio Rustico PDP 2012 - Garching
Conclusions
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
Videos
Motivation
Multi-GPU SPH Why exploiting multiple devices simultaneously?
Performance - faster simulations Problem size - bigger simulations
Eugenio Rustico PDP 2012 - Garching
Results
Analysis
Conclusions
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
Videos
Results
Analysis
Motivation
How to exploit multiple devices simultaneously? Distribute the computational phases (pipeline; not scalable) Distribute the problem domain (scalable; boundary conditions) Greedy list-split (unfeasible: no guarantees for neighbors) Cells-based split (list-split + ordering)
Eugenio Rustico PDP 2012 - Garching
Conclusions
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
Videos
Results
Analysis
Splitting the problem
To split the domain, we exploit the virtual grid used for fast-neighbor search Eugenio Rustico PDP 2012 - Garching
Conclusions
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
Videos
Results
Analysis
Conclusions
Splitting the problem
Particles are enumerated in linear memory according to the cartesian axis chosen to split Eugenio Rustico PDP 2012 - Garching
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
Videos
Results
Splitting the problem
GPU 0 GPU 0
GPU 1
GPU 2
GPU 3
GPU 1
GPU 2
GPU 3
Avoid multiple split planes in the same simulation Eugenio Rustico PDP 2012 - Garching
Analysis
Conclusions
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
Videos
Results
Analysis
Conclusions
Splitting the problem
device D
slice sn
internal
external
external
internal
slice sn+1
slice sn+2
device D+1
slice sn+3 = particle
= cell
External particles: read-only copy of neighboring devices, to compute the forces for edging particles Internal particles: particles needed by neighboring devices Total overlap: two cells, to be updated after every force computations. Up to CUDA 3.2, no direct device-device communication: we need a CPU buffer and multiple transfers Eugenio Rustico PDP 2012 - Garching
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
Videos
Results
Analysis
Conclusions
Splitting the problem
1
BuildNeibs - Read all particles, write for internal
2
Forces - Ditto
3
MinimumScan - Read only internal
4
Euler - Write both internal and external
Plus: after each execution of Forces, must exchange overlapping borders with neighbor devices. We need a technique to hide those transfers, or the overhead will ”kill” the simulations
Eugenio Rustico PDP 2012 - Garching
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
Videos
Results
Analysis
Conclusions
Hiding transfers
kernel_forces_async Launch kernel_forces on edging internal slices (1,2)
GPU Stream A
Stream B
Stream C
1 2
Launch kernel_forces on the rest of slices (3) Launch download of internal slices (forces) GPUThread BARRIER
3
Launch upload of external slices (forces)
Key idea: exchange the slices as soon as they are ready, while still computing the forces of the internal particles, by using the asynchronous API Know-how acquired with a multi-GPU Cellular Automaton (MAGFLOW)
Eugenio Rustico PDP 2012 - Garching
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
Videos
Results
Analysis
Conclusions
Hiding transfers
431.755ms
531.755ms
Download dt Sort Minimumscan BuildNeibs Forces Async slice down. Async slice up. Euler 10ms
Actual timeline, with kernel lengths in scale. Only one GPU with 2 neighboring devices is shown. The little dots mark the moment operations were issued on the CPU.
The timeline has been produced with a custom visualization tool
Eugenio Rustico PDP 2012 - Garching
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
Videos
Results
Hiding transfers
Overall simulator design: GPUThread::simulationThread()
GPUThread::runMultiGPU() Allocate CPU buffers Compute first split Launch SimulationThreads while (running) BARRIER choose global min. dt t += dt update moving boundaries check if need save_request check if simulation is over check balancing if (unbalanced) compute new balance BARRIER Deallocate CPU buffers
Allocate GPU buffers Upload data to GPU while (running) if (balancing_requests) give / take slices if (buildneibs_time) kernel_sort download internal slices GPUThread BARRIER upload external slices kernel_buildneibs foreach of 2 steps kernel_forces_async kernel_find_min_dt kernel_euler if (save_request) dump_subdomain BARRIER BARRIER Deallocate GPU buffers
One simulationThread (CPU thread) per device Eugenio Rustico PDP 2012 - Garching
Analysis
Conclusions
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
Videos
Results
Analysis
Conclusions
Algorithm
Load balancing It is important to balance the workload, as the fluid topology can greatly influence the overall performance.
AVERAGE
LB GPU1
Eugenio Rustico PDP 2012 - Garching
GPU2
GPU3
GPU4
GPU1
GPU2
GPU3
GPU4
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
Videos
Results
Algorithm
1
AG ←
2
Ts ← avg . d u r a t i o n of one s l i c e = AG / num . s l i c e s
3
f o r e a c h GPU
4
g l o b a l avg . d u r a t i o n of k e r n e l F o r c e s
Ag ← self avg . d u r a t i o n of k e r n e l F o r c e s
5
d el t a ← Ag - AG
6
if ( abs ( d el t a ) > Ts * H L B _ T H R E S H O L D )
7
if ( d el t a > 0) mark as G I V I N G _ C A N D I D A T E
8 9
else mark as T A K I N G _ C A N D I D A T E
10 11 12
mark G I V I N G _ C A N D I D A T E as G I V I N G
14
mark T A K I N G _ C A N D I D A T E as T A K I N G
15
f o r e a c h i n t e r m e d i a t e GPU
17
PDP 2012 - Garching
if a c o u p l e G I V I N G / T A K I N G is fo und
13
16
Eugenio Rustico
f o r e a c h GPU
mark as G I V I N G and T A K I N G br eak
Analysis
Conclusions
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
Videos
Results
Analysis
Conclusions
Algorithm
avg.
-
GIVE
TAKE GIVE
TAKE
GPU1
GPU2
GPU3
GPU4
SLICE
SLICE
Simple policy: One transfer at a time Time for one slice computed from average Transfer between the first detected giving/taking pair Static threshold (no check for local minima nor ping-pong transfers) Eugenio Rustico PDP 2012 - Garching
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
Videos
Results
BoreInABox
Eugenio Rustico PDP 2012 - Garching
t = 0.0s
t = 0.2s
t = 0.5s
t = 0.64s
t = 1.04s
t = 1.68s
Analysis
Conclusions
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
Videos
Results
BoreInABox corridor variant
Eugenio Rustico PDP 2012 - Garching
t = 0.0s
t = 0.16s
t = 0.32s
t = 0.48s
t = 0.8s
t = 2.0s
Analysis
Conclusions
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
Videos
Results
Analysis
Videos available for download at: http://www.dmi.unict.it/~rustico/sphvideos/
Eugenio Rustico PDP 2012 - Garching
Conclusions
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Hardware platform
Hardware platform TYAN-FT72 rack Dual-Xeon 16 cores 16Gb RAM 6×GTX480 cards on PCIe2
Eugenio Rustico PDP 2012 - Garching
Load balancing
Videos
Results
Analysis
Conclusions
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
Videos
Results
Analysis
Performance improvement
Results - DamBreak3D (symmetric problem)
7000 6000 5000 4000 Time (s) LB time (s) Ideal time (s) 3000 2000 1000 0 1 GPU
Eugenio Rustico PDP 2012 - Garching
2 GPUs
3 GPUs
4 GPUs
5 GPUs
6 GPUs
Conclusions
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
Videos
Results
Analysis
Performance improvement
Results - BoreInABox (asymmetric problem) 9000 8000 7000 6000 5000 Time (s) 4000 LB time (s) Ideal time (s)3000 2000 1000 0 1 GPU
Eugenio Rustico PDP 2012 - Garching
2 GPUs
3 GPUs
4 GPUs
5 GPUs
6 GPUs
Conclusions
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
Videos
Results
Analysis
Conclusions
Measuring the throughput
A possible performance metric abstracting from the number of particles and the simulation settings is the number of iterations computed, times the number of particles, divided by the execution time in seconds. In thousands, we can therefore measure kip/s (kilo-iterations times particles per second)
Eugenio Rustico PDP 2012 - Garching
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
Videos
Results
Analysis
Measuring the throughput
Kip/s, with and without load balancing, with a symmetric problem (DamBreak3D): DamBreak3D
Kip/s LB kip/s Ideal kip/s
1 GPU 9, 977 -
2 GPUs 19, 149 19, 380 19, 955
3 GPUs 27, 802 27, 191 29, 932
4 GPUs 35, 213 35, 336 39, 910
5 GPUs 39, 784 42, 578 49, 887
6 GPUs 45, 599 49, 491 59, 865
4 GPUs 20, 649 30, 548 35, 082
5 GPUs 25, 657 36, 800 43, 852
6 GPUs 29, 418 41, 745 52, 623
With an asymmetric problem (BoreInABox): BoreInABox
Kip/s LB kip/s Ideal kip/s
Eugenio Rustico PDP 2012 - Garching
1 GPU 8, 770 -
2 GPUs 12, 713 16, 940 17, 541
3 GPUs 16, 275 24, 115 26, 311
Conclusions
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
Videos
Results
Analysis
Measuring the efficiency
A common starting point to analyze a parallel implementation is to model the problem as made by a parallelizable part and a serial, non parallelizable one. Let α be the parallelizable fraction of the problem and β the non-parallelizable one; it is α + β = 1, with α, β non-negative real numbers.
Eugenio Rustico PDP 2012 - Garching
Conclusions
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
Videos
Results
Analysis
Measuring the efficiency
Amdahl’s law measures the speedup in terms of execution time and gives an upper-bound inversely proportional to β. With a slight change in the notation, Amdahl’s law can be written as SA (N) =
1 β+
α N
where N is the number of used cores and SA the expected speed-up
Eugenio Rustico PDP 2012 - Garching
Conclusions
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
Videos
Results
Analysis
Measuring the efficiency
20.00
Parallel portion (α)
18.00
0.50 0.75
16.00
0.90 0.95
14.00
Speedup
12.00 10.00 8.00 6.00 4.00
Eugenio Rustico PDP 2012 - Garching
65536
32768
8192
16384
4096
2048
512
Number of cores
1024
256
64
128
32
16
8
4
1
0.00
2
2.00
Conclusions
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
Videos
Results
Analysis
Conclusions
Measuring the efficiency
The Karp–Flatt metric was proposed in 1990 as a measure of the efficiency of the parallel implementation of a problem. The new efficiency metric is the experimental measure of the serial fraction β of a problem: 1 − 1 Se N βKF = 1 1− N Se is the empirically measured speedup and N is the number of cores (GPUs). An efficient parallelization presents a constant β
Eugenio Rustico PDP 2012 - Garching
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
Videos
Results
Analysis
Conclusions
Measuring the efficiency
Measured β, with and without load balancing, with a symmetric problem (DamBreak3D): DamBreak3D βKF , no LB βKF , LB
2 GPUs 0.053 0.053
3 GPUs 0.036 0.056
4 GPUs 0.048 0.037
5 GPUs 0.063 0.041
6 GPUs 0.061 0.040
With an asymmetric problem (BoreInABox): BoreInABox βKF , no LB βKF , LB
Eugenio Rustico PDP 2012 - Garching
2 GPUs 0, 429 0, 053
3 GPUs 0, 289 0, 056
4 GPUs 0, 222 0, 048
5 GPUs 0, 181 0, 048
6 GPUs 0, 153 0, 050
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
Videos
Results
Analysis
Conclusions
Future work
Future work Extend to GPU clusters, achieving a third level of parallelism (what is the best split?) Smarter load balancing (adaptive threshold and time analysis)
Eugenio Rustico PDP 2012 - Garching
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
Videos
Results
Analysis
References
References E. Rustico, G. Bilotta, A. H´erault, C. Del Negro, G. Gallo, Smoothed Particle Hydrodynamics simulations on multi-GPU systems, 20th Euromicro International Conference on Parallel, Distributed and Network-Based Computing, Special Session on GPU Computing and Hybrid Computing, 2012, Garching/Munich (DE) G. Bilotta, E. Rustico, A. H´erault, A. Vicari, C. Del Negro and G. Gallo, Porting and optimizing MAGFLOW on CUDA, Annals of Geophysics, Vol. 54, n. 5, doi: 10.4401/ag-5341 A. H´erault, G. Bilotta, E. Rustico, A. Vicari and C. Del Negro, Numerical Simulation of lava flow using a GPU SPH model, Annals of Geophysics, Vol. 54, n. 5, doi: 10.4401/ag-5343 A. H´erault, G. Bilotta, E. Rustico, A. Vicari and C. Del Negro, Numerical Simulation of lava flow using a GPU SPH model, Annals of Geophysics, Vol. 54, n. 5, doi: 10.4401/ag-5343 A. Herault, G. Bilotta, and R. A. Dalrymple, SPH on GPU with CUDA, Journal of Hydraulic Research, 48(Extra Issue):74–79, 2010 J. J. Monaghan, Smoothed particle hydrodynamics, Annual Review of Astronomy and Astrophysics 30, pages 543–574, 1977 Eugenio Rustico PDP 2012 - Garching
Conclusions
CUDA
SPH
Single-GPU SPH
Multi-GPU SPH
Load balancing
References
Thank you ...that’s all, folks!
Eugenio Rustico http://www.dmi.unict.it/~rustico
[email protected]
Eugenio Rustico PDP 2012 - Garching
Videos
Results
Analysis
Conclusions