Smoothed Particle Hydrodynamics simulations on multi ... - PDP 2012

CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

Videos

Results

Analysis

Smoothed Particle Hydrodynamics simulations on multi-GPU systems E. Rustico, G. Bilotta, A. Hérault, C. Del Negro and G. Gallo Università di Catania, INGV Osservatorio Etneo, CNAM (Paris) February 16, 2012

Eugenio Rustico PDP 2012 - Garching

Conclusions

CUDA

SPH


Single-GPU SPH

Multi-GPU SPH

Load balancing

Videos

Results

Analysis

Conclusions

CUDA

SPH

Single-GPU SPH

Outline 1 CUDA 2 SPH 3 Single-GPU SPH 4 Multi-GPU SPH 5 Load balancing 6 Videos 7 Results 8 Analysis 9 Conclusions


Multi-GPU SPH

Load balancing

Videos

Results

Analysis

Conclusions

CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

Videos

Results

Analysis

Conclusions

Evolutions of GPUs

1981

1992

2001

2011

”Same” game, slightly different amount of computational requirements... Eugenio Rustico PDP 2012 - Garching

CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

Videos

Results

Analysis

Conclusions

Evolutions of GPUs

Fermi

1500

peak GFLOP/s

1000

CUDA

NVIDIA GPU Intel CPU

GT200

750 G80

G80 Ultra

G92

500 G71

250

0

G70 NV35 NV30

NV40

Jan Jun 2003

Apr 2004

3.0 Ghz Core 2 Duo

Jun 2005

Mar Nov 2006

May 2007

3.2 Ghz Harpertown

Core i7-920 Arrandale

Jun Jun 2008 2009

Apr 2010

Gaming has been the main pulling factor for GPU computational power. GPUs are nowadays faster than CPUs for specific kinds of computations Eugenio Rustico PDP 2012 - Garching

CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

Videos

Results

Analysis

GPU programming Timeline


1980s: first Graphic Processing Units 1990s: almost every computer has its own graphic accelerator 2001: release of the first programmable shader; researchers used to map a non-graphic problem to a graphic-one, in order to exploit the computational power of GPUs for general-purpose problems 2007: release of CUDA, software and hardware platform with explicit support for general-purpose programmability


Conclusions

CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

Videos

Results

Analysis

Conclusions


Multiprocessor

Multiprocessor

Multiprocessor

Instruction unit

Instruction unit

Instruction unit

Multiprocessor Instruction unit

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

ALU ALU ALU ALU

Shared memory

Shared memory

Shared memory

Shared memory

Registers

Registers

Registers

Registers

Global memory Constant/texture memory

A multi-core GPU can be seen as a parallel computer with shared memory


PciE

CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

Videos

Results

Analysis

Conclusions

CUDA

Why CUDA? Simple programming language; use an extension of C++ Simple programming model; it is easy to quickly gain a consistent speedup Cheap hardware (CUDA-capable hardware since 2007; 3 TFLOPS with less than 1, 000 e) Various optimized libraries available for common tasks (sorting, selection, linear algebra, FFT, image processing, computer vision...) ...aggressive marketing! In november 2011, three CUDA-enabled clusters are among the top 5 in TOP500 Eugenio Rustico PDP 2012 - Garching

CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

Videos

Results

Analysis

CUDA

Why not CUDA? Bound to a specific manufacturer No complete abstraction from hardware (like OpenCL) Theoretical peak power often smaller than the main competitor (ATI)


Conclusions

CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

CUDA

Example: array initialization, serial CPU code

1

for ( i =0; i < 4 0 9 6 ; i ++)

2


cpu_array [i] = i;

Videos

Results

Analysis

Conclusions

CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

Videos

Results

Analysis

CUDA

Example: array initialization, parallel GPU code

1

# d e f i n e B L K S I Z E 64

2

...

3

_ _ g l o b a l _ _ voi d i n i t _ k e r n e l ( int * g p u P o i n t e r ) { index = blockDim .x* blockIdx .x + threadIdx .x;

4

gpuPointer [ index ] = index ;

5 6

}

7

...

8

i n i t _ k e r n e l < < < 4 0 9 6 / BLKSIZE , B L K S I Z E > > >( g p u _ a r r a y );


Conclusions

CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

Videos

Results

Analysis

Model

SPH Smoothed Particle Hydrodynamics is a mesh-free, Lagrangian numerical method for fluid modeling, originally developed by Gingold, Monaghan and Lucy for astrophysical simulations.

The fluid is discretized in a set of arbitrarily distributed particles each carrying scalar and vector properties (mass, density, position, velocity, etc.). Eugenio Rustico PDP 2012 - Garching

Conclusions

CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

Videos

Results

Analysis

Model

It is possible to interpolate the value of a field A at any point r of the domain by convolving the field values with a smoothing kernel W :

A(r) '

X j

mj

Aj W (|r − rj |, h), ρj

where mj is the mass of particle j Aj is the value of the field A for particle j ρj is the density associated with particle j r denotes position W is the smoothing kernel function, which is chosen to have compact support with radius proportional to a given smoothing length h


Conclusions

CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

Videos

Results

Analysis

Conclusions

Model

As the smoothing kernel W has compact support, each particle only interacts with a small set of neighbors:

Because each particle accesses its neighbors multiple times during an iteration, it is convenient to compute the neighbors list once and store it


CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

Videos

Results

Analysis

Model

Each iteration of the simulation consists of the following steps: 1

For each particle, compute and store the list of neighbors

2

For each particle, compute the interaction with all neighbors and compute the resulting force and the maximum allowable δt

3

Find the minimum allowable δt (to be used in next iteration)

4

For each particle, integrate the force over the δt and compute the resulting position and velocity

The current implementation actually relies on a predictor-corrector integration scheme with adaptive δt, where steps 2-4 are performed twice each iteration


Conclusions

CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

Videos

Results

Analysis

Conclusions

Kernels

Single-GPU SPH

The SPH model exposes a high degree of parallelism and it steps can be straightforwardly implemented as GPU functions (i.e. CUDA kernels): 1

BuildNeibs - Computed every k th iteration

2

Forces - The most expensive step

3

MinimumScan - Computed with CUDPP minimum scan

4

Euler - The fastest step


CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

Videos

Results

Analysis

Conclusions

Fast Neighbor Search

The na¨ıf algorithm to gather the neighbor list for each particle is O(n2 ) (quadratic with the number of particles). While the SPH model is gridless, organizing the particles in a virtual grid with cells as wide as the influence radius boosts the search:

Cells are completely transparent to the SPH model and have side equal to the influence radius Eugenio Rustico PDP 2012 - Garching

CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

Videos

Results

Analysis

Conclusions

CPU vs. GPU

Differences with respect to the CPU implementation Data are stored in a structure of arrays instead of array of structures fashion (to improve coalescence; cfr. CUDA Best Practices) Each interaction is computed twice, as on the GPU it is fastest not to deal with concurrent writes Negligible numerical differences: GPU uses by default single-precision float and order of additions may differ


CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

Videos

Results

Analysis

Results

The GPU-based implementation reached a 100× speedup over the reference CPU code (serial, single thread). Details and results in A. Herault, G. Bilotta, and R. A. Dalrymple, SPH on GPU with CUDA, Journal of Hydraulic Research, 48(Extra Issue):74–79, 2010

Open source implementation: GPU-SPH www.ce.jhu.edu/dalrymple/GPUSPH


Conclusions

CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

Videos

Motivation

Multi-GPU SPH Why exploiting multiple devices simultaneously?

Performance - faster simulations Problem size - bigger simulations


Results

Analysis

Conclusions

CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

Videos

Results

Analysis

Motivation

How to exploit multiple devices simultaneously? Distribute the computational phases (pipeline; not scalable) Distribute the problem domain (scalable; boundary conditions) Greedy list-split (unfeasible: no guarantees for neighbors) Cells-based split (list-split + ordering)


Conclusions

CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

Videos

Results

Analysis

Splitting the problem

To split the domain, we exploit the virtual grid used for fast-neighbor search Eugenio Rustico PDP 2012 - Garching

Conclusions

CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

Videos

Results

Analysis

Conclusions


Particles are enumerated in linear memory according to the cartesian axis chosen to split Eugenio Rustico PDP 2012 - Garching

CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

Videos

Results


GPU 0 GPU 0

GPU 1

GPU 2

GPU 3

GPU 1

GPU 2

GPU 3

Avoid multiple split planes in the same simulation Eugenio Rustico PDP 2012 - Garching

Analysis

Conclusions

CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

Videos

Results

Analysis

Conclusions


device D

slice sn

internal

external

external

internal

slice sn+1

slice sn+2

device D+1

slice sn+3 = particle

= cell

External particles: read-only copy of neighboring devices, to compute the forces for edging particles Internal particles: particles needed by neighboring devices Total overlap: two cells, to be updated after every force computations. Up to CUDA 3.2, no direct device-device communication: we need a CPU buffer and multiple transfers Eugenio Rustico PDP 2012 - Garching

CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

Videos

Results

Analysis

Conclusions


1

BuildNeibs - Read all particles, write for internal

2

Forces - Ditto

3

MinimumScan - Read only internal

4

Euler - Write both internal and external

Plus: after each execution of Forces, must exchange overlapping borders with neighbor devices. We need a technique to hide those transfers, or the overhead will ”kill” the simulations


CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

Videos

Results

Analysis

Conclusions

Hiding transfers

kernel_forces_async Launch kernel_forces on edging internal slices (1,2)

GPU Stream A

Stream B

Stream C

1 2

Launch kernel_forces on the rest of slices (3) Launch download of internal slices (forces) GPUThread BARRIER

3

Launch upload of external slices (forces)

Key idea: exchange the slices as soon as they are ready, while still computing the forces of the internal particles, by using the asynchronous API Know-how acquired with a multi-GPU Cellular Automaton (MAGFLOW)


CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

Videos

Results

Analysis

Conclusions

Hiding transfers

431.755ms

531.755ms

Download dt Sort Minimumscan BuildNeibs Forces Async slice down. Async slice up. Euler 10ms

Actual timeline, with kernel lengths in scale. Only one GPU with 2 neighboring devices is shown. The little dots mark the moment operations were issued on the CPU.

The timeline has been produced with a custom visualization tool


CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

Videos

Results

Hiding transfers

Overall simulator design: GPUThread::simulationThread()

GPUThread::runMultiGPU() Allocate CPU buffers Compute first split Launch SimulationThreads while (running) BARRIER choose global min. dt t += dt update moving boundaries check if need save_request check if simulation is over check balancing if (unbalanced) compute new balance BARRIER Deallocate CPU buffers

Allocate GPU buffers Upload data to GPU while (running) if (balancing_requests) give / take slices if (buildneibs_time) kernel_sort download internal slices GPUThread BARRIER upload external slices kernel_buildneibs foreach of 2 steps kernel_forces_async kernel_find_min_dt kernel_euler if (save_request) dump_subdomain BARRIER BARRIER Deallocate GPU buffers

One simulationThread (CPU thread) per device Eugenio Rustico PDP 2012 - Garching

Analysis

Conclusions

CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

Videos

Results

Analysis

Conclusions

Algorithm

Load balancing It is important to balance the workload, as the fluid topology can greatly influence the overall performance.

AVERAGE

LB GPU1


GPU2

GPU3

GPU4

GPU1

GPU2

GPU3

GPU4

CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

Videos

Results

Algorithm

1

AG ←

2

Ts ← avg . d u r a t i o n of one s l i c e = AG / num . s l i c e s

3

f o r e a c h GPU

4

g l o b a l avg . d u r a t i o n of k e r n e l F o r c e s

Ag ← self avg . d u r a t i o n of k e r n e l F o r c e s

5

d el t a ← Ag - AG

6

if ( abs ( d el t a ) > Ts * H L B _ T H R E S H O L D )

7

if ( d el t a > 0) mark as G I V I N G _ C A N D I D A T E

8 9

else mark as T A K I N G _ C A N D I D A T E

10 11 12

mark G I V I N G _ C A N D I D A T E as G I V I N G

14

mark T A K I N G _ C A N D I D A T E as T A K I N G

15

f o r e a c h i n t e r m e d i a t e GPU

17

PDP 2012 - Garching

if a c o u p l e G I V I N G / T A K I N G is fo und

13

16

Eugenio Rustico

f o r e a c h GPU

mark as G I V I N G and T A K I N G br eak

Analysis

Conclusions

CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

Videos

Results

Analysis

Conclusions

Algorithm

avg.

-

GIVE

TAKE GIVE

TAKE

GPU1

GPU2

GPU3

GPU4

SLICE

SLICE

Simple policy: One transfer at a time Time for one slice computed from average Transfer between the first detected giving/taking pair Static threshold (no check for local minima nor ping-pong transfers) Eugenio Rustico PDP 2012 - Garching

CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

Videos

Results

BoreInABox


t = 0.0s

t = 0.2s

t = 0.5s

t = 0.64s

t = 1.04s

t = 1.68s

Analysis

Conclusions

CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

Videos

Results

BoreInABox corridor variant


t = 0.0s

t = 0.16s

t = 0.32s

t = 0.48s

t = 0.8s

t = 2.0s

Analysis

Conclusions

CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

Videos

Results

Analysis

Videos available for download at: http://www.dmi.unict.it/~rustico/sphvideos/


Conclusions

CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Hardware platform

Hardware platform TYAN-FT72 rack Dual-Xeon 16 cores 16Gb RAM 6×GTX480 cards on PCIe2


Load balancing

Videos

Results

Analysis

Conclusions

CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

Videos

Results

Analysis

Performance improvement

Results - DamBreak3D (symmetric problem)

7000 6000 5000 4000 Time (s) LB time (s) Ideal time (s) 3000 2000 1000 0 1 GPU


2 GPUs

3 GPUs

4 GPUs

5 GPUs

6 GPUs

Conclusions

CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

Videos

Results

Analysis

Performance improvement

Results - BoreInABox (asymmetric problem) 9000 8000 7000 6000 5000 Time (s) 4000 LB time (s) Ideal time (s)3000 2000 1000 0 1 GPU


2 GPUs

3 GPUs

4 GPUs

5 GPUs

6 GPUs

Conclusions

CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

Videos

Results

Analysis

Conclusions

Measuring the throughput

A possible performance metric abstracting from the number of particles and the simulation settings is the number of iterations computed, times the number of particles, divided by the execution time in seconds. In thousands, we can therefore measure kip/s (kilo-iterations times particles per second)


CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

Videos

Results

Analysis

Measuring the throughput

Kip/s, with and without load balancing, with a symmetric problem (DamBreak3D): DamBreak3D

Kip/s LB kip/s Ideal kip/s

1 GPU 9, 977 -

2 GPUs 19, 149 19, 380 19, 955

3 GPUs 27, 802 27, 191 29, 932

4 GPUs 35, 213 35, 336 39, 910

5 GPUs 39, 784 42, 578 49, 887

6 GPUs 45, 599 49, 491 59, 865

4 GPUs 20, 649 30, 548 35, 082

5 GPUs 25, 657 36, 800 43, 852

6 GPUs 29, 418 41, 745 52, 623

With an asymmetric problem (BoreInABox): BoreInABox

Kip/s LB kip/s Ideal kip/s


1 GPU 8, 770 -

2 GPUs 12, 713 16, 940 17, 541

3 GPUs 16, 275 24, 115 26, 311

Conclusions

CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

Videos

Results

Analysis

Measuring the efficiency

A common starting point to analyze a parallel implementation is to model the problem as made by a parallelizable part and a serial, non parallelizable one. Let α be the parallelizable fraction of the problem and β the non-parallelizable one; it is α + β = 1, with α, β non-negative real numbers.


Conclusions

CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

Videos

Results

Analysis


Amdahl’s law measures the speedup in terms of execution time and gives an upper-bound inversely proportional to β. With a slight change in the notation, Amdahl’s law can be written as SA (N) =

1 β+

α N

where N is the number of used cores and SA the expected speed-up


Conclusions

CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

Videos

Results

Analysis


20.00

Parallel portion (α)

18.00

0.50 0.75

16.00

0.90 0.95

14.00

Speedup

12.00 10.00 8.00 6.00 4.00


65536

32768

8192

16384

4096

2048

512

Number of cores

1024

256

64

128

32

16

8

4

1

0.00

2

2.00

Conclusions

CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

Videos

Results

Analysis

Conclusions


The Karp–Flatt metric was proposed in 1990 as a measure of the efficiency of the parallel implementation of a problem. The new efficiency metric is the experimental measure of the serial fraction β of a problem: 1 − 1 Se N βKF = 1 1− N Se is the empirically measured speedup and N is the number of cores (GPUs). An efficient parallelization presents a constant β


CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

Videos

Results

Analysis

Conclusions


Measured β, with and without load balancing, with a symmetric problem (DamBreak3D): DamBreak3D βKF , no LB βKF , LB

2 GPUs 0.053 0.053

3 GPUs 0.036 0.056

4 GPUs 0.048 0.037

5 GPUs 0.063 0.041

6 GPUs 0.061 0.040

With an asymmetric problem (BoreInABox): BoreInABox βKF , no LB βKF , LB


2 GPUs 0, 429 0, 053

3 GPUs 0, 289 0, 056

4 GPUs 0, 222 0, 048

5 GPUs 0, 181 0, 048

6 GPUs 0, 153 0, 050

CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

Videos

Results

Analysis

Conclusions

Future work

Future work Extend to GPU clusters, achieving a third level of parallelism (what is the best split?) Smarter load balancing (adaptive threshold and time analysis)


CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

Videos

Results

Analysis

References

References E. Rustico, G. Bilotta, A. Hérault, C. Del Negro, G. Gallo, Smoothed Particle Hydrodynamics simulations on multi-GPU systems, 20th Euromicro International Conference on Parallel, Distributed and Network-Based Computing, Special Session on GPU Computing and Hybrid Computing, 2012, Garching/Munich (DE) G. Bilotta, E. Rustico, A. Hérault, A. Vicari, C. Del Negro and G. Gallo, Porting and optimizing MAGFLOW on CUDA, Annals of Geophysics, Vol. 54, n. 5, doi: 10.4401/ag-5341 A. Hérault, G. Bilotta, E. Rustico, A. Vicari and C. Del Negro, Numerical Simulation of lava flow using a GPU SPH model, Annals of Geophysics, Vol. 54, n. 5, doi: 10.4401/ag-5343 A. Hérault, G. Bilotta, E. Rustico, A. Vicari and C. Del Negro, Numerical Simulation of lava flow using a GPU SPH model, Annals of Geophysics, Vol. 54, n. 5, doi: 10.4401/ag-5343 A. Herault, G. Bilotta, and R. A. Dalrymple, SPH on GPU with CUDA, Journal of Hydraulic Research, 48(Extra Issue):74–79, 2010 J. J. Monaghan, Smoothed particle hydrodynamics, Annual Review of Astronomy and Astrophysics 30, pages 543–574, 1977 Eugenio Rustico PDP 2012 - Garching

Conclusions

CUDA

SPH

Single-GPU SPH

Multi-GPU SPH

Load balancing

References

Thank you ...that’s all, folks!

Eugenio Rustico http://www.dmi.unict.it/~rustico [email protected]


Videos

Results

Analysis

Conclusions

Smoothed Particle Hydrodynamics simulations on multi ... - PDP 2012

Smoothed Particle Hydrodynamics simulations on multi ... - PDP 2012

Suggest Documents

Smoothed particle hydrodynamics simulations of

Towards accelerating Smoothed Particle Hydrodynamics simulations ...

A Consistent Multi-Resolution Smoothed Particle Hydrodynamics ...

Smoothed Particle Hydrodynamics for Naval Hydrodynamics - iwwwfb

Implementation of Smoothed-Particle Hydrodynamics

Dual-support Smoothed Particle Hydrodynamics

High-resolution Smoothed Particle Hydrodynamics simulations of the ...

Metal diffusion in smoothed particle hydrodynamics simulations of ...

Performance Optimisation of Smoothed Particle Hydrodynamics ...

conservation properties of smoothed particle hydrodynamics applied ...

3D SMOOTHED PARTICLE HYDRODYNAMICS MODELS OF ...

Symmetric smoothed particle hydrodynamics - Semantic Scholar

Divergence-Free Smoothed Particle Hydrodynamics - RWTH Aachen

Development of an Incompressible Smoothed Particle Hydrodynamics

Conformally Flat Smoothed Particle Hydrodynamics: Application to ...

Adaptive Smoothed Particle Hydrodynamics: Methodology II

Controlling artificial viscosity in smoothed particle hydrodynamics ...

Three-Dimensional Smoothed Particle Hydrodynamics Method for ...

Second-Order Symmetric Smoothed Particle Hydrodynamics Method

Smoothed Particle Hydrodynamics coupled with Radiation Transfer

Visualization of Smoothed Particle Hydrodynamics for ... - CiteSeerX

conservation properties of smoothed particle hydrodynamics applied

Neighbour lists in smoothed particle hydrodynamics - EPhysLab

Smoothed Particle Hydrodynamics Using Interparticle ... - Science Direct