GPU Computing with CUDA Lecture 10 ... - Boston University

72 downloads 104 Views 1MB Size Report
‣Tomorrow's lab: implement O(N2) N body calculation on GPU. ‣Nyland L., Harris M., Prins J. “Fast N-Body Simulation with CUDA”. GPU Gems 3, Chapter 31.
GPU Computing with CUDA Lecture 10 - Applications - N body problem

Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

Outline of lecture ‣ What is an N body problem? ‣ What kind of problems can be solved with N body approach? ‣ Fast calculation algorithms ‣ N-body problem on GPU

2

Introduction ‣ N body problem calculates the interaction between N bodies

xj xi

x = xi − xj

! |x|ij = (xi − xj )2 + (yi − yj )2 + (zi − zj )2

‣ Classic example: Gravitational Potential

N ! mj φi = |x| ij j=0

∇φ = g 3

Introduction ‣ Useful for anything with Green’s function! ‣ Poisson ∇ φ = f - Astrophysics - Fluid mechanics - Electrostatics 2

!

!



φ∇2 GdΩ +



‣ Helmholtz ∇ φ + k φ = f - Acoustics - Electromagnetics 2

2

G∇2 φdΩ = !

Γ

!

Gf dΩ



n · [G∇φ − (∇G)φ]dΓ =

φ=

!

!

Gf dΩ



Gf dΩ



‣ Poisson- Boltzmann ∇(!∇φ) + k 2 φ = f - Geophysics - Biophysics

4

Introduction ‣ Radial Basis Function interpolation

u(x) =

N ! i=0

αi φ(|x − xi |)

5

Order of N body problems ‣ Direct calculation is O(N2) N ! mj Φi = |x| ij j=0

6

Fast algorithms ‣ Multipole expansions

7

Fast algorithms Independent of i

Independent of j

8

Fast algorithms

9

Fast algorithms ‣ Tree Code - Barnes and Hut ‣ Fast Multipole Methods (FMM) - Greengard and Rokhlin

Tree code

FMM 10

Fast algorithms ‣ Flow of calculation

11

N body on GPU ‣ Tomorrow’s lab: implement O(N2) N body calculation on GPU ‣ Nyland L., Harris M., Prins J. “Fast N-Body Simulation with CUDA”. GPU Gems 3, Chapter 31. ‣ Lots of operations per load ‣ Much faster, but still O(N2)!

12

N body on GPU ‣ Look at the problem as a matrix vector product N ! mj Φi = |x| ij j=0

Φi =

N ! j=0

mj (|x|ij + !2 )

‣ Each thread will compute one row (not one element)

13

N body on GPU ‣ Data reuse: Tiling - Load tiles to shared memory for data reuse

14

N body on GPU ‣ Kernel Allocate shared memory Start loop on blocks Load to shared memory Start loop on elements Do sum 15

N body on GPU - Optimization ‣ Loop unrolling - Avoid unnecessary operation ‐ #pragma
unroll
32

‣ Separate last loop - The number of elements might not be multiple of the block size - Separate last loop to avoid unnecessary warps performing a calculation

‣ Vary block size

16

N body on GPU - Performance metric ‣ Compute bounded problem - Performance in FLOPS/s - Count number of floating point operations and divide by kernel execution time

17