High Resolution Fluid Dynamics Simulations on a GPU Cluster

8 downloads 951 Views 14MB Size Report
other numerical part: shear border condition ... http://www.maisondelasimulation.fr/projects/RAMSES-GPU/html/ ... border conditions in azimutal direction and.
Astrophysics motivations - HPC applications CPU/GPU performances

High Resolution Fluid Dynamics Simulations on a GPU Cluster Pierre Kestener1 Alexei Kritsuk3 Sébastien Fromang 2

Patrick Hennebelle2

1 CEA Saclay, DSM, Maison de la Simulation 2 CEA Saclay, DSM, Service d’Astrophysique 3 UCSD, Laboratory for Computational Astrophysics

GPU Technology Conference (GTC) 2014 San Jose, March. 26, 2014

1 / 44

Astrophysics motivations - HPC applications CPU/GPU performances

Content Astrophysics motivation - Protoplanetary disk simulations GPU Implementation of high-order finite volumes schemes MHD using Contraint Transport for magnetic field update shearing box border condition Applications: MRI and interstellar turbulence

Optimization stategies Kepler optimization: register and read-only data cache Memory footprint optimization

Performance measurements on real-case applications Magneto-Rotational Instability (MRI) on CURIE hybrid partition (up to 256 GPUs) MHD Turbulence on KEENELAND system (up 500 GPUs)

Perpectives / Future applications and developments

2 / 44

Astrophysics motivations - HPC applications CPU/GPU performances

Motivation - Stencil applications - sciences cases Porting finite volumes / finite differences schemes (for regular grid) on GPU is reportedly said to be easy, but what is the impact on performance of: complex numerical schemes : CFD can have large amount of FLOPS / cell data precision : single / double intrinsics instructions : e.g. fast math algorithm : split / unsplit schemes, 2D vs 3D physics : hydro / MHD / Riemann solver other numerical part: shear border condition

Present work: take the full MHD numerical scheme (from astrophysics code RAMSES) for GPU porting; all refactoring allowed / put everything on GPU Make the code production ready for science case on existing large clusters of GPU : magneto-rotationnal instability (MRI) interstellar turbulence

3 / 44

Astrophysics motivations - HPC applications CPU/GPU performances

Code Ramses-GPU : MHD astrophysics applications RAMSES-GPU is developped in CUDA/C for astrophysics MHD applications of regular grid

http://www.maisondelasimulation.fr/projects/RAMSES-GPU/html/ ∼ 45k lines of code (out of which ∼ 10k in CUDA) Sciences cases applications : MRI in accretion disk (Pm = 4): ∼ 500000h on CURIE (256 GPU) at 800 × 1600 × 800 MHD Driven turbulence: (Mach ∼ 10): ∼ 150000h on KEENELAND (U. Tenessee) up to resolution 20163 (486 GPUs)

4 / 44

Astrophysics motivations - HPC applications CPU/GPU performances

Code Ramses-GPU : MHD astrophysics applications RAMSES-GPU is developped in CUDA/C for astrophysics MHD applications of regular grid

http://www.maisondelasimulation.fr/projects/RAMSES-GPU/html/ ∼ 45k lines of code (out of which ∼ 10k in CUDA) Sciences cases applications : MRI in accretion disk (Pm = 4): ∼ 500000h on CURIE (256 GPU) at 800 × 1600 × 800 MHD Driven turbulence: (Mach ∼ 10): ∼ 150000h on KEENELAND (U. Tenessee) up to resolution 20163 (486 GPUs)

5 / 44

Astrophysics motivations - HPC applications CPU/GPU performances

What is MRI (Magneto-rotational Instability) ? High resources requirements : need for GPU acceleration Compressible (M)HD and finite volume methods

Summary

1

Astrophysics motivations - HPC applications What is MRI (Magneto-rotational Instability) ? High resources requirements : need for GPU acceleration Compressible (M)HD and finite volume methods

2

CPU/GPU performances Memory footprint optimization MHD Conclusion

6 / 44

Astrophysics motivations - HPC applications CPU/GPU performances

What is MRI (Magneto-rotational Instability) ? High resources requirements : need for GPU acceleration Compressible (M)HD and finite volume methods

Accretion disks Systems with an accretion disk protostellar disk accretion disk in binary star systems Active galactic nuclei (AGN) (most luminous persistent sources in the universe) Understanding accretion disks AGN NGC4261, source : HST/NASA/ESA

Objects difficult to observe Main problem : matter falls (spiral) towards central object but there is an outgoing net transfert ofangular momentum (AM).

Shear viscosity not sufficient; effective viscosity due to turbulence can provide an explanation to observed mass accretion rate How turbulence is driven ? Estimate AM transfert rates. What about B ?

7 / 44

Astrophysics motivations - HPC applications CPU/GPU performances

What is MRI (Magneto-rotational Instability) ? High resources requirements : need for GPU acceleration Compressible (M)HD and finite volume methods

MRI - Shearing box setup Magneto-Rotationnal Instability (MRI)

Long-term behavior study ⇒ long simulation runs (several 106 time steps); numerical scheme must be stable ! use Cartesian grid x, y, z to model R, φ, z in the neighborhood of R0 Write MHD equations in a rotating frame Ω0 differential rotation is modeled by shear border conditions in azimutal direction and periodic in y and z Ref: Balbus & Hawley, Rev. Mod. Phys. 70, 1-53 (1998) 8 / 44

Astrophysics motivations - HPC applications CPU/GPU performances

What is MRI (Magneto-rotational Instability) ? High resources requirements : need for GPU acceleration Compressible (M)HD and finite volume methods

MRI - Shearing box setup Shearing-box setup: MHD equation in a rotating frame (Ω0 ) magnetic field: induction equation inertial forces terms (static) gravitational terms dissipative terms (viscosity + resistivity) shearing box border conditions (in radial direction)

∂ρ + ∇.(ρv) = 0 ∂t ∂v (∇ × B) × B ρ + ρ(v.∇)v + 2ρΩ0 × v = − ∇P + ∇T + ρg ∂t 4π ∂B = ∇ × (v × B − η∇ × B) ∂t

9 / 44

Astrophysics motivations - HPC applications CPU/GPU performances

What is MRI (Magneto-rotational Instability) ? High resources requirements : need for GPU acceleration Compressible (M)HD and finite volume methods

MRI - Simulations

Movie representing magnetic field By evolution during 1 orbit resolution is 800 × 1600 × 800, Pm = 4, Re = 25000 1 orbit is 1 day wall clock time on 256 GPU (CURIE system - Fermi M2090)

10 / 44

Astrophysics motivations - HPC applications CPU/GPU performances

What is MRI (Magneto-rotational Instability) ? High resources requirements : need for GPU acceleration Compressible (M)HD and finite volume methods

MRI - Simulations

Movie representing magnetic field By evolution during 1 orbit resolution is 800 × 1600 × 800, Pm = 4, Re = 25000 1 orbit is 1 day wall clock time on 256 GPU (CURIE system - Fermi M2090)

11 / 44

Astrophysics motivations - HPC applications CPU/GPU performances

What is MRI (Magneto-rotational Instability) ? High resources requirements : need for GPU acceleration Compressible (M)HD and finite volume methods

MHD - MRI - simulation run setup

Technical features of a CPU run of MHD SB with dissipative terms 1 of a highly resolved MHD (Magneto-Rotational Instability) problem on the French machine TITANE (CCRT in 2010 2 ): resolution : 256 × 512 × 256 ∼ 33M cells time steps : 106 MPI processes : 1024 wall clock time : 4 days, i.e. ∼ 100hours (100.000 compute hours) effective performance in cell updates per seconds : 256 ∗ 512 ∗ 256 ∗ 106 /(100 ∗ 3600) = 93.2M updates/s (that is 93200 update/s per CPU core) - today’s CPU perf is closer to 170M updates/s.

1 Intel Nehalem, quadcore, 2.93 GHz 2 using the RAMSES (CPU) code 12 / 44

Astrophysics motivations - HPC applications CPU/GPU performances

What is MRI (Magneto-rotational Instability) ? High resources requirements : need for GPU acceleration Compressible (M)HD and finite volume methods

MHD - MRI - simulation run setup Problem : double space resolution space resolution : 512 × 1024 × 512 ⇒ total cells number is 8 times larger time steps : 2 ∗ 106 (same dynamics) ⇒ NEED 16 times mode computing resources (1.6Mh), just for one run By keeping the same sub-domain size per MPI proc, we need 8000 MPI proc during 200 hours (one week) ! Or 2000 MPI processes for one month. If perfect weak scaling, ⇒ : 8000 ∗ 93200 = 746Mupdate/s for the entire system

Can we implement a GPU-based solution ? What for ? Solve the same problem faster. Solve bigger problems for smaller amount of time

What would be the amount of computing resources ? Solution: develop code Ramses-GPU using CUDA/C 13 / 44

Astrophysics motivations - HPC applications CPU/GPU performances

What is MRI (Magneto-rotational Instability) ? High resources requirements : need for GPU acceleration Compressible (M)HD and finite volume methods

Godunov finite volume scheme on GPU - stencil algorithms

Pseudo-algorithm for Ut + F(U)x = 0 for k do for j do for i do Get Hydro State Ui,j,k Convert To Primitive Variables using Equation of State Compute slopes ∆i Interpolate states at i − 1/2, i + 1/2, evolve ∆t/2 n+1/2 Solve local Riemann problem, compute fluxes Fi+1/2 n+1 = Un + ∆t(Fn+1/2 − Fn+1/2 )/∆x Update Ui,j,k i,j,k i+1/2,j,k i−1/2,j,k

end for end for end for

Very large FLOPS / cell / time step : ∼ 700 FLOPS in Hydro and ∼ 3500 FLOPS in MHD ⇒ lots of local variables, large register requirements ⇒ potential register spilling problems CUDA/C parallelization strategy: loops over i and j are unfolded; each thread receives a z-column to deal with (see P. Micikevicius, ACM, 2009)

Allow some refactoring between CPU and GPU version: split the for loops (then need to store intermediate results in GPU external memory)

14 / 44

Astrophysics motivations - HPC applications CPU/GPU performances

What is MRI (Magneto-rotational Instability) ? High resources requirements : need for GPU acceleration Compressible (M)HD and finite volume methods

Godunov finite volume scheme on GPU - stencil algorithms Input

Input

kernel1

Fuse Kernels kernel3

Inter mediate

Merge Kernels kernel2

Output

Output

Very large FLOPS / cell / time step : ∼ 700 FLOPS in Hydro and ∼ 3500 FLOPS in MHD ⇒ lots of local variables, large register requirements ⇒ potential register spilling problems CUDA/C parallelization strategy: loops over i and j are unfolded; each thread receives a z-column to deal with (see P. Micikevicius, ACM, 2009) Allow some refactoring between CPU and GPU version: split the for loops (then need to store intermediate results in GPU external memory)

15 / 44

Astrophysics motivations - HPC applications CPU/GPU performances

What is MRI (Magneto-rotational Instability) ? High resources requirements : need for GPU acceleration Compressible (M)HD and finite volume methods

Godunov finite volume scheme on GPU - stencil algorithms

Directional splitted scheme ⇒ stencil 1D : only one GPU kernel (low memory storage required) Directional unsplit scheme ⇒ stencil 3D : better perf when split computation into 3 GPU kernels (larger memory storage) GPU implementation/parallelization : naturally unfold for loops over i and j; loop over z is performed inside one GPU thread (efficient shared memory usage).

16 / 44

Astrophysics motivations - HPC applications CPU/GPU performances

What is MRI (Magneto-rotational Instability) ? High resources requirements : need for GPU acceleration Compressible (M)HD and finite volume methods

Hydrodynamics - GPU implementation

Directionally splitted Hydrodynamics - 1D stencil The entire scheme can fit into a single kernel by using a single loop ⇒ memory consumtion very low ⇒ fit very large sub-domain per GPU

Directionnally unsplit Hydrodynamics - 3D stencil Each stage of algorithm require larger floating-point computations ⇒ need more registers per thread In 2D using a single kernel gets a 40 % perf decrease compared to using 3 kernels (primitive var conversion, slopes and interpolation computations, Riemann fluxes and cell updates). In 3D, perf decrease is ∼ 30% Several CUDA kernels ⇒ better design, much much easier to debug, easier to insert a new kernel (e.g. for physics terms) than modify existing kernels.

17 / 44

Astrophysics motivations - HPC applications CPU/GPU performances

What is MRI (Magneto-rotational Instability) ? High resources requirements : need for GPU acceleration Compressible (M)HD and finite volume methods

Directionally splitted Hydro - 3D performances Number of 106 cell updates / second versus domain size (single precision) taille

GeForce GTX 560 Ti SP/fast

M2090 SP/fast

32x32x32 64x64x64 96x96x96 128x128x128 192x192x192 225x225x225

7.3 37.6 68.2 90.6 111.0 117.8

18.8 83.4 100.7 114.7 133.0 137.2

M2090 SP

M2090 DP

32.1

9.2

39.5

11.1

pure hydro on GPU is 10× faster (DP) than a single CPU core pure hydro on GPU is ∼ 100× faster (SP/fast) than a single CPU core Tesla M2090 has a much better memory bandwidth Need 1 large domain (or many small) to fully load GPU Huge performance gap when deactivating instrinsics or using DP (not so strong in MHD) There is room for designing mixed precision algorithms.

18 / 44

Astrophysics motivations - HPC applications CPU/GPU performances

What is MRI (Magneto-rotational Instability) ? High resources requirements : need for GPU acceleration Compressible (M)HD and finite volume methods

Directionally splitted Hydro - 3D performances Number of 106 cell updates / second versus domain size taille 32x32x32 64x64x64 96x96x96 128x128x128 192x192x192 225x225x225

M2090 SP/fast 18.8 83.4 100.7 114.7 133.0 137.2

SP

DP

32.1

9.2

39.5

11.1

K20 SP/fast 26.6 (+41%) 95.2 (+14%) 175.1 (+73%) 178.7 (+55%) 226.7 (+70%) 210.5 (+53%)

SP

DP

+70%

+30%

+92%

+30%

Architecture Kepler K20 versus Fermi M2090 Just rebuild application with CUDA 5.5 toolchain for architecture 3.5 in SP/fast : Kepler is ∼50% faster than M2090 in SP : Kepler is ∼70% faster than M2090 in DP : Kepler is ∼30% faster than M2090 19 / 44

Astrophysics motivations - HPC applications CPU/GPU performances

What is MRI (Magneto-rotational Instability) ? High resources requirements : need for GPU acceleration Compressible (M)HD and finite volume methods

Directionally splitted Hydro - 3D performances Number of 106 cell updates / second versus domain size taille 32x32x32 64x64x64 96x96x96 128x128x128 192x192x192 225x225x225

M2090 SP/fast 18.8 83.4 100.7 114.7 133.0 137.2

SP

DP

32.1

9.2

39.5

11.1

K20 SP/fast 26.6 (+41%) 95.2 (+14%) 175.1 (+73%) 178.7 (+55%) 226.7 (+70%) 210.5 (+53%)

SP

DP

72.4

34.9

97.4

40.6

Architecture Kepler K20 versus Fermi M2090 Rebuild application with CUDA 5.5 toolchain for architecture 3.5 and tune flags Tune max register count for Kepler, and care about read-only data cache ==> No more register spilling, DP perf is optimal ! in DP : Kepler is ∼ 3.5× faster than than M2090 20 / 44

Astrophysics motivations - HPC applications CPU/GPU performances

What is MRI (Magneto-rotational Instability) ? High resources requirements : need for GPU acceleration Compressible (M)HD and finite volume methods

Directionally splitted Hydro - 3D performances

Border updates on CPU, borders update is always insignificant compared to numerical scheme (∼ 0.05%) on GPU, borders update are NOT ACCELERATED at all; not enought workload for GPU: ∼ 160ms per time steps whatever domain size; profiling is strongly domain size dependant; maximun acceleration for small domain size is limited (Amdahl law). On GPU, borders update ⇒ non-coalescent memory access

Memory consumption in splitted Hydro scheme 2D : 8N 2 (e.g. N = 212 = 4096 ⇒ 530 MB in single precision 3D : 10N 3 (e.g. N = 28 = 256 ⇒ 671 MB in single precision)

21 / 44

Astrophysics motivations - HPC applications CPU/GPU performances

Memory footprint optimization MHD Conclusion

Summary

1

Astrophysics motivations - HPC applications What is MRI (Magneto-rotational Instability) ? High resources requirements : need for GPU acceleration Compressible (M)HD and finite volume methods

2

CPU/GPU performances Memory footprint optimization MHD Conclusion

22 / 44

Astrophysics motivations - HPC applications CPU/GPU performances

Memory footprint optimization MHD Conclusion

Memory footprint optimization - MHD scheme

Input

kernel1

Enable the largest ever done MRI simulation at resolution 800 × 1600 × 800 Inter mediate

kernel2

With original GPU implementation, memory requirement was more than 5.8 GBytes / per GPU (using 256 GPUs) ⇒ can’t do it on the CURIE system

/

Output

23 / 44

Astrophysics motivations - HPC applications CPU/GPU performances

Memory footprint optimization MHD Conclusion

Memory footprint optimization - MHD scheme

Input

kernel1

Enable the largest ever done MRI simulation at resolution 800 × 1600 × 800 Inter mediate

kernel2

z-block Loop

With original GPU implementation, memory requirement was more than 5.8 GBytes / per GPU (using 256 GPUs) ⇒ can’t do it on the CURIE system

/

Memory footprint optimization: apply a z-blocking technique to all kernels involved ⇒ 1.9 GBytes / per Output

GPU (using 256 GPUs) ⇒

,

24 / 44

Astrophysics motivations - HPC applications CPU/GPU performances

Memory footprint optimization MHD Conclusion

MHD - 3D Orszag-Tang - performances Hardware configuration : Fermi M2090 - Single GPU Single-GPU/CPU run - Number of 106 cell updates / second in single precision size

CPU

CPU (dissip)

GPU

GPU (dissisp)

GPU CPU

32x32x32 48x48x48 64x64x64 96x96x96 128x128x128 150x150x150

0.184 0.195 0.201 0.204 0.210 0.218

0.161 0.168 0.175 0.176 0.184 0.183

6.03 9.63 11.94 12.60 13.64 14.60

4.97 8.49 10.43 11.20 12.08 13.00

32.7 49.4 59.4 61.7 64.5 67.0

On CPU, dissipative terms (∼ 12%) whatever domain size On GPU, dissipative terms impact is 18% (323 ) and 11% (1503 ); dissipative terms computation needs to call border routine (larger cost) Double-precision has a strong impact on GPU perf: ∼ /4 ⇒ reg. spilling problems 25 / 44

Memory footprint optimization MHD Conclusion

Astrophysics motivations - HPC applications CPU/GPU performances

MHD - 3D Orszag-Tang - performances Hardware configuration : Fermi M2090 - Single GPU Single-GPU/CPU run - Number of 106 cell updates / second in single precision size

CPU

CPU (dissip)

GPU

GPU (dissisp)

GPU CPU

32x32x32 48x48x48 64x64x64 96x96x96 128x128x128 150x150x150

0.184 0.195 0.201 0.204 0.210 0.218

0.161 0.168 0.175 0.176 0.184 0.183

6.03 9.63 11.94 12.60 13.64 14.60

4.97 8.49 10.43 11.20 12.08 13.00

32.7 49.4 59.4 61.7 64.5 67.0

Fermi M2090 : Double-precision has a strong impact on GPU perf: ∼ /4 ⇒ reg. spilling problems Kepler: reg. spilling problem strongly reduced. Kepler: 1 K20 GPU ∼ 2.5 faster than a 16-core CPU node (2 Proc Intel Xeon E5-2670) 26 / 44

Astrophysics motivations - HPC applications CPU/GPU performances

Memory footprint optimization MHD Conclusion

MHD - MRI - 3D performances MRI with dissipative terms (viscosity and resisitivity) and shearing-box Hardware configuration: Fermi M2090 - Single GPU Number of 106 cell updates / second in single precision size

CPU

GPU

GPU CPU total

16x32x16 32x64x32 48x96x48 64x96x64 64x128x64 96x150x96 96x192x96 128x128x64

0.138 0.156 0.163 0.169 0.168 0.175 0.170 0.173

1.65 5.56 8.02 8.96 9.12 10.1 10.2 9.3

11.8 37.0 50.1 52.7 53.6 59.4 60.0 53.7 27 / 44

Astrophysics motivations - HPC applications CPU/GPU performances

Memory footprint optimization MHD Conclusion

MHD - MRI - 3D performances

MRI with dissipative terms (viscosity and resisitivity) and shearing-box Hardware configuration: M2090 - Multi GPU (1 GPU per MPI proc) Mpi perf : Number of 106 cell updates / second / MPI proc in single precision. Sub-domain size is 128 × 128 × 64 # proc MPI

global size

1 8 64 256

128 × 128 × 64 256 × 256 × 128 512 × 512 × 256 512 × 1024 × 512

CPU 0.173 1.39 11.1 43.8

Efficiency CPU

GPU

Efficiency GPU

GPU CPU total

100.6 % 100.1 % 98.7 %

8.0 38.4 306.7 1066.0

60.0 % 59.9 % 52.1 %

50.1

28 / 44

Astrophysics motivations - HPC applications CPU/GPU performances

Memory footprint optimization MHD Conclusion

MHD - MRI - 3D performances MRI with dissipative terms (viscosity and resisitivity) and shearing-box Hardware configuration: M2090 MPI perf : Number of 106 cell updates / second / MPI proc in single precision using sub-domain size (128 × 128 × 64) # proc MPI

global size

CPU

Efficiency CPU

GPU

256

512 × 1024 × 512

43.8

98.7 %

1066.0

Efficiency GPU 52.1 % 6

Simulation sizing (estimated): CURIE/CPU, 2.10 time steps, domain size 512 × 1024 × 512 ⇒ 1 week, 4000 cores Simulation sizing (measured): CURIE/GPU, 2.106 time steps, domain size 512 × 1024 × 512 ⇒ 5.8 days, 256 GPU Simulation sizing (actual): CURIE/GPU (DP), 0.5x106 time steps, domain size 800 × 1600 × 800 ⇒ 14 days, 256 GPU, largest ever performed 29 / 44

Astrophysics motivations - HPC applications CPU/GPU performances

Memory footprint optimization MHD Conclusion

MHD - MRI - 3D performances

MRI with dissipative terms (viscosity and resisitivity) and shearing-box largest ever performed MRI simulation 800 × 1600 × 800 (also done on BlueGene, but requiring 32k processors). GPU implementation has a quite low MPI parallelization efficiency; this due to communications (MPI + PciE) for borders update (normal and shear); this strongly modify profiling and decreases maximum possible acceleration (Amdahl law). There is room for optimization in the MHD scheme: CUDA kernels are register limited on Fermi. Need to use very large sub-domain; strong scaling should drop faster (but we only have 288 GPU on CURIE !)

30 / 44

Astrophysics motivations - HPC applications CPU/GPU performances

Memory footprint optimization MHD Conclusion

MHD - MRI - 3D performances

MRI with dissipative terms (viscosity and resisitivity) and shearing-box Profiling for a 64 MPI proc configuration with 48 × 96 × 48 sub-domain size (CURIE) Function Godunov scheme dissipative terms borders (internal/external) shearing border

% of total time 22.9 % 17.5 % 22.7 % 36.9 %

31 / 44

Astrophysics motivations - HPC applications CPU/GPU performances

Memory footprint optimization MHD Conclusion

MHD - MRI - 3D performances

MRI with dissipative terms (viscosity and resisitivity) and shearing-box Profiling for a 64 MPI proc configuration with 48 × 96 × 48 sub-domain size (CURIE) Function Godunov scheme dissipative terms borders (internal/external) shearing border

% of total time 22.9 % 17.5 % 22.7 % 36.9 %

32 / 44

Astrophysics motivations - HPC applications CPU/GPU performances

Memory footprint optimization MHD Conclusion

MHD - MRI - 3D performances

MRI with dissipative terms (viscosity and resisitivity) and shearing-box borders update are dominated by MPI and CPU-GPU PciE communications. borders update is ∼ 60% of total time (48 × 96 × 48 sub-domaine) !! It was only 10 % for single-GPU ... Once MPI + PciE communications are integrated into profiling, strong scaling is quite good

33 / 44

Astrophysics motivations - HPC applications CPU/GPU performances

Memory footprint optimization MHD Conclusion

MHD - MRI - 3D performances

MRI with dissipative terms (viscosity and resisitivity) and shearing-box Work in progress: new physics modules(on-going): ambipolar diffusion, cooling/heating terms Coupling to Paraview/Catalyst for in-situ visualization

Other possible developments: cylindrical/spherical coordinate (almost done) Radiative HD/MHD ? AMR ? Interesting work by CLAMR (cell-based AMR on GPU with OpenCL)

34 / 44

Astrophysics motivations - HPC applications CPU/GPU performances

Memory footprint optimization MHD Conclusion

MHD turbulence at Mach = 10 - 20163 - Simulations

Figure: Log-scale density plot of MHD turbulence at Mach = 10, resolution 20163 on Keeneland: 486 GPUs (∼ 61%) out of 768 GPUs. Double precision run on Keeneland, MHD performance is 2.5 Mcell-updates/s per GPU; expected performance on BlueWaters is 5.5 Mcell-updates/s per GPU large sub-domain 224 × 224 × 336 (5.1 GBytes / GPU) can only run one job once a week ! Need large GPU cluster to stimulate work on porting algorithms and try to reach state-of-the-art problems 10003 MHD runs can be done routinely on small GPU cluster (∼ 64 GPU).

35 / 44

Astrophysics motivations - HPC applications CPU/GPU performances

Memory footprint optimization MHD Conclusion

MHD turbulence at Mach = 10 - 20163 - Simulations

Figure: Log-scale density plot of MHD turbulence at Mach = 10, resolution 20163 on Keeneland: 486 GPUs (∼ 61%) out of 768 GPUs. Double precision run on Keeneland, MHD performance is 2.5 Mcell-updates/s per GPU; expected performance on BlueWaters is 5.5 Mcell-updates/s per GPU large sub-domain 224 × 224 × 336 (5.1 GBytes / GPU) can only run one job once a week ! Need large GPU cluster to stimulate work on porting algorithms and try to reach state-of-the-art problems 10003 MHD runs can be done routinely on small GPU cluster (∼ 64 GPU). 36 / 44

Astrophysics motivations - HPC applications CPU/GPU performances

Memory footprint optimization MHD Conclusion

MHD turbulence at Mach = 10 - 20163 - Simulations

Figure: Compensated Fourier power-spectrum of v = ρ 1/3 u and Em - Mach = 10, resolution 20163 on Keeneland (486 GPUs).

Special care about parallel IO at 20163 (need a little care about Lustre FS parameters) Ongoing work on doing analysis of turbulence data. 37 / 44

Astrophysics motivations - HPC applications CPU/GPU performances

Memory footprint optimization MHD Conclusion

Conclusion Provide GPU Implementation for multiple finite volume schemes (hydrodynamics and MHD) Performed some of the largest ever done MHD simulations Entire application on GPU Excellent single-GPU performances Double-precision MHD implementation: strong performance drop on Fermi due to register spilling / shear border condition; situation stronly improved on Kepler

Very good multi-GPU perf provided using large sub-domain size Cuda/C++ code available :

stratified MRI

On-poing work with C. Gammie and B. Ryan http://www.maisondelasimulation.fr/projects/RAMSES-GPU/html/index.html 38 / 44

Astrophysics motivations - HPC applications CPU/GPU performances

Memory footprint optimization MHD Conclusion

Stratified MRI - Simulations - ongoing recent work

Movie representing magnetic field By evolution at onset of Magneto-Rotational Instability in a stratified box (with gravity field).

39 / 44

Astrophysics motivations - HPC applications CPU/GPU performances

Memory footprint optimization MHD Conclusion

Extra slide : MHD - Orszag-Tang 3D - performances MPI/GPU no shear border - no dissipative terms Hardware configuration: CURIE - Multi GPU (1 GPU per MPI proc) MPI perf : Number of 106 cell updates / second / MPI proc in single precision. Sub-domain size is 128 × 128 × 128 # proc MPI

global size

CPU

CPU perf per MPI proc

GPU

GPU perf per MPI proc

1 8 64 128 256

128 × 128 × 128 256 × 256 × 256 512 × 512 × 512 1024 × 512 × 512 1024 × 1024 × 512

0.21 1.68 13.4 26.8 53.5

0.21 0.21 0.21 0.21 0.21

13.6 95.3 750.7 1498.3 2969.3

13.6 11.9 11.7 11.7 11.6

MPI communications weight in profiling much less important than in MRI (because shear border and dissip terms comm are very costly) Scaling is also much better for the MHD MPI-GPU version than MHD + shear border Sustained speedup of ∼ 55 of GPU versus CPU using the same number of MPI processes

40 / 44

Astrophysics motivations - HPC applications CPU/GPU performances

Memory footprint optimization MHD Conclusion

Extra slide : MHD - Orszag-Tang 3D - performances MPI/GPU no shear border - no dissipative terms Hardware configuration: CURIE - Multi GPU (1 GPU per MPI proc) MPI perf : Number of 106 cell updates / second / MPI proc in single precision. Sub-domain size is 128 × 128 × 128 # proc MPI

global size

CPU

CPU perf per MPI proc

GPU

GPU perf per MPI proc

1 8 64 128 256

128 × 128 × 128 256 × 256 × 256 512 × 512 × 512 1024 × 512 × 512 1024 × 1024 × 512

0.21 1.68 13.4 26.8 53.5

0.21 0.21 0.21 0.21 0.21

13.6 95.3 750.7 1498.3 2969.3

13.6 11.9 11.7 11.7 11.6

For compressible MHD, need to take into account how structure functions computations and other required pre-processing analysis will slow down GPU... 41 / 44

Astrophysics motivations - HPC applications CPU/GPU performances

Memory footprint optimization MHD Conclusion

RAMSES-GPU Code RAMSES-GPU available for download http://www.maisondelasimulation.fr/projects/RAMSES-GPU/html/download.html

42 / 44

Astrophysics motivations - HPC applications CPU/GPU performances

Memory footprint optimization MHD Conclusion

Extra slide - RamsesGPU on Nvidia CARMA

x86_64 CPU has a high electric power consumption; replace by low-power ARM architecture GFlops/Watt ratio is over 6 for GPU Quadro 1000M and arround 1.5 for CPU tegra3; total consumption (5+45 Watt) Just rebuild Ramses GPU with cross toolchain for CUDA on ARM Thanks to E. Orlotti (NVIDIA) for providing CARMA board. 43 / 44

Astrophysics motivations - HPC applications CPU/GPU performances

Memory footprint optimization MHD Conclusion

Extra slide - RamsesGPU on Nvidia CARMA MHD (single precision) : NVIDIA CARMA w/ or w/o dissipative terms Single-GPU/CPU - Number of 106 cell updates / second taille

CPU a

CPU (dissip)

GPU

GPU (dissisp)

GPU CPU

32x32x32 48x48x48 64x64x64 96x96x96 128x128x128

0.106 0.125 0.130 0.139 0.141

0.093 0.108 0.118 0.124 0.125

1.19 1.72 1.94 2.13 2.35

0.98 1.35 1.57 1.70 1.93

10.5 12.5 13.3 13.7 15.4

a using 4 MPI tasks

ARM core perf: not so bad (SP only) Quadro 1000M ∼ 40 to 60 ARM cores 44 / 44

Suggest Documents