other numerical part: shear border condition ... http://www.maisondelasimulation.fr/projects/RAMSES-GPU/html/ ... border conditions in azimutal direction and.
Astrophysics motivations - HPC applications CPU/GPU performances
High Resolution Fluid Dynamics Simulations on a GPU Cluster Pierre Kestener1 Alexei Kritsuk3 Sébastien Fromang 2
Patrick Hennebelle2
1 CEA Saclay, DSM, Maison de la Simulation 2 CEA Saclay, DSM, Service d’Astrophysique 3 UCSD, Laboratory for Computational Astrophysics
GPU Technology Conference (GTC) 2014 San Jose, March. 26, 2014
1 / 44
Astrophysics motivations - HPC applications CPU/GPU performances
Content Astrophysics motivation - Protoplanetary disk simulations GPU Implementation of high-order finite volumes schemes MHD using Contraint Transport for magnetic field update shearing box border condition Applications: MRI and interstellar turbulence
Optimization stategies Kepler optimization: register and read-only data cache Memory footprint optimization
Performance measurements on real-case applications Magneto-Rotational Instability (MRI) on CURIE hybrid partition (up to 256 GPUs) MHD Turbulence on KEENELAND system (up 500 GPUs)
Perpectives / Future applications and developments
2 / 44
Astrophysics motivations - HPC applications CPU/GPU performances
Motivation - Stencil applications - sciences cases Porting finite volumes / finite differences schemes (for regular grid) on GPU is reportedly said to be easy, but what is the impact on performance of: complex numerical schemes : CFD can have large amount of FLOPS / cell data precision : single / double intrinsics instructions : e.g. fast math algorithm : split / unsplit schemes, 2D vs 3D physics : hydro / MHD / Riemann solver other numerical part: shear border condition
Present work: take the full MHD numerical scheme (from astrophysics code RAMSES) for GPU porting; all refactoring allowed / put everything on GPU Make the code production ready for science case on existing large clusters of GPU : magneto-rotationnal instability (MRI) interstellar turbulence
3 / 44
Astrophysics motivations - HPC applications CPU/GPU performances
Code Ramses-GPU : MHD astrophysics applications RAMSES-GPU is developped in CUDA/C for astrophysics MHD applications of regular grid
http://www.maisondelasimulation.fr/projects/RAMSES-GPU/html/ ∼ 45k lines of code (out of which ∼ 10k in CUDA) Sciences cases applications : MRI in accretion disk (Pm = 4): ∼ 500000h on CURIE (256 GPU) at 800 × 1600 × 800 MHD Driven turbulence: (Mach ∼ 10): ∼ 150000h on KEENELAND (U. Tenessee) up to resolution 20163 (486 GPUs)
4 / 44
Astrophysics motivations - HPC applications CPU/GPU performances
Code Ramses-GPU : MHD astrophysics applications RAMSES-GPU is developped in CUDA/C for astrophysics MHD applications of regular grid
http://www.maisondelasimulation.fr/projects/RAMSES-GPU/html/ ∼ 45k lines of code (out of which ∼ 10k in CUDA) Sciences cases applications : MRI in accretion disk (Pm = 4): ∼ 500000h on CURIE (256 GPU) at 800 × 1600 × 800 MHD Driven turbulence: (Mach ∼ 10): ∼ 150000h on KEENELAND (U. Tenessee) up to resolution 20163 (486 GPUs)
5 / 44
Astrophysics motivations - HPC applications CPU/GPU performances
What is MRI (Magneto-rotational Instability) ? High resources requirements : need for GPU acceleration Compressible (M)HD and finite volume methods
Summary
1
Astrophysics motivations - HPC applications What is MRI (Magneto-rotational Instability) ? High resources requirements : need for GPU acceleration Compressible (M)HD and finite volume methods
2
CPU/GPU performances Memory footprint optimization MHD Conclusion
6 / 44
Astrophysics motivations - HPC applications CPU/GPU performances
What is MRI (Magneto-rotational Instability) ? High resources requirements : need for GPU acceleration Compressible (M)HD and finite volume methods
Accretion disks Systems with an accretion disk protostellar disk accretion disk in binary star systems Active galactic nuclei (AGN) (most luminous persistent sources in the universe) Understanding accretion disks AGN NGC4261, source : HST/NASA/ESA
Objects difficult to observe Main problem : matter falls (spiral) towards central object but there is an outgoing net transfert ofangular momentum (AM).
Shear viscosity not sufficient; effective viscosity due to turbulence can provide an explanation to observed mass accretion rate How turbulence is driven ? Estimate AM transfert rates. What about B ?
7 / 44
Astrophysics motivations - HPC applications CPU/GPU performances
What is MRI (Magneto-rotational Instability) ? High resources requirements : need for GPU acceleration Compressible (M)HD and finite volume methods
MRI - Shearing box setup Magneto-Rotationnal Instability (MRI)
Long-term behavior study ⇒ long simulation runs (several 106 time steps); numerical scheme must be stable ! use Cartesian grid x, y, z to model R, φ, z in the neighborhood of R0 Write MHD equations in a rotating frame Ω0 differential rotation is modeled by shear border conditions in azimutal direction and periodic in y and z Ref: Balbus & Hawley, Rev. Mod. Phys. 70, 1-53 (1998) 8 / 44
Astrophysics motivations - HPC applications CPU/GPU performances
What is MRI (Magneto-rotational Instability) ? High resources requirements : need for GPU acceleration Compressible (M)HD and finite volume methods
MRI - Shearing box setup Shearing-box setup: MHD equation in a rotating frame (Ω0 ) magnetic field: induction equation inertial forces terms (static) gravitational terms dissipative terms (viscosity + resistivity) shearing box border conditions (in radial direction)
∂ρ + ∇.(ρv) = 0 ∂t ∂v (∇ × B) × B ρ + ρ(v.∇)v + 2ρΩ0 × v = − ∇P + ∇T + ρg ∂t 4π ∂B = ∇ × (v × B − η∇ × B) ∂t
9 / 44
Astrophysics motivations - HPC applications CPU/GPU performances
What is MRI (Magneto-rotational Instability) ? High resources requirements : need for GPU acceleration Compressible (M)HD and finite volume methods
MRI - Simulations
Movie representing magnetic field By evolution during 1 orbit resolution is 800 × 1600 × 800, Pm = 4, Re = 25000 1 orbit is 1 day wall clock time on 256 GPU (CURIE system - Fermi M2090)
10 / 44
Astrophysics motivations - HPC applications CPU/GPU performances
What is MRI (Magneto-rotational Instability) ? High resources requirements : need for GPU acceleration Compressible (M)HD and finite volume methods
MRI - Simulations
Movie representing magnetic field By evolution during 1 orbit resolution is 800 × 1600 × 800, Pm = 4, Re = 25000 1 orbit is 1 day wall clock time on 256 GPU (CURIE system - Fermi M2090)
11 / 44
Astrophysics motivations - HPC applications CPU/GPU performances
What is MRI (Magneto-rotational Instability) ? High resources requirements : need for GPU acceleration Compressible (M)HD and finite volume methods
MHD - MRI - simulation run setup
Technical features of a CPU run of MHD SB with dissipative terms 1 of a highly resolved MHD (Magneto-Rotational Instability) problem on the French machine TITANE (CCRT in 2010 2 ): resolution : 256 × 512 × 256 ∼ 33M cells time steps : 106 MPI processes : 1024 wall clock time : 4 days, i.e. ∼ 100hours (100.000 compute hours) effective performance in cell updates per seconds : 256 ∗ 512 ∗ 256 ∗ 106 /(100 ∗ 3600) = 93.2M updates/s (that is 93200 update/s per CPU core) - today’s CPU perf is closer to 170M updates/s.
1 Intel Nehalem, quadcore, 2.93 GHz 2 using the RAMSES (CPU) code 12 / 44
Astrophysics motivations - HPC applications CPU/GPU performances
What is MRI (Magneto-rotational Instability) ? High resources requirements : need for GPU acceleration Compressible (M)HD and finite volume methods
MHD - MRI - simulation run setup Problem : double space resolution space resolution : 512 × 1024 × 512 ⇒ total cells number is 8 times larger time steps : 2 ∗ 106 (same dynamics) ⇒ NEED 16 times mode computing resources (1.6Mh), just for one run By keeping the same sub-domain size per MPI proc, we need 8000 MPI proc during 200 hours (one week) ! Or 2000 MPI processes for one month. If perfect weak scaling, ⇒ : 8000 ∗ 93200 = 746Mupdate/s for the entire system
Can we implement a GPU-based solution ? What for ? Solve the same problem faster. Solve bigger problems for smaller amount of time
What would be the amount of computing resources ? Solution: develop code Ramses-GPU using CUDA/C 13 / 44
Astrophysics motivations - HPC applications CPU/GPU performances
What is MRI (Magneto-rotational Instability) ? High resources requirements : need for GPU acceleration Compressible (M)HD and finite volume methods
Godunov finite volume scheme on GPU - stencil algorithms
Pseudo-algorithm for Ut + F(U)x = 0 for k do for j do for i do Get Hydro State Ui,j,k Convert To Primitive Variables using Equation of State Compute slopes ∆i Interpolate states at i − 1/2, i + 1/2, evolve ∆t/2 n+1/2 Solve local Riemann problem, compute fluxes Fi+1/2 n+1 = Un + ∆t(Fn+1/2 − Fn+1/2 )/∆x Update Ui,j,k i,j,k i+1/2,j,k i−1/2,j,k
end for end for end for
Very large FLOPS / cell / time step : ∼ 700 FLOPS in Hydro and ∼ 3500 FLOPS in MHD ⇒ lots of local variables, large register requirements ⇒ potential register spilling problems CUDA/C parallelization strategy: loops over i and j are unfolded; each thread receives a z-column to deal with (see P. Micikevicius, ACM, 2009)
Allow some refactoring between CPU and GPU version: split the for loops (then need to store intermediate results in GPU external memory)
14 / 44
Astrophysics motivations - HPC applications CPU/GPU performances
What is MRI (Magneto-rotational Instability) ? High resources requirements : need for GPU acceleration Compressible (M)HD and finite volume methods
Godunov finite volume scheme on GPU - stencil algorithms Input
Input
kernel1
Fuse Kernels kernel3
Inter mediate
Merge Kernels kernel2
Output
Output
Very large FLOPS / cell / time step : ∼ 700 FLOPS in Hydro and ∼ 3500 FLOPS in MHD ⇒ lots of local variables, large register requirements ⇒ potential register spilling problems CUDA/C parallelization strategy: loops over i and j are unfolded; each thread receives a z-column to deal with (see P. Micikevicius, ACM, 2009) Allow some refactoring between CPU and GPU version: split the for loops (then need to store intermediate results in GPU external memory)
15 / 44
Astrophysics motivations - HPC applications CPU/GPU performances
What is MRI (Magneto-rotational Instability) ? High resources requirements : need for GPU acceleration Compressible (M)HD and finite volume methods
Godunov finite volume scheme on GPU - stencil algorithms
Directional splitted scheme ⇒ stencil 1D : only one GPU kernel (low memory storage required) Directional unsplit scheme ⇒ stencil 3D : better perf when split computation into 3 GPU kernels (larger memory storage) GPU implementation/parallelization : naturally unfold for loops over i and j; loop over z is performed inside one GPU thread (efficient shared memory usage).
16 / 44
Astrophysics motivations - HPC applications CPU/GPU performances
What is MRI (Magneto-rotational Instability) ? High resources requirements : need for GPU acceleration Compressible (M)HD and finite volume methods
Hydrodynamics - GPU implementation
Directionally splitted Hydrodynamics - 1D stencil The entire scheme can fit into a single kernel by using a single loop ⇒ memory consumtion very low ⇒ fit very large sub-domain per GPU
Directionnally unsplit Hydrodynamics - 3D stencil Each stage of algorithm require larger floating-point computations ⇒ need more registers per thread In 2D using a single kernel gets a 40 % perf decrease compared to using 3 kernels (primitive var conversion, slopes and interpolation computations, Riemann fluxes and cell updates). In 3D, perf decrease is ∼ 30% Several CUDA kernels ⇒ better design, much much easier to debug, easier to insert a new kernel (e.g. for physics terms) than modify existing kernels.
17 / 44
Astrophysics motivations - HPC applications CPU/GPU performances
What is MRI (Magneto-rotational Instability) ? High resources requirements : need for GPU acceleration Compressible (M)HD and finite volume methods
Directionally splitted Hydro - 3D performances Number of 106 cell updates / second versus domain size (single precision) taille
GeForce GTX 560 Ti SP/fast
M2090 SP/fast
32x32x32 64x64x64 96x96x96 128x128x128 192x192x192 225x225x225
7.3 37.6 68.2 90.6 111.0 117.8
18.8 83.4 100.7 114.7 133.0 137.2
M2090 SP
M2090 DP
32.1
9.2
39.5
11.1
pure hydro on GPU is 10× faster (DP) than a single CPU core pure hydro on GPU is ∼ 100× faster (SP/fast) than a single CPU core Tesla M2090 has a much better memory bandwidth Need 1 large domain (or many small) to fully load GPU Huge performance gap when deactivating instrinsics or using DP (not so strong in MHD) There is room for designing mixed precision algorithms.
18 / 44
Astrophysics motivations - HPC applications CPU/GPU performances
What is MRI (Magneto-rotational Instability) ? High resources requirements : need for GPU acceleration Compressible (M)HD and finite volume methods
Directionally splitted Hydro - 3D performances Number of 106 cell updates / second versus domain size taille 32x32x32 64x64x64 96x96x96 128x128x128 192x192x192 225x225x225
M2090 SP/fast 18.8 83.4 100.7 114.7 133.0 137.2
SP
DP
32.1
9.2
39.5
11.1
K20 SP/fast 26.6 (+41%) 95.2 (+14%) 175.1 (+73%) 178.7 (+55%) 226.7 (+70%) 210.5 (+53%)
SP
DP
+70%
+30%
+92%
+30%
Architecture Kepler K20 versus Fermi M2090 Just rebuild application with CUDA 5.5 toolchain for architecture 3.5 in SP/fast : Kepler is ∼50% faster than M2090 in SP : Kepler is ∼70% faster than M2090 in DP : Kepler is ∼30% faster than M2090 19 / 44
Astrophysics motivations - HPC applications CPU/GPU performances
What is MRI (Magneto-rotational Instability) ? High resources requirements : need for GPU acceleration Compressible (M)HD and finite volume methods
Directionally splitted Hydro - 3D performances Number of 106 cell updates / second versus domain size taille 32x32x32 64x64x64 96x96x96 128x128x128 192x192x192 225x225x225
M2090 SP/fast 18.8 83.4 100.7 114.7 133.0 137.2
SP
DP
32.1
9.2
39.5
11.1
K20 SP/fast 26.6 (+41%) 95.2 (+14%) 175.1 (+73%) 178.7 (+55%) 226.7 (+70%) 210.5 (+53%)
SP
DP
72.4
34.9
97.4
40.6
Architecture Kepler K20 versus Fermi M2090 Rebuild application with CUDA 5.5 toolchain for architecture 3.5 and tune flags Tune max register count for Kepler, and care about read-only data cache ==> No more register spilling, DP perf is optimal ! in DP : Kepler is ∼ 3.5× faster than than M2090 20 / 44
Astrophysics motivations - HPC applications CPU/GPU performances
What is MRI (Magneto-rotational Instability) ? High resources requirements : need for GPU acceleration Compressible (M)HD and finite volume methods
Directionally splitted Hydro - 3D performances
Border updates on CPU, borders update is always insignificant compared to numerical scheme (∼ 0.05%) on GPU, borders update are NOT ACCELERATED at all; not enought workload for GPU: ∼ 160ms per time steps whatever domain size; profiling is strongly domain size dependant; maximun acceleration for small domain size is limited (Amdahl law). On GPU, borders update ⇒ non-coalescent memory access
Memory consumption in splitted Hydro scheme 2D : 8N 2 (e.g. N = 212 = 4096 ⇒ 530 MB in single precision 3D : 10N 3 (e.g. N = 28 = 256 ⇒ 671 MB in single precision)
21 / 44
Astrophysics motivations - HPC applications CPU/GPU performances
Memory footprint optimization MHD Conclusion
Summary
1
Astrophysics motivations - HPC applications What is MRI (Magneto-rotational Instability) ? High resources requirements : need for GPU acceleration Compressible (M)HD and finite volume methods
2
CPU/GPU performances Memory footprint optimization MHD Conclusion
22 / 44
Astrophysics motivations - HPC applications CPU/GPU performances
Memory footprint optimization MHD Conclusion
Memory footprint optimization - MHD scheme
Input
kernel1
Enable the largest ever done MRI simulation at resolution 800 × 1600 × 800 Inter mediate
kernel2
With original GPU implementation, memory requirement was more than 5.8 GBytes / per GPU (using 256 GPUs) ⇒ can’t do it on the CURIE system
/
Output
23 / 44
Astrophysics motivations - HPC applications CPU/GPU performances
Memory footprint optimization MHD Conclusion
Memory footprint optimization - MHD scheme
Input
kernel1
Enable the largest ever done MRI simulation at resolution 800 × 1600 × 800 Inter mediate
kernel2
z-block Loop
With original GPU implementation, memory requirement was more than 5.8 GBytes / per GPU (using 256 GPUs) ⇒ can’t do it on the CURIE system
/
Memory footprint optimization: apply a z-blocking technique to all kernels involved ⇒ 1.9 GBytes / per Output
GPU (using 256 GPUs) ⇒
,
24 / 44
Astrophysics motivations - HPC applications CPU/GPU performances
Memory footprint optimization MHD Conclusion
MHD - 3D Orszag-Tang - performances Hardware configuration : Fermi M2090 - Single GPU Single-GPU/CPU run - Number of 106 cell updates / second in single precision size
CPU
CPU (dissip)
GPU
GPU (dissisp)
GPU CPU
32x32x32 48x48x48 64x64x64 96x96x96 128x128x128 150x150x150
0.184 0.195 0.201 0.204 0.210 0.218
0.161 0.168 0.175 0.176 0.184 0.183
6.03 9.63 11.94 12.60 13.64 14.60
4.97 8.49 10.43 11.20 12.08 13.00
32.7 49.4 59.4 61.7 64.5 67.0
On CPU, dissipative terms (∼ 12%) whatever domain size On GPU, dissipative terms impact is 18% (323 ) and 11% (1503 ); dissipative terms computation needs to call border routine (larger cost) Double-precision has a strong impact on GPU perf: ∼ /4 ⇒ reg. spilling problems 25 / 44
Memory footprint optimization MHD Conclusion
Astrophysics motivations - HPC applications CPU/GPU performances
MHD - 3D Orszag-Tang - performances Hardware configuration : Fermi M2090 - Single GPU Single-GPU/CPU run - Number of 106 cell updates / second in single precision size
CPU
CPU (dissip)
GPU
GPU (dissisp)
GPU CPU
32x32x32 48x48x48 64x64x64 96x96x96 128x128x128 150x150x150
0.184 0.195 0.201 0.204 0.210 0.218
0.161 0.168 0.175 0.176 0.184 0.183
6.03 9.63 11.94 12.60 13.64 14.60
4.97 8.49 10.43 11.20 12.08 13.00
32.7 49.4 59.4 61.7 64.5 67.0
Fermi M2090 : Double-precision has a strong impact on GPU perf: ∼ /4 ⇒ reg. spilling problems Kepler: reg. spilling problem strongly reduced. Kepler: 1 K20 GPU ∼ 2.5 faster than a 16-core CPU node (2 Proc Intel Xeon E5-2670) 26 / 44
Astrophysics motivations - HPC applications CPU/GPU performances
Memory footprint optimization MHD Conclusion
MHD - MRI - 3D performances MRI with dissipative terms (viscosity and resisitivity) and shearing-box Hardware configuration: Fermi M2090 - Single GPU Number of 106 cell updates / second in single precision size
CPU
GPU
GPU CPU total
16x32x16 32x64x32 48x96x48 64x96x64 64x128x64 96x150x96 96x192x96 128x128x64
0.138 0.156 0.163 0.169 0.168 0.175 0.170 0.173
1.65 5.56 8.02 8.96 9.12 10.1 10.2 9.3
11.8 37.0 50.1 52.7 53.6 59.4 60.0 53.7 27 / 44
Astrophysics motivations - HPC applications CPU/GPU performances
Memory footprint optimization MHD Conclusion
MHD - MRI - 3D performances
MRI with dissipative terms (viscosity and resisitivity) and shearing-box Hardware configuration: M2090 - Multi GPU (1 GPU per MPI proc) Mpi perf : Number of 106 cell updates / second / MPI proc in single precision. Sub-domain size is 128 × 128 × 64 # proc MPI
global size
1 8 64 256
128 × 128 × 64 256 × 256 × 128 512 × 512 × 256 512 × 1024 × 512
CPU 0.173 1.39 11.1 43.8
Efficiency CPU
GPU
Efficiency GPU
GPU CPU total
100.6 % 100.1 % 98.7 %
8.0 38.4 306.7 1066.0
60.0 % 59.9 % 52.1 %
50.1
28 / 44
Astrophysics motivations - HPC applications CPU/GPU performances
Memory footprint optimization MHD Conclusion
MHD - MRI - 3D performances MRI with dissipative terms (viscosity and resisitivity) and shearing-box Hardware configuration: M2090 MPI perf : Number of 106 cell updates / second / MPI proc in single precision using sub-domain size (128 × 128 × 64) # proc MPI
global size
CPU
Efficiency CPU
GPU
256
512 × 1024 × 512
43.8
98.7 %
1066.0
Efficiency GPU 52.1 % 6
Simulation sizing (estimated): CURIE/CPU, 2.10 time steps, domain size 512 × 1024 × 512 ⇒ 1 week, 4000 cores Simulation sizing (measured): CURIE/GPU, 2.106 time steps, domain size 512 × 1024 × 512 ⇒ 5.8 days, 256 GPU Simulation sizing (actual): CURIE/GPU (DP), 0.5x106 time steps, domain size 800 × 1600 × 800 ⇒ 14 days, 256 GPU, largest ever performed 29 / 44
Astrophysics motivations - HPC applications CPU/GPU performances
Memory footprint optimization MHD Conclusion
MHD - MRI - 3D performances
MRI with dissipative terms (viscosity and resisitivity) and shearing-box largest ever performed MRI simulation 800 × 1600 × 800 (also done on BlueGene, but requiring 32k processors). GPU implementation has a quite low MPI parallelization efficiency; this due to communications (MPI + PciE) for borders update (normal and shear); this strongly modify profiling and decreases maximum possible acceleration (Amdahl law). There is room for optimization in the MHD scheme: CUDA kernels are register limited on Fermi. Need to use very large sub-domain; strong scaling should drop faster (but we only have 288 GPU on CURIE !)
30 / 44
Astrophysics motivations - HPC applications CPU/GPU performances
Memory footprint optimization MHD Conclusion
MHD - MRI - 3D performances
MRI with dissipative terms (viscosity and resisitivity) and shearing-box Profiling for a 64 MPI proc configuration with 48 × 96 × 48 sub-domain size (CURIE) Function Godunov scheme dissipative terms borders (internal/external) shearing border
% of total time 22.9 % 17.5 % 22.7 % 36.9 %
31 / 44
Astrophysics motivations - HPC applications CPU/GPU performances
Memory footprint optimization MHD Conclusion
MHD - MRI - 3D performances
MRI with dissipative terms (viscosity and resisitivity) and shearing-box Profiling for a 64 MPI proc configuration with 48 × 96 × 48 sub-domain size (CURIE) Function Godunov scheme dissipative terms borders (internal/external) shearing border
% of total time 22.9 % 17.5 % 22.7 % 36.9 %
32 / 44
Astrophysics motivations - HPC applications CPU/GPU performances
Memory footprint optimization MHD Conclusion
MHD - MRI - 3D performances
MRI with dissipative terms (viscosity and resisitivity) and shearing-box borders update are dominated by MPI and CPU-GPU PciE communications. borders update is ∼ 60% of total time (48 × 96 × 48 sub-domaine) !! It was only 10 % for single-GPU ... Once MPI + PciE communications are integrated into profiling, strong scaling is quite good
33 / 44
Astrophysics motivations - HPC applications CPU/GPU performances
Memory footprint optimization MHD Conclusion
MHD - MRI - 3D performances
MRI with dissipative terms (viscosity and resisitivity) and shearing-box Work in progress: new physics modules(on-going): ambipolar diffusion, cooling/heating terms Coupling to Paraview/Catalyst for in-situ visualization
Other possible developments: cylindrical/spherical coordinate (almost done) Radiative HD/MHD ? AMR ? Interesting work by CLAMR (cell-based AMR on GPU with OpenCL)
34 / 44
Astrophysics motivations - HPC applications CPU/GPU performances
Memory footprint optimization MHD Conclusion
MHD turbulence at Mach = 10 - 20163 - Simulations
Figure: Log-scale density plot of MHD turbulence at Mach = 10, resolution 20163 on Keeneland: 486 GPUs (∼ 61%) out of 768 GPUs. Double precision run on Keeneland, MHD performance is 2.5 Mcell-updates/s per GPU; expected performance on BlueWaters is 5.5 Mcell-updates/s per GPU large sub-domain 224 × 224 × 336 (5.1 GBytes / GPU) can only run one job once a week ! Need large GPU cluster to stimulate work on porting algorithms and try to reach state-of-the-art problems 10003 MHD runs can be done routinely on small GPU cluster (∼ 64 GPU).
35 / 44
Astrophysics motivations - HPC applications CPU/GPU performances
Memory footprint optimization MHD Conclusion
MHD turbulence at Mach = 10 - 20163 - Simulations
Figure: Log-scale density plot of MHD turbulence at Mach = 10, resolution 20163 on Keeneland: 486 GPUs (∼ 61%) out of 768 GPUs. Double precision run on Keeneland, MHD performance is 2.5 Mcell-updates/s per GPU; expected performance on BlueWaters is 5.5 Mcell-updates/s per GPU large sub-domain 224 × 224 × 336 (5.1 GBytes / GPU) can only run one job once a week ! Need large GPU cluster to stimulate work on porting algorithms and try to reach state-of-the-art problems 10003 MHD runs can be done routinely on small GPU cluster (∼ 64 GPU). 36 / 44
Astrophysics motivations - HPC applications CPU/GPU performances
Memory footprint optimization MHD Conclusion
MHD turbulence at Mach = 10 - 20163 - Simulations
Figure: Compensated Fourier power-spectrum of v = ρ 1/3 u and Em - Mach = 10, resolution 20163 on Keeneland (486 GPUs).
Special care about parallel IO at 20163 (need a little care about Lustre FS parameters) Ongoing work on doing analysis of turbulence data. 37 / 44
Astrophysics motivations - HPC applications CPU/GPU performances
Memory footprint optimization MHD Conclusion
Conclusion Provide GPU Implementation for multiple finite volume schemes (hydrodynamics and MHD) Performed some of the largest ever done MHD simulations Entire application on GPU Excellent single-GPU performances Double-precision MHD implementation: strong performance drop on Fermi due to register spilling / shear border condition; situation stronly improved on Kepler
Very good multi-GPU perf provided using large sub-domain size Cuda/C++ code available :
stratified MRI
On-poing work with C. Gammie and B. Ryan http://www.maisondelasimulation.fr/projects/RAMSES-GPU/html/index.html 38 / 44
Astrophysics motivations - HPC applications CPU/GPU performances
Memory footprint optimization MHD Conclusion
Stratified MRI - Simulations - ongoing recent work
Movie representing magnetic field By evolution at onset of Magneto-Rotational Instability in a stratified box (with gravity field).
39 / 44
Astrophysics motivations - HPC applications CPU/GPU performances
Memory footprint optimization MHD Conclusion
Extra slide : MHD - Orszag-Tang 3D - performances MPI/GPU no shear border - no dissipative terms Hardware configuration: CURIE - Multi GPU (1 GPU per MPI proc) MPI perf : Number of 106 cell updates / second / MPI proc in single precision. Sub-domain size is 128 × 128 × 128 # proc MPI
global size
CPU
CPU perf per MPI proc
GPU
GPU perf per MPI proc
1 8 64 128 256
128 × 128 × 128 256 × 256 × 256 512 × 512 × 512 1024 × 512 × 512 1024 × 1024 × 512
0.21 1.68 13.4 26.8 53.5
0.21 0.21 0.21 0.21 0.21
13.6 95.3 750.7 1498.3 2969.3
13.6 11.9 11.7 11.7 11.6
MPI communications weight in profiling much less important than in MRI (because shear border and dissip terms comm are very costly) Scaling is also much better for the MHD MPI-GPU version than MHD + shear border Sustained speedup of ∼ 55 of GPU versus CPU using the same number of MPI processes
40 / 44
Astrophysics motivations - HPC applications CPU/GPU performances
Memory footprint optimization MHD Conclusion
Extra slide : MHD - Orszag-Tang 3D - performances MPI/GPU no shear border - no dissipative terms Hardware configuration: CURIE - Multi GPU (1 GPU per MPI proc) MPI perf : Number of 106 cell updates / second / MPI proc in single precision. Sub-domain size is 128 × 128 × 128 # proc MPI
global size
CPU
CPU perf per MPI proc
GPU
GPU perf per MPI proc
1 8 64 128 256
128 × 128 × 128 256 × 256 × 256 512 × 512 × 512 1024 × 512 × 512 1024 × 1024 × 512
0.21 1.68 13.4 26.8 53.5
0.21 0.21 0.21 0.21 0.21
13.6 95.3 750.7 1498.3 2969.3
13.6 11.9 11.7 11.7 11.6
For compressible MHD, need to take into account how structure functions computations and other required pre-processing analysis will slow down GPU... 41 / 44
Astrophysics motivations - HPC applications CPU/GPU performances
Memory footprint optimization MHD Conclusion
RAMSES-GPU Code RAMSES-GPU available for download http://www.maisondelasimulation.fr/projects/RAMSES-GPU/html/download.html
42 / 44
Astrophysics motivations - HPC applications CPU/GPU performances
Memory footprint optimization MHD Conclusion
Extra slide - RamsesGPU on Nvidia CARMA
x86_64 CPU has a high electric power consumption; replace by low-power ARM architecture GFlops/Watt ratio is over 6 for GPU Quadro 1000M and arround 1.5 for CPU tegra3; total consumption (5+45 Watt) Just rebuild Ramses GPU with cross toolchain for CUDA on ARM Thanks to E. Orlotti (NVIDIA) for providing CARMA board. 43 / 44
Astrophysics motivations - HPC applications CPU/GPU performances
Memory footprint optimization MHD Conclusion
Extra slide - RamsesGPU on Nvidia CARMA MHD (single precision) : NVIDIA CARMA w/ or w/o dissipative terms Single-GPU/CPU - Number of 106 cell updates / second taille
CPU a
CPU (dissip)
GPU
GPU (dissisp)
GPU CPU
32x32x32 48x48x48 64x64x64 96x96x96 128x128x128
0.106 0.125 0.130 0.139 0.141
0.093 0.108 0.118 0.124 0.125
1.19 1.72 1.94 2.13 2.35
0.98 1.35 1.57 1.70 1.93
10.5 12.5 13.3 13.7 15.4
a using 4 MPI tasks
ARM core perf: not so bad (SP only) Quadro 1000M ∼ 40 to 60 ARM cores 44 / 44