assume computer with 1 PetaFlops, 100k cores, n=109 expected time to solution: 3*10-5 sec (microseconds!) standard computational practice in 2017 misses ...
Lean and Mean: Finite element solvers beyond a trillion degrees of freedom U. Rüde Centre Européen de Recherche et de Formation Avancée en Calcul Scientifique www.cerfacs.fr
Lehrstuhl für Simulation Universität Erlangen-Nürnberg TERRA NEO www10.informatik.uni-erlangen.de
joint work with D. Bartuschat, B. Gmeiner, D. Thoennes, H. Stengel, H. Köstler (FAU) C. Waluga, M. Huber, L. John, D. Drzisga, B. Wohlmuth (TUM) M. Mohr, S. Bauer, J. Weismüller, P. Bunge (LMU)
TERRA
Prologue: Cost and Complexity What is the cost of solving a linear system? Gaussian elimination: #Flops ~ 2/3 n3 What is the cost of solving the discrete Poisson equation? C’mon: it’s the mother of all PDE! Yep - there are O(n) algorithms: wavelet, fast multipole, multigrid, … But what is the best constant published? For Poisson 2D, second order: #Flops ~ 30 n (Stüben, 1982) assume computer with 1 PetaFlops, 100k cores, n=109 expected time to solution: 3*10-5 sec (microseconds!) standard computational practice in 2017 misses this by
many orders of magnitude! Why do we have this huge gap from theory to practice? No, it cannot be explained by „cache effects“ alone. This talk tries to understand where we stand.
Lean and Mean Parallel FE
-
Uli Rüde
2
The questions I want to ask: What is the minimal cost of solving a PDE?
(such as Poisson’s or Stokes’ equation in 3D) asymptotic results of the form
Cost
C np
(Flops)
with unspecified constant C are
inadequate to predict the real performance
Additionally flops as metric have become questionable
How do we quantify true cost?
(i.e. resources needed) Number of flops? Memory consumption? Memory bandwidth? (aggregated?) Communication bandwidth? Communication latency? Power consumption? Lean and Mean Parallel FE
-
Uli Rüde
3
Multi-PetaFlops Supercomputers Sunway TaihuLight 10,649,600 cores 260 cores (1.45 GHz)
per node 32 GiB RAM per node 125 PFlops Peak
TOP 500: #1
SuperMUC (phase 1)
Blue Gene/Q
architecture
SW26010 processor
Power consumption:
15.37 MW
JUQUEEN
458,752 PowerPC
A2 cores 16 cores (1.6 GHz)
per node 16 GiB RAM per node 5D torus interconnect 5.8 PFlops Peak
Intel Xeon architecture 147,456 cores 16 cores (2.7 GHz) per node 32 GiB RAM per node Pruned tree interconnect 3.2 PFlops Peak TOP 500: #27
TOP 500: #13
LBM Methods
Ulrich Rüde
How big a PDE problem can we solve? 400 TByte main memory = 4*1014 Bytes =
5 vectors each with 1013 elements
8 Byte = double precision matrix-free implementation necessary even with a sparse matrix format, storing a matrix of dimension 1013 is not possible on Juqueen Which algorithm? multigrid Cost = C*N C „moderate“, e.g. C=100. does it parallelize well? overhead? Lean and Mean Parallel FE
-
Uli Rüde
5
HHG: A modern mesh-free architecture for FE
Structured refinement of an unstructured base mesh Geometrical Hierarchy: Volume, Face, Edge, Vertex Lean and Mean Parallel FE
-
Uli Rüde
6
Hierarchical Hybrid Grids (HHG) B. Bergen, F. Hülsemann, UR, G. Wellein: ISC Award 2006: „Is 1.7× 1010 unknowns the largest finite element system that can be solved today?“, SuperComputing, 2005. Gmeiner, UR, Stengel, Waluga, Wohlmuth: Towards Textbook Efficiency for Parallel Multigrid, Journal of Numerical Mathematics: Theory, Methods and Applications, 2015
Parallelize multigrid for tetrahedral finite elements partition domain parallelize all operations on all grids use clever data structures matrix free implementation
Bey‘s Tetrahedral Refinement
Do not worry (so much) about coarse grids idle processors? short messages? sequential dependency in grid hierarchy?
Elliptic problems always require global communication and thus coarser grids for the global data transport
Lean and Mean Parallel FE
-
Uli Rüde
7
Edge
Vertex
Copying to update ghost points
Interior
14 12
13
9
10
11
5
6
7
8
0
1
2
3
5
6
7
8
0
1
2
3
9
10
11
12
4
4
linearization & memory representation
Lean and Mean Parallel FE
MPI message to update ghost points
-
Uli Rüde
8
HHG Parallel Grid Traversal
Lean and Mean Parallel FE
-
Uli Rüde
9
Application to Earth Mantle Convection Models Ghattas O., et. al.: An Extreme-Scale Implicit Solver for Complex PDEs: Highly Heterogeneous Flow in Earth's Mantle. Gordon Bell Prize 2015. Weismüller, J., Gmeiner, B., Ghelichkhan, S., Huber, M., John, L., Wohlmuth, B., UR, Bunge, H. P. (2015). Fast asthenosphere motion in high-resolution global mantle flow models. Geophysical Research Letters, 42(18), 7429-7435.
Lean and Mean Parallel FE
-
Uli Rüde
10
HHG Solver for Stokes System
by Earth Mantle convection problem inesqMotivated model for mantle convection problems
Gmeiner, Waluga, Stengel, Wohlmuth, UR: Performance and Scalability ived from the equations balance of forces, conservation of Hierarchical Hybridfor Multigrid Solvers for Stokes Systems, of SISC, ss and2015. energy:
r · (2⌘✏(u)) + rp = ⇢(T )g, @T + u · rT @t u p T ⌫ ✏(u) = 12 (ru + (ru)T ) ⇢ , , g
r · u = 0,
r · (rT ) = .
Solution of the Stokes equations
velocity dynamic pressure temperature viscosity of the material strain rate tensor density thermal conductivity, heat sources, gravity vector
Scale up to ~1013 nodes/ DOFs resolve the whole Earth Mantle globally with 1km resolution
Simulation of Earth Mantle Convection
TERRA NEO
-
Uli Rüde
11
TERRA
Comparison of Stokes Solvers Three solvers: Schur complement CG (pressure correction) with MG MINRES with MG preconditioner for Vector-Laplace MG for saddle point problem with Uzawa type smoother
Time to solution residual reduction by 10-8 SCG nodes
threads
1
24
6
192
48
1 536
384
12 288
3 072
98 304
24 576
786 432
DoFs 6.6 · 107 5.3 · 108 4.3 · 109
3.4 · 1010
2.7 · 1011 2.2 · 1012
iter
PMINRES time
iter
time
UMG iter
time
28 (3)
137.26
65 (3)
116.35
7
62.44
26 (5)
135.85
64 (7)
133.12
7
70.97
26 (15)
139.64
62 (15)
133.29
7
79.14
26 (30)
142.92
62 (35)
138.26
7
92.76
26 (60)
154.73
62 (71)
156.86
7
116.44
28 (120)
187.42
64 (144)
184.22
7
169.15
Table 5: Weak scaling results with time-to-solution (in sec.) for the Laplace-operator.
Lean and Mean Parallel FE
-
Uli Rüde
12
Instationary Computation
u + rp =
div u = 0
RaT b r
@tT + u · rT = Pe
1
T
Simplified Model
Dimensionless numbers:
• Rayleigh number (Ra): determines the p • Peclet number (Pe): ratio of convective
The mantle convection problem is defined b spherical shell as well as suitable initial/bou
JJ J I II
0 1 2 3 4
Finite Elements with 6.5×109 DoF, 10000 time steps, run time 7 days Mid-size cluster: 288 compute cores in 9 nodes of LSS at FAU Lean and Mean Parallel FE
-
Uli Rüde
13
Stationary Flow Field
Lean and Mean Parallel FE
-
Uli Rüde
14
Exploring the Limits …
ally appear in simulations for molecules, quantum mechanics, or geophysics. The initial
Gmeiner B., Huber M, John L, UR, Wohlmuth, B: A quantitative performance study for Stokes solvers at the of extreme Journal of Computational Science, onsists 240 scale, tetrahedrons for the case of 52016. nodes and 80 threads. The number
of degre
oms on the coarse grid grows from 9.0 · 103 to 4.1 · 107 by the weak scaling. We con Multigrid withT0Uzawa Smoother
Optimized Minimal Memory Consumption tokes system with thefor Laplace-operator formulation. The relative accuracies for coarse 13 Unknowns correspond to 80 TByte for the solution vector 10 r (PMINRES and CG algorithm) are set to 10 3 and 10 4 , respectively. All other param Juqueen has 450 TByte Memory ? he solver remain as previously described. e t a d matrix free implementation essential ved to nodes
threads
5
is80
40
640
320
5 120
2 560
40 960
20 480
327 680
so m e t s t sy
DoFs largesiter
e h t s i h t
2.7 · 109
2.1 · 1010 1.2 · 1011
1.7 · 1012
1.1 · 1013
l
time
time w.c.g.
time c.g. in %
10
685.88
678.77
1.04
10
703.69
686.24
2.48
10
741.86
709.88
4.31
9
720.24
671.63
6.75
9
776.09
681.91
12.14
Table 10: Weak scaling results with and without coarse grid for the spherical shell geometry. 15 Lean and Mean Parallel FE - Uli Rüde
1012 - these are BIG problems Energy computer generation
gigascale:
109 FLOPS
terascale
1012 FLOPS
petascale
1015 FLOPS
exascale
1018 FLOPS
desired problem size DoF=N
106
109
1012
1015
energy estimate (kWh) 1 NJoule × N2 all-to-all communication
0.278 Wh 10 min of
LED light
278 kWh 2 weeks
blow drying hair
278 GWh 1 month electricity for Berlin
278 PWh 100 years
world electricity production
TerraNeo prototype (kWh)
0.13 Wh
0.03 kWh
27 kWh
?
At extreme scale: optimal complexity is a must! No algorithm with O(N2) complexity is usable: no dense matrix-vector multiplication no pairwise distance computation no advanced graph partitioning (for load balancing) no … Lean and Mean Parallel FE
-
Uli Rüde
16
Towards systematic performance engineering:
Parallel Textbook Efficiency Brandt, A. (1998). Barriers to achieving textbook multigrid efficiency (TME) in CFD. Thomas, J. L., Diskin, B., & Brandt, A. (2003). Textbook Multigrid Efficiency for Fluid Simulations*. Annual review of fluid mechanics, 35(1), 317-340. Gmeiner, UR, Stengel, Waluga, Wohlmuth: Towards Textbook Efficiency for Parallel Multigrid, Journal of Numerical Mathematics: Theory, Methods and Applications, 2015 Huber, M., John, L., Pustejovska, P., UR, Waluga, C., & Wohlmuth, B. (2015). Solution Techniques for the Stokes System: A priori and a posteriori modifications, resilient algorithms. Proceedings of ICIAM, Beijing, 2015, arXiv preprint arXiv:1511.05759.
Lean and Mean Parallel FE
-
Uli Rüde
17
Textbook Multigrid Efficiency (TME) „Textbook multigrid efficiency means solving a discrete PDE problem with a computational effort that is only a small (less than 10) multiple of the operation count associated with the discretized equations itself.“ [Brandt, 98] work unit (WU) = single elementary relaxation classical algorithmic TME-factor: ops for solution/ ops for work unit new parallel TME-factor to quantify algorithmic efficiency, combined with scalability Lean and Mean Parallel FE
-
Uli Rüde
18
New TME paradigm as parallel performance metric Analyse cost of an elementary relaxation micro-kernel benchmarks aggregate performance defines a WU Measure parallel solver performance Compute TME factor
Lean and Mean Parallel FE
-
Uli Rüde
19
Parallel TME µsm U U µsm
# of elementary relaxation steps on single core/sec # cores aggregate peak relaxation performance
N TWU (N, U ) = U µsm
T (N, U )
idealized time for a work unit
time to solution for N unknowns on U cores
Parallel textbook efficiency factor
T (N, U ) U µsm EParTME (N, U ) = = T (N, U ) TWU (N, U ) N combines algorithmic and implementation efficiency. Lean and Mean Parallel FE
-
Uli Rüde
20
TME and Parallel TME results Setting/Measure Grid points Processor cores U (CC) - FMG(2,2) (VC) - FMG(2,2) (SF) - FMG(2,1)
ETME 6.5 6.5 31
ESerTME 2 · 106 1 15 11 64
ENodeTME 3 · 107 16 22 13 100
EParTME1 9 · 109 4 096 26 15 118
EParTME2 2 · 1011 16 384 22 13 -
Table 4: TME factors for different problem sizes and discretizations. Three model problems: scalar constant (CC) & scalar variable (VC) coefficients Stokes solved Schur complement (SF) as the (SF) case. lists the efficiencies forvia the (CC), (VC), as well Full FMG-4CG-1V(2,1) multigrid with #iterations such that asymptotic mploy the pressure correction cycle. All cycle ey are optimality as cheap maintained as possible, but still keep the ratio l ⇡ 2. We obs 6.5 the (algorithmically forfor scalar cases) n the ETME =and new E (CC) and (SF) compared to TME
ParTME
ParTME 20 for scalar PDE, ≥100 grids for Stokes so observe thearound effect that traversing theand coarser of the multig ent than the processing large on the finest 21 level. This is Lean and the Mean the Parallel FE -grids Uli Rüde
Conclusion 1013 DoF for FE are possible Multigrid scales to Peta (and beyond) HHG - lean matrix-free implementation Excellent time to solution Algorithmic resilience with multigrid
TERRA NEO
TERRA
Lean and Mean Parallel FE
-
Uli Rüde
22