Lean and Mean: Finite element solvers beyond a

14 downloads 0 Views 8MB Size Report
assume computer with 1 PetaFlops, 100k cores, n=109 expected time to solution: 3*10-5 sec (microseconds!) standard computational practice in 2017 misses ...
Lean and Mean: Finite element solvers beyond a trillion degrees of freedom U. Rüde Centre Européen de Recherche et de Formation Avancée en Calcul Scientifique www.cerfacs.fr

Lehrstuhl für Simulation Universität Erlangen-Nürnberg TERRA NEO www10.informatik.uni-erlangen.de

joint work with D. Bartuschat, B. Gmeiner, D. Thoennes, H. Stengel, H. Köstler (FAU) C. Waluga, M. Huber, L. John, D. Drzisga, B. Wohlmuth (TUM) M. Mohr, S. Bauer, J. Weismüller, P. Bunge (LMU)

TERRA

Prologue: Cost and Complexity What is the cost of solving a linear system? Gaussian elimination: #Flops ~ 2/3 n3 What is the cost of solving the discrete Poisson equation? C’mon: it’s the mother of all PDE! Yep - there are O(n) algorithms: wavelet, fast multipole, multigrid, … But what is the best constant published? For Poisson 2D, second order: #Flops ~ 30 n (Stüben, 1982) assume computer with 1 PetaFlops, 100k cores, n=109 expected time to solution: 3*10-5 sec (microseconds!) standard computational practice in 2017 misses this by
 many orders of magnitude! Why do we have this huge gap from theory to practice? No, it cannot be explained by „cache effects“ alone. This talk tries to understand where we stand.

Lean and Mean Parallel FE

-

Uli Rüde

2

The questions I want to ask: What is the minimal cost of solving a PDE?
 (such as Poisson’s or Stokes’ equation in 3D) asymptotic results of the form
 


Cost

C np

(Flops)


with unspecified constant C are
 inadequate to predict the real performance
 Additionally flops as metric have become questionable

How do we quantify true cost?
 (i.e. resources needed) Number of flops? Memory consumption? Memory bandwidth? (aggregated?) Communication bandwidth? Communication latency? Power consumption? Lean and Mean Parallel FE

-

Uli Rüde

3

Multi-PetaFlops Supercomputers Sunway TaihuLight 10,649,600 cores 260 cores (1.45 GHz)
 per node 32 GiB RAM per node 125 PFlops Peak

TOP 500: #1

SuperMUC (phase 1)

Blue Gene/Q
 architecture

SW26010 processor

Power consumption:
 15.37 MW

JUQUEEN

458,752 PowerPC
 A2 cores 16 cores (1.6 GHz)
 per node 16 GiB RAM per node 5D torus interconnect 5.8 PFlops Peak

Intel Xeon architecture 147,456 cores 16 cores (2.7 GHz) per node 32 GiB RAM per node Pruned tree interconnect 3.2 PFlops Peak TOP 500: #27

TOP 500: #13

LBM Methods

Ulrich Rüde

How big a PDE problem can we solve? 400 TByte main memory = 4*1014 Bytes = 
 5 vectors each with 1013 elements
 8 Byte = double precision matrix-free implementation necessary even with a sparse matrix format, storing a matrix of dimension 1013 is not possible on Juqueen Which algorithm? multigrid Cost = C*N C „moderate“, e.g. C=100. does it parallelize well? overhead? Lean and Mean Parallel FE

-

Uli Rüde

5

HHG: A modern mesh-free architecture for FE

Structured refinement of an unstructured base mesh Geometrical Hierarchy: Volume, Face, Edge, Vertex Lean and Mean Parallel FE

-

Uli Rüde

6

Hierarchical Hybrid Grids (HHG) B. Bergen, F. Hülsemann, UR, G. Wellein: ISC Award 2006: „Is 1.7× 1010 unknowns the largest finite element system that can be solved today?“, SuperComputing, 2005. Gmeiner, UR, Stengel, Waluga, Wohlmuth: Towards Textbook Efficiency for Parallel Multigrid, Journal of Numerical Mathematics: Theory, Methods and Applications, 2015

Parallelize multigrid for tetrahedral finite elements partition domain parallelize all operations on all grids use clever data structures matrix free implementation

Bey‘s Tetrahedral Refinement

Do not worry (so much) about coarse grids idle processors? short messages? sequential dependency in grid hierarchy?

Elliptic problems always require global communication and thus coarser grids for the global data transport

Lean and Mean Parallel FE

-

Uli Rüde

7

Edge

Vertex

Copying to update ghost points

Interior

14 12

13

9

10

11

5

6

7

8

0

1

2

3

5

6

7

8

0

1

2

3

9

10

11

12

4

4

linearization & memory representation

Lean and Mean Parallel FE

MPI message to update ghost points

-

Uli Rüde

8

HHG Parallel Grid Traversal

Lean and Mean Parallel FE

-

Uli Rüde

9

Application to Earth Mantle Convection Models Ghattas O., et. al.: An Extreme-Scale Implicit Solver for Complex PDEs: Highly Heterogeneous Flow in Earth's Mantle. Gordon Bell Prize 2015. Weismüller, J., Gmeiner, B., Ghelichkhan, S., Huber, M., John, L., Wohlmuth, B., UR, Bunge, H. P. (2015). Fast asthenosphere motion in high-resolution global mantle flow models. Geophysical Research Letters, 42(18), 7429-7435.

Lean and Mean Parallel FE

-

Uli Rüde

10

HHG Solver for Stokes System

by Earth Mantle convection problem inesqMotivated model for mantle convection problems

Gmeiner, Waluga, Stengel, Wohlmuth, UR: Performance and Scalability ived from the equations balance of forces, conservation of Hierarchical Hybridfor Multigrid Solvers for Stokes Systems, of SISC, ss and2015. energy:

r · (2⌘✏(u)) + rp = ⇢(T )g, @T + u · rT @t u p T ⌫ ✏(u) = 12 (ru + (ru)T ) ⇢ , , g

r · u = 0,

r · (rT ) = .

Solution of the Stokes equations

velocity dynamic pressure temperature viscosity of the material strain rate tensor density thermal conductivity, heat sources, gravity vector

Scale up to ~1013 nodes/ DOFs resolve the whole Earth Mantle globally with 1km resolution

Simulation of Earth Mantle Convection

TERRA NEO

-

Uli Rüde

11

TERRA

Comparison of Stokes Solvers Three solvers: Schur complement CG (pressure correction) with MG MINRES with MG preconditioner for Vector-Laplace MG for saddle point problem with Uzawa type smoother

Time to solution residual reduction by 10-8 SCG nodes

threads

1

24

6

192

48

1 536

384

12 288

3 072

98 304

24 576

786 432

DoFs 6.6 · 107 5.3 · 108 4.3 · 109

3.4 · 1010

2.7 · 1011 2.2 · 1012

iter

PMINRES time

iter

time

UMG iter

time

28 (3)

137.26

65 (3)

116.35

7

62.44

26 (5)

135.85

64 (7)

133.12

7

70.97

26 (15)

139.64

62 (15)

133.29

7

79.14

26 (30)

142.92

62 (35)

138.26

7

92.76

26 (60)

154.73

62 (71)

156.86

7

116.44

28 (120)

187.42

64 (144)

184.22

7

169.15

Table 5: Weak scaling results with time-to-solution (in sec.) for the Laplace-operator.

Lean and Mean Parallel FE

-

Uli Rüde

12

Instationary Computation

u + rp =

div u = 0

RaT b r

@tT + u · rT = Pe

1

T

Simplified Model

Dimensionless numbers:

• Rayleigh number (Ra): determines the p • Peclet number (Pe): ratio of convective

The mantle convection problem is defined b spherical shell as well as suitable initial/bou

JJ J I II

0 1 2 3 4

Finite Elements with 6.5×109 DoF, 10000 time steps, run time 7 days Mid-size cluster: 288 compute cores in 9 nodes of LSS at FAU Lean and Mean Parallel FE

-

Uli Rüde

13

Stationary Flow Field

Lean and Mean Parallel FE

-

Uli Rüde

14

Exploring the Limits …

ally appear in simulations for molecules, quantum mechanics, or geophysics. The initial

Gmeiner B., Huber M, John L, UR, Wohlmuth, B: A quantitative performance study for Stokes solvers at the of extreme Journal of Computational Science, onsists 240 scale, tetrahedrons for the case of 52016. nodes and 80 threads. The number

of degre

oms on the coarse grid grows from 9.0 · 103 to 4.1 · 107 by the weak scaling. We con Multigrid withT0Uzawa Smoother

Optimized Minimal Memory Consumption tokes system with thefor Laplace-operator formulation. The relative accuracies for coarse 13 Unknowns correspond to 80 TByte for the solution vector 10 r (PMINRES and CG algorithm) are set to 10 3 and 10 4 , respectively. All other param Juqueen has 450 TByte Memory ? he solver remain as previously described. e t a d matrix free implementation essential ved to nodes

threads

5

is80

40

640

320

5 120

2 560

40 960

20 480

327 680

so m e t s t sy

DoFs largesiter

e h t s i h t

2.7 · 109

2.1 · 1010 1.2 · 1011

1.7 · 1012

1.1 · 1013

l

time

time w.c.g.

time c.g. in %

10

685.88

678.77

1.04

10

703.69

686.24

2.48

10

741.86

709.88

4.31

9

720.24

671.63

6.75

9

776.09

681.91

12.14

Table 10: Weak scaling results with and without coarse grid for the spherical shell geometry. 15 Lean and Mean Parallel FE - Uli Rüde

1012 - these are BIG problems Energy computer generation

gigascale:

109 FLOPS

terascale

1012 FLOPS

petascale

1015 FLOPS

exascale

1018 FLOPS

desired problem size DoF=N

106

109

1012

1015

energy estimate (kWh) 1 NJoule × N2 all-to-all communication

0.278 Wh 10 min of

LED light

278 kWh 2 weeks

blow drying hair

278 GWh 1 month electricity for Berlin

278 PWh 100 years

world electricity production

TerraNeo prototype (kWh)

0.13 Wh

0.03 kWh

27 kWh

?

At extreme scale: optimal complexity is a must! No algorithm with O(N2) complexity is usable: no dense matrix-vector multiplication no pairwise distance computation no advanced graph partitioning (for load balancing) no … Lean and Mean Parallel FE

-

Uli Rüde

16

Towards systematic performance engineering:

Parallel Textbook Efficiency Brandt, A. (1998). Barriers to achieving textbook multigrid efficiency (TME) in CFD. Thomas, J. L., Diskin, B., & Brandt, A. (2003). Textbook Multigrid Efficiency for Fluid Simulations*. Annual review of fluid mechanics, 35(1), 317-340. Gmeiner, UR, Stengel, Waluga, Wohlmuth: Towards Textbook Efficiency for Parallel Multigrid, Journal of Numerical Mathematics: Theory, Methods and Applications, 2015 Huber, M., John, L., Pustejovska, P., UR, Waluga, C., & Wohlmuth, B. (2015). Solution Techniques for the Stokes System: A priori and a posteriori modifications, resilient algorithms. Proceedings of ICIAM, Beijing, 2015, arXiv preprint arXiv:1511.05759.

Lean and Mean Parallel FE

-

Uli Rüde

17

Textbook Multigrid Efficiency (TME) „Textbook multigrid efficiency means solving a discrete PDE problem with a computational effort that is only a small (less than 10) multiple of the operation count associated with the discretized equations itself.“ [Brandt, 98] work unit (WU) = single elementary relaxation classical algorithmic TME-factor: ops for solution/ ops for work unit new parallel TME-factor to quantify algorithmic efficiency, combined with scalability Lean and Mean Parallel FE

-

Uli Rüde

18

New TME paradigm as parallel performance metric Analyse cost of an elementary relaxation micro-kernel benchmarks aggregate performance defines a WU Measure parallel solver performance Compute TME factor

Lean and Mean Parallel FE

-

Uli Rüde

19

Parallel TME µsm U U µsm

# of elementary relaxation steps on single core/sec # cores aggregate peak relaxation performance

N TWU (N, U ) = U µsm

T (N, U )

idealized time for a work unit

time to solution for N unknowns on U cores

Parallel textbook efficiency factor

T (N, U ) U µsm EParTME (N, U ) = = T (N, U ) TWU (N, U ) N combines algorithmic and implementation efficiency. Lean and Mean Parallel FE

-

Uli Rüde

20

TME and Parallel TME results Setting/Measure Grid points Processor cores U (CC) - FMG(2,2) (VC) - FMG(2,2) (SF) - FMG(2,1)

ETME 6.5 6.5 31

ESerTME 2 · 106 1 15 11 64

ENodeTME 3 · 107 16 22 13 100

EParTME1 9 · 109 4 096 26 15 118

EParTME2 2 · 1011 16 384 22 13 -

Table 4: TME factors for different problem sizes and discretizations. Three model problems: scalar constant (CC) & scalar variable (VC) coefficients Stokes solved Schur complement (SF) as the (SF) case. lists the efficiencies forvia the (CC), (VC), as well Full FMG-4CG-1V(2,1) multigrid with #iterations such that asymptotic mploy the pressure correction cycle. All cycle ey are optimality as cheap maintained as possible, but still keep the ratio l ⇡ 2. We obs 6.5 the (algorithmically forfor scalar cases) n the ETME =and new E (CC) and (SF) compared to TME

ParTME

ParTME 20 for scalar PDE, ≥100 grids for Stokes so observe thearound effect that traversing theand coarser of the multig ent than the processing large on the finest 21 level. This is Lean and the Mean the Parallel FE -grids Uli Rüde

Conclusion 1013 DoF for FE are possible Multigrid scales to Peta (and beyond) HHG - lean matrix-free implementation Excellent time to solution Algorithmic resilience with multigrid

TERRA NEO

TERRA

Lean and Mean Parallel FE

-

Uli Rüde

22