What is the Largest Finite Element System That Can be Solved Today?

1 downloads 0 Views 35MB Size Report
Jun 17, 2015 - The Curse of Size. At extreme scale: optimal complexity is a must! ..... the largest FE system solved? - Uli Rüde. Superman is called for help. 38 ...
TERRA

What is the Largest Finite Element System That Can be Solved Today? U. Rüde (FAU Erlangen, [email protected]) joint work with B. Gmeiner, M. Huber, H. Stengel (FAU) C. Waluga, M. Huber, L. John, B. Wohlmuth (TUM) M. Mohr, S. Baumann, J. Weismüller, P. Bunge (LMU)

Lehrstuhl für Simulation FAU Erlangen-Nürnberg

International Conference On Preconditioning Techniques For Scientific And Industrial Applications 17-19 June 2015 Eindhoven, The Netherlands

What‘s the largest FE system solved?

- Uli Rüde

1

? Optimal ⟺ Fast UR, New Mathematics for Extreme-scale Computational Science? SIAM News, Volume 48, Number 5, June 2015

Accuracy ⟺  Computational Cost Higher order ⟺   Complexity of discrete system Mesh: Unstructured ⟺    Structured Adaptive refinement ⟺  Superconvergence Efficiency on modern computer architectures Algebraic Solver: Operation count ⟺  Parallel scalability Cost metric: time to solution ⟺  energy consumption Asymptotic estimates ⟺  quantified constants Worst case ⟺  Average case Bitwise reproducibility ⟺  Algorithmic Fault Tolerance What‘s the largest FE system solved?

- Uli Rüde

2

Two Multi-PetaFlops Supercomputers JUQUEEN

SuperMUC

Blue Gene/Q architecture 458,752 PowerPC A2 cores 16 cores (1.6 GHz) per node 16 GiB RAM per node 5D torus interconnect 5.8 PFlops Peak TOP 500: #8

What is the problem? Exascale Computing

Intel Xeon architecture 147,456 cores 16 cores (2.7 GHz) per node 32 GiB RAM per node Pruned tree interconnect 3.2 PFlops Peak TOP 500: #14

SuperMuc: 3 PFlops

Ulrich Rüde

Would you want to propel a Superjumbo

Designing Algorithms! with four strong jet engines

Large Scale S

imulation Soft wa

Moderately Parallel Computing

re

or with 1,000,000 blow dryer fans?

Massively Parallel MultiCore Systems

Coupled Physical Models at Extreme Scale -

Ulrich Rüde

4

Multigrid: Algorithms for 1012 unknowns Goal: solve Ah uh = f h using a hierarchy of grids Ah u h = f h rh = f h Residual

Relax on

Restrict

Correct

uh

uh + eh

Ah u h

rH = IhH rh

Interpolate

h H e eh = IH

Multigrid uses coarse grids Solve AH eH = r H to accomplish the inevitable global data exchange in thebymost recursion efficient way possible …



What‘s the largest FE system solved?

- Uli Rüde

5

Hierarchical Hybrid Grids (HHG) B. Bergen, F. Hülsemann, U. Rüde, G. Wellein: ISC Award 2006: „Is 1.7× 1010 unknowns the largest finite element system that can be solved today?“, SuperComputing, 2005.

Parallelize „plain vanilla“ multigrid for tetrahedral finite elements partition domain parallelize all operations on all grids use clever data structures matrix free implementation

Do not worry (so much) about Coarse Grids

Bey‘s Tetrahedral Refinement

idle processors? short messages? sequential dependency in grid hierarchy?

Elliptic problems always require global communication. This cannot be accomplished by local relaxation or Krylov space acceleration or domain decomposition without coarse grid

What‘s the largest FE system solved?

- Uli Rüde

6

HHG Data Structures

Geometrical Hierarchy: Volume, Face, Edge, Vertex What‘s the largest FE system solved?

- Uli Rüde

7

Edge

Vertex

Copying to update ghost points

Interior

14 12

13

9

10

11

5

6

7

8

0

1

2

3

5

6

7

8

0

1

2

3

9

10

11

12

4

4

linearization & memory representation

MPI message to update ghost points

What‘s the largest FE system solved?

- Uli Rüde

8

HHG Parallel Grid Traversal

What‘s the largest FE system solved?

- Uli Rüde

9

Parallel Efficiencies of HHG on Di↵erent Clusters

Parallel Efficiency of HHG

Gmeiner, Köstler, Stürmer, UR: Parallel multigrid on hierarchical hybrid grids: a performance study on current high performance computing clusters, Concurrency and Computation: Practice and Experience, 2014

1 0.95

Parallel Efficiency

0.9 0.85

1 Island

0.8

1 Node Card

0.75

Hybrid Parallel

131 072 cores

1 Midplane

0.7

262 144 cores

0.65 0.6 Hybrid Parallel

0.55 0.5 1.0E+07

393 216 cores

1.0E+08

1.0E+09

JUGENE

1.0E+10 1.0E+11 Problem Size JUQUEEN

What‘s the largest FE system solved?

1.0E+12

1.0E+13

SuperMUC

- Uli Rüde

10

Application to Earth Mantle Convection Models

What‘s the largest FE system solved?

- Uli Rüde

11

Mantle Convection

TERRA

Why Mantle Convection? • driving force for plate tectonics • mountain building and earthquakes

Why Exascale? • mantle has 1012 km3 • inversion and UQ blow up cost

Why TerraNeo? • ultra-scalable and fast • sustainable framework

Challenges • • • •

Peter Bunge

computer sciences: software design for future exascale systems mathematics: HPC performance oriented metrics geophysics: model complexity and uncertainty bridging disciplines: integrated co-design

Ulrich Rüde

Barbara Wohlmuth

12

HHG Solver for Stokes System

by Earth Mantle convection problem BoussinesqMotivated model for mantle convection problems Gmeiner, Waluga, Stengel, Wohlmuth, UR: Performance and derived from the equations for balance of forces, conservation Scalability of Hierarchical Hybrid Multigrid Solvers forofStokes mass andSystems, energy: SISC, 2015.

r · (2⌘✏(u)) + rp = ⇢(T )g, @T + u · rT @t u p T ⌫ ✏(u) = 12 (ru + (ru)T ) ⇢ , , g

r · u = 0,

r · (rT ) = .

Solution of the Stokes equations

velocity dynamic pressure temperature viscosity of the material strain rate tensor density thermal conductivity, heat sources, gravity vector

Scale up to ~1012 nodes/ DOFs resolve the whole Earth Mantle globally with 1km resolution

What‘s the largest FE system solved?

- Uli Rüde

13

The Curse of Size

TERRA

Energy computer generation

gigascale: 109 FLOPS

terascale 1012 FLOPS

petascale 1015 FLOPS

exascale 1018 FLOPS

desired problem size DoF=N

106

109

1012

1015

energy estimate (kWh) 1 NJoule × N2 all-to-all communication

0.278 Wh 10 min of LED light

278 kWh 2 weeks blow drying hair

278 GWh 2 months electricity for Munich

278 PWh 100 years world electricity production

TerraNeo prototype (kWh)

0.13 Wh

0.03 kWh

27 kWh

?

At extreme scale: optimal complexity is a must!

Peter Bunge

Ulrich Rüde

Barbara Wohlmuth

14

Why Co-Design?

TERRA

No exascale simulation without maximal efficiency on all levels • •

model formulation - discretisation - solution algorithm core - node - cluster TERRA NEO

prototype

TERRA

Standard PC: Department Server:

Petascale Cluster:

DOF 8.4 · 107 6.4 · 108 5.2 · 109 4.4 · 1010 3.4 · 1011 2.8 · 1012

Resolution 32 km 16 km 8 km 4 km 2 km 1 km

Rank 8, Nodes 1 4 30 240 1 920 15 360

JUQUEEN TOP 500, Nov 2014 Threads Time 30 30 s 240 38 s 1 920 40 s 15 260 44 s 122 880 48 s 983 040 54 s

n g i es

-D o C

SuperMUC Rank 14, TOP 500, Nov 2014 Nodes Threads Time 1 4 16 s 2 30 20 s 15 240 24 s 120 1 920 27 s 960 15 360 34 s 7 680 122 880 41 s

High-Q Club Highest Scaling Codes on JuQUEEN

Peter Bunge

Ulrich Rüde

Barbara Wohlmuth

15

onlinearities and inhomogeneities in the material.

n dimensionless form, we obtain:

Notation: Problem Coupled Flow-Transport u velocity

u + rp =

div u = 0

RaT b r

@tT + u · rT = Pe

1

T

Dimensionless numbers:

TERRA

T temperature p pressure b r radial unit vector

• Rayleigh number (Ra): determines the presence and strenght of convection

• Peclet number (Pe): ratio of convective to di↵usive transport

The mantle convection problem is defined by the model equations in the interior of the pherical shell as well as suitable initial/boundary conditions.

JJ J I II

0 1 2 3 4 9



6.5×10 DoF



10000 time steps



run time 7 days



Mid-size cluster: 288 compute cores in 9 nodes of LSS at FAU

Peter Bunge

Ulrich Rüde

Barbara Wohlmuth

16

SCG Nodes

Threads

DoFs

iter

MINRES time

iter

Uzawa time

Comparison of(3)Stokes 24 6.6 · 10 28 137.26 s Solvers 65 (3) 116.35 s

1 6 48 384 3 072 24 576

7

192 5.3 · 108 26 (5) 135.85 s 64 (7) 133.12 s Three 1 536 solvers: 4.3 · 109 26 (15) 139.64 s 62 (15) 133.29 s Schur3.4 complement MG s 12 288 · 1010 26CG (30)(pressure 142.92correction) s 62 (35)with138.26 98 304 2.7 ·with 1011MG26 (60) 154.73for s Vector-Laplace 62 (71) 156.86 s MINRES preconditioner 12 786 432 2.2 · 10 28 (120) s 64 (144) 184.22 s MG for saddle point problem187.42 with Uzawa type smoother

iter 7 7 7 7 7 8

(1) (3) (5) (10) (20) (40)

time 65.71 80.42 85.28 93.19 114.36 177.69

Table 5. Weak scaling results grid for the Laplace operator. Time to solution without coarse solver residual reduction by 10-8 with coarse grid solver the difference shrinks

SCG Nodes

Threads

DoFs

iter

1 6 48 384 3 072 24 576

24 192 1 536 12 288 98 304 786 432

6.6 · 107 5.3 · 108 4.3 · 109 3.4 · 1010 2.7 · 1011 2.2 · 1012

28 26 26 26 26 28

time 136.30 134.26 135.06 135.41 139.55 154.06

s s s s s s

MINRES iter time

Uzawa iter time

65 64 62 62 62 64

7 7 7 7 7 8

115.05 130.56 128.34 128.34 133.30 139.52

s s s s s s

58.80 64.40 65.03 64.96 66.08 78.24

s s s s s s

Table 6. Weak scaling results for the Laplace operator without coarse grid time. What‘s the largest FE system solved?

- Uli Rüde

17

s s s s s s

Towards a holistic performance engineering methodology:

Parallel Textbook Multigrid Efficiency Gmeiner, UR, Stengel, Waluga, Wohlmuth: Towards Textbook Efficiency for Parallel Multigrid, Journal of Numerical Mathematics: Theory, Methods and Applications, 2015

What‘s the largest FE system solved?

- Uli Rüde

18

Textbook Multigrid Efficiency (TME) „Textbook multigrid efficiency means solving a discrete PDE problem with a computational effort that is only a small (less than 10) multiple of the operation count associated with the discretized equations itself.“ [Brandt, 98] work unit (WU) = single elementary relaxation classical algorithmic TME-factor: ops for solution/ ops for work unit new parallel TME-factor to quantify algorithmic efficiency combined with implementation scalability What‘s the largest FE system solved?

- Uli Rüde

19

Parallel TME µsm

# of elementary relaxation steps on single core/sec

U

# cores

U µsm

aggregate relaxation performance

N TWU (N, U ) = U µsm T (N, U )

idealized time for a work unit

time to solution for N unkowns on U cores

Parallel textbook efficiency factor

T (N, U ) U µsm EParTME (N, U ) = = T (N, U ) TWU (N, U ) N combines algorithmic and implementation efficiency. What‘s the largest FE system solved?

- Uli Rüde

20

re 7, we reconsider different non-FMG cycles as in Secal ETMEAlgorithmic definition, the EParTME factor is approximately Performance Residual reduction per work unit (WU) 1.0E+00

W(3,3)-cycle (SOR)

0

140

20

40

60

80

100

120

140

1.0E-01 1.0E-02

1.0E-03 Residual (euc)

120

1.0E-04 1.0E-05 1.0E-06 1.0E-07

1.0E-08 1.0E-09 V(1,1)-cycle

EParTME V(3,3)-cycle

V(3,3)-cycle (SOR)

W(3,3)-cycle (SOR)

Residual for scalar constant coefficient case on a resolution of 9 CC) on a resolution of 8.6 · 10 grid points on 4 096 cores. 8.6·109 grid points on 4096 cores for different mutligrid cycles What‘s the largest FE system solved?

- Uli Rüde

21

d to classical TME. We also point out that the coarsest

Analysing TME Efficiency: RB-GS Smoother    for

(int i=1; i < (tsize-j-k-1); i=i+2) { u[mp_mr+i] = c[0] * ( -c[1] *u[mp_mr+i+1] - c[2] *u[mp_tr+i-1] c[3] *u[mp_tr+i] - c[4] *u[tp_br+i] c[5] *u[tp_br+i+1] - c[6] *u[tp_mr+i-1] c[7] *u[tp_mr+i] - c[8] *u[bp_mr+i] c[9] *u[bp_mr+i+1] - c[10]*u[bp_tr+i-1] c[11]*u[bp_tr+i] - c[12]*u[mp_br+i] c[13]*u[mp_br+i+1] - c[14]*u[mp_mr+i-1] f[mp_mr+i] );

tp_mr

+

tp_br mp_tr mp_mr mp_br

bp_tr bp_mr

This loop should be executed on single SuperMuc core at 720 M updates/sec (in theory - peak performance) µsm = 176 M updates/sec (in practice - memory access bottleneck; RB-ordering prohibits vector loads) Thus whole SuperMuc should perform U µsm = 147456*176M ≈ 26T updates/sec What‘s the largest FE system solved?

- Uli Rüde

22

Execution-Cache-Memory Model (ECM) Gmeiner, UR, Stengel, Waluga, Wohlmuth: Towards Textbook Efficiency for Parallel Multigrid, accepted for publication in Journal of Numerical Mathematics: Theory, Methods and Applications, 2015 Hager et al. Exploring performance and power properties of modern multi-core chips via simple machine models. Concurrency and Computation: Practice and Experience, 2013. Björn Gmeiner





  

 













 





 

 







 







ECM model for the 15-point stencil on SNB core. Arrow indicates a 64 Byte cache line transfer. Run-times represent 8 elementary updates.

5: ECM model for the 15-point stencil with variable coefficients (left) and constant coefficients Variable coefficients Constant coefficients on SNB core. An arrow indicates a 64 Byte cache line transfer. Run-times represent 8 elementary .

23 What‘s the largest FE system solved? - Uli Rüde nce the performance predictions obtained as above by the roofline model are un-

Parallel Multigrid

ECM: single-chip prediction vs. measurement

performance [MLups/s]

1600

17

constant coefficients ECM model constant coefficients measured variable coefficients ECM model variable coefficients measured

1200

800

400

0

1

2

3

4

5

6

7

8

# cores

Figure 6:Sandy SNB single-chip performance scaling of the stencil smoothers onstencil 2563 grid points. Measured Bridge single-chip performance scaling of the smoothers data and ECM prediction ranges are shown. 3

on 256 grid points. Measured data and ECM prediction ranges shown.

blocking strategies, would fail. The above discussion demonstrates that —different form 24 What‘s the largest FE system solved? - Uli Rüde conventional wisdom— memory bandwidth is not the critical resource and thus an any

Bjö

TME and Parallel TME results Setting/Measure Grid points Processor cores U (CC) - FMG(2,2) (VC) - FMG(2,2) (SF) - FMG(2,1)

ETME 6.5 6.5 31

ESerTME 2 · 106 1 15 11 64

ENodeTME 3 · 107 16 22 13 100

EParTME1 9 · 109 4 096 26 15 118

EParTME2 2 · 1011 16 384 22 13 -

Table 4: TME factors for different problem sizes and discretizations. Three model problems: scalar constant & scalar variable coefficients Stokes solved Schur complement e 4 lists the efficiencies forvia the (CC), (VC), as well as the (SF) case. In Full FMG-4CG-1V(2,1) multigrid with #iterations such that asymptotic employ the pressure correction cycle. All cycles ar t they are optimality as cheap maintained as possible, but still keep the ratio ⇡ 2. We observ l

= 6.5 the (algorithmically forfor scalar cases) ween the ETME and new E (CC) and (SF) compared to the TME ParTME ParTME 20 for scalar, andthe ≥100 for Stokes e also observe thearound effect that traversing coarser grids of the multigrid ficient than the processing the FE thesystem large grids -onUlithe What‘s the largest solved? Rüdefinest25level. This is pa ause of communication overhead and the deteriorating surface to volume

TME Scalability as Guideline for Performance Engineering Quantify the cost of an elementary relaxation through systematic micro-kernel benchmarks optimize kernel performance, justify any performance degradation compute idealized aggregate performance for large scale parallel system Evaluate parallel solver performance Quantify TM-scalability factor as overall measure of efficiency analyse discrepancies identify possible improvements improve TM-scalability factor What‘s the largest FE system solved?

- Uli Rüde

26

Exa-Scale Topic: Resilience

Fault Tolerance in Multigrid Solvers

Huber, M., Gmeiner, B., UR, Wohlmuth, B.: Resilience for Exascale Enabled Multigrid Methods, Copper Mountain Conference on Multigrid Methods, 2015

What‘s the largest FE system solved?

- Uli Rüde

27

Rationale for Resilience With increasing complexity of systems and further miniaturization, faults will become more frequent Classification of fault fail-stop silent error correction (e.g. ECC) leads to slowdown Traditional techniques resilience through redundancy costs silicon and energy checkpoint restart too slow Algorithm Based Fault Tolerance (ABFT)

What‘s the largest FE system solved?

- Uli Rüde

28

Model Scenario u=f

in ⌦

+ BC

1.00E+00

Residual after fault

FAULT

Residual

1.00E-04

1.00E-08

1.00E-12

1.00E-16

0

5

10

15

20

Iterations No Fault

Fault

What‘s the largest FE system solved?

- Uli Rüde

29

Iteration in Progress

What‘s the largest FE system solved?

- Uli Rüde

30

One Processor Faults

What‘s the largest FE system solved?

- Uli Rüde

31

... and pulls others down

What‘s the largest FE system solved?

- Uli Rüde

32

larger and larger parts are effected

What‘s the largest FE system solved?

- Uli Rüde

33

Disaster ....

What‘s the largest FE system solved?

- Uli Rüde

34

.... strikes!

What‘s the largest FE system solved?

- Uli Rüde

35

Alternative: Processor Faults

What‘s the largest FE system solved?

- Uli Rüde

36

Safety bolts are installed

What‘s the largest FE system solved?

- Uli Rüde

37

Superman is called for help

What‘s the largest FE system solved?

- Uli Rüde

38

Safety bolts removed

What‘s the largest FE system solved?

- Uli Rüde

39

Mission accomplished

What‘s the largest FE system solved?

- Uli Rüde

40

Superman‘s Task: Recover solution in faulty subdomain Assumptions fault detectable (not yet with current HW and SW) static data (rhs and stiffness matrix) are available

Observation Dirichlet data for faulty subdomain are available (redundancy of HHG data structures for faces)

Task: Solve local Dirichlet problem approximately by local multigrid iteration V-Cylces W-Cycles F-Cycles What‘s the largest FE system solved?

- Uli Rüde

41

Visualization of Residual Residual after fault

Residual after global V-cycle

Residual after recovery with local F-cycle

Residual after further global V-cycle

What‘s the largest FE system solved?

- Uli Rüde

42

boundary data u on . Before continuing with the global multigrid iteration for solving (2.2), we approximate the auxiliary local problem by nF steps of a given iterative solver, see Alg. 1.

Local Recovery Strategy

Algorithm 1 Local recovery (LR) algorithm 1: Solve (2.2) by multigrid cycles. 2: if Fault has occurred then 3: STOP solving (2.2). 4: Recover the boundary data u F from line 4 in (3.1). 5: Initialize uF with zero. 6: Approximate line 5 in (3.1) by solving iteratively with nF steps: 7: AF F u F = f F AF F u F 8: RETURN to line 1 with new values uF in ⌦F . 9: end if

0

10

1

0

1

f I tolerant multigrid AObviously, AI Ithis algorithm 0 0is only a0first steputowards a fault II I B 0 and su↵ers CB C Bof0the C local problem on solver fromId the fact0that during the approximation Id 0 u B , the processors assigned to the healthy domain C B ⌦I C B idle. C Since in practice ⌦ will wait IC BFA I C B B 0 A 0 A F CB u C = B 0 C . B ⌦F is small compared to ⌦I , this will naturally result in many idleCprocessors during @ 0 A @u A @0A 0This performance Id Idloss will0be addressed the recovery. F in Sec. 5. Nevertheless the local 0 will0 already0provide AFvaluable AFinsight strategies inuthe the iterative solver F F selectionfof F F for the recovery subproblem and in the choice of nF .

4.1. Influence of the algorithmic onRüde the performance. To 43 What‘s the largest FE system parameters solved? - Uli illustrate the method, we consider again the situation of Subsec. 2.2. We study five di↵erent iterative solvers: local V-cycles, local W-cycles, local F-cycles, each with

10

Quantitative Analysis (local strategy) 100

100

10

4

8

10

12

10

16

0

5

Fault 2⇥ Vcycle 2⇥ Wcycle 2⇥ Fcycle 15⇥ Smth 2000⇥ Smth

Residual

Residual

10

FAULT & RECOVERY

Fault-Free

100⇥ PCG

5

10

15

20

10

4

10

8

10

12

10

16

0

Iterations What‘s the largest FE system solved?

- Uli Rüde

44

Fig. 4.1. Convergence of the relative residual for di↵er

Alternative Recovery Strategy: Asynchronous Global Iteration Obervation: During local recovery all healthy domain processors stay idle. Therefore:

Temporary decoupling of faulty and healthy domains. Asynchronously do: Iterate in healthy domain • Dirichlet or Neumann BC on interface to faulty domain • V-cycles

Iterate in faulty domain • Dirichlet BC on interface to healthy domain • V-cycles

Re-couple faulty and healthy domains Perform global V-cycles What‘s the largest FE system solved?

- Uli Rüde

45

14 Algorithm 3 Dirichlet-Neumann recovery (DN) algorithm 1: Solve (2.2) by multigrid cycles. 2: if Fault has occurred then 3: STOP solving (2.2). 4: Compute the Neumann condition from line 2 in (3.2): 5: = A I I uI A I I u I I 6: Initialize uF with zero. 7: In parallel do: 8: a) Perform nF cycles with acceleration ⌘super : 9: Recover/update boundary values u F from line 5 in (3.2). 10: Use 1 MG cycle to approximate line 6 in (3.2): 11: AF F u F = f F AF F u F 12: b) Perform nI cycles: 13: ✓Use 1 MG cycle ◆ ✓to approximate ◆ ✓ ◆lines 1+2 in (3.2) fI AII AI I uI 14: = A II A I I u I I 15: Update boundary values u from line 3 in (3.2). 16: RETURN to line 1 with values uI in ⌦I , u on and uF in ⌦F . 17: end if

Global Strategy

0

AII BA I I B B 0 B BA I B @ 0 0

AI I A I I Id 0 0 0

0 Id 0 0 0 0

0 0 Id A Id 0

What‘s the largest

10

1

0

1

fI 0 0 uI B C B C 0 0 C C Bu I C B 0 C B C B C 0 0 C CB I C = B 0 C B C B C 0 A FC CB u C B 0 C Id 0 A @u F A @ 0 A left to right: Scenario (I), Scenario (II) and Scenario (III) AF FFig. 5.1.AFrom uF fF FF

computations. Scenario (I) has 2 millions unknowns and the fault (red marked tetrais located with two faces at the boundary of 46 the computational domain and FEhedron) system solved? Uli Rüde a↵ects about 16.7% of the total unknowns. Scenario (II) has 16 millions unknowns, a corruption by the fault of 2.0% (like in Subsec. 4.1), and the faulty domain has one edge coinciding with the boundary. Scenario (III) has 56 millions unknowns, the

Additional time spans (in seconds) of di↵erent global recovery strategies during weak scaling with a fault after kF = 7 iterations.

DD Strategy n = 1 DN Strategy n = 1 Time to solution Size No Rec ⌘super = 1 2 4 8 ⌘super = 1 2 4 weak scaling 769 13.00 12.97 10.30 2.66 -0.02 12.98 10.30 2.66 1 281 13.05 after 12.79 10.37 2.52 cycles 0.28 12.78 10.40 2.53 fault 7 multigrid 2 305 13.25 10.34 7.86 2.98 0.08 10.66 7.91 2.98 4 353 10.89 10.81 5.50 -0.02 0.15 10.57 & 5.29 global superman recovery: DD DN-0.21 I

I

3 3 3 3

DD Strategy nI = 2

8 0.02 0.23 0.08 -0.83

DN Strategy nI = 2

Size

No Rec

⌘super = 1

2

4

8

⌘super = 1

2

4

8

7693 1 2813 2 3053 4 3533

13.00 13.05 13.25 10.89

12.72 12.52 10.05 7.81

5.03 5.00 4.98 2.52

-0.10 -0.30 0.07 2.28

-0.24 -0.03 -0.22 2.49

12.74 12.53 10.11 7.63

5.05 5.01 5.03 2.33

-0.09 -0.28 0.07 -0.54

-0.24 -0.03 -0.22 -0.35

DD Strategy nI = 3

DN Strategy nI = 3

Size

No Rec

⌘super = 1

2

4

8

⌘super = 1

2

4

8

7693 1 2813 2 3053 4 3533

13.00 13.05 13.25 10.89

9.95 9.26 9.78 7.53

2.28 2.17 2.15 2.21

-0.36 -0.56 -0.21 1.99

-0.48 -0.31 -0.52 2.16

9.99 9.70 9.81 7.27

2.31 2.15 2.15 -0.65

-0.32 -0.54 -0.21 -0.85

-0.47 -0.32 -0.51 -0.68

DD Strategy nI = 4

Size 3

DN Strategy nI = 4 47 What‘s the largest FE system solved? - Uli Rüde No Rec ⌘super = 1 2 4 8 ⌘super = 1 2 4

8

Conclusions and Outlook 2.8×1012 unknowns in fully implicit solve with Peta-Scale machines demonstrated But optimal algortithms are not necessarily fast Multigrid scales to Peta and beyond HHG: lean and mean implementation: excellent time to sol.

Can we do to 10 DoF? 13

What‘s the largest FE system solved?

- Uli Rüde

48

Suggest Documents