TERRA
What is the Largest Finite Element System That Can be Solved Today? U. Rüde (FAU Erlangen,
[email protected]) joint work with B. Gmeiner, M. Huber, H. Stengel (FAU) C. Waluga, M. Huber, L. John, B. Wohlmuth (TUM) M. Mohr, S. Baumann, J. Weismüller, P. Bunge (LMU)
Lehrstuhl für Simulation FAU Erlangen-Nürnberg
International Conference On Preconditioning Techniques For Scientific And Industrial Applications 17-19 June 2015 Eindhoven, The Netherlands
What‘s the largest FE system solved?
- Uli Rüde
1
? Optimal ⟺ Fast UR, New Mathematics for Extreme-scale Computational Science? SIAM News, Volume 48, Number 5, June 2015
Accuracy ⟺ Computational Cost Higher order ⟺ Complexity of discrete system Mesh: Unstructured ⟺ Structured Adaptive refinement ⟺ Superconvergence Efficiency on modern computer architectures Algebraic Solver: Operation count ⟺ Parallel scalability Cost metric: time to solution ⟺ energy consumption Asymptotic estimates ⟺ quantified constants Worst case ⟺ Average case Bitwise reproducibility ⟺ Algorithmic Fault Tolerance What‘s the largest FE system solved?
- Uli Rüde
2
Two Multi-PetaFlops Supercomputers JUQUEEN
SuperMUC
Blue Gene/Q architecture 458,752 PowerPC A2 cores 16 cores (1.6 GHz) per node 16 GiB RAM per node 5D torus interconnect 5.8 PFlops Peak TOP 500: #8
What is the problem? Exascale Computing
Intel Xeon architecture 147,456 cores 16 cores (2.7 GHz) per node 32 GiB RAM per node Pruned tree interconnect 3.2 PFlops Peak TOP 500: #14
SuperMuc: 3 PFlops
Ulrich Rüde
Would you want to propel a Superjumbo
Designing Algorithms! with four strong jet engines
Large Scale S
imulation Soft wa
Moderately Parallel Computing
re
or with 1,000,000 blow dryer fans?
Massively Parallel MultiCore Systems
Coupled Physical Models at Extreme Scale -
Ulrich Rüde
4
Multigrid: Algorithms for 1012 unknowns Goal: solve Ah uh = f h using a hierarchy of grids Ah u h = f h rh = f h Residual
Relax on
Restrict
Correct
uh
uh + eh
Ah u h
rH = IhH rh
Interpolate
h H e eh = IH
Multigrid uses coarse grids Solve AH eH = r H to accomplish the inevitable global data exchange in thebymost recursion efficient way possible …
…
What‘s the largest FE system solved?
- Uli Rüde
5
Hierarchical Hybrid Grids (HHG) B. Bergen, F. Hülsemann, U. Rüde, G. Wellein: ISC Award 2006: „Is 1.7× 1010 unknowns the largest finite element system that can be solved today?“, SuperComputing, 2005.
Parallelize „plain vanilla“ multigrid for tetrahedral finite elements partition domain parallelize all operations on all grids use clever data structures matrix free implementation
Do not worry (so much) about Coarse Grids
Bey‘s Tetrahedral Refinement
idle processors? short messages? sequential dependency in grid hierarchy?
Elliptic problems always require global communication. This cannot be accomplished by local relaxation or Krylov space acceleration or domain decomposition without coarse grid
What‘s the largest FE system solved?
- Uli Rüde
6
HHG Data Structures
Geometrical Hierarchy: Volume, Face, Edge, Vertex What‘s the largest FE system solved?
- Uli Rüde
7
Edge
Vertex
Copying to update ghost points
Interior
14 12
13
9
10
11
5
6
7
8
0
1
2
3
5
6
7
8
0
1
2
3
9
10
11
12
4
4
linearization & memory representation
MPI message to update ghost points
What‘s the largest FE system solved?
- Uli Rüde
8
HHG Parallel Grid Traversal
What‘s the largest FE system solved?
- Uli Rüde
9
Parallel Efficiencies of HHG on Di↵erent Clusters
Parallel Efficiency of HHG
Gmeiner, Köstler, Stürmer, UR: Parallel multigrid on hierarchical hybrid grids: a performance study on current high performance computing clusters, Concurrency and Computation: Practice and Experience, 2014
1 0.95
Parallel Efficiency
0.9 0.85
1 Island
0.8
1 Node Card
0.75
Hybrid Parallel
131 072 cores
1 Midplane
0.7
262 144 cores
0.65 0.6 Hybrid Parallel
0.55 0.5 1.0E+07
393 216 cores
1.0E+08
1.0E+09
JUGENE
1.0E+10 1.0E+11 Problem Size JUQUEEN
What‘s the largest FE system solved?
1.0E+12
1.0E+13
SuperMUC
- Uli Rüde
10
Application to Earth Mantle Convection Models
What‘s the largest FE system solved?
- Uli Rüde
11
Mantle Convection
TERRA
Why Mantle Convection? • driving force for plate tectonics • mountain building and earthquakes
Why Exascale? • mantle has 1012 km3 • inversion and UQ blow up cost
Why TerraNeo? • ultra-scalable and fast • sustainable framework
Challenges • • • •
Peter Bunge
computer sciences: software design for future exascale systems mathematics: HPC performance oriented metrics geophysics: model complexity and uncertainty bridging disciplines: integrated co-design
Ulrich Rüde
Barbara Wohlmuth
12
HHG Solver for Stokes System
by Earth Mantle convection problem BoussinesqMotivated model for mantle convection problems Gmeiner, Waluga, Stengel, Wohlmuth, UR: Performance and derived from the equations for balance of forces, conservation Scalability of Hierarchical Hybrid Multigrid Solvers forofStokes mass andSystems, energy: SISC, 2015.
r · (2⌘✏(u)) + rp = ⇢(T )g, @T + u · rT @t u p T ⌫ ✏(u) = 12 (ru + (ru)T ) ⇢ , , g
r · u = 0,
r · (rT ) = .
Solution of the Stokes equations
velocity dynamic pressure temperature viscosity of the material strain rate tensor density thermal conductivity, heat sources, gravity vector
Scale up to ~1012 nodes/ DOFs resolve the whole Earth Mantle globally with 1km resolution
What‘s the largest FE system solved?
- Uli Rüde
13
The Curse of Size
TERRA
Energy computer generation
gigascale: 109 FLOPS
terascale 1012 FLOPS
petascale 1015 FLOPS
exascale 1018 FLOPS
desired problem size DoF=N
106
109
1012
1015
energy estimate (kWh) 1 NJoule × N2 all-to-all communication
0.278 Wh 10 min of LED light
278 kWh 2 weeks blow drying hair
278 GWh 2 months electricity for Munich
278 PWh 100 years world electricity production
TerraNeo prototype (kWh)
0.13 Wh
0.03 kWh
27 kWh
?
At extreme scale: optimal complexity is a must!
Peter Bunge
Ulrich Rüde
Barbara Wohlmuth
14
Why Co-Design?
TERRA
No exascale simulation without maximal efficiency on all levels • •
model formulation - discretisation - solution algorithm core - node - cluster TERRA NEO
prototype
TERRA
Standard PC: Department Server:
Petascale Cluster:
DOF 8.4 · 107 6.4 · 108 5.2 · 109 4.4 · 1010 3.4 · 1011 2.8 · 1012
Resolution 32 km 16 km 8 km 4 km 2 km 1 km
Rank 8, Nodes 1 4 30 240 1 920 15 360
JUQUEEN TOP 500, Nov 2014 Threads Time 30 30 s 240 38 s 1 920 40 s 15 260 44 s 122 880 48 s 983 040 54 s
n g i es
-D o C
SuperMUC Rank 14, TOP 500, Nov 2014 Nodes Threads Time 1 4 16 s 2 30 20 s 15 240 24 s 120 1 920 27 s 960 15 360 34 s 7 680 122 880 41 s
High-Q Club Highest Scaling Codes on JuQUEEN
Peter Bunge
Ulrich Rüde
Barbara Wohlmuth
15
onlinearities and inhomogeneities in the material.
n dimensionless form, we obtain:
Notation: Problem Coupled Flow-Transport u velocity
u + rp =
div u = 0
RaT b r
@tT + u · rT = Pe
1
T
Dimensionless numbers:
TERRA
T temperature p pressure b r radial unit vector
• Rayleigh number (Ra): determines the presence and strenght of convection
• Peclet number (Pe): ratio of convective to di↵usive transport
The mantle convection problem is defined by the model equations in the interior of the pherical shell as well as suitable initial/boundary conditions.
JJ J I II
0 1 2 3 4 9
•
6.5×10 DoF
•
10000 time steps
•
run time 7 days
•
Mid-size cluster: 288 compute cores in 9 nodes of LSS at FAU
Peter Bunge
Ulrich Rüde
Barbara Wohlmuth
16
SCG Nodes
Threads
DoFs
iter
MINRES time
iter
Uzawa time
Comparison of(3)Stokes 24 6.6 · 10 28 137.26 s Solvers 65 (3) 116.35 s
1 6 48 384 3 072 24 576
7
192 5.3 · 108 26 (5) 135.85 s 64 (7) 133.12 s Three 1 536 solvers: 4.3 · 109 26 (15) 139.64 s 62 (15) 133.29 s Schur3.4 complement MG s 12 288 · 1010 26CG (30)(pressure 142.92correction) s 62 (35)with138.26 98 304 2.7 ·with 1011MG26 (60) 154.73for s Vector-Laplace 62 (71) 156.86 s MINRES preconditioner 12 786 432 2.2 · 10 28 (120) s 64 (144) 184.22 s MG for saddle point problem187.42 with Uzawa type smoother
iter 7 7 7 7 7 8
(1) (3) (5) (10) (20) (40)
time 65.71 80.42 85.28 93.19 114.36 177.69
Table 5. Weak scaling results grid for the Laplace operator. Time to solution without coarse solver residual reduction by 10-8 with coarse grid solver the difference shrinks
SCG Nodes
Threads
DoFs
iter
1 6 48 384 3 072 24 576
24 192 1 536 12 288 98 304 786 432
6.6 · 107 5.3 · 108 4.3 · 109 3.4 · 1010 2.7 · 1011 2.2 · 1012
28 26 26 26 26 28
time 136.30 134.26 135.06 135.41 139.55 154.06
s s s s s s
MINRES iter time
Uzawa iter time
65 64 62 62 62 64
7 7 7 7 7 8
115.05 130.56 128.34 128.34 133.30 139.52
s s s s s s
58.80 64.40 65.03 64.96 66.08 78.24
s s s s s s
Table 6. Weak scaling results for the Laplace operator without coarse grid time. What‘s the largest FE system solved?
- Uli Rüde
17
s s s s s s
Towards a holistic performance engineering methodology:
Parallel Textbook Multigrid Efficiency Gmeiner, UR, Stengel, Waluga, Wohlmuth: Towards Textbook Efficiency for Parallel Multigrid, Journal of Numerical Mathematics: Theory, Methods and Applications, 2015
What‘s the largest FE system solved?
- Uli Rüde
18
Textbook Multigrid Efficiency (TME) „Textbook multigrid efficiency means solving a discrete PDE problem with a computational effort that is only a small (less than 10) multiple of the operation count associated with the discretized equations itself.“ [Brandt, 98] work unit (WU) = single elementary relaxation classical algorithmic TME-factor: ops for solution/ ops for work unit new parallel TME-factor to quantify algorithmic efficiency combined with implementation scalability What‘s the largest FE system solved?
- Uli Rüde
19
Parallel TME µsm
# of elementary relaxation steps on single core/sec
U
# cores
U µsm
aggregate relaxation performance
N TWU (N, U ) = U µsm T (N, U )
idealized time for a work unit
time to solution for N unkowns on U cores
Parallel textbook efficiency factor
T (N, U ) U µsm EParTME (N, U ) = = T (N, U ) TWU (N, U ) N combines algorithmic and implementation efficiency. What‘s the largest FE system solved?
- Uli Rüde
20
re 7, we reconsider different non-FMG cycles as in Secal ETMEAlgorithmic definition, the EParTME factor is approximately Performance Residual reduction per work unit (WU) 1.0E+00
W(3,3)-cycle (SOR)
0
140
20
40
60
80
100
120
140
1.0E-01 1.0E-02
1.0E-03 Residual (euc)
120
1.0E-04 1.0E-05 1.0E-06 1.0E-07
1.0E-08 1.0E-09 V(1,1)-cycle
EParTME V(3,3)-cycle
V(3,3)-cycle (SOR)
W(3,3)-cycle (SOR)
Residual for scalar constant coefficient case on a resolution of 9 CC) on a resolution of 8.6 · 10 grid points on 4 096 cores. 8.6·109 grid points on 4096 cores for different mutligrid cycles What‘s the largest FE system solved?
- Uli Rüde
21
d to classical TME. We also point out that the coarsest
Analysing TME Efficiency: RB-GS Smoother for
(int i=1; i < (tsize-j-k-1); i=i+2) { u[mp_mr+i] = c[0] * ( -c[1] *u[mp_mr+i+1] - c[2] *u[mp_tr+i-1] c[3] *u[mp_tr+i] - c[4] *u[tp_br+i] c[5] *u[tp_br+i+1] - c[6] *u[tp_mr+i-1] c[7] *u[tp_mr+i] - c[8] *u[bp_mr+i] c[9] *u[bp_mr+i+1] - c[10]*u[bp_tr+i-1] c[11]*u[bp_tr+i] - c[12]*u[mp_br+i] c[13]*u[mp_br+i+1] - c[14]*u[mp_mr+i-1] f[mp_mr+i] );
tp_mr
+
tp_br mp_tr mp_mr mp_br
bp_tr bp_mr
This loop should be executed on single SuperMuc core at 720 M updates/sec (in theory - peak performance) µsm = 176 M updates/sec (in practice - memory access bottleneck; RB-ordering prohibits vector loads) Thus whole SuperMuc should perform U µsm = 147456*176M ≈ 26T updates/sec What‘s the largest FE system solved?
- Uli Rüde
22
Execution-Cache-Memory Model (ECM) Gmeiner, UR, Stengel, Waluga, Wohlmuth: Towards Textbook Efficiency for Parallel Multigrid, accepted for publication in Journal of Numerical Mathematics: Theory, Methods and Applications, 2015 Hager et al. Exploring performance and power properties of modern multi-core chips via simple machine models. Concurrency and Computation: Practice and Experience, 2013. Björn Gmeiner
ECM model for the 15-point stencil on SNB core. Arrow indicates a 64 Byte cache line transfer. Run-times represent 8 elementary updates.
5: ECM model for the 15-point stencil with variable coefficients (left) and constant coefficients Variable coefficients Constant coefficients on SNB core. An arrow indicates a 64 Byte cache line transfer. Run-times represent 8 elementary .
23 What‘s the largest FE system solved? - Uli Rüde nce the performance predictions obtained as above by the roofline model are un-
Parallel Multigrid
ECM: single-chip prediction vs. measurement
performance [MLups/s]
1600
17
constant coefficients ECM model constant coefficients measured variable coefficients ECM model variable coefficients measured
1200
800
400
0
1
2
3
4
5
6
7
8
# cores
Figure 6:Sandy SNB single-chip performance scaling of the stencil smoothers onstencil 2563 grid points. Measured Bridge single-chip performance scaling of the smoothers data and ECM prediction ranges are shown. 3
on 256 grid points. Measured data and ECM prediction ranges shown.
blocking strategies, would fail. The above discussion demonstrates that —different form 24 What‘s the largest FE system solved? - Uli Rüde conventional wisdom— memory bandwidth is not the critical resource and thus an any
Bjö
TME and Parallel TME results Setting/Measure Grid points Processor cores U (CC) - FMG(2,2) (VC) - FMG(2,2) (SF) - FMG(2,1)
ETME 6.5 6.5 31
ESerTME 2 · 106 1 15 11 64
ENodeTME 3 · 107 16 22 13 100
EParTME1 9 · 109 4 096 26 15 118
EParTME2 2 · 1011 16 384 22 13 -
Table 4: TME factors for different problem sizes and discretizations. Three model problems: scalar constant & scalar variable coefficients Stokes solved Schur complement e 4 lists the efficiencies forvia the (CC), (VC), as well as the (SF) case. In Full FMG-4CG-1V(2,1) multigrid with #iterations such that asymptotic employ the pressure correction cycle. All cycles ar t they are optimality as cheap maintained as possible, but still keep the ratio ⇡ 2. We observ l
= 6.5 the (algorithmically forfor scalar cases) ween the ETME and new E (CC) and (SF) compared to the TME ParTME ParTME 20 for scalar, andthe ≥100 for Stokes e also observe thearound effect that traversing coarser grids of the multigrid ficient than the processing the FE thesystem large grids -onUlithe What‘s the largest solved? Rüdefinest25level. This is pa ause of communication overhead and the deteriorating surface to volume
TME Scalability as Guideline for Performance Engineering Quantify the cost of an elementary relaxation through systematic micro-kernel benchmarks optimize kernel performance, justify any performance degradation compute idealized aggregate performance for large scale parallel system Evaluate parallel solver performance Quantify TM-scalability factor as overall measure of efficiency analyse discrepancies identify possible improvements improve TM-scalability factor What‘s the largest FE system solved?
- Uli Rüde
26
Exa-Scale Topic: Resilience
Fault Tolerance in Multigrid Solvers
Huber, M., Gmeiner, B., UR, Wohlmuth, B.: Resilience for Exascale Enabled Multigrid Methods, Copper Mountain Conference on Multigrid Methods, 2015
What‘s the largest FE system solved?
- Uli Rüde
27
Rationale for Resilience With increasing complexity of systems and further miniaturization, faults will become more frequent Classification of fault fail-stop silent error correction (e.g. ECC) leads to slowdown Traditional techniques resilience through redundancy costs silicon and energy checkpoint restart too slow Algorithm Based Fault Tolerance (ABFT)
What‘s the largest FE system solved?
- Uli Rüde
28
Model Scenario u=f
in ⌦
+ BC
1.00E+00
Residual after fault
FAULT
Residual
1.00E-04
1.00E-08
1.00E-12
1.00E-16
0
5
10
15
20
Iterations No Fault
Fault
What‘s the largest FE system solved?
- Uli Rüde
29
Iteration in Progress
What‘s the largest FE system solved?
- Uli Rüde
30
One Processor Faults
What‘s the largest FE system solved?
- Uli Rüde
31
... and pulls others down
What‘s the largest FE system solved?
- Uli Rüde
32
larger and larger parts are effected
What‘s the largest FE system solved?
- Uli Rüde
33
Disaster ....
What‘s the largest FE system solved?
- Uli Rüde
34
.... strikes!
What‘s the largest FE system solved?
- Uli Rüde
35
Alternative: Processor Faults
What‘s the largest FE system solved?
- Uli Rüde
36
Safety bolts are installed
What‘s the largest FE system solved?
- Uli Rüde
37
Superman is called for help
What‘s the largest FE system solved?
- Uli Rüde
38
Safety bolts removed
What‘s the largest FE system solved?
- Uli Rüde
39
Mission accomplished
What‘s the largest FE system solved?
- Uli Rüde
40
Superman‘s Task: Recover solution in faulty subdomain Assumptions fault detectable (not yet with current HW and SW) static data (rhs and stiffness matrix) are available
Observation Dirichlet data for faulty subdomain are available (redundancy of HHG data structures for faces)
Task: Solve local Dirichlet problem approximately by local multigrid iteration V-Cylces W-Cycles F-Cycles What‘s the largest FE system solved?
- Uli Rüde
41
Visualization of Residual Residual after fault
Residual after global V-cycle
Residual after recovery with local F-cycle
Residual after further global V-cycle
What‘s the largest FE system solved?
- Uli Rüde
42
boundary data u on . Before continuing with the global multigrid iteration for solving (2.2), we approximate the auxiliary local problem by nF steps of a given iterative solver, see Alg. 1.
Local Recovery Strategy
Algorithm 1 Local recovery (LR) algorithm 1: Solve (2.2) by multigrid cycles. 2: if Fault has occurred then 3: STOP solving (2.2). 4: Recover the boundary data u F from line 4 in (3.1). 5: Initialize uF with zero. 6: Approximate line 5 in (3.1) by solving iteratively with nF steps: 7: AF F u F = f F AF F u F 8: RETURN to line 1 with new values uF in ⌦F . 9: end if
0
10
1
0
1
f I tolerant multigrid AObviously, AI Ithis algorithm 0 0is only a0first steputowards a fault II I B 0 and su↵ers CB C Bof0the C local problem on solver fromId the fact0that during the approximation Id 0 u B , the processors assigned to the healthy domain C B ⌦I C B idle. C Since in practice ⌦ will wait IC BFA I C B B 0 A 0 A F CB u C = B 0 C . B ⌦F is small compared to ⌦I , this will naturally result in many idleCprocessors during @ 0 A @u A @0A 0This performance Id Idloss will0be addressed the recovery. F in Sec. 5. Nevertheless the local 0 will0 already0provide AFvaluable AFinsight strategies inuthe the iterative solver F F selectionfof F F for the recovery subproblem and in the choice of nF .
4.1. Influence of the algorithmic onRüde the performance. To 43 What‘s the largest FE system parameters solved? - Uli illustrate the method, we consider again the situation of Subsec. 2.2. We study five di↵erent iterative solvers: local V-cycles, local W-cycles, local F-cycles, each with
10
Quantitative Analysis (local strategy) 100
100
10
4
8
10
12
10
16
0
5
Fault 2⇥ Vcycle 2⇥ Wcycle 2⇥ Fcycle 15⇥ Smth 2000⇥ Smth
Residual
Residual
10
FAULT & RECOVERY
Fault-Free
100⇥ PCG
5
10
15
20
10
4
10
8
10
12
10
16
0
Iterations What‘s the largest FE system solved?
- Uli Rüde
44
Fig. 4.1. Convergence of the relative residual for di↵er
Alternative Recovery Strategy: Asynchronous Global Iteration Obervation: During local recovery all healthy domain processors stay idle. Therefore:
Temporary decoupling of faulty and healthy domains. Asynchronously do: Iterate in healthy domain • Dirichlet or Neumann BC on interface to faulty domain • V-cycles
Iterate in faulty domain • Dirichlet BC on interface to healthy domain • V-cycles
Re-couple faulty and healthy domains Perform global V-cycles What‘s the largest FE system solved?
- Uli Rüde
45
14 Algorithm 3 Dirichlet-Neumann recovery (DN) algorithm 1: Solve (2.2) by multigrid cycles. 2: if Fault has occurred then 3: STOP solving (2.2). 4: Compute the Neumann condition from line 2 in (3.2): 5: = A I I uI A I I u I I 6: Initialize uF with zero. 7: In parallel do: 8: a) Perform nF cycles with acceleration ⌘super : 9: Recover/update boundary values u F from line 5 in (3.2). 10: Use 1 MG cycle to approximate line 6 in (3.2): 11: AF F u F = f F AF F u F 12: b) Perform nI cycles: 13: ✓Use 1 MG cycle ◆ ✓to approximate ◆ ✓ ◆lines 1+2 in (3.2) fI AII AI I uI 14: = A II A I I u I I 15: Update boundary values u from line 3 in (3.2). 16: RETURN to line 1 with values uI in ⌦I , u on and uF in ⌦F . 17: end if
Global Strategy
0
AII BA I I B B 0 B BA I B @ 0 0
AI I A I I Id 0 0 0
0 Id 0 0 0 0
0 0 Id A Id 0
What‘s the largest
10
1
0
1
fI 0 0 uI B C B C 0 0 C C Bu I C B 0 C B C B C 0 0 C CB I C = B 0 C B C B C 0 A FC CB u C B 0 C Id 0 A @u F A @ 0 A left to right: Scenario (I), Scenario (II) and Scenario (III) AF FFig. 5.1.AFrom uF fF FF
computations. Scenario (I) has 2 millions unknowns and the fault (red marked tetrais located with two faces at the boundary of 46 the computational domain and FEhedron) system solved? Uli Rüde a↵ects about 16.7% of the total unknowns. Scenario (II) has 16 millions unknowns, a corruption by the fault of 2.0% (like in Subsec. 4.1), and the faulty domain has one edge coinciding with the boundary. Scenario (III) has 56 millions unknowns, the
Additional time spans (in seconds) of di↵erent global recovery strategies during weak scaling with a fault after kF = 7 iterations.
DD Strategy n = 1 DN Strategy n = 1 Time to solution Size No Rec ⌘super = 1 2 4 8 ⌘super = 1 2 4 weak scaling 769 13.00 12.97 10.30 2.66 -0.02 12.98 10.30 2.66 1 281 13.05 after 12.79 10.37 2.52 cycles 0.28 12.78 10.40 2.53 fault 7 multigrid 2 305 13.25 10.34 7.86 2.98 0.08 10.66 7.91 2.98 4 353 10.89 10.81 5.50 -0.02 0.15 10.57 & 5.29 global superman recovery: DD DN-0.21 I
I
3 3 3 3
DD Strategy nI = 2
8 0.02 0.23 0.08 -0.83
DN Strategy nI = 2
Size
No Rec
⌘super = 1
2
4
8
⌘super = 1
2
4
8
7693 1 2813 2 3053 4 3533
13.00 13.05 13.25 10.89
12.72 12.52 10.05 7.81
5.03 5.00 4.98 2.52
-0.10 -0.30 0.07 2.28
-0.24 -0.03 -0.22 2.49
12.74 12.53 10.11 7.63
5.05 5.01 5.03 2.33
-0.09 -0.28 0.07 -0.54
-0.24 -0.03 -0.22 -0.35
DD Strategy nI = 3
DN Strategy nI = 3
Size
No Rec
⌘super = 1
2
4
8
⌘super = 1
2
4
8
7693 1 2813 2 3053 4 3533
13.00 13.05 13.25 10.89
9.95 9.26 9.78 7.53
2.28 2.17 2.15 2.21
-0.36 -0.56 -0.21 1.99
-0.48 -0.31 -0.52 2.16
9.99 9.70 9.81 7.27
2.31 2.15 2.15 -0.65
-0.32 -0.54 -0.21 -0.85
-0.47 -0.32 -0.51 -0.68
DD Strategy nI = 4
Size 3
DN Strategy nI = 4 47 What‘s the largest FE system solved? - Uli Rüde No Rec ⌘super = 1 2 4 8 ⌘super = 1 2 4
8
Conclusions and Outlook 2.8×1012 unknowns in fully implicit solve with Peta-Scale machines demonstrated But optimal algortithms are not necessarily fast Multigrid scales to Peta and beyond HHG: lean and mean implementation: excellent time to sol.
Can we do to 10 DoF? 13
What‘s the largest FE system solved?
- Uli Rüde
48