Supercomputing
The Supercomputer MACH-2
Expanding the Limits of Predictability
[JKU Scientific Computing Administration] [Current System Load] February 19, 2018: MACH-2 Opening Celebration
Ulrichsupercomputer Rüde MACH-2 is a massively parallel shared memory operated by the Scientific Computing Administration of the Johannes LSS Kepler University (JKU) Linz,Erlangen Austria, on and behalf CERFACS of a consortium Toulouse consisting of JKU Linz, the University of Innsbruck, the Paris
[email protected] Lodron University of Salzburg, Technische Universität Wien, and the Johann Radon Institute for Computational and Applied Mathematics (RICAM). The machine has been purchased for the purpose of a cooperation project that runs from 2017 to 2021 and is supported by a grant of the Federal Ministry of Education, Science, and Research (BMBWF) in the frame of the HRSM 2016 call. The Centre Européen Recherche etisdefully operational since Lehrstuhl für Simulation system was installed inde October 2017 and January 2018. MACH-2en is Calcul named Scientifique after the Austrian physicist Universität and Formation Avancée Erlangen-Nürnberg philosopher Ernst Mach and replaces the previous MACH www.cerfacs.fr www10.informatik.uni-erlangen.de supercomputer that was operated jointly by JKU and the University of Innsbruck from 2011 to 2017. 1
19.2.2018
Supercomputing and Predictability — Ulrich Rüde
Supercomputer Mach-2 Linz
JUQUEEN
SGI UV 3000
1,728 puter MACH-2
processor cores
Intel Xeon E5-
4650V3 2.1 GHz
Administration] [Current System Load]
-2 Opening Celebration
parallel shared memory supercomputer Computing Administration of the Johannes Linz, Austria, on behalf of a consortium the University of Innsbruck, the Paris zburg, Technische Universität Wien, and titute for Computational and Applied he machine has been purchased for the project that runs from 2017 to 2021 and is e Federal Ministry of Education, Science, in the frame of the HRSM 2016 call. The October 2017 and is fully operational since s named after the Austrian physicist and h and replaces the previous MACH perated jointly by JKU and the University of 17.
ccNUMA
7D enhanced Hypercube 20 TB shared memory
Sunway TaihuLight
Blue Gene/Q
architecture 458,752 PowerPC
A2 cores 16 cores (1.6 GHz)
per node 16 GiB RAM per node 5D torus interconnect 5.8 PFlops Peak
10,649,600 cores 260 cores (1.45 GHz)
per node 32 GiB RAM per node 125 PFlops Peak Power consumption:
15.37 MW TOP 500: #1
TOP 500: #22
type SGI UV 3000 of the former company nal (SGI), now Hewlett Packard Enterprise ass of cache coherent Non-Uniform Memory hich are massively parallel supercomputers shared memory model on top of scalable H-2 is housed in three racks of this kind and istics:
SW26010 processor
ores with 20 TB global shared memory 2 blades, where o 12-core processors of type Intel Xeon E52.1 GHz with 30 MB L3 cache, ped with 256 GB and 8 blades are equipped y, and nected by a NUMAlink 6 network in 7D e topology which yields a "full bandwidth"
as mass storage 4 SSD drives with 400 GB es with 1.6 TB each, and 24 HDD drives with 10 TB each (260 TB mass storage in
Supercomputing and Predictability
-
Ulrich Rüde
On the agenda today … Continuum models: Finite elements and implicit solvers Geophysics Particle based methods: rigid body dynamics Granular systems Mesoscopic methods: Lattice Boltzman methods Complex flows Perspectives Towards Predictive Science Computational Science and Engineering
Supercomputing and Predictability
-
Ulrich Rüde
3
Building block I:
Fast Implicit Solvers: Parallel Multigrid Methods for Earth Mantle Convection Gmeiner, B., Rüde, U., Stengel, H., Waluga, C., & Wohlmuth, B. (2015). Performance and scalability of hierarchical hybrid multigrid solvers for Stokes systems. SIAM Journal on Scientific Computing, 37(2), C143-C168. Gmeiner, B., Rüde, U., Stengel, H., Waluga, C., & Wohlmuth, B. (2015). Towards textbook efficiency for parallel multigrid. Numerical Mathematics: Theory, Methods and Applications, 8(01), 22-46. Huber, M., John, L., Pustejovska, P., Rüde, U., Waluga, C., & Wohlmuth, B. (2015). Solution Techniques for the Stokes System: A priori and a posteriori modifications, resilient algorithms. ICIAM 2015.
Supercomputing and Predictability
-
Ulrich Rüde
4
Instationary Computation
u + rp =
div u = 0
RaT b r
@tT + u · rT = Pe
1
T
Simplified Model
Dimensionless numbers:
• Rayleigh number (Ra): determines the p • Peclet number (Pe): ratio of convective
The mantle convection problem is defined b spherical shell as well as suitable initial/bou
Finite Elements with
6.5×109
JJ J I II 0 1 2 3 4 DoF, 10000 time steps, run time 7 days
Mid-size cluster: 288 compute cores in 9 nodes of LSS at FAU Supercomputing and Predictability
-
Ulrich Rüde
5
Stationary Flow Field
Supercomputing and Predictability
-
Ulrich Rüde
6
„Tera-Scale“ Applications: what is the largest system that we can solve today? Bergen, B, Hülsemann, F, UR (2005): Is 1.7· 1010 unknowns the largest finite element system that can be solved today? Proceedings of SC’05.
and now, 13 years later? we have 400 TByte main memory = 4*1014 Bytes =
5 vectors each with N=1013 double precision elements matrix-free implementation necessary even with a sparse matrix format, storing a matrix of dimension N=1013 is not possible Which algorithm? multigrid Cost = C*N C „moderate“, e.g. C=100. does it parallelize well on that scale? should we worry since 𝜿 = O(N2/3)? Supercomputing and Predictability
-
Ulrich Rüde
7
Exploring the Limits … ally appear in simulations for molecules, quantum mechanics, or geophysics. The initial
Gmeiner B., Huber M., John L., UR, Wohlmuth, B. (2016), A quantitative performance study for Stokes solvers at thetetrahedrons extreme scale, Journal of Computational 509-521. onsists of 240 for the case of 5 Science, nodes 17, and 80 threads. The number
of degr
Multigrid withT0Uzawa Smoother oms on the coarse grid grows from 9.0 · 103 to 4.1 · 107 by the weak scaling. We con
Optimized Minimal Memory Consumption tokes system with thefor Laplace-operator formulation. The relative accuracies for coarse 13 unknowns correspond to 80 TByte for the solution vector 10 r (PMINRES and CG algorithm) are set to 10 3 and 10 4 , respectively. All other param Juqueen has 450 TByte Memory F o D e he solver remain as previously described. or m e matrix free implementation essential d s nitu nodes 5
er ag v l m o f s o E s r F e t r d r a o e time time w.c.g. threads iter h t f o threeDoFs te a t s n ha · 109 10 685.88 80 t2.7 678.77
40
640
320
5 120
2 560
40 960
20 480
327 680
2.1 · 1010 1.2 · 1011
1.7 · 1012
1.1 · 1013
time c.g. in % 1.04
10
703.69
686.24
2.48
10
741.86
709.88
4.31
9
720.24
671.63
6.75
9
776.09
681.91
12.14
Table 10: Weak scaling results withand andPredictability without coarse the spherical shell geometry. Supercomputing - grid Ulrichfor Rüde 8
Building block II:
The Lagrangian View: Granular media simulations
1 250 000 spherical particles 256 processors 300 300 time steps
runtime: 48 h (including data output) texture mapping, ray tracing
with the physics engine Pöschel, T., & Schwager, T. (2005). Computational granular dynamics: models and algorithms. Springer Science & Business Media. Supercomputing and Predictability
-
Ulrich Rüde
9
Shaker scenario with sharp edged hard objects
864 000 sharp-edged particles with a diameter between 0.25 mm and 2 mm. Supercomputing and Predictability
-
Ulrich Rüde
10
PE marble run - rigid objects in complex geometry
Animation by Sebastian Eibl and Christian Godenschwager Supercomputing and Predictability
-
Ulrich Rüde
11
But what can we predict
? What does prediction here mean? Supercomputing and Predictability
-
Ulrich Rüde
12
Parallel Computation Key features of the parallelization: domain partitioning distribution of data contact detection synchronization protocol subdomain NBGS accumulators and corrections aggressive message aggregation nearest-neighbor communication Iglberger, K., & UR (2010). Massively parallel granular flow simulations with non-spherical particles. Computer ScienceResearch and Development, 25(1-2), 105-113 Iglberger, K., & UR (2011). Large-scale rigid body simulations. Multibody System Dynamics, 25(1), 81-95 Supercomputing and Predictability
-
Ulrich Rüde
13
Scaling Results
Tobias Preclik, Ulrich R¨ ude
fastest. The T coordinate is limited by the number of Solver algorithmically not optimal for dense systems, hence processes per node, which wascannot 64 for thescale above measurecreation of a three-dimensional communiunconditionally, but is highly efficient in ments. manyUpon cases of practical importance cator, the three dimensions of the domain partitionStrong and weak scaling results for a constant number of iterations performed ing are mapped also in row-major order. This e↵ects, if on SuperMUC and Juqueen the number of processes in z-dimension is less than the number of processes per node, that a two-dimensional Largest ensembles computed or even three-dimensional section of the domain parti10 tioning is mapped to a single node. However, if the number of processes in z-dimension is larger or equal to the 10 number Breakup of processes up per of node, only a one-dimensional compute times on section of Erlangen the domain partitioning is mapped to a single granular gas: scaling results RRZE Cluster Emmy 7.1 Scalability of Granular Gases (a) Weak-scaling graph on the Emmy cluster. node. A one-dimensional section of the domain partitioning performs considerably less intra-node communi0.116 1.2 cation than a two- or three-dimensional section of the 0.114 1 domain partitioning. This matches exactly the situa0.112 tion for 2 048 and124.6 096 nodes. For 2 048 nodes, 1a6 two0.11 .5 % % 0.8 % % section 1⇥2⇥32 of the domain .7 dimensional partitioning 0.108 64⇥64⇥32 is mapped to each node, and for 4 096 nodes 0.106 0.6 0.104 a one-dimensional section 1 ⇥ 1 ⇥ 64 of the domain par0.4 0.102 titioning 64 ⇥ 64 ⇥ 64 is mapped to each node. To sub0.1 stantiate this claim, we confirmed that the performance 0.2 av. time per time step ( rst series) av. time per time step (second series) 0.098 jump occurs when the last dimension of the domain parparallel e ciency (second series) 0.096 0 titioning reaches the number of processes per node, also 1 4 16 64 256 1024 4096 16384 number of nodes when using 16 and 32 processes per node. (a) Time-step profile of the granular gas exe(b) Time-step profile of the granular gas exe22
%
.3% %8 9 . 5
25 .9
% 9.5
25
.8%
8.0
cuted with 5 × 2 × 2 = 20 processes on a single node.
30.6%
(b) Weak-scaling graph on the Juqueen supercomputer.
16.0%
parallel e ciency
2.8 × 10 non-spherical particles es l c i t r 1.1 × 10 contacts a ep r o m ns e o i d t u a t i t n n e ag m m e l f p o -im rs t e r d a r o e f-th four o e t a st n i n a th 18.1%
av. time per time step and 1000 particles in s
18
cuted with 8 × 8 × 5 = 320 processes on 16 nodes.
Fig. 5c presents the weak-scaling results on the SuFigure 7.3: supercomputer. The time-step profiles for two weak-scaling the granular perMUC The setup executions di↵ers offrom thegas on the Emmy cluster with 253 particles per process. granular gas scenario presented in Sect. 147.2.1 in that it Supercomputing and Predictability Ulrich Rüde is more TheThe distance between centers of twodecomdomain dilute. decompositions. scaling experiment for thethe one-dimensional domain
Building Block III:
Scalable Flow Simulations with the Lattice Boltzmann Method Succi, S. (2001). The lattice Boltzmann equation: for fluid dynamics and beyond. Oxford university press. Feichtinger, C., Donath, S., Köstler, H., Götz, J., & Rüde, U. (2011). WaLBerla: HPC software design for computational engineering simulations. Journal of Computational Science, 2(2), 105-112. Supercomputing and Predictability
-
Ulrich Rüde
15
Adaptive Mesh Refinement and Load Balancing
Isaac, T., Burstedde, C., Wilcox, L. C., & Ghattas, O. (2015). Recursive algorithms for distributed forests of octrees. SIAM Journal on Scientific Computing, 37(5), C497-C531. Meyerhenke, H., Monien, B., & Sauerwald, T. (2009). A new diffusion-based multilevel algorithm for computing graph partitions. Journal of Parallel and Distributed Computing, 69(9), 750-761. Schornbaum, F., & Rüde, U. (2015). Massively parallel algorithms for the lattice Boltzmann method on non-uniform grids. arXiv:1508.07982, SISC 2016, in print.
Supercomputing and Predictability
-
Ulrich Rüde
16
The stream step Move PDFs into neighboring cells
Non-local part, Linear propagation to neighbors (stream step) Supercomputing and Predictability
Local part, Non-linear operator, (collide step) -
Ulrich Rüde
17
Performance on Coronary Arteries Geometry Weak scaling
Color coded proc assignment
Godenschwager, C., Schornbaum, F., Bauer, M., Köstler, H., & UR (2013). A framework for hybrid parallel flow simulations with a trillion cells in complex geometries. In Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis (p. 35). ACM.
458,752 cores of JUQUEEN
over a trillion 1012 fluid lattice cells cell sizes 1.27µm
diameter of red blood cells: 7µm 2.1 1012 cell updates per second 0.41 PFlops
Strong scaling
32,768 cores of SuperMUC cell sizes of 0.1 mm 2.1 million fluid cells 6000+ time steps per second
Supercomputing and Predictability
-
Ulrich Rüde
Flow through structure of thin crystals (filter)
Gil, A., Galache, J. P. G., Godenschwager, C., UR. (2017). Optimum configuration for accurate simulations of chaotic porous media with Lattice Boltzmann Methods considering boundary conditions, lattice spacing and domain size. Computers & Mathematics with Applications, 73(12), 2515-2528. Supercomputing and Predictability
-
Ulrich Rüde
19
Multi-Physics Simulations for Particulate Flows Parallel Coupling with waLBerla and PE
Ladd, A. J. (1994). Numerical simulations of particulate suspensions via a discretized Boltzmann equation. Part 1. Theoretical foundation. Journal of Fluid Mechanics, 271(1), 285-309. Tenneti, S., & Subramaniam, S. (2014). Particle-resolved direct numerical simulation for gas-solid flow model development. Annual Review of Fluid Mechanics, 46, 199-230. Bartuschat, D., Fischermeier, E., Gustavsson, K., & UR (2016). Two computational models for simulating the tumbling motion of elongated particles in fluids. Computers & Fluids, 127, 17-35.
Supercomputing and Predictability
-
Ulrich Rüde
20
Fluid-Structure Interaction
direct simulation of Particle Laden Flows (4-way coupling)
Götz, J., Iglberger, K., Stürmer, M., & UR (2010). Direct numerical simulation of particulate flows on 294912 processor cores. In Proceedings of Supercomputing 2010, IEEE Computer Society. Götz, J., Iglberger, K., Feichtinger, C., Donath, S., & UR (2010). Coupling multibody dynamics and computational fluid dynamics on 8192 processor cores. Parallel Computing, 36(2), 142-151.
Supercomputing and Predictability
-
Ulrich Rüde
21
Simulation of suspended particle transport
Rettinger, C., Godenschwager, C., Eibl, S., Preclik, T., Schruff, T., Frings, R., & UR (2017). Fully Resolved Simulations of Dune Formation in Riverbeds. In: Kunkel J., Yokota R., Balaji P., Keyes D. (eds) High Performance Computing. ISC 2017. Lecture Notes in Computer Science, vol 10266. Springer, Cham
Supercomputing and Predictability
-
Ulrich Rüde
22
Building Block V
Volume of Fluids Method for Free Surface Flows joint work with Regina Ammer, Simon Bogner, Martin Bauer, Daniela Anderl, Nils Thürey, Stefan Donath, Thomas Pohl, C Körner, A. Delgado Körner, C., Thies, M., Hofmann, T., Thürey, N., & UR. (2005). Lattice Boltzmann model for free surface flow for modeling foaming. Journal of Statistical Physics, 121(1-2), 179-196. Donath, S., Feichtinger, C., Pohl, T., Götz, J., & UR. (2010). A Parallel Free Surface Lattice Boltzmann Method for LargeScale Applications. Parallel Computational Fluid Dynamics: Recent Advances and Future Directions, 318. Anderl, D., Bauer, M., Rauh, C., UR, & Delgado, A. (2014). Numerical simulation of adsorption and bubble interaction in protein foams using a lattice Boltzmann method. Food & function, 5(4), 755-763.
Supercomputing and Predictability
-
Ulrich Rüde
23
Free Surface Flows Volume-of-Fluids like approach Flag field: Compute only in fluid Special “free surface” conditions in interface cells Reconstruction of curvature for surface tension
Supercomputing and Predictability
-
Ulrich Rüde
24
Simulation for hygiene products (for Procter&Gamble)
capillary pressure inclination Supercomputing and Predictability
surface tension contact angle -
Ulrich Rüde
25
Towards fully resolved 3-phase systems
Supercomputing and Predictability
-
Ulrich Rüde
26
Additive Manufacturing Fast Electron Beam Melting
Supercomputing and Predictability
-
Ulrich Rüde
27
Electron Beam Melting Process
3D printing EU-Project FastEBM ARCAM (Gothenburg) TWI (Cambridge) FAU Erlangen Generation of powder bed Energy transfer by electron beam
penetration depth heat transfer Flow dynamics melting melt flow surface tension wetting capillary forces contact angles solidification
Ammer, R., Markl, M., Ljungblad, U., Körner, C., & UR (2014). Simulating fast electron beam melting with a parallel thermal free surface lattice Boltzmann method. Computers & Mathematics with Applications, 67(2), 318-330. Ammer, R., UR, Markl, M., Jüchter V., & Körner, C. (2014). Validation experiments for LBM simulations of electron beam melting. International Journal of Modern Physics C.
Supercomputing and Predictability
-
Ulrich Rüde
28
Simulation of Electron Beam Melting
High speed camera shows melting step for manufacturing a hollow cylinder
Simulating powder bed generation Simulating powder bed generation using the PE framework using the PE framework
WaLBerla Simulation
Coupled Flow for ExaScale
—
Ulrich Rüde
29
Perspectives beyond today … the science of „algorithmic modeling“ towards predictive science
Supercomputing and Predictability
-
Ulrich Rüde
30
How big PDE problems A x = b can we solve? 400 TByte (Juqueen) main memory = 4*1014 Bytes =
5 vectors each with 1013 elements
8 Byte = double precision even with a sparse matrix format, storing a matrix of dimension 1013 is not possible matrix-free implementation necessary Which algorithm? asymptotically optimal complexity: Cost = C*N C „moderate“ multigrid does it parallelize well? overhead?
And now assume that we know the inverse matrix to compute x = A-1b Supercomputing and Predictability
-
Ulrich Rüde
31
1012 - these are BIG problems Energy computer generation
gigascale:
109 FLOPS
terascale
1012 FLOPS
petascale
1015 FLOPS
exascale
1018 FLOPS
desired problem size DoF=N
106
109
1012
1015
energy estimate (kWh) 1 NJoule × N2 all-to-all communication
0.278 Wh 10 min of
LED light
278 kWh 2 weeks
blow drying hair
278 GWh 1 month electricity for Berlin
278 PWh 100 years
world electricity production
TerraNeo prototype (kWh)
0.13 Wh
0.03 kWh
27 kWh
?
At extreme scale: optimal complexity is a must! No algorithm with O(N2) complexity is usable: no matrix-vector multiplication no pairwise distance computation no complex graph partitioning no … Supercomputing and Predictability
-
Ulrich Rüde
32
Towards Predictive Science
Supercomputing and Predictability
-
Ulrich Rüde
33
What do the PDEs look like? Equations for ECMWF weather model model) Equations of motion (ECWMF East-west wind
A
: w o n nd
t t c i d e pr
r e h t a e w e h
m
! w orro
o t North-south wind r fo Temperature Humidity
Continuity of mass Surface pressure Supercomputing and Predictability
-
Ulrich Rüde
34
The Two Principles of Science Three Theory
Descriptive Modeling
Experiments
mathematical models, observationth and to a p e differential equations, prototypes h t en p o s Newton od empirical sciences h t e m l a e n c o i t n a e t i u c p S m e co v i y t r c u i t n d e e c r t P s 21
Computational Science simulation, optimization (quantitative) virtual reality Coupled Physical Models at Extreme Scale -
Ulrich Rüde
35
Computational Science and Engineering as an emerging discipline The universal predictive power of computing y in the : g w o a l L o marks a new era d n 3r h s ’ c e e k t r d Cla e c history ic n a agscience. dv mof
a y m l t o r n f e i e The „big data“-thing can l ic b f f a u h s s i u g Any n i st i only become science d n i is
with predictive models in the back. Societal and political consequences?
SIAM News, December 2016 Ulrich Rüde, Karen Willcox, Lois Curfman McInnes, Hans De Sterck … Carol S. Woodward. Research and Education in Computational Science
Using predictive science for decision support.
and Engineering. arXiv preprint arXiv:1610.02608, accepted to SIREV.
Coupled Physical Models at Extreme Scale -
Ulrich Rüde
36
Thank your your attention!
Thürey, N., Keiser, R., Pauly, M., & Rüde, U. (2009). Detail-preserving fluid control. Graphical Models, 71(6), 221-228. Thürey, N., &UR. (2009). Stable free surface flows with the lattice Boltzmann method on adaptively coarsened grids. Computing and Visualization in Science, 12(5), 247-263.
Supercomputing and Predictability
-
Ulrich Rüde
37