Supercomputing - RISC-Linz

3 downloads 0 Views 11MB Size Report
Feb 19, 2018 - SIAM Journal on Scientific Computing, 37(2), C143-C168. ... solvers at the extreme scale, Journal of Computational Science, 17, 509-521.
Supercomputing

The Supercomputer MACH-2

Expanding the Limits of Predictability


[JKU Scientific Computing Administration] [Current System Load] February 19, 2018: MACH-2 Opening Celebration

Ulrichsupercomputer Rüde MACH-2 is a massively parallel shared memory operated by the Scientific Computing Administration of the Johannes LSS Kepler University (JKU) Linz,Erlangen Austria, on and behalf CERFACS of a consortium Toulouse consisting of JKU Linz, the University of Innsbruck, the Paris [email protected] Lodron University of Salzburg, Technische Universität Wien, and the Johann Radon Institute for Computational and Applied Mathematics (RICAM). The machine has been purchased for the purpose of a cooperation project that runs from 2017 to 2021 and is supported by a grant of the Federal Ministry of Education, Science, and Research (BMBWF) in the frame of the HRSM 2016 call. The Centre Européen Recherche etisdefully operational since Lehrstuhl für Simulation system was installed inde October 2017 and January 2018. MACH-2en is Calcul named Scientifique after the Austrian physicist Universität and Formation Avancée Erlangen-Nürnberg philosopher Ernst Mach and replaces the previous MACH www.cerfacs.fr www10.informatik.uni-erlangen.de supercomputer that was operated jointly by JKU and the University of Innsbruck from 2011 to 2017. 1

19.2.2018

Supercomputing and Predictability — Ulrich Rüde

Supercomputer Mach-2 Linz

JUQUEEN

SGI UV 3000

1,728 puter MACH-2

processor cores

Intel Xeon E5-
 4650V3 2.1 GHz

Administration] [Current System Load]

-2 Opening Celebration

parallel shared memory supercomputer Computing Administration of the Johannes Linz, Austria, on behalf of a consortium the University of Innsbruck, the Paris zburg, Technische Universität Wien, and titute for Computational and Applied he machine has been purchased for the project that runs from 2017 to 2021 and is e Federal Ministry of Education, Science, in the frame of the HRSM 2016 call. The October 2017 and is fully operational since s named after the Austrian physicist and h and replaces the previous MACH perated jointly by JKU and the University of 17.

ccNUMA

7D enhanced Hypercube 20 TB shared memory

Sunway TaihuLight

Blue Gene/Q
 architecture 458,752 PowerPC
 A2 cores 16 cores (1.6 GHz)
 per node 16 GiB RAM per node 5D torus interconnect 5.8 PFlops Peak

10,649,600 cores 260 cores (1.45 GHz)
 per node 32 GiB RAM per node 125 PFlops Peak Power consumption:
 15.37 MW TOP 500: #1

TOP 500: #22

type SGI UV 3000 of the former company nal (SGI), now Hewlett Packard Enterprise ass of cache coherent Non-Uniform Memory hich are massively parallel supercomputers shared memory model on top of scalable H-2 is housed in three racks of this kind and istics:

SW26010 processor

ores with 20 TB global shared memory 2 blades, where o 12-core processors of type Intel Xeon E52.1 GHz with 30 MB L3 cache, ped with 256 GB and 8 blades are equipped y, and nected by a NUMAlink 6 network in 7D e topology which yields a "full bandwidth"

as mass storage 4 SSD drives with 400 GB es with 1.6 TB each, and 24 HDD drives with 10 TB each (260 TB mass storage in

Supercomputing and Predictability

-

Ulrich Rüde

On the agenda today … Continuum models: Finite elements and implicit solvers Geophysics Particle based methods: rigid body dynamics Granular systems Mesoscopic methods: Lattice Boltzman methods Complex flows Perspectives Towards Predictive Science Computational Science and Engineering

Supercomputing and Predictability

-

Ulrich Rüde

3

Building block I:

Fast Implicit Solvers: Parallel Multigrid Methods for Earth Mantle Convection Gmeiner, B., Rüde, U., Stengel, H., Waluga, C., & Wohlmuth, B. (2015). Performance and scalability of hierarchical hybrid multigrid solvers for Stokes systems. SIAM Journal on Scientific Computing, 37(2), C143-C168. Gmeiner, B., Rüde, U., Stengel, H., Waluga, C., & Wohlmuth, B. (2015). Towards textbook efficiency for parallel multigrid. Numerical Mathematics: Theory, Methods and Applications, 8(01), 22-46. Huber, M., John, L., Pustejovska, P., Rüde, U., Waluga, C., & Wohlmuth, B. (2015). Solution Techniques for the Stokes System: A priori and a posteriori modifications, resilient algorithms. ICIAM 2015.

Supercomputing and Predictability

-

Ulrich Rüde

4

Instationary Computation

u + rp =

div u = 0

RaT b r

@tT + u · rT = Pe

1

T

Simplified Model

Dimensionless numbers:

• Rayleigh number (Ra): determines the p • Peclet number (Pe): ratio of convective

The mantle convection problem is defined b spherical shell as well as suitable initial/bou

Finite Elements with

6.5×109

JJ J I II 0 1 2 3 4 DoF, 10000 time steps, run time 7 days

Mid-size cluster: 288 compute cores in 9 nodes of LSS at FAU Supercomputing and Predictability

-

Ulrich Rüde

5

Stationary Flow Field

Supercomputing and Predictability

-

Ulrich Rüde

6

„Tera-Scale“ Applications: what is the largest system that we can solve today? Bergen, B, Hülsemann, F, UR (2005): Is 1.7· 1010 unknowns the largest finite element system that can be solved today? Proceedings of SC’05.

and now, 13 years later? we have 400 TByte main memory = 4*1014 Bytes = 
 5 vectors each with N=1013 double precision elements matrix-free implementation necessary even with a sparse matrix format, storing a matrix of dimension N=1013 is not possible Which algorithm? multigrid Cost = C*N C „moderate“, e.g. C=100. does it parallelize well on that scale? should we worry since 𝜿 = O(N2/3)? Supercomputing and Predictability

-

Ulrich Rüde

7

Exploring the Limits … ally appear in simulations for molecules, quantum mechanics, or geophysics. The initial

Gmeiner B., Huber M., John L., UR, Wohlmuth, B. (2016), A quantitative performance study for Stokes solvers at thetetrahedrons extreme scale, Journal of Computational 509-521. onsists of 240 for the case of 5 Science, nodes 17, and 80 threads. The number

of degr

Multigrid withT0Uzawa Smoother oms on the coarse grid grows from 9.0 · 103 to 4.1 · 107 by the weak scaling. We con

Optimized Minimal Memory Consumption tokes system with thefor Laplace-operator formulation. The relative accuracies for coarse 13 unknowns correspond to 80 TByte for the solution vector 10 r (PMINRES and CG algorithm) are set to 10 3 and 10 4 , respectively. All other param Juqueen has 450 TByte Memory F o D e he solver remain as previously described. or m e matrix free implementation essential d s nitu nodes 5

er ag v l m o f s o E s r F e t r d r a o e time time w.c.g. threads iter h t f o threeDoFs te a t s n ha · 109 10 685.88 80 t2.7 678.77

40

640

320

5 120

2 560

40 960

20 480

327 680

2.1 · 1010 1.2 · 1011

1.7 · 1012

1.1 · 1013

time c.g. in % 1.04

10

703.69

686.24

2.48

10

741.86

709.88

4.31

9

720.24

671.63

6.75

9

776.09

681.91

12.14

Table 10: Weak scaling results withand andPredictability without coarse the spherical shell geometry. Supercomputing - grid Ulrichfor Rüde 8

Building block II:

The Lagrangian View: Granular media simulations

1 250 000 spherical particles 256 processors 300 300 time steps
 runtime: 48 h (including data output) texture mapping, ray tracing

with the physics engine Pöschel, T., & Schwager, T. (2005). Computational granular dynamics: models and algorithms. Springer Science & Business Media. Supercomputing and Predictability

-

Ulrich Rüde

9

Shaker scenario with sharp edged hard objects

864 000 sharp-edged particles with a diameter between 0.25 mm and 2 mm. Supercomputing and Predictability

-

Ulrich Rüde

10

PE marble run - rigid objects in complex geometry

Animation by Sebastian Eibl and Christian Godenschwager Supercomputing and Predictability

-

Ulrich Rüde

11

But what can we predict

? What does prediction here mean? Supercomputing and Predictability

-

Ulrich Rüde

12

Parallel Computation Key features of the parallelization: domain partitioning distribution of data contact detection synchronization protocol subdomain NBGS accumulators and corrections aggressive message aggregation nearest-neighbor communication Iglberger, K., & UR (2010). Massively parallel granular flow simulations with non-spherical particles. Computer ScienceResearch and Development, 25(1-2), 105-113 Iglberger, K., & UR (2011). Large-scale rigid body simulations. Multibody System Dynamics, 25(1), 81-95 Supercomputing and Predictability

-

Ulrich Rüde

13

Scaling Results

Tobias Preclik, Ulrich R¨ ude

fastest. The T coordinate is limited by the number of Solver algorithmically not optimal for dense systems, hence processes per node, which wascannot 64 for thescale above measurecreation of a three-dimensional communiunconditionally, but is highly efficient in ments. manyUpon cases of practical importance cator, the three dimensions of the domain partitionStrong and weak scaling results for a constant number of iterations performed ing are mapped also in row-major order. This e↵ects, if on SuperMUC and Juqueen the number of processes in z-dimension is less than the number of processes per node, that a two-dimensional Largest ensembles computed or even three-dimensional section of the domain parti10 tioning is mapped to a single node. However, if the number of processes in z-dimension is larger or equal to the 10 number Breakup of processes up per of node, only a one-dimensional compute times on section of Erlangen the domain partitioning is mapped to a single granular gas: scaling results RRZE Cluster Emmy 7.1 Scalability of Granular Gases (a) Weak-scaling graph on the Emmy cluster. node. A one-dimensional section of the domain partitioning performs considerably less intra-node communi0.116 1.2 cation than a two- or three-dimensional section of the 0.114 1 domain partitioning. This matches exactly the situa0.112 tion for 2 048 and124.6 096 nodes. For 2 048 nodes, 1a6 two0.11 .5 % % 0.8 % % section 1⇥2⇥32 of the domain .7 dimensional partitioning 0.108 64⇥64⇥32 is mapped to each node, and for 4 096 nodes 0.106 0.6 0.104 a one-dimensional section 1 ⇥ 1 ⇥ 64 of the domain par0.4 0.102 titioning 64 ⇥ 64 ⇥ 64 is mapped to each node. To sub0.1 stantiate this claim, we confirmed that the performance 0.2 av. time per time step ( rst series) av. time per time step (second series) 0.098 jump occurs when the last dimension of the domain parparallel e ciency (second series) 0.096 0 titioning reaches the number of processes per node, also 1 4 16 64 256 1024 4096 16384 number of nodes when using 16 and 32 processes per node. (a) Time-step profile of the granular gas exe(b) Time-step profile of the granular gas exe22

%

.3% %8 9 . 5

25 .9

% 9.5

25

.8%

8.0

cuted with 5 × 2 × 2 = 20 processes on a single node.

30.6%

(b) Weak-scaling graph on the Juqueen supercomputer.

16.0%

parallel e ciency

2.8 × 10 non-spherical particles es l c i t r 1.1 × 10 contacts a ep r o m ns e o i d t u a t i t n n e ag m m e l f p o -im rs t e r d a r o e f-th four o e t a st n i n a th 18.1%

av. time per time step and 1000 particles in s

18

cuted with 8 × 8 × 5 = 320 processes on 16 nodes.

Fig. 5c presents the weak-scaling results on the SuFigure 7.3: supercomputer. The time-step profiles for two weak-scaling the granular perMUC The setup executions di↵ers offrom thegas on the Emmy cluster with 253 particles per process. granular gas scenario presented in Sect. 147.2.1 in that it Supercomputing and Predictability Ulrich Rüde is more TheThe distance between centers of twodecomdomain dilute. decompositions. scaling experiment for thethe one-dimensional domain

Building Block III:

Scalable Flow Simulations with the Lattice Boltzmann Method Succi, S. (2001). The lattice Boltzmann equation: for fluid dynamics and beyond. Oxford university press. Feichtinger, C., Donath, S., Köstler, H., Götz, J., & Rüde, U. (2011). WaLBerla: HPC software design for computational engineering simulations. Journal of Computational Science, 2(2), 105-112. Supercomputing and Predictability

-

Ulrich Rüde

15

Adaptive Mesh Refinement and Load Balancing

Isaac, T., Burstedde, C., Wilcox, L. C., & Ghattas, O. (2015). Recursive algorithms for distributed forests of octrees. SIAM Journal on Scientific Computing, 37(5), C497-C531. Meyerhenke, H., Monien, B., & Sauerwald, T. (2009). A new diffusion-based multilevel algorithm for computing graph partitions. Journal of Parallel and Distributed Computing, 69(9), 750-761. Schornbaum, F., & Rüde, U. (2015). Massively parallel algorithms for the lattice Boltzmann method on non-uniform grids. arXiv:1508.07982, SISC 2016, in print.

Supercomputing and Predictability

-

Ulrich Rüde

16

The stream step Move PDFs into neighboring cells

Non-local part, Linear propagation to neighbors (stream step) Supercomputing and Predictability

Local part, Non-linear operator, (collide step) -

Ulrich Rüde

17

Performance on Coronary Arteries Geometry Weak scaling 


Color coded proc assignment

Godenschwager, C., Schornbaum, F., Bauer, M., Köstler, H., & UR (2013). A framework for hybrid parallel flow simulations with a trillion cells in complex geometries. In Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis (p. 35). ACM.

458,752 cores of JUQUEEN
 over a trillion 1012 fluid lattice cells cell sizes 1.27µm
 diameter of red blood cells: 7µm 2.1 1012 cell updates per second 0.41 PFlops

Strong scaling
 32,768 cores of SuperMUC cell sizes of 0.1 mm 2.1 million fluid cells 6000+ time steps per second

Supercomputing and Predictability

-

Ulrich Rüde

Flow through structure of thin crystals (filter)

Gil, A., Galache, J. P. G., Godenschwager, C., UR. (2017). Optimum configuration for accurate simulations of chaotic porous media with Lattice Boltzmann Methods considering boundary conditions, lattice spacing and domain size. Computers & Mathematics with Applications, 73(12), 2515-2528. Supercomputing and Predictability

-

Ulrich Rüde

19

Multi-Physics Simulations for Particulate Flows Parallel Coupling with waLBerla and PE

Ladd, A. J. (1994). Numerical simulations of particulate suspensions via a discretized Boltzmann equation. Part 1. Theoretical foundation. Journal of Fluid Mechanics, 271(1), 285-309. Tenneti, S., & Subramaniam, S. (2014). Particle-resolved direct numerical simulation for gas-solid flow model development. Annual Review of Fluid Mechanics, 46, 199-230. Bartuschat, D., Fischermeier, E., Gustavsson, K., & UR (2016). Two computational models for simulating the tumbling motion of elongated particles in fluids. Computers & Fluids, 127, 17-35.

Supercomputing and Predictability

-

Ulrich Rüde

20

Fluid-Structure Interaction

direct simulation of Particle Laden Flows (4-way coupling)

Götz, J., Iglberger, K., Stürmer, M., & UR (2010). Direct numerical simulation of particulate flows on 294912 processor cores. In Proceedings of Supercomputing 2010, IEEE Computer Society. Götz, J., Iglberger, K., Feichtinger, C., Donath, S., & UR (2010). Coupling multibody dynamics and computational fluid dynamics on 8192 processor cores. Parallel Computing, 36(2), 142-151.

Supercomputing and Predictability

-

Ulrich Rüde

21

Simulation of suspended particle transport

Rettinger, C., Godenschwager, C., Eibl, S., Preclik, T., Schruff, T., Frings, R., & UR (2017). Fully Resolved Simulations of Dune Formation in Riverbeds. In: Kunkel J., Yokota R., Balaji P., Keyes D. (eds) High Performance Computing. ISC 2017. Lecture Notes in Computer Science, vol 10266. Springer, Cham

Supercomputing and Predictability

-

Ulrich Rüde

22

Building Block V

Volume of Fluids Method for Free Surface Flows joint work with Regina Ammer, Simon Bogner, Martin Bauer, Daniela Anderl, Nils Thürey, Stefan Donath, Thomas Pohl, C Körner, A. Delgado Körner, C., Thies, M., Hofmann, T., Thürey, N., & UR. (2005). Lattice Boltzmann model for free surface flow for modeling foaming. Journal of Statistical Physics, 121(1-2), 179-196. Donath, S., Feichtinger, C., Pohl, T., Götz, J., & UR. (2010). A Parallel Free Surface Lattice Boltzmann Method for LargeScale Applications. Parallel Computational Fluid Dynamics: Recent Advances and Future Directions, 318. Anderl, D., Bauer, M., Rauh, C., UR, & Delgado, A. (2014). Numerical simulation of adsorption and bubble interaction in protein foams using a lattice Boltzmann method. Food & function, 5(4), 755-763.

Supercomputing and Predictability

-

Ulrich Rüde

23

Free Surface Flows Volume-of-Fluids like approach Flag field: Compute only in fluid Special “free surface” conditions in interface cells Reconstruction of curvature for surface tension

Supercomputing and Predictability

-

Ulrich Rüde

24

Simulation for hygiene products (for Procter&Gamble)

capillary pressure inclination Supercomputing and Predictability

surface tension contact angle -

Ulrich Rüde

25

Towards fully resolved 3-phase systems

Supercomputing and Predictability

-

Ulrich Rüde

26

Additive Manufacturing Fast Electron Beam Melting

Supercomputing and Predictability

-

Ulrich Rüde

27

Electron Beam Melting Process
 3D printing EU-Project FastEBM ARCAM (Gothenburg) TWI (Cambridge) FAU Erlangen Generation of powder bed Energy transfer by electron beam

penetration depth heat transfer Flow dynamics melting melt flow surface tension wetting capillary forces contact angles solidification

Ammer, R., Markl, M., Ljungblad, U., Körner, C., & UR (2014). Simulating fast electron beam melting with a parallel thermal free surface lattice Boltzmann method. Computers & Mathematics with Applications, 67(2), 318-330. Ammer, R., UR, Markl, M., Jüchter V., & Körner, C. (2014). Validation experiments for LBM simulations of electron beam melting. International Journal of Modern Physics C.

Supercomputing and Predictability

-

Ulrich Rüde

28

Simulation of Electron Beam Melting

High speed camera shows melting step for manufacturing a hollow cylinder

Simulating powder bed generation Simulating powder bed generation using the PE framework using the PE framework

WaLBerla Simulation

Coupled Flow for ExaScale



Ulrich Rüde

29

Perspectives beyond today … the science of „algorithmic modeling“ towards predictive science

Supercomputing and Predictability

-

Ulrich Rüde

30

How big PDE problems A x = b can we solve? 400 TByte (Juqueen) main memory = 4*1014 Bytes = 
 5 vectors each with 1013 elements
 8 Byte = double precision even with a sparse matrix format, storing a matrix of dimension 1013 is not possible matrix-free implementation necessary Which algorithm? asymptotically optimal complexity: Cost = C*N C „moderate“ multigrid does it parallelize well? overhead?

And now assume that we know the inverse matrix to compute x = A-1b Supercomputing and Predictability

-

Ulrich Rüde

31

1012 - these are BIG problems Energy computer generation

gigascale:

109 FLOPS

terascale

1012 FLOPS

petascale

1015 FLOPS

exascale

1018 FLOPS

desired problem size DoF=N

106

109

1012

1015

energy estimate (kWh) 1 NJoule × N2 all-to-all communication

0.278 Wh 10 min of

LED light

278 kWh 2 weeks

blow drying hair

278 GWh 1 month electricity for Berlin

278 PWh 100 years

world electricity production

TerraNeo prototype (kWh)

0.13 Wh

0.03 kWh

27 kWh

?

At extreme scale: optimal complexity is a must! No algorithm with O(N2) complexity is usable: no matrix-vector multiplication no pairwise distance computation no complex graph partitioning no … Supercomputing and Predictability

-

Ulrich Rüde

32

Towards Predictive Science

Supercomputing and Predictability

-

Ulrich Rüde

33

What do the PDEs look like? Equations for ECMWF weather model model) Equations of motion (ECWMF East-west wind

A

: w o n nd

t t c i d e pr

r e h t a e w e h

m

! w orro

o t North-south wind r fo Temperature Humidity

Continuity of mass Surface pressure Supercomputing and Predictability

-

Ulrich Rüde

34

The Two Principles of Science Three Theory

Descriptive Modeling

Experiments

mathematical models, observationth and to a p e differential equations, prototypes h t en p o s Newton od empirical sciences h t e m l a e n c o i t n a e t i u c p S m e co v i y t r c u i t n d e e c r t P s 21

Computational Science simulation, optimization (quantitative) virtual reality Coupled Physical Models at Extreme Scale -

Ulrich Rüde

35

Computational Science and Engineering as an emerging discipline The universal predictive power of computing y in the : g w o a l L o marks a new era d n 3r h s ’ c e e k t r d Cla e c history ic n a agscience. dv mof

a y m l t o r n f e i e The „big data“-thing can l ic b f f a u h s s i u g Any n i st i only become science d n i is

with predictive models in the back. Societal and political consequences?

SIAM News, December 2016 Ulrich Rüde, Karen Willcox, Lois Curfman McInnes, Hans De Sterck … Carol S. Woodward. Research and Education in Computational Science

Using predictive science for decision support.

and Engineering. arXiv preprint arXiv:1610.02608, accepted to SIREV.

Coupled Physical Models at Extreme Scale -

Ulrich Rüde

36

Thank your your attention!

Thürey, N., Keiser, R., Pauly, M., & Rüde, U. (2009). Detail-preserving fluid control. Graphical Models, 71(6), 221-228. Thürey, N., &UR. (2009). Stable free surface flows with the lattice Boltzmann method on adaptively coarsened grids. Computing and Visualization in Science, 12(5), 247-263.

Supercomputing and Predictability

-

Ulrich Rüde

37