Dec 8, 2016 - International Journal of Computational Science and. Engineering, 4(1), 3-11. ... SIAM Journal on Scientific Computing, 38(2), C96-C126.
Lattice Boltzmann methods on the way to Exa Scale
Ulrich Rüde, Algo Team December 8, 2016
Extreme Scale Simulation
—
Ulrich Rüde
1
Outline Motivation Building Blocks for the Direct Simulation of Complex Flows 1. Supercomputing: scalable algorithms, efficient software 2. Solid phase - rigid body dynamics 3. Fluid phase - Lattice Boltzmann method 4. Electrostatics - finite volume 5. Fast implicit solvers - multigrid 6. Gas phase - free surface tracking, volume of fluids Multi-physics applications Coupling Examples Perspectives
LBM for EXA
—
Ulrich Rüde
2
Multi-PetaFlops Supercomputers Sunway TaihuLight 10,649,600 cores 260 cores (1.45 GHz)
per node 32 GiB RAM per node 125 PFlops Peak
TOP 500: #1
SuperMUC (phase 1)
Blue Gene/Q
architecture
SW26010 processor
Power consumption:
15.37 MW
JUQUEEN
458,752 PowerPC
A2 cores 16 cores (1.6 GHz)
per node 16 GiB RAM per node 5D torus interconnect 5.8 PFlops Peak
Intel Xeon architecture 147,456 cores 16 cores (2.7 GHz) per node 32 GiB RAM per node Pruned tree interconnect 3.2 PFlops Peak TOP 500: #27
TOP 500: #13
LBM Methods
Ulrich Rüde
Building block II:
The Lagrangian View: 1 250 000 spherical particles 256 processors 300 300 time steps
runtime: 48 h (including data output) texture mapping, ray tracing
Granular media simulations with the physics engine
Pöschel, T., & Schwager, T. (2005). Computational granular dynamics: models and algorithms. Springer Science & Business Media.
LBM for EXA
—
Ulrich Rüde
4
Lagrangian Particle Presentation Single
particle described by
state variables (position x, orientation φ,
translational and angular velocity v and ω) a parameterization of its shape S (e.g.
geometric primitive, composite object, or mesh) and its inertia properties (mass m, principle
moments of inertia Ixx, Iyy and Izz).
The Newton-Euler equations of motion for rigid bodies describe the rate of change of the state Newton-Euler Equations for Rigid Bodies variables: ✓ ◆ ✓ ◆ ˙ x(t) v(t) = ˙ '(t) Q('(t))!(t) ✓ ◆ ✓ ◆ ˙ v(t) f(s(t), t) M('(t)) = ˙ !(t) ⌧(s(t), t) !(t) ⇥ I('(t))!(t)
• Integrator of order one similar to semi-implicit Euler. LBM for EXA
—
Ulrich Rüde
5
Contact Models
Hard Contacts alternative to the discrete element method
Hard contacts • require impulses, • exhibit non-differentiable but continuous trajectories, • contact reactions are defined implicitly in general, • have non-unique solutions, • and can be solved numerically by methods from two classes.
100%
soft contacts hard contacts
80% 60% 40% 20% 0% -20% -1
-0.5
0
0.5
1
1.5
2
2.5
3
Fig.: Bouncing ball with a soft and a hard contact model.
) measure differential inclusions Moreau, J., Panagiotopoulos P. (1988): Nonsmooth mechanics and applications, vol 302. Springer, Wien-New York Erlangen, 15.12.2014 — T. Preclik —
Lehrstuhl fur ¨ Systemsimulation — Ultrascale Simulations of Non-smooth Granular Dynamics
6
Popa, C., Preclik, T., & UR (2014). Regularized solution of LCP problems with application to rigid body dynamics. Numerical Algorithms, 1-12. Preclik, T. & UR (2015). Ultrascale simulations of non-smooth granular dynamics; Computational Particle Mechanics, DOI: 10.1007/s40571-015-0047-6
Extreme Scale LBM Methods
-
Ulrich Rüde
6
Nonlinear Complementarity Discretization Underlying the Time-Stepping and Time Stepping Non-penetration conditions
⇠
discrete
impulses
continuous
forces
⇠=0
⇠˙+ ⇠˙+ = 0
⇠¨+ ⇠
⇠=0
⇠˙+
0?
n
0
n
0
n
0
0 ? ⇤n
0
0? 0?
0 ? ⇤n
⇠ + v 0n( ) t Signorini condition
k
+ k vto k2
+ k ˙v tok2
n
to k2
µ
to
=
to
=
µ µ
n n
+ vto
n
˙v + to
+ k vto k2 = 0
k⇤tok2 µ⇤n
+ k vto k2⇤to =
0
0?
impact law
Coulomb friction conditions
0
k
0 k vto ( )k2
friction cone condition
to k2 to
+ µ⇤n vto
µ
=
µ
n n
0 vto ( )
frictional reaction opposes slip
Moreau, J., Panagiotopoulos P. (1988): Nonsmooth mechanics and applications, vol 302. Springer, Wien-New York Erlangen, 15.12.2014 — T. Preclik —
Lehrstuhl fur ¨ Systemsimulation — Ultrascale Simulations of Non-smooth Granular Dynamics
Popa, C., Preclik, T., & UR (2014). Regularized solution of LCP problems with application to rigid body dynamics. Numerical Algorithms, 1-12. Preclik, T. & UR (2015). Ultrascale simulations of non-smooth granular dynamics; Computational Particle Mechanics, DOI: 10.1007/s40571-015-0047-6
LBM for EXA
—
Ulrich Rüde
7
9
Parallel Computation Key features of the parallelization: domain partitioning distribution of data contact detection synchronization protocol subdomain NBGS accumulators and corrections aggressive message aggregation nearest-neighbor communication Iglberger, K., & UR (2010). Massively parallel granular flow simulations with non-spherical particles. Computer Science-Research and Development, 25(1-2), 105-113 Iglberger, K., & UR (2011). Large-scale rigid body simulations. Multibody System Dynamics, 25(1), 81-95
LBM for EXA
—
Ulrich Rüde
8
Shaker scenario with sharp edged hard objects
864 000 sharp-edged particles with a diameter between 0.25 mm and 2 mm. LBM for EXA
—
Ulrich Rüde
9
PE marble run - rigid objects in complex geometry
Animation by Sebastian Eibl and Christian Godenschwager LBM for EXA
—
Ulrich Rüde
10
Scaling Results
Tobias Preclik, Ulrich R¨ ude
fastest. The T coordinate is limited by the number of Solver algorithmically not optimal for dense systems, hence scalemeasureprocesses per node, which was cannot 64 for the above creation of a three-dimensional communiunconditionally, but is highly efficient inments. manyUpon cases of practical importance cator, the three dimensions of the domain partitionStrong and weak scaling results for a constant number iterations ing are mapped also in of row-major order. This e↵ects, if performed on SuperMUC and Juqueenthe number of processes in z-dimension is less than the number of processes per node, that a two-dimensional Largest ensembles computed or even three-dimensional section of the domain parti10 tioning is mapped to a single node. However, if the number of processes in z-dimension is larger or equal to the 10 number Breakup of processes up per of node, only a one-dimensional compute times on section of Erlangen the domain partitioning is mapped to a single granular gas:graph scaling results RRZE Cluster Emmy 7.1 Scalability of Granular Gases (a) Weak-scaling on the Emmy cluster. node. A one-dimensional section of the domain partitioning performs considerably less intra-node communi0.116 1.2 cation than a two- or three-dimensional section of the 0.114 1 domain partitioning. This matches exactly the situa0.112 tion for 2 048 and124.6096 nodes. For 2 048 nodes, a16 two0.11 .5 % % 0.8 % % section 1⇥2⇥32 of the domain dimensional partitioning .7 0.108 64⇥64⇥32 is mapped to each node, and for 4 096 nodes 0.106 0.6 0.104 a one-dimensional section 1 ⇥ 1 ⇥ 64 of the domain par0.4 0.102 titioning 64 ⇥ 64 ⇥ 64 is mapped to each node. To sub0.1 stantiate this claim, we confirmed that the performance 0.2 av. time per time step ( rst series) av. time per time step (second series) 0.098 jump occurs when the last dimension of the domain parparallel e ciency (second series) 0.096 0 titioning reaches the number of processes per node, also 1 4 16 64 256 1024 4096 16384 number of nodes when using 16 and 32 processes per node. (a) Time-step profile of the granular gas exe(b) Time-step profile of the granular gas exe22
.3% %8 5.9
25 .9
%
% 9.5
25
.8%
8.0
LBM for EXA
cuted with 5 × 2 × 2 = 20 processes on a single
30.6%
(b) Weak-scaling graph on the Juqueen supercomputer.
16.0%
parallel e ciency
2.8 × 10 non-spherical particles icles t 1.1 × 10 contacts r a p re o m s n e o i d t u a t i t en agn m m e l f p o t-im ers r d a r o e h r f-t fou o e t a t s n i n a h t 18.1%
av. time per time step and 1000 particles in s
18
cuted with 8 × 8 × 5 = 320 processes on 16
node. Fig. 5c presents the weak-scalingnodes. results on the SuFigure 7.3: supercomputer. The time-step profiles for The two weak-scaling the granular perMUC setup executions di↵ers offrom thegas on the Emmy cluster with 253 particles per process. granular gas scenario presented in Sect. 7.2.1 in that it — Ulrich Rüde 11centers of two is more TheThe distance between the domain dilute. decompositions. scaling experiment for the one-dimensional domain decompositions (20 × 1 × 1, . . . , 10 240 × 1 × 1) performs best and achieves on 512 nodes a parallel granular particles along each spatial dimension is 2 cm,
Building Block III:
Scalable Flow Simulations with the Lattice Boltzmann Method Lallemand, P., & Luo, L. S. (2000). Theory of the lattice Boltzmann method: Dispersion, dissipation, isotropy, Galilean invariance, and stability. Physical Review E, 61(6), 6546. Feichtinger, C., Donath, S., Köstler, H., Götz, J., & Rüde, U. (2011). WaLBerla: HPC software design for computational engineering simulations. Journal of Computational Science, 2(2), 105-112.
Extreme Scale LBM Methods
-
Ulrich Rüde
12
The stream step Move PDFs into neighboring cells
Non-local part, Linear propagation to neighbors (stream step) Extreme Scale LBM Methods
Local part, Non-linear operator, (collide step) -
Ulrich Rüde
13
Performance on Coronary Arteries Geometry Weak scaling
458,752 cores of JUQUEEN
d n o c 12 e s over a trillion 10 r fluid lattice cells e p s e at 1.27µm
st d i l p a u n cell sizes i l l f 2 1 ce ell B 0 n 1 o × rd of red blood cells: 7µm diameter o 1 . G 2 5 o 1 0 pt 2 u s h t a i t 2.1 1012 cell updates per second s W a f s a e c i twassignment 0.41 PFlops Color coded proc
Godenschwager, C., Schornbaum, F., Bauer, M., Köstler, H., & UR (2013). A framework for hybrid parallel flow simulations with a trillion cells in complex geometries. In Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis (p. 35). ACM.
Extreme Scale LBM Methods
Strong scaling
32,768 cores of SuperMUC cell sizes of 0.1 mm 2.1 million fluid cells 6000+ time steps per second -
Ulrich Rüde
Single Node Performance SuperMUC
JUQUEEN
ed
riz o t c ve
o
ze i m pti
d
standard
Pohl, T., Deserno, F., Thürey, N., UR, Lammers, P., Wellein, G., & Zeiser, T. (2004). Performance evaluation of parallel largescale lattice Boltzmann applications on three supercomputing architectures. Proceedings of the 2004 ACM/IEEE conference on Supercomputing (p. 21). IEEE Computer Society. Donath, S., Iglberger, K., Wellein, G., Zeiser, T., Nitsure, A., & UR (2008). Performance comparison of different parallel lattice Boltzmann implementations on multi-core multi-socket systems. International Journal of Computational Science and Engineering, 4(1), 3-11.
Extreme Scale LBM
-
Ulrich Rüde
Weak scaling for TRT lid driven cavity - uniform grids JUQUEEN
SuperMUC
s 4 processes per node
16 processes per node 4 threads per process
e 4 threads per process l t l a ) e d s c p 2 p u 1 ) u l l s
0 L e p 1 s T c u ( 2 e × 1 t L d a T 7 n 0 ( d 3 1 co p d 8 . u n × se 0 o 1 r c . e 2 pe s r e p
Körner, C., Pohl, T., UR., Thürey, N., & Zeiser, T. (2006). Parallel lattice Boltzmann methods for CFD applications. In Numerical Solution of Partial Differential Equations on Parallel Computers (pp. 439-466). Springer Berlin Heidelberg.
Feichtinger, C., Habich, J., Köstler, H., UR, & Aoki, T. (2015). Performance modeling and analysis of heterogeneous lattice Boltzmann simulations on CPU–GPU clusters. Parallel Computing, 46, 1-13.
Extreme Scale LBM
-
Ulrich Rüde
Automatic Generation of Efficient LBM Code Equations with fields and neighbor accesses
lbmpy
pystencils Abstract Syntax Tree
Transformations Array access
Loop Splitting
Kernel
Loop
Add
Blocking
Assign
Condition
Mul
Move Constants before loop
...
... Python JIT
CUDA
waLBerla
Propagation
- moments (SRT,TRT,MRT) - cumulant - (entropic)
- source/destination - EsoTwist - AABB
Specific Transformations - specific common subexpression elimination - loop splitting - input/output of macroscopic values
Backends C(++)
Collision
LLVM
other C/C++ frameworks
Equations with fields and neighbor accesses
Functions
C/C++ Code
Bauer, M., Schornbaum, F., Godenschwager, C., Markl, M., Anderl, D., Köstler, H., & Rüde, U. (2015). A Python extension for the massively parallel multiphysics simulation framework waLBerla. International Journal of Parallel, Emergent and Distributed Systems.
Flow at Scale
—
Ulrich Rüde
17
Automatic Generation of Efficient LBM Code Measured performance improvements
Flow at Scale
—
Ulrich Rüde
18
Partitioning and Parallelization
static block-level refinement (→ forest of octrees)
static load balancing
DISK
compact (KiB/MiB) binary MPI IO
DISK
separation of domain partitioning from simulation (optional)
allocation of block data (→ grids) Flow at Scale
—
Ulrich Rüde
19
Flow through structure of thin crystals (filter)
work with Jose Pedro Galache and Antonio Gil CMT-Motores Termicos, Universitat Politecnica de Valencia LBM for EXA
—
Ulrich Rüde
20
Parallel AMR load balancing 2:1 balanced grid
(used for the LBM)
different views on domain partitioning
distributed graph:
forest of octrees:
nodes = blocks edges explicitly stored as < block ID, process rank > pairs
octrees are not explicitly stored,
but implicitly defined via block IDs
Flow at Scale
—
Ulrich Rüde
21
AMR and Load Balancing with waLBerla
Isaac, T., Burstedde, C., Wilcox, L. C., & Ghattas, O. (2015). Recursive algorithms for distributed forests of octrees. SIAM Journal on Scientific Computing, 37(5), C497-C531. Meyerhenke, H., Monien, B., & Sauerwald, T. (2009). A new diffusion-based multilevel algorithm for computing graph partitions. Journal of Parallel and Distributed Computing, 69(9), 750-761. Schornbaum, F., & Rüde, U. (2016). Massively Parallel Algorithms for the Lattice Boltzmann Method on NonUniform Grids. SIAM Journal on Scientific Computing, 38(2), C96-C126.
Extreme Scale LBM Methods
-
Ulrich Rüde
22
AMR Performance LBM AMR - Performance • Benchmark Environments: •
JUQUEEN (5.0 PFLOP/s) • •
•
Blue Gene/Q, 459K cores, 1 GB/core compiler: IBM XL / IBM MPI
SuperMUC (2.9 PFLOP/s) • •
Intel Xeon, 147K cores, 2 GB/core compiler: Intel XE / IBM MPI
• Benchmark (LBM D3Q19 TRT): avg. blocks/process (max. blocks/proc.) level
initially
after refresh
after load balance
0
0.383 (1)
0.328 (1)
0.328 (1)
1
0.656 (1)
0.875 (9)
0.875 (1)
2
1.313 (2)
3.063 (11)
3.063 (4)
3
3.500 (4)
3.500 (16)
3.500 (4)
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
Extreme Scale LBM Methods
-
Ulrich Rüde
23
33
LBMAMR AMR Performance AMR Performance LBM - -Performance
BenchmarkEnvironments: Environments: • • Benchmark JUQUEEN(5.0 (5.0PFLOP/s) PFLOP/s) • • JUQUEEN BlueGene/Q, Gene/Q,459K 459Kcores, cores,1 1GB/core GB/core • • Blue compiler:IBM IBMXLXL/ IBM / IBMMPI MPI • • compiler: SuperMUC(2.9 (2.9PFLOP/s) PFLOP/s) • • SuperMUC IntelXeon, Xeon,147K 147Kcores, cores,2 2GB/core GB/core • • Intel compiler:Intel IntelXEXE/ /IBM IBMMPI MPI • • compiler:
Benchmark(LBM (LBMD3Q19 D3Q19TRT): TRT): • • Benchmark during this refresh process … … all cells on the finest level are coarsened and coarsen refine the same amount of fine cells is created by splitting coarser cells → 72 % of all cells change their size Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR the LBM Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR forfor the LBM Florian Schornbaum- FAU - FAU Erlangen-Nürnberg- April - April 2016 Florian Schornbaum Erlangen-Nürnberg 15,15, 2016
Extreme Scale LBM Methods
-
Ulrich Rüde
24
3032
AMR Performance LBM AMR - Performance • JUQUEEN – space filling curve: Morton 12
197 billion cells
seconds
10
58 billion cells
#cells per core
14 billion cells
8
31,062
hybrid MPI+OpenMP version with SMP 1 process ⇔ 2 cores ⇔ 8 threads
6
127,232 429,408
4
2 0 256
4096
32,768
458,752
cores Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
Extreme Scale LBM Methods
-
Ulrich Rüde
25
37
LBM AMR - Performance AMR Performance
• JUQUEEN – diffusion load balancing 12
197 billion cells
seconds
10
58 billion cells
#cells per core
14 billion cells
8
31,062 6
127,232
time almost independent of #processes !
4
429,408
2 0 256
4096
32,768
458,752
cores Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
Extreme Scale LBM Methods
-
Ulrich Rüde
26
38
Multi-Physics Simulations for Particulate Flows
Ladd, A. J. (1994). Numerical simulations of particulate suspensions via a discretized Boltzmann equation. Part 1. Theoretical foundation. Journal of Fluid Mechanics, 271(1), 285-309.
Parallel Coupling with waLBerla and PE LBM for EXA
—
Tenneti, S., & Subramaniam, S. (2014). Particle-resolved direct numerical simulation for gas-solid flow model development. Annual Review of Fluid Mechanics, 46, 199-230. Bartuschat, D., Fischermeier, E., Gustavsson, K., & UR (2016). Two computational models for simulating the tumbling motion of elongated particles in fluids. Computers & Fluids, 127, 17-35.
Ulrich Rüde
27
Fluid-Structure Interaction
direct simulation of Particle Laden Flows (4-way coupling)
Götz, J., Iglberger, K., Stürmer, M., & UR (2010). Direct numerical simulation of particulate flows on 294912 processor cores. In Proceedings of Supercomputing 2010, IEEE Computer Society. Götz, J., Iglberger, K., Feichtinger, C., Donath, S., & UR (2010). Coupling multibody dynamics and computational fluid dynamics on 8192 processor cores. Parallel Computing, 36(2), 142-151.
LBM for EXA
—
Ulrich Rüde
28
Simulation of suspended particle transport
Simulation und Vorhersagbarkeit —
Ulrich Rüde
29
Building Block IV (electrostatics)
Positive and negatively charged particles in flow subjected to transversal electric field
Direct numerical simulation of charged particles in flow Masilamani, K., Ganguly, S., Feichtinger, C., & UR (2011). Hybrid lattice-boltzmann and finite-difference simulation of electroosmotic flow in a microchannel. Fluid Dynamics Research, 43(2), 025501. Bartuschat, D., Ritter, D., & UR (2012). Parallel multigrid for electrokinetic simulation in particle-fluid flows. In High Performance Computing and Simulation (HPCS), 2012 International Conference on (pp. 374-380). IEEE. Bartuschat, D. & UR (2015). Parallel Multiphysics Simulations of Charged Particles in Microfluidic Flows, Journal of Computational Science, Volume 8, May 2015, Pages 1-19
LBM for EXA
—
Ulrich Rüde
30
6-way coupling charge distribution
velocity BCs
LBM
Finite volumes MG treat BCs V-cycle
treat BCs stream-collide step
object motion
hydrodynam. force
iter at.
electrostat. force
Newtonian mechanics collision response object distance
LBM for EXA
—
correction force
Ulrich Rüde
Lubrication correction 31
2048
32 8 ⇥ 4 ⇥to16 the 26 2048 behaviour 32 ⇥ 32 ⇥ 64 This corresponds expected that 116 the required number of iterations scales with the diameter coarsestsize gridGmeiner problem et is depicted for according di↵erent probof the the problem al. (2014) to lem sizes. Doubling the domain in Shewchuk all three dimensions, the growth in the condition number (1994). the when number of CG the iterations approximately doubles. However, doubling problem size, CG iterations This stay corresponds to or thehave expected that the sometimes constant to bebehaviour increased. This required number of iterations scales with the diameter results from di↵erent shares of Neumann and Dirichlet of the problem size Gmeiner et al. (2014) according to BCs onthethe boundary. Whenever the relative proportion growth in the condition number Shewchuk (1994). of Neumann BCs increases, and However, when doublingconvergence the problem deteriorates size, CG iterations more CG iterations necessary. sometimes stayare constant or have to be increased. This Theresults runtimes all parts of the algorithmand are Dirichlet shown from of di↵erent shares of Neumann in Fig.BCs 13 on forthe di↵erent problem sizes, indicating their boundary. Whenever the relative proportion increases, and sharesofonNeumann the totalBCs runtime. Thisconvergence diagram isdeteriorates based on the more CG iterations maximal (for MG, LBM, are pe) necessary. or average (others) runtimes The runtimes all partsall of processes. the algorithm areupper shown of the di↵erent sweepsofamong The
353
157 (41)
136 (35)
27
7
60
of the individual sweeps. The sweeps that scale perfectly—HydrF, LubrC, size—mainly dueand to increasing MPI communication—are Map, SetRHS, ElectF—are summarized as ‘Oth‘. LBM, MG,simulation pe, and PtCm. LBM and MG take For longer timesOverall, the particles attracted by the up more than 75% of the total time, w.r.t. the runtimes bottom wall are no longer evenly distributed, possibly of the individual sweeps. causing load imbalances. However, they hardly a↵ect the The sweeps that scale perfectly—HydrF, LubrC, overall performance. For the simulation for the animaMap, SetRHS, and ElectF—are summarized as ‘Oth‘. tion, the relative share of the lubrication correction is For longer simulation times the particles attracted by the below each otherevenly sweepdistributed, of ‘Oth‘ ispossibly well below bottom0.1% wall and are no longer 4% of the total runtime.However, they hardly a↵ect the causing load imbalances. Overall the coupled achieves overall performance. For multiphysics the simulationalgorithm for the animation, parallel the relative share of lubrication correction 83% efficiency onthe 2048 nodes. Since mostistime and eachLBM other and sweepMG, of ‘Oth‘ is well below isbelow spent0.1% to execute we will now turn to 4% of the total runtime. analyse them in more detail. Fig. 14 displays the paralOverall the coupled multiphysics algorithm achieves lel performance for di↵erent numbers of nodes. On 2048 83% parallel efficiency on 2048 nodes. Since most time to nodes, MG executes 121,083 MLUPS, corresponding is spent to execute LBM and MG, we will now turn to a parallel efficiency of 64%. The LBM performs 95,372 analyse them in more detail. Fig. 14 displays the paralMFLUPS, withfor 91% parallel efficiency. lel performance di↵erent numbers of nodes. On 2048
Separation experiment
shares on the total runtime. This diagram is based on the
64 12 8 25 6 51 2 10 24 20 48
32
16
8
4
2
1
0 Figure 13 Runtimes of charged particle algorithm sweeps for 240 time stepsNumber on increasing number of nodes. of nodes Figure 13 Runtimes of charged particle algorithm sweeps forofEXA for 240 time steps on increasingLBM number nodes.
part of the diagram shows the cost of fluid-simulation
80 40 LBM Perform. 60 MG Perform. 20 40
0
Perform. 250 500 750 1000 1250LBM 1500 1750 2000 Number of nodesMG Perform. 20
10 Figure 0 14 Weak scaling performance of MG and LBM 250 500 750 1000 1250 1500 32 1750 2000 — 0 sweep Ulrich Rüde for 240 time of steps. Number nodes Figure 14
Weak scaling performance of MG and LBM
3
60 30 50 20 40 10 30 0 20
3
Number of nodes
SetRHS PtCm ElectF
10 MLUPS (MG)
64 12 8 25 6 51 2 10 24 20 48
16 32
100
8
0
200
4
100
300
2
200
3
400
3
300
10 MFLUPS 10 MFLUPS (LBM) (LBM)
LBM Map nodes, MG executes 121,083 MLUPS, corresponding to Lubr a parallel efficiency of 64%. The LBM performs 95,372 HydrF 90 120 LBM MFLUPS, with 91% parallel efficiency. pe 80 Map 100 MG 70 Lubr SetRHS 60 HydrF 90 120 80 PtCm pe 80 50 100 60 MG ElectF 70 40
of the di↵erent sweeps among all processes. The upper
1 Total runtimes []
Total runtimes []
400maximal (for MG, LBM, pe) or average (others) runtimes
10 MLUPS (MG)
240 time steps fully 6-way coupled simulation 400 sec on SuperMuc weak scaling up to 32 768 cores 7.1 Mio particles in Fig. 13 for di↵erent problem sizes, indicating their
Building Block V
Volume of Fluids Method for Free Surface Flows joint work with Regina Ammer, Simon Bogner, Martin Bauer, Daniela Anderl, Nils Thürey, Stefan Donath, Thomas Pohl, C Körner, A. Delgado Körner, C., Thies, M., Hofmann, T., Thürey, N., & UR. (2005). Lattice Boltzmann model for free surface flow for modeling foaming. Journal of Statistical Physics, 121(1-2), 179-196. Donath, S., Feichtinger, C., Pohl, T., Götz, J., & UR. (2010). A Parallel Free Surface Lattice Boltzmann Method for LargeScale Applications. Parallel Computational Fluid Dynamics: Recent Advances and Future Directions, 318. Anderl, D., Bauer, M., Rauh, C., UR, & Delgado, A. (2014). Numerical simulation of adsorption and bubble interaction in protein foams using a lattice Boltzmann method. Food & function, 5(4), 755-763.
Simulation und Vorhersagbarkeit —
Ulrich Rüde
33
Free Surface Flows Volume-of-Fluids like approach Flag field: Compute only in fluid Special “free surface” conditions in interface cells Reconstruction of curvature for surface tension
Simulation und Vorhersagbarkeit —
Ulrich Rüde
34
Simulation for hygiene products (for Procter&Gamble)
capillary pressure inclination LBM for EXA
—
surface tension contact angle Ulrich Rüde
35
Additive Manufacturing Fast Electron Beam Melting Ammer, R., Markl, M., Ljungblad, U., Körner, C., & UR (2014). Simulating fast electron beam melting with a parallel thermal free surface lattice Boltzmann method. Computers & Mathematics with Applications, 67(2), 318-330. Ammer, R., UR, Markl, M., Jüchter V., & Körner, C. (2014). Validation experiments for LBM simulations of electron beam melting. International Journal of Modern Physics C.
LBM for EXA
—
Ulrich Rüde
36
Electron Beam Melting Process
3D printing EU-Project FastEBM ARCAM (Gothenburg) TWI (Cambridge) FAU Erlangen Generation of powder bed Energy transfer by electron beam
penetration depth heat transfer Flow dynamics melting melt flow surface tension wetting capillary forces contact angles solidification
Ammer, R., Markl, M., Ljungblad, U., Körner, C., & UR (2014). Simulating fast electron beam melting with a parallel thermal free surface lattice Boltzmann method. Computers & Mathematics with Applications, 67(2), 318-330. Ammer, R., UR, Markl, M., Jüchter V., & Körner, C. (2014). Validation experiments for LBM simulations of electron beam melting. International Journal of Modern Physics C.
LBM for EXA
—
Ulrich Rüde
37
Simulation of Electron Beam Melting
High speed camera shows melting step for manufacturing a hollow cylinder
Simulating powder bed generation using the PE framework
WaLBerla Simulation
LBM for EXA
—
Ulrich Rüde
38
Conclusions
LBM for EXA
—
Ulrich Rüde
39
Research in Computational Science is done by teams
Harald Köstler
Florian Schornbaum
Christian Godenschwager
Sebastian Kuckuk
Kristina Pickl
Regina Ammer
Simon Bogner
Christoph Rettinger
Dominik Bartuschat
Martin Bauer
LBM for EXA
—
Ulrich Rüde
Ehsan Fattahi
Christian Kuschel
40
Thank your your attention!
Thürey, N., Keiser, R., Pauly, M., & Rüde, U. (2009). Detail-preserving fluid control. Graphical Models, 71(6), 221-228. Thürey, N., &UR. (2009). Stable free surface flows with the lattice Boltzmann method on adaptively coarsened grids. Computing and Visualization in Science, 12(5), 247-263.
Simulation und Vorhersagbarkeit —
Ulrich Rüde
41