Feb 21, 2014 - FLOPS are cheap - MEMORY access is costly. Optimization .... Domain. Decomposition. Computational. Input/Output. Raúl de la Cruz (CASE - BSC) .... Alleviate footprint and register pressure by splitting internal loops (fission).
Stencil Computations: from Academy to Industry Raúl de la Cruz, Mauricio Hanzich and Jose María Cela {raul.delacruz,mauricio.hanzich,josem.cela}@bsc.es Barcelona Supercomputing Center, Barcelona (Spain)
Parallel Processing Conference Optimizing Stencil-based Algorithms Portland, Oregon, USA, February 18-21, 2014
Raúl de la Cruz (CASE - BSC)
Stencil Comp.: from Academy to Industry
PP14
1 / 21
FLOPS are cheap - MEMORY access is costly
FLOPS/DRAM Byte Arithmetic Intensity O(1) O(log n) O(n,n2,n3) AI = [0.33, 3.5]
Irregular codes (Stencils & SpMV)
FFT
WARIS Raúl de la Cruz (CASE - BSC)
AI = [1, ~1.6]
AI = [0.25, ~2.0]
Computation bound
Memory bound
Optimization is a MUST
Dense algebra (BLAS-like DGEMM)
ATLAS ESSL
Stencil Comp.: from Academy to Industry
PP14
2 / 21
Stencil in a Nutshell Finite Difference Method 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13:
Domain decomposition of mesh for time = 0 to timeend do Read Input Pre-processing Inject source Apply boundary conditions for all points in my domain do Stencil computation (X t ) end for Exchange overlapped points Post-processing Write Output end for
Stencil Computation for k = ℓ to Y − ℓ do for j = ℓ to X − ℓ do for i = ℓ to Z − ℓ do t−1 t = C0 ∗ Xi,j,k Xi,j,k t−1 t−1 t−1 t−1 +CZ 1 ∗ (Xi−1,j,k + Xi+1,j,k ) + .. + CZ ℓ ∗ (Xi−ℓ,j,k + Xi+ℓ,j,k ) t−1 t−1 t−1 t−1 +CX 1 ∗ (Xi,j−1,k + Xi,j+1,k ) + .. + CX ℓ ∗ (Xi,j−ℓ,k + Xi,j+ℓ,k ) t−1 t−1 t−1 t−1 +CY 1 ∗ (Xi,j,k + Xi,j,k ) + .. + CY ℓ ∗ (Xi,j,k + Xi,j,k ) −1 +1 −ℓ +ℓ
end for end for end for
Load balancing Kernel computation Intra/inter-node communication
Raúl de la Cruz (CASE - BSC)
Stencil Comp.: from Academy to Industry
PP14
3 / 21
State of the Art - The Big Picture
Raúl de la Cruz (CASE - BSC)
Stencil Comp.: from Academy to Industry
PP14
4 / 21
re ct io
n)
State of the Art - The most popular av er sin g
di
a)
Backward Update
F/B
K I (unit-stride)
b)
(tr
Forward Update
F
IxTJxK block
Rivera t Const. Boundary
time
21-
1st
0-
0
2nd
3rd
4th
cache block
1
2
3
4
5
6 7 space
8
9
Time-skewing Raúl de la Cruz (CASE - BSC)
Const. Boundary
Semi-stencil
10 11 12 13
x
Cache-oblivious Stencil Comp.: from Academy to Industry
PP14
5 / 21
Raúl de la Cruz (CASE - BSC) Stencil Comp.: from Academy to Industry 64x256x512
PP14
512x512x512
256x512x512
128x512x512
64x512x512
32x512x512
16x512x512
512x256x512
256x256x512
128x256x512
Time-skewing Time-skewing+Semi
32x256x512
16x256x512
512x128x512
256x128x512
128x128x512
64x128x512
32x128x512
16x128x512
Rivera Rivera+Semi
512x64x512
256x64x512
128x64x512
64x64x512
32x64x512
16x64x512
512x32x512
Naive Naive+Semi
256x32x512
128x32x512
64x32x512
32x32x512
16x32x512
512x16x512
256x16x512
128x16x512
20
64x16x512
32x16x512
16x16x512
Time (in seconds)
State of the Art - Crunching numbers AMD Opteron performance results for 5123 volume (timesteps=2, length=4) Cache oblivious Cache oblivious+Semi
18
16
14
12
10
8
6
4
6 / 21
The reality
Raúl de la Cruz (CASE - BSC)
Stencil Comp.: from Academy to Industry
PP14
7 / 21
The black boxes Many stages are involved in an Industrial FD scheme code Stencil computation is the core of PDE+FD based simulations (up to 80% of execution time) However, the optimization of the remaining stages are also crucial in order to boost simulations Read Input
Preprocess
Source Injection
Input/Output Boundary Counditions
Neighbor Send/Recv
Raúl de la Cruz (CASE - BSC)
Computational
Stencil Operator
Postprocess
Domain Decomposition Write Output
Stencil Comp.: from Academy to Industry
PP14
8 / 21
Ash Transport-Deposition model Advection-Diffusion-Sedimentation eq.: Concentration (C t ,C t+1 ) Wind velocity (Ux ,Uy ,Uz ) Diffusion (Kx ,Ky ,Kz ) Source (S∗ ) Air density (ρ∗ ) ∂ ∂ ∂ + (Ux C) + (Uy C) + [(Uz − Vsj )C] = ∂t ∂X ∂Y ∂Z ∂ ∂C/ρ∗ ∂C/ρ∗ ∂ ρ∗ Kx + ρ∗ Ky + ∂X ∂X ∂Y ∂Y ∂ ∂C/ρ∗ + S∗ ρ∗ Kz ∂Z ∂Z
∂C
Read Input
Meteorological data • U, V, W, rho, T • p, u*, zi, L • Interpolate data
Boundary Counditions
Neighbor Send/Recv
Preprocess meteorological data: • Compute diffusion terms (Kx, Ky, Kz) • Compute Vsettling and Vdrydeposition • Scale terms, from LON-LAT to Cartesian • Scale precipitation rates (mm s-1 to mm h-1) • Add radial wind • Compute div(u) (ensure mass-conservation law) • Precompute fields: Vsj=Uz-Vsj, [Kx,Ky,Kz]*rho
Preprocess
Boundary Conditions (P1 and P2 stages): • Zero derivative for outgoing flux • Null concentration for incoming flux
Communicate P1: • MPI_Send/Recv (extra-domains) • Shared memory (intra-domains)
Raúl de la Cruz (CASE - BSC)
Postprocess
Stencil Operator
Postprocess output: • Compute ground accumulation (cdry and cwet) • Compute mass lost at boundaries, mass at volume and deposited • Add divergence correction (mass-conservation law)
Source Injection
Source injection: • Ash particles ejected by volcano
Stencil computation (P1 and P2 stages): • Advection term: Lax-wendorff scheme 2n-order (space/time), with slope limiter (minmod) • Diffusion: central difference scheme
Write Output
Stencil Comp.: from Academy to Industry
Write output: • Scale factor, LON-LAT to Cartesian • Write wet/dry deposit load (kg/cm2), flight level (flm), thickness deposit (cm), PM05/10/20 z-cummulative concentration (gr/cm2), 3D-volumes
PP14
9 / 21
Full Waveform Inversion Invert input signal to match a reference by modifiying the input model Iterative system running through different frequencies FWI EAC Production Chain: Master Side:
Worker Side:
More Frequency Iterations? No
Manage SSDB
Yes
Using Super Shots? Yes
More Gradient Iterations? No
EAC Production Chain: Master Side:
Worker Side:
Coding enabled? No
Seismic Output?
EAC Production Chain: Master Side:
Need Data Preprocessing?
Worker Side:
More Shots in Super Shot? No
No
No
Yes
Seismic Output? No Yes
Misfit History?
More Shots in Super Shot?
No
No Yes Yes
Topography? No
Seismic Output? No
Fatal Errors? No
Yes
Yes
Need Another Test Run? No
Topography? No
Fatal Errors? No
Seismic Output?
Yes No
Misfit OK?
Yes
No
Raúl de la Cruz (CASE - BSC)
Stencil Comp.: from Academy to Industry
PP14
10 / 21
Full Waveform Inversion - Example
Raúl de la Cruz (CASE - BSC)
Stencil Comp.: from Academy to Industry
PP14
11 / 21
Exaflop challenges Full Waveform Inversion Stencil: 8th order in space, 2nd order in time, staggered grids and many params Resources: 5000 shots x 5 frequencies x 20 gradient iters x 5 stencil kernels x 5 hours/stencil (using 10 MN3 nodes) = 12500000 hours (200 million core/hours) Real synthetic case: 1601x1601x601 points → ≈ 750Gb Memory per shot Dataset: Material properties Computational parameters Q (attenuation) Total volumes
Acoustic 1 (Vp field) 3 (p1,p2,p3 fields) 4
Elastic (Velocity) 1 density (rho) 12 + 24 = 36 37
Elastic (Stress) 21 elastic coef. idem 24 x 3 = 72 129
Ensemble forecasting for Ash Transport models Combine ADD models along with forecast models online (two-way feedback) Numerical prediction method generating probabilistic future states of a dynamic system Multiple simulations are conducted with slightly different initial conditions
Raúl de la Cruz (CASE - BSC)
Stencil Comp.: from Academy to Industry
PP14
12 / 21
Divide and Conquer Alleviate footprint and register pressure by splitting internal loops (fission) By doing this, conflict (related with cache associativity) and capacity misses (related with cache size) are reduced Loop fission can be performed in Z cols, Z -X planes or in whole Z -X -Y doms Prefetchers can evict data which might be reused in next iterations ! Compute main stencil do k = 1, NY do j = 1, NX ! Compute Advection term (no slope-lim) ! Single loop performing everything do i = 1, NZ c(i,j,k,0) = c(i,j,k,1) (dtddz * (u(i+1,j,k) * c(i+1,j,k) u(i-1,j,k) * c(i-1,j,k)) dtddx * (v(i,j+1,k) * c(i,j+1,k) v(i,j-1,k) * c(i,j+1,k)) dtddy * (w(i,j,k+1) * c(i,j,k+1) w(i,j,k-1) * c(i,j,k-1)) end do end do end do
Raúl de la Cruz (CASE - BSC)
! Compute main stencil do k = 1, NY do j = 1, NX do i = 1, NZ ! First block c(i,j,k,0) = c(i,j,k,1) (dtddz * (u(i+1,j,k) * c(i+1,j,k) u(i-1,j,k) * c(i-1,j,k))) end do do i = 1, NZ ! Second block c(i,j,k,0) = c(i,j,k,0) (dtddx * (v(i,j+1,k) * c(i,j+1,k) v(i,j-1,k) * c(i,j+1,k)) dtddy * (w(i,j,k+1) * c(i,j,k+1) w(i,j,k-1) * c(i,j,k-1))) end do end do end do
Stencil Comp.: from Academy to Industry
PP14
13 / 21
Traversing algorithms effectiveness Time-blocking algorithms pose many issues in Industrial codes (execution flow very complex between time-steps)
Z
The first two dimensions (Z and X ) are what really matters
3 2
Space-blocking (a.k.a. Rivera) shows partial effectiveness and clearly depends on Cache and Z -X dimensions per thread
1
X
Ash Dispersion model: MareNostrum 3. 32KB L1 D-Cache, 256KB L2 Cache and 20MB L3 Cache Domain size
256x256x64
256x512x64
256x1024x64
256x2048x64
No blocking Best blocking Speedup
0.077 0.075 (256x128) 1.02x
0.158 0.151 (256x128) 1.05x
0.428 0.334 (256x128) 1.27x
0.873412 0.66 (256x64) 1.32x
5 4.5
Blocking parameter
Time (in seconds)
4 3.5 3 2.5 2 1.5 1 0.66
0.87
16 x 32 16x x 6 64 16x 4 12 x16 64 8 x 25 x16 64 6x x6 16 16x 4 x 6 32 32x 4 x 6 64 32x 4 12 x32 64 8 x 25 x32 64 6x x6 16 32x 4 x 6 32 64x 4 x 6 64 64x 4 12 x64 64 8 x 25 x64 64 6 x 16 x64 64 x x 32 128 64 x x 64 128 64 12 x12 x6 8x 8x 4 25 12 6 6x 8x 4 16 128 64 x x 32 256 64 x x 64 256 64 12 x25 x6 8 6 4 25 x25 x64 6x 6x 16 256 64 x x 32 512 64 x x 64 512 64 12 x51 x6 8x 2x 4 25 51 6 6 2 4 16 x51 x64 x 2 32 102 x6 x 4 4 64 102 x6 4 12 x10 4x6 8 2 4 25 x10 4x6 6x 24 4 16 102 x6 x 4 4 32 204 x6 x 8 4 64 204 x64 12 x20 8x 8x 48 64 25 20 x 6x 48 64 20 x6 48 4 x6 4
0.5
Raúl de la Cruz (CASE - BSC)
Stencil Comp.: from Academy to Industry
PP14
14 / 21
Thread & Domain Affinity - SMT Approach CORE 1 SMT Up to 4 thrds
D-Cache reuse may be improved by a wise decomposition
L1D 32KB
L1D 32KB
Reduce conflict & capacity misses due to smaller footprint
L2 512KB
L2 512KB
Thre Thre ad 3 Thre ad 2 Thre ad 3 ad 2
E
0
(T
h
rd s
01)
(T CO h R rd E s 1 23)
SMT Up to 4 thrds
R O
Threads 2-3
CORE 0
SMT enables sharing cache resources among threads
Y C
Threads 0-1
Thre Thre ad 1 Thre ad 0 Thre ad 1 ad 0
Z
# pragma omp parallel firstprivate(iniy,endy) { int tid = omp_get_thread_num()%THRCORE; int dom = omp_get_max_threads()/THRCORE; int nby = (endy - iniy) / dom; int rnb = (endy - iniy) % dom; if (rnb > dom) iniy = iniy + (++nby)*dom; else iniy = iniy + nby*dom + rnb; endy = MIN(iniy+nby, endy); /* Update maingrid */ for (k= iniy+tid; k< endy; k+= THRCORE) { for (j= inix; j< endx; j++) { for (i= iniz; i< endz; i++) {
X
Ash Dispersion model: Intel MIC, 4 cores (2 threads each). Domain 256x64x1024 Metric No affinity With affinity Reduction
L1 Misses
L2 Demand + L2 HWP Misses
TLB Misses
Time (in sec)
9504992 8201066 14%
6245341 (3492094 + 2753247) 5294566 (361439 + 4933127) 16%
4411206 346578 92%
4.88 3.96 19%
Raúl de la Cruz (CASE - BSC)
Stencil Comp.: from Academy to Industry
PP14
15 / 21
WARIS Multi-purpose framework to solve FD problems Developed in a parallel and efficient way for ES and CFD problems Structured meshes (conforming and non-conforming) Modular to ease development cycles, portability and reusability Supports Explicit, Implicit and Semi-implicit methods
Extra-domain MPI process 0 (Running in CN0)
Main
High level of parallelism
Domain
Intra-domain
MPI process 1 (Running in CN1)
Domain 0-1
Main
Domain
Comm.
I/O
Intra-domain
Asynchronous I/O Computation & I/O overlapping
Intra-domain
Domain 0-0
Extra-domain
Domain 1-0
Domain
Comm.
I/O
Comm.
I/O Device
Intra-domain Domain 1-1
Domain
I/O
Comm.
Comm.
I/O
Comm.
I/O Device
I/O Device
I/O Device
Raúl de la Cruz (CASE - BSC)
Stencil Comp.: from Academy to Industry
PP14
16 / 21
WARIS - PSK Framework PSK Framework Con g
Initialize
Finalize
Scatter
Gather
Generalized framework providing an execution flow User must create problem specialization routines
More Steps?
Red blocks are provided by the PSK Framework Pre-Processing (phase 1)
Main Processing (phase 2)
Green blocks are provided by the user problem specialization P1 routines compute data required for domain decomposition exchange
Post-Processing (phase 3)
Comm Comm H-D H-H I/O H-D
I/O H-H
P2 routines perform computation over the remaining domain
Wait Comm
Raúl de la Cruz (CASE - BSC)
Stencil Comp.: from Academy to Industry
PP14
17 / 21
Embarrassingly parallel executions Process several physical problems in parallel Statistical study, optimal parameter search or production chain
Status Control Client
Resilience Fault tolerance Postprocessing Checkpointing
Master
Command Workers
Worker0 Pre/post Process Kernel
Local FS Worker1 Pre/post Process
Status Control Server
Global FS
Kernel
Preprocess Job selection Checkpointing Workern Postprocess Local FS
Raúl de la Cruz (CASE - BSC)
Stencil Comp.: from Academy to Industry
PP14
18 / 21
Example of success: WARIS-Transport vs FALL3D WARIS-Transport explicit kernel optimizations SIMDization (SSE, AVX and MIC) Auto-tune of blocking parameter Pipeline optimizations (loop fission) Parallel I/O (active buffering strategy & two-phase collective)
Eyjafjallajökull validation testcase Input dataset: 9 GBytes of meteorological data Output dataset: 200 MBytes of simulated concentration data FALL3D requires 1h and 58 minutes with 16 processors WARIS-Transport took 17 minutes to process it (Speedup 6.8×) Eyjafjallajökull case. Domain 41x241x141. Test conducted in MareNostrum (Intel SandyBridge-EP E5-2670) Number Proc. 1 2 4 8 16 24
Explicit Kernel P1 stage P2 stage 0.0 54.6 55.1 57.5 67.4 68.9
Raúl de la Cruz (CASE - BSC)
4028.1 1972.9 971.1 448.3 296.8 121.61
I/O
Meteo data Postprocess
153.4 103.6 83.9 62.4 90.2 92.2
2599.5 1335.2 718.9 414.9 257.2 201.2
Output Preprocess 11.0 4.8 2.5 1.2 0.7 0.4
Stencil Comp.: from Academy to Industry
Total Time
Speed-up
I/O 3.9 2.8 2.9 0.2 0.3 9.5
8654.5 4350.0 2318.5 1304.9 1037.8 788.0
1.0 1.9 3.7 6.6 8.3 10.9
PP14
19 / 21
Summary & Conclusions Optimize the stencil computation is important, BUT Not all optimization techniques in academia give a leap Space-tiling is only useful for certain dataset cases and caches Consider underlying hardware when DD (SMT & shared caches) Design a proper framework for Industrial codes Include Auto-tuning techniques to your framework (Model + Search parameter) Keep a simple domain decomposition to reduce memory copies Independent threads for independent tasks Overlap Communication - Computation - I/O tasks Include a production chain for embarrasingly parallel cases
Raúl de la Cruz (CASE - BSC)
Stencil Comp.: from Academy to Industry
PP14
20 / 21
Questions
Thank you for your attention!
Raúl de la Cruz (CASE - BSC)
Stencil Comp.: from Academy to Industry
PP14
21 / 21
Modelling & Hardware Limitations AMD Opteron 8
(s tre a
m )
+mul/add balanced
yb
an
GFlops/s
dw
idt
h
4
em
or
actual 25-point peak
m
actual 13,43-point peak
1/2
h ba m ea str n-
43
no
13
+Capacity misses both
+Conflict misses semi
7
25
nd
wi
dt 1/4
+Conflict classical
7
1 1/8
mul/add imbalanced
43
13
Compulsory misses semi
2
1
Compulsory misses classical
pe
ak
25
2
actual 7-point peak
4
Operational Intensity (Flops/Byte)
Raúl de la Cruz (CASE - BSC)
Stencil Comp.: from Academy to Industry
PP14
21 / 21
Permute or not permute, that’s the question
Performance may vary depending on the first two axis sizes (Z and X ) Is it 256x2048x64 traversing computation ⇔ 256x64x2048? Permute input and output buffers Does the permutation pay off the cost and benefit?
Raúl de la Cruz (CASE - BSC)
Stencil Comp.: from Academy to Industry
PP14
21 / 21
Auto-tuning blocking parameters Modeling + Fine Grain Search (re nement) • Helper function in WARIS framework to search the best TX parameter.
Seed
time
Prediction model (initial seed for search)
Renement
TX parameter Raúl de la Cruz (CASE - BSC)
Stencil Comp.: from Academy to Industry
PP14
21 / 21