Stencil Computations: from Academy to Industry

Stencil Computations: from Academy to Industry Raúl de la Cruz, Mauricio Hanzich and Jose María Cela {raul.delacruz,mauricio.hanzich,josem.cela}@bsc.es Barcelona Supercomputing Center, Barcelona (Spain)

Parallel Processing Conference Optimizing Stencil-based Algorithms Portland, Oregon, USA, February 18-21, 2014

Raúl de la Cruz (CASE - BSC)

Stencil Comp.: from Academy to Industry

PP14

1 / 21

FLOPS are cheap - MEMORY access is costly

FLOPS/DRAM Byte Arithmetic Intensity O(1) O(log n) O(n,n2,n3) AI = [0.33, 3.5]

Irregular codes (Stencils & SpMV)

FFT

WARIS Raúl de la Cruz (CASE - BSC)

AI = [1, ~1.6]

AI = [0.25, ~2.0]

Computation bound

Memory bound

Optimization is a MUST

Dense algebra (BLAS-like DGEMM)

ATLAS ESSL


PP14

2 / 21

Stencil in a Nutshell Finite Difference Method 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13:

Domain decomposition of mesh for time = 0 to timeend do Read Input Pre-processing Inject source Apply boundary conditions for all points in my domain do Stencil computation (X t ) end for Exchange overlapped points Post-processing Write Output end for

Stencil Computation for k = ℓ to Y − ℓ do for j = ℓ to X − ℓ do for i = ℓ to Z − ℓ do t−1 t = C0 ∗ Xi,j,k Xi,j,k t−1 t−1 t−1 t−1 +CZ 1 ∗ (Xi−1,j,k + Xi+1,j,k ) + .. + CZ ℓ ∗ (Xi−ℓ,j,k + Xi+ℓ,j,k ) t−1 t−1 t−1 t−1 +CX 1 ∗ (Xi,j−1,k + Xi,j+1,k ) + .. + CX ℓ ∗ (Xi,j−ℓ,k + Xi,j+ℓ,k ) t−1 t−1 t−1 t−1 +CY 1 ∗ (Xi,j,k + Xi,j,k ) + .. + CY ℓ ∗ (Xi,j,k + Xi,j,k ) −1 +1 −ℓ +ℓ

end for end for end for

Load balancing Kernel computation Intra/inter-node communication



PP14

3 / 21

State of the Art - The Big Picture



PP14

4 / 21

re ct io

n)

State of the Art - The most popular av er sin g

di

a)

Backward Update

F/B

K I (unit-stride)

b)

(tr

Forward Update

F

IxTJxK block

Rivera t Const. Boundary

time

21-

1st

0-

0

2nd

3rd

4th

cache block

1

2

3

4

5

6 7 space

8

9

Time-skewing Raúl de la Cruz (CASE - BSC)

Const. Boundary

Semi-stencil

10 11 12 13

x

Cache-oblivious Stencil Comp.: from Academy to Industry

PP14

5 / 21

Raúl de la Cruz (CASE - BSC) Stencil Comp.: from Academy to Industry 64x256x512

PP14

512x512x512

256x512x512

128x512x512

64x512x512

32x512x512

16x512x512

512x256x512

256x256x512

128x256x512

Time-skewing Time-skewing+Semi

32x256x512

16x256x512

512x128x512

256x128x512

128x128x512

64x128x512

32x128x512

16x128x512

Rivera Rivera+Semi

512x64x512

256x64x512

128x64x512

64x64x512

32x64x512

16x64x512

512x32x512

Naive Naive+Semi

256x32x512

128x32x512

64x32x512

32x32x512

16x32x512

512x16x512

256x16x512

128x16x512

20

64x16x512

32x16x512

16x16x512

Time (in seconds)

State of the Art - Crunching numbers AMD Opteron performance results for 5123 volume (timesteps=2, length=4) Cache oblivious Cache oblivious+Semi

18

16

14

12

10

8

6

4

6 / 21

The reality



PP14

7 / 21

The black boxes Many stages are involved in an Industrial FD scheme code Stencil computation is the core of PDE+FD based simulations (up to 80% of execution time) However, the optimization of the remaining stages are also crucial in order to boost simulations Read Input

Preprocess

Source Injection

Input/Output Boundary Counditions

Neighbor Send/Recv


Computational

Stencil Operator

Postprocess

Domain Decomposition Write Output


PP14

8 / 21

Ash Transport-Deposition model Advection-Diffusion-Sedimentation eq.: Concentration (C t ,C t+1 ) Wind velocity (Ux ,Uy ,Uz ) Diffusion (Kx ,Ky ,Kz ) Source (S∗ ) Air density (ρ∗ ) ∂ ∂ ∂ + (Ux C) + (Uy C) + [(Uz − Vsj )C] = ∂t ∂X ∂Y ∂Z ∂ ∂C/ρ∗ ∂C/ρ∗ ∂ ρ∗ Kx + ρ∗ Ky + ∂X ∂X ∂Y ∂Y ∂ ∂C/ρ∗ + S∗ ρ∗ Kz ∂Z ∂Z

∂C

Read Input

Meteorological data • U, V, W, rho, T • p, u*, zi, L • Interpolate data

Boundary Counditions

Neighbor Send/Recv

Preprocess meteorological data: • Compute diffusion terms (Kx, Ky, Kz) • Compute Vsettling and Vdrydeposition • Scale terms, from LON-LAT to Cartesian • Scale precipitation rates (mm s-1 to mm h-1) • Add radial wind • Compute div(u) (ensure mass-conservation law) • Precompute fields: Vsj=Uz-Vsj, [Kx,Ky,Kz]*rho

Preprocess

Boundary Conditions (P1 and P2 stages): • Zero derivative for outgoing flux • Null concentration for incoming flux

Communicate P1: • MPI_Send/Recv (extra-domains) • Shared memory (intra-domains)


Postprocess

Stencil Operator

Postprocess output: • Compute ground accumulation (cdry and cwet) • Compute mass lost at boundaries, mass at volume and deposited • Add divergence correction (mass-conservation law)

Source Injection

Source injection: • Ash particles ejected by volcano

Stencil computation (P1 and P2 stages): • Advection term: Lax-wendorff scheme 2n-order (space/time), with slope limiter (minmod) • Diffusion: central difference scheme

Write Output


Write output: • Scale factor, LON-LAT to Cartesian • Write wet/dry deposit load (kg/cm2), flight level (flm), thickness deposit (cm), PM05/10/20 z-cummulative concentration (gr/cm2), 3D-volumes

PP14

9 / 21

Full Waveform Inversion Invert input signal to match a reference by modifiying the input model Iterative system running through different frequencies FWI EAC Production Chain: Master Side:

Worker Side:

More Frequency Iterations? No

Manage SSDB

Yes

Using Super Shots? Yes

More Gradient Iterations? No

EAC Production Chain: Master Side:

Worker Side:

Coding enabled? No

Seismic Output?

EAC Production Chain: Master Side:

Need Data Preprocessing?

Worker Side:

More Shots in Super Shot? No

No

No

Yes

Seismic Output? No Yes

Misﬁt History?

More Shots in Super Shot?

No

No Yes Yes

Topography? No

Seismic Output? No

Fatal Errors? No

Yes

Yes

Need Another Test Run? No

Topography? No

Fatal Errors? No

Seismic Output?

Yes No

Misﬁt OK?

Yes

No



PP14

10 / 21

Full Waveform Inversion - Example



PP14

11 / 21

Exaflop challenges Full Waveform Inversion Stencil: 8th order in space, 2nd order in time, staggered grids and many params Resources: 5000 shots x 5 frequencies x 20 gradient iters x 5 stencil kernels x 5 hours/stencil (using 10 MN3 nodes) = 12500000 hours (200 million core/hours) Real synthetic case: 1601x1601x601 points → ≈ 750Gb Memory per shot Dataset: Material properties Computational parameters Q (attenuation) Total volumes

Acoustic 1 (Vp field) 3 (p1,p2,p3 fields) 4

Elastic (Velocity) 1 density (rho) 12 + 24 = 36 37

Elastic (Stress) 21 elastic coef. idem 24 x 3 = 72 129

Ensemble forecasting for Ash Transport models Combine ADD models along with forecast models online (two-way feedback) Numerical prediction method generating probabilistic future states of a dynamic system Multiple simulations are conducted with slightly different initial conditions



PP14

12 / 21

Divide and Conquer Alleviate footprint and register pressure by splitting internal loops (fission) By doing this, conflict (related with cache associativity) and capacity misses (related with cache size) are reduced Loop fission can be performed in Z cols, Z -X planes or in whole Z -X -Y doms Prefetchers can evict data which might be reused in next iterations ! Compute main stencil do k = 1, NY do j = 1, NX ! Compute Advection term (no slope-lim) ! Single loop performing everything do i = 1, NZ c(i,j,k,0) = c(i,j,k,1) (dtddz * (u(i+1,j,k) * c(i+1,j,k) u(i-1,j,k) * c(i-1,j,k)) dtddx * (v(i,j+1,k) * c(i,j+1,k) v(i,j-1,k) * c(i,j+1,k)) dtddy * (w(i,j,k+1) * c(i,j,k+1) w(i,j,k-1) * c(i,j,k-1)) end do end do end do


! Compute main stencil do k = 1, NY do j = 1, NX do i = 1, NZ ! First block c(i,j,k,0) = c(i,j,k,1) (dtddz * (u(i+1,j,k) * c(i+1,j,k) u(i-1,j,k) * c(i-1,j,k))) end do do i = 1, NZ ! Second block c(i,j,k,0) = c(i,j,k,0) (dtddx * (v(i,j+1,k) * c(i,j+1,k) v(i,j-1,k) * c(i,j+1,k)) dtddy * (w(i,j,k+1) * c(i,j,k+1) w(i,j,k-1) * c(i,j,k-1))) end do end do end do


PP14

13 / 21

Traversing algorithms effectiveness Time-blocking algorithms pose many issues in Industrial codes (execution flow very complex between time-steps)

Z

The first two dimensions (Z and X ) are what really matters

3 2

Space-blocking (a.k.a. Rivera) shows partial effectiveness and clearly depends on Cache and Z -X dimensions per thread

1

X

Ash Dispersion model: MareNostrum 3. 32KB L1 D-Cache, 256KB L2 Cache and 20MB L3 Cache Domain size

256x256x64

256x512x64

256x1024x64

256x2048x64

No blocking Best blocking Speedup

0.077 0.075 (256x128) 1.02x

0.158 0.151 (256x128) 1.05x

0.428 0.334 (256x128) 1.27x

0.873412 0.66 (256x64) 1.32x

5 4.5

Blocking parameter

Time (in seconds)

4 3.5 3 2.5 2 1.5 1 0.66

0.87

16 x 32 16x x 6 64 16x 4 12 x16 64 8 x 25 x16 64 6x x6 16 16x 4 x 6 32 32x 4 x 6 64 32x 4 12 x32 64 8 x 25 x32 64 6x x6 16 32x 4 x 6 32 64x 4 x 6 64 64x 4 12 x64 64 8 x 25 x64 64 6 x 16 x64 64 x x 32 128 64 x x 64 128 64 12 x12 x6 8x 8x 4 25 12 6 6x 8x 4 16 128 64 x x 32 256 64 x x 64 256 64 12 x25 x6 8 6 4 25 x25 x64 6x 6x 16 256 64 x x 32 512 64 x x 64 512 64 12 x51 x6 8x 2x 4 25 51 6 6 2 4 16 x51 x64 x 2 32 102 x6 x 4 4 64 102 x6 4 12 x10 4x6 8 2 4 25 x10 4x6 6x 24 4 16 102 x6 x 4 4 32 204 x6 x 8 4 64 204 x64 12 x20 8x 8x 48 64 25 20 x 6x 48 64 20 x6 48 4 x6 4

0.5



PP14

14 / 21

Thread & Domain Affinity - SMT Approach CORE 1 SMT Up to 4 thrds

D-Cache reuse may be improved by a wise decomposition

L1D 32KB

L1D 32KB

Reduce conflict & capacity misses due to smaller footprint

L2 512KB

L2 512KB

Thre Thre ad 3 Thre ad 2 Thre ad 3 ad 2

E

0

(T

h

rd s

01)

(T CO h R rd E s 1 23)

SMT Up to 4 thrds

R O

Threads 2-3

CORE 0

SMT enables sharing cache resources among threads

Y C

Threads 0-1

Thre Thre ad 1 Thre ad 0 Thre ad 1 ad 0

Z

# pragma omp parallel firstprivate(iniy,endy) { int tid = omp_get_thread_num()%THRCORE; int dom = omp_get_max_threads()/THRCORE; int nby = (endy - iniy) / dom; int rnb = (endy - iniy) % dom; if (rnb > dom) iniy = iniy + (++nby)*dom; else iniy = iniy + nby*dom + rnb; endy = MIN(iniy+nby, endy); /* Update maingrid */ for (k= iniy+tid; k< endy; k+= THRCORE) { for (j= inix; j< endx; j++) { for (i= iniz; i< endz; i++) {

X

Ash Dispersion model: Intel MIC, 4 cores (2 threads each). Domain 256x64x1024 Metric No affinity With affinity Reduction

L1 Misses

L2 Demand + L2 HWP Misses

TLB Misses

Time (in sec)

9504992 8201066 14%

6245341 (3492094 + 2753247) 5294566 (361439 + 4933127) 16%

4411206 346578 92%

4.88 3.96 19%



PP14

15 / 21

WARIS Multi-purpose framework to solve FD problems Developed in a parallel and efficient way for ES and CFD problems Structured meshes (conforming and non-conforming) Modular to ease development cycles, portability and reusability Supports Explicit, Implicit and Semi-implicit methods

Extra-domain MPI process 0 (Running in CN0)

Main

High level of parallelism

Domain

Intra-domain

MPI process 1 (Running in CN1)

Domain 0-1

Main

Domain

Comm.

I/O

Intra-domain

Asynchronous I/O Computation & I/O overlapping

Intra-domain

Domain 0-0

Extra-domain

Domain 1-0

Domain

Comm.

I/O

Comm.

I/O Device

Intra-domain Domain 1-1

Domain

I/O

Comm.

Comm.

I/O

Comm.

I/O Device

I/O Device

I/O Device



PP14

16 / 21

WARIS - PSK Framework PSK Framework Con g

Initialize

Finalize

Scatter

Gather

Generalized framework providing an execution flow User must create problem specialization routines

More Steps?

Red blocks are provided by the PSK Framework Pre-Processing (phase 1)

Main Processing (phase 2)

Green blocks are provided by the user problem specialization P1 routines compute data required for domain decomposition exchange

Post-Processing (phase 3)

Comm Comm H-D H-H I/O H-D

I/O H-H

P2 routines perform computation over the remaining domain

Wait Comm



PP14

17 / 21

Embarrassingly parallel executions Process several physical problems in parallel Statistical study, optimal parameter search or production chain

Status Control Client

Resilience Fault tolerance Postprocessing Checkpointing

Master

Command Workers

Worker0 Pre/post Process Kernel

Local FS Worker1 Pre/post Process

Status Control Server

Global FS

Kernel

Preprocess Job selection Checkpointing Workern Postprocess Local FS



PP14

18 / 21

Example of success: WARIS-Transport vs FALL3D WARIS-Transport explicit kernel optimizations SIMDization (SSE, AVX and MIC) Auto-tune of blocking parameter Pipeline optimizations (loop fission) Parallel I/O (active buffering strategy & two-phase collective)

Eyjafjallajökull validation testcase Input dataset: 9 GBytes of meteorological data Output dataset: 200 MBytes of simulated concentration data FALL3D requires 1h and 58 minutes with 16 processors WARIS-Transport took 17 minutes to process it (Speedup 6.8×) Eyjafjallajökull case. Domain 41x241x141. Test conducted in MareNostrum (Intel SandyBridge-EP E5-2670) Number Proc. 1 2 4 8 16 24

Explicit Kernel P1 stage P2 stage 0.0 54.6 55.1 57.5 67.4 68.9


4028.1 1972.9 971.1 448.3 296.8 121.61

I/O

Meteo data Postprocess

153.4 103.6 83.9 62.4 90.2 92.2

2599.5 1335.2 718.9 414.9 257.2 201.2

Output Preprocess 11.0 4.8 2.5 1.2 0.7 0.4


Total Time

Speed-up

I/O 3.9 2.8 2.9 0.2 0.3 9.5

8654.5 4350.0 2318.5 1304.9 1037.8 788.0

1.0 1.9 3.7 6.6 8.3 10.9

PP14

19 / 21

Summary & Conclusions Optimize the stencil computation is important, BUT Not all optimization techniques in academia give a leap Space-tiling is only useful for certain dataset cases and caches Consider underlying hardware when DD (SMT & shared caches) Design a proper framework for Industrial codes Include Auto-tuning techniques to your framework (Model + Search parameter) Keep a simple domain decomposition to reduce memory copies Independent threads for independent tasks Overlap Communication - Computation - I/O tasks Include a production chain for embarrasingly parallel cases



PP14

20 / 21

Questions

Thank you for your attention!



PP14

21 / 21

Modelling & Hardware Limitations AMD Opteron 8

(s tre a

m )

+mul/add balanced

yb

an

GFlops/s

dw

idt

h

4

em

or

actual 25-point peak

m

actual 13,43-point peak

1/2

h ba m ea str n-

43

no

13

+Capacity misses both

+Conflict misses semi

7

25

nd

wi

dt 1/4

+Conflict classical

7

1 1/8

mul/add imbalanced

43

13

Compulsory misses semi

2

1

Compulsory misses classical

pe

ak

25

2

actual 7-point peak

4

Operational Intensity (Flops/Byte)



PP14

21 / 21

Permute or not permute, that’s the question

Performance may vary depending on the first two axis sizes (Z and X ) Is it 256x2048x64 traversing computation ⇔ 256x64x2048? Permute input and output buffers Does the permutation pay off the cost and benefit?



PP14

21 / 21

Auto-tuning blocking parameters Modeling + Fine Grain Search (re nement) • Helper function in WARIS framework to search the best TX parameter.

Seed

time

Prediction model (initial seed for search)

Renement

TX parameter Raúl de la Cruz (CASE - BSC)


PP14

21 / 21