High-Order Discontinuous Galerkin Methods by GPU Metaprogramming

Introduction GPU-DG Results

High-Order Discontinuous Galerkin Methods by GPU Metaprogramming Andreas Kl¨ ockner Courant Institute of Mathematical Sciences New York University

February 28, 2011

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming


Thanks

Jan Hesthaven (Brown) Tim Warburton (Rice) Leslie Greengard (NYU) Hendrik Riedmann (Stuttgart/Brown) PyOpenCL, PyCUDA contributors Nvidia Corporation

Andreas Kl¨ ockner



Outline

1 Introduction 2 GPU-DG: Challenges and Solutions 3 Results

Andreas Kl¨ ockner



Discontinuous Galerkin Methods

Outline

1 Introduction

Discontinuous Galerkin Methods 2 GPU-DG: Challenges and Solutions 3 Results

Andreas Kl¨ ockner




Outline

1 Introduction

Discontinuous Galerkin Methods 2 GPU-DG: Challenges and Solutions 3 Results

Andreas Kl¨ ockner




Discontinuous Galerkin Method Let Ω :=

S

i

Dk ⊂ Rd .

Andreas Kl¨ ockner





S

i

Dk ⊂ Rd .

Goal Solve a conservation law on Ω:

Andreas Kl¨ ockner

ut + ∇ · F (u) = 0





S

i

Dk ⊂ Rd .

Goal ut + ∇ · F (u) = 0

Solve a conservation law on Ω:

Example Maxwell’s Equations: EM field: E (x, t), H(x, t) on Ω governed by j 1 ∂t E − ∇ × H = − , ε ε ρ ∇·E = , ε

Andreas Kl¨ ockner

∂t H +

1 ∇ × E = 0, µ ∇ · H = 0.




Discontinuous Galerkin Method Multiply by test function, integrate by parts: ˆ 0= ut ϕ + [∇ · F (u)]ϕ dx Dk ˆ ˆ = ut ϕ − F (u) · ∇ϕ dx + (ˆ n · F )∗ ϕ dSx , Dk

∂Dk

Subsitute in basis functions, introduce elementwise stiffness, mass, and surface mass matrices matrices S, M, MA : X ∂t u k = − D ∂ν ,k [F (u k )] + Lk [ˆ n · F − (ˆ n · F )∗ ]|A⊂∂Dk . ν

For straight-sided simplicial elements: Reduce D ∂ν and L to reference matrices. Andreas Kl¨ ockner




Decomposition of a DG operator into Subtasks

DG’s execution decomposes into two (mostly) separate branches: Flux Gather

Flux Lifting

uk

∂t u k F (u k )

Local Differentiation

Green: Element-local parts of the DG operator.

Andreas Kl¨ ockner



Introduction Challenges GPU Metaprog.

Outline

1 Introduction 2 GPU-DG: Challenges and Solutions

Introduction Challenges A Different Way of Writing GPU Codes 3 Results

Andreas Kl¨ ockner




Outline



Andreas Kl¨ ockner




DG on GPUs: Possible Advantages

DG on GPUs: Why? GPUs have deep Memory Hierarchy The majority of DG is local.

Compute Bandwidth Memory Bandwidth DG is arithmetically intense.

GPUs favor dense data. Local parts of the DG operator are dense.

GPUs don’t like scattered access. DG’s cell connectivity is sparser than CG’s and more regular.

Andreas Kl¨ ockner




DG on the GPU: What are we trying to achieve? Objectives: Main: Speed Reduce need for compute-bound clusters Secondary: Generality Be applicable to many problems Tertiary: Ease-of-Use Hide complexity of GPU hardware Setting (for now): Specialize to straight-sided simplices Optimize for (but don’t specialize to) tetrahedra (ie. 3D) Optimize for “medium” order (3. . . 5)

Andreas Kl¨ ockner




Outline



Andreas Kl¨ ockner




Element-Local Operations: Differentiation

K

Np

Local Templated Local Templated Local Templated Derivative Matrices Derivative Matrices Derivative Matrices

Np

Field Data

Np

Geometric Factors

Andreas Kl¨ ockner




Element-Local Operations: Lifting K

Np

Local Templated Lifting Matrix

Facial Field Data

Nf Nfp (Inverse) Jacobians

Andreas Kl¨ ockner


Nf Nfp




Np


Facial Field Data

Nf Nfp (Inverse) Jacobians

On-Chip Storage Andreas Kl¨ ockner


Nf Nfp




Np


Nf Nfp

Facial Field Data

? (Inverse) Jacobians



Nf Nfp




Np


Facial Field Data

? Nf Nfp (Inverse) Jacobians



Nf Nfp



Best use for on-chip memory?

Basic Problem On-chip storage is scarce. . . . . . and will be for the foreseeable future. Possible uses: Matrix/Matrices Part of a matrix Field Data Both How to decide? Does it matter?

Andreas Kl¨ ockner




Work Partition for Element-Local Operators

Natural Work Decomposition: One Element per Block

?

Andreas Kl¨ ockner




Work Partition for Element-Local Operators

Natural Work Decomposition: One Element per Block + Straightforward to implement + No granularity penalty

- Cannot fill wide SIMD: unused compute power for small to medium elements - Data alignment: Padding wastes memory - Cannot amortize cost of preparation steps (e.g. fetching)

Andreas Kl¨ ockner

?




Loop Slicing for element-local parts of GPU DG Per Block: KL element-local mat.mult. + matrix load Preparation

Question: How should one assign work to threads?

Andreas Kl¨ ockner




Loop Slicing for element-local parts of GPU DG Per Block: KL element-local mat.mult. + matrix load Preparation

Question: How should one assign work to threads? ws : in sequence

wi : “inline-parallel”

Thread

wp : in parallel

Thread

t

t

(amortize preparation)

Thread

t

(exploit register space)

Andreas Kl¨ ockner




Evaluating the Surface Flux Idea: Potential for data reuse in computing flux on Γa , Γb Problem: Should store entire element’s worth of fluxes for optimal bandwidth use Da

→ Must break some face pairs. Partition mesh, obtain three types of faces: Within one part

Γa Γb Db

Straddling parts (no reuse) Boundary Evaluate: Walk one partition’s list of faces in parallel within work group

Andreas Kl¨ ockner




Work Partition for Surface Flux Evaluation

Granularity Tradeoff: Large Blocks: + More Data Reuse - Less Parallelism - Less Latency Hiding Block Size limited by two factors: Output buffer size Face metadata size

Optimal Block Size: not obvious

Andreas Kl¨ ockner







Andreas Kl¨ ockner







Andreas Kl¨ ockner




More than one Granularity Different block sizes introduced so far: Differentiation Lifting Padding

Surface Fluxes 128

...

...

64

Element Element Element

0

Element Element Element Np

Andreas Kl¨ ockner

...

KM Np





Surface Fluxes

Idea

128

Introduce another, smaller block size to satisfy SIMD width and alignment constraints. (“Microblock”) And demand other block sizes be a multiple of this new size

Andreas Kl¨ ockner

...

...

...

64


0


KM Np





Surface Fluxes

Idea

128

Introduce another, smaller block size to satisfy SIMD width and alignment constraints. (“Microblock”) And demand other block sizes be a multiple of this new size

...

...

64


0


KM Np

How big? Not obvious.

Andreas Kl¨ ockner

...




DG on GPUs: Implementation Choices

Many difficult questions Insufficient heuristics Answers are hardware-specific and have no lasting value

Andreas Kl¨ ockner




DG on GPUs: Implementation Choices

Many difficult questions Insufficient heuristics Answers are hardware-specific and have no lasting value Proposed Solution: Tune automatically for hardware at computation time, cache tuning results. Decrease reliance on knowledge of hardware internals Shift emphasis from tuning results to tuning ideas

Andreas Kl¨ ockner




Outline



Andreas Kl¨ ockner




Metaprogramming

In GPU scripting, GPU code does not need to be a compile-time constant.

Andreas Kl¨ ockner




Metaprogramming


(Key: Code is data–it wants to be reasoned about at run time)

Andreas Kl¨ ockner




Metaprogramming Idea



Andreas Kl¨ ockner






Python Code GPU Code GPU Compiler GPU Binary


GPU Result

Andreas Kl¨ ockner






Python Code GPU Code GPU Compiler GPU Binary

Machine (Key: Code is data–it wants to be reasoned about at run time)

GPU Result

Andreas Kl¨ ockner





Human Python Code GPU Code GPU Compiler


GPU Binary


GPU Result

Andreas Kl¨ ockner




Metaprogramming Idea Python Code

Good for code generation

GPU Code GPU Compiler


GPU Binary


GPU Result

Andreas Kl¨ ockner







UDA PyCscripting, In GPU GPU code does not need to be a compile-time constant.

GPU Binary


GPU Result

Andreas Kl¨ ockner







A PyPOp UDCL en yCscripting, In GPU GPU code does not need to be a compile-time constant.

GPU Binary


GPU Result

Andreas Kl¨ ockner




Machine-generated Code

Why machine-generate code? Automated Tuning (cf. ATLAS, FFTW) Data types Specialize code for given problem Constants faster than variables (→ register pressure) Loop Unrolling

Andreas Kl¨ ockner




Why do Scripting for GPUs? GPUs are everything that scripting languages are not. Highly parallel Very architecture-sensitive Built for maximum FP/memory throughput

→ complement each other CPU: largely restricted to control tasks (∼1000/sec) Scripting fast enough

Python + CUDA = PyCUDA Python + OpenCL = PyOpenCL

Andreas Kl¨ ockner




PyOpenCL, PyCUDA: Vital Information http://mathema.tician.de/ software/pyopencl (or /pycuda) Complete documentation X Consortium License (no warranty, free for all use) Convenient abstractions Arrays, Elementwise op., Reductions, soon: Scan Require: numpy, Python 2.4+ (Win/OS X/Linux) Community: mailing list, wiki, add-on packages (FFT, scikits.cuda, . . . )

Andreas Kl¨ ockner



GPU-DG: Performance and Generality Conclusions

Outline



Andreas Kl¨ ockner




Outline



Andreas Kl¨ ockner




Loop Slicing for Differentiation 2.2

1.8

1.6 1.4 1.2 1.0 0.8 15

3.0 2.8 2.6 2.4 2.2 2.0 1.8 1.6 1.4 1.2 1.0

ws

Execution time [ms]

Local differentiation, matrix-in-shared, order 4, with microblocking © ª 2.0 point size denotes wi ∈ 1, ,4

20

wp

25

Andreas Kl¨ ockner

30




Nvidia GTX280 vs. single core of Intel Core 2 Duo E8400 300 250

GPU CPU

GFlops/s

200 150 100 50 00

2

4 6 Polynomial Order N Andreas Kl¨ ockner

8

10




Memory Bandwidth on a GTX 280

Global Memory Bandwidth [GB/s]

200 180 160 140 120 100 80 60 40 201

Gather Lift Diff Assy. Peak

2

3

4 5 6 Polynomial Order N Andreas Kl¨ ockner

7

8

9




Multiple GPUs via MPI: 16 GPUs vs. 64 CPUs

GFlops/s

Flop Rates: 16 GPUs vs 64 CPU cores GPU 4000 CPU 3000 2000 1000 00

2


10




Hedge DG Solver

High-Level Operator Description Maxwell’s Euler Poisson Compressible Navier-Stokes, . . .

One Code runs. . . . . . on CPU, CUDA . . . on {CPU,CUDA}+MPI . . . in 1D, 2D, 3D . . . at any order Andreas Kl¨ ockner

Generates C++ or CUDA code at Run Time Open Source (GPL3) Written in Python, uses PyCUDA




Outline



Andreas Kl¨ ockner




GPU DG Showcase

Eletromagnetism

Andreas Kl¨ ockner




GPU DG Showcase

Eletromagnetism Poisson

Andreas Kl¨ ockner




GPU DG Showcase

Eletromagnetism Poisson CFD

Andreas Kl¨ ockner




GPU DG Showcase

Eletromagnetism Poisson CFD

Andreas Kl¨ ockner




Where to from here? PyCUDA, PyOpenCL, hedge → http://www.cims.nyu.edu/~kloeckner/

GPU-DG Article AK, T. Warburton, J. Bridge, J.S. Hesthaven, “Nodal Discontinuous Galerkin Methods on Graphics Processors”, J. Comp. Phys., 228 (21), 7863–7882. Also: Intro in GPU Computing Gems Vol 2

GPU RTCG AK, N. Pinto et al. PyCUDA: GPU Run-Time Code Generation for High-Performance Computing, submitted. Andreas Kl¨ ockner




Conclusions GPU-DG is significantly faster than CPU-DG Method well-suited a priori Numerous tricks enable good performance

GPUs and scripting work surprisingly well together Enable Run-Time Code Generation

Ongoing work in GPU-DG: Curvilinear Elements (T. Warburton) Local Time Stepping Shock Detection for Nonlinear Equations

Shameless plugs: A peek at hedge’s guts: Tomorrow 6pm Shock detection for GPU-DG: Thursday 1pm

Andreas Kl¨ ockner




Questions?

? Thank you for your attention!

http://www.cims.nyu.edu/~kloeckner/

image credits

Andreas Kl¨ ockner




Image Credits Carrot: OpenClipart.org Dart board: sxc.hu/195617 Question Mark: sxc.hu/svilen001 ?/! Marks: sxc.hu/svilen001 Machine: flickr.com/13521837@N00 C870 GPU: Nvidia Corp. Floppy disk: flickr.com/ethanhein

Andreas Kl¨ ockner


Loo.py

RTCG via Templates from mako.template import Template tpl = Template(””” kernel void add( global ${ type name } ∗tgt, global const ${ type name } ∗op1, global const ${ type name } ∗op2) { int idx = get local id (0) + ${ local size } ∗ ${ thread strides } ∗ get group id (0); % for i in range( thread strides ): tgt [ idx + ${ offset }] = op1[idx + ${ offset }] + op2[idx + ${ offset } ]; % endfor }”””) rendered tpl = tpl . render(type name=”float”, local size = local size , thread strides = thread strides ) Andreas Kl¨ ockner


Loo.py

RTCG via AST Generation from codepy.cgen import ∗ from codepy.cgen.opencl import \ CLKernel, CLGlobal, CLRequiredWorkGroupSize mod = Module([ FunctionBody( CLKernel(CLRequiredWorkGroupSize((local size,), FunctionDeclaration (Value(”void”, ”twice”), arg decls =[CLGlobal(Pointer(Const(POD(dtype, ”tgt”))))]))), Block([ Initializer (POD(numpy.int32, ”idx”), ” get local id (0) + %d ∗ get group id(0)” % ( local size ∗ thread strides )) ]+[ Statement(”tgt[idx+%d] ∗= 2” % (o∗local size)) for o in range( thread strides )] ))]) knl = cl.Program(ctx, str (mod)).build (). twice

Andreas Kl¨ ockner


Loo.py

GPU-DG in Double Precision

GFlops/s

GPU-DG: Double vs. Single Precision

400 350 300 250 200 150 100 50 00

Single Double

Ratio

1.0 0.8 0.6 0.4 0.2

2


100.0


Loo.py

Outline

Automatic GPU Programming

Andreas Kl¨ ockner


Loo.py

Automating GPU Programming GPU programming can be time-consuming, unintuitive and error-prone. Obvious idea: Let the computer do it. One way: Smart compilers

Andreas Kl¨ ockner


Loo.py

Automating GPU Programming GPU programming can be time-consuming, unintuitive and error-prone. Obvious idea: Let the computer do it. One way: Smart compilers GPU programming requires complex tradeoffs Tradeoffs require heuristics Heuristics are fragile

Andreas Kl¨ ockner


Loo.py

Automating GPU Programming GPU programming can be time-consuming, unintuitive and error-prone. Obvious idea: Let the computer do it. One way: Smart compilers GPU programming requires complex tradeoffs Tradeoffs require heuristics Heuristics are fragile

Another way: Dumb enumeration Enumerate loop slicings Enumerate prefetch options Choose by running resulting code on actual hardware

Andreas Kl¨ ockner


Loo.py

Loo.py Example Empirical GPU loop optimization: a, b, c, i , j , k = [var(s) for s in ” abcijk ”] n = 500 k = make loop kernel([ LoopDimension(”i”, n), LoopDimension(”j”, n), LoopDimension(”k”, n), ], [ (c[ i +n∗j], a[ i +n∗k]∗b[k+n∗j]) ]) gen kwargs = { ”min threads”: 128, ”min blocks”: 32, }

→ Ideal case: Finds 160 GF/s kernel without human intervention. Andreas Kl¨ ockner


Loo.py

Loo.py Status

Limited scope: Require input/output separation Kernels must be expressible using “loopy” model (i.e. indices decompose into “output” and “reduction”) Enough for DG, LA, FD, . . .

Andreas Kl¨ ockner


Loo.py

Loo.py Status

Limited scope: Require input/output separation Kernels must be expressible using “loopy” model (i.e. indices decompose into “output” and “reduction”) Enough for DG, LA, FD, . . .

Kernel compilation limits trial rate Non-Goal: Peak performance Good results currently for dense linear algebra and (some) DG subkernels

Andreas Kl¨ ockner


High-Order Discontinuous Galerkin Methods by GPU Metaprogramming

High-Order Discontinuous Galerkin Methods by GPU Metaprogramming

Suggest Documents

GPU-accelerated Bernstein-Bezier discontinuous Galerkin methods ...

Discontinuous Galerkin methods

GPU Accelerated Discontinuous Galerkin Time ... - (PIER) Journals

Discontinuous Galerkin Finite Element Methods

Discontinuous Galerkin Methods for Ordinary

LOCAL DISCONTINUOUS GALERKIN METHODS ...

ADAPTIVE DISCONTINUOUS GALERKIN METHODS ...

DISCONTINUOUS GALERKIN METHODS FOR CONVECTION ...

Runge-Kutta Discontinuous Galerkin Methods for Convection ...

RUNGE-KUTTA DISCONTINUOUS GALERKIN METHODS FOR ...

Stochastic Discontinuous Galerkin Methods (SDGM) Based on ...

Discontinuous Galerkin Methods with Trefftz Approximation

Nodal Discontinuous Galerkin Methods on Graphics Processors

Application of Discontinuous Galerkin Methods for ... - CiteSeerX

UNIFIED ANALYSIS OF DISCONTINUOUS GALERKIN METHODS ...

DISCONTINUOUS GALERKIN METHODS FOR ... - Semantic Scholar

SPACETIME MESHING FOR DISCONTINUOUS GALERKIN METHODS

discontinuous galerkin methods and local time

Multiplicative Schwarz Methods for Discontinuous Galerkin ... - Numdam

Discontinuous Galerkin Methods for Partial Differential Equations

Discontinuous Galerkin methods for the Boltzmann

Weight-adjusted discontinuous Galerkin methods: curvilinear meshes

discontinuous galerkin methods for convection-diffusion problems

Local Discontinuous Galerkin methods for fractional ... - Infoscience