High-Order Discontinuous Galerkin Methods by GPU Metaprogramming

4 downloads 480 Views 3MB Size Report
Feb 28, 2011 - Main: Speed. Reduce need for ... Loop Slicing for element-local parts of GPU DG. Per Block: KL .... fast enough. Python + CUDA = PyCUDA.
Introduction GPU-DG Results

High-Order Discontinuous Galerkin Methods by GPU Metaprogramming Andreas Kl¨ ockner Courant Institute of Mathematical Sciences New York University

February 28, 2011

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

Thanks

Jan Hesthaven (Brown) Tim Warburton (Rice) Leslie Greengard (NYU) Hendrik Riedmann (Stuttgart/Brown) PyOpenCL, PyCUDA contributors Nvidia Corporation

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

Outline

1 Introduction 2 GPU-DG: Challenges and Solutions 3 Results

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

Discontinuous Galerkin Methods

Outline

1 Introduction

Discontinuous Galerkin Methods 2 GPU-DG: Challenges and Solutions 3 Results

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

Discontinuous Galerkin Methods

Outline

1 Introduction

Discontinuous Galerkin Methods 2 GPU-DG: Challenges and Solutions 3 Results

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

Discontinuous Galerkin Methods

Discontinuous Galerkin Method Let Ω :=

S

i

Dk ⊂ Rd .

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

Discontinuous Galerkin Methods

Discontinuous Galerkin Method Let Ω :=

S

i

Dk ⊂ Rd .

Goal Solve a conservation law on Ω:

Andreas Kl¨ ockner

ut + ∇ · F (u) = 0

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

Discontinuous Galerkin Methods

Discontinuous Galerkin Method Let Ω :=

S

i

Dk ⊂ Rd .

Goal ut + ∇ · F (u) = 0

Solve a conservation law on Ω:

Example Maxwell’s Equations: EM field: E (x, t), H(x, t) on Ω governed by j 1 ∂t E − ∇ × H = − , ε ε ρ ∇·E = , ε

Andreas Kl¨ ockner

∂t H +

1 ∇ × E = 0, µ ∇ · H = 0.

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

Discontinuous Galerkin Methods

Discontinuous Galerkin Method Multiply by test function, integrate by parts: ˆ 0= ut ϕ + [∇ · F (u)]ϕ dx Dk ˆ ˆ = ut ϕ − F (u) · ∇ϕ dx + (ˆ n · F )∗ ϕ dSx , Dk

∂Dk

Subsitute in basis functions, introduce elementwise stiffness, mass, and surface mass matrices matrices S, M, MA : X ∂t u k = − D ∂ν ,k [F (u k )] + Lk [ˆ n · F − (ˆ n · F )∗ ]|A⊂∂Dk . ν

For straight-sided simplicial elements: Reduce D ∂ν and L to reference matrices. Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

Discontinuous Galerkin Methods

Decomposition of a DG operator into Subtasks

DG’s execution decomposes into two (mostly) separate branches: Flux Gather

Flux Lifting

uk

∂t u k F (u k )

Local Differentiation

Green: Element-local parts of the DG operator.

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

Introduction Challenges GPU Metaprog.

Outline

1 Introduction 2 GPU-DG: Challenges and Solutions

Introduction Challenges A Different Way of Writing GPU Codes 3 Results

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

Introduction Challenges GPU Metaprog.

Outline

1 Introduction 2 GPU-DG: Challenges and Solutions

Introduction Challenges A Different Way of Writing GPU Codes 3 Results

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

Introduction Challenges GPU Metaprog.

DG on GPUs: Possible Advantages

DG on GPUs: Why? GPUs have deep Memory Hierarchy The majority of DG is local.

Compute Bandwidth  Memory Bandwidth DG is arithmetically intense.

GPUs favor dense data. Local parts of the DG operator are dense.

GPUs don’t like scattered access. DG’s cell connectivity is sparser than CG’s and more regular.

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

Introduction Challenges GPU Metaprog.

DG on the GPU: What are we trying to achieve? Objectives: Main: Speed Reduce need for compute-bound clusters Secondary: Generality Be applicable to many problems Tertiary: Ease-of-Use Hide complexity of GPU hardware Setting (for now): Specialize to straight-sided simplices Optimize for (but don’t specialize to) tetrahedra (ie. 3D) Optimize for “medium” order (3. . . 5)

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

Introduction Challenges GPU Metaprog.

Outline

1 Introduction 2 GPU-DG: Challenges and Solutions

Introduction Challenges A Different Way of Writing GPU Codes 3 Results

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

Introduction Challenges GPU Metaprog.

Element-Local Operations: Differentiation

K

Np

Local Templated Local Templated Local Templated Derivative Matrices Derivative Matrices Derivative Matrices

Np

Field Data

Np

Geometric Factors

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

Introduction Challenges GPU Metaprog.

Element-Local Operations: Lifting K

Np

Local Templated Lifting Matrix

Facial Field Data

Nf Nfp (Inverse) Jacobians

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Nf Nfp

Introduction GPU-DG Results

Introduction Challenges GPU Metaprog.

Element-Local Operations: Lifting K

Np

Local Templated Lifting Matrix

Facial Field Data

Nf Nfp (Inverse) Jacobians

On-Chip Storage Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Nf Nfp

Introduction GPU-DG Results

Introduction Challenges GPU Metaprog.

Element-Local Operations: Lifting K

Np

Local Templated Lifting Matrix

Nf Nfp

Facial Field Data

? (Inverse) Jacobians

On-Chip Storage Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Nf Nfp

Introduction GPU-DG Results

Introduction Challenges GPU Metaprog.

Element-Local Operations: Lifting K

Np

Local Templated Lifting Matrix

Facial Field Data

? Nf Nfp (Inverse) Jacobians

On-Chip Storage Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Nf Nfp

Introduction GPU-DG Results

Introduction Challenges GPU Metaprog.

Best use for on-chip memory?

Basic Problem On-chip storage is scarce. . . . . . and will be for the foreseeable future. Possible uses: Matrix/Matrices Part of a matrix Field Data Both How to decide? Does it matter?

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

Introduction Challenges GPU Metaprog.

Work Partition for Element-Local Operators

Natural Work Decomposition: One Element per Block

?

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

Introduction Challenges GPU Metaprog.

Work Partition for Element-Local Operators

Natural Work Decomposition: One Element per Block + Straightforward to implement + No granularity penalty

- Cannot fill wide SIMD: unused compute power for small to medium elements - Data alignment: Padding wastes memory - Cannot amortize cost of preparation steps (e.g. fetching)

Andreas Kl¨ ockner

?

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

Introduction Challenges GPU Metaprog.

Loop Slicing for element-local parts of GPU DG Per Block: KL element-local mat.mult. + matrix load Preparation

Question: How should one assign work to threads?

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

Introduction Challenges GPU Metaprog.

Loop Slicing for element-local parts of GPU DG Per Block: KL element-local mat.mult. + matrix load Preparation

Question: How should one assign work to threads? ws : in sequence

wi : “inline-parallel”

Thread

wp : in parallel

Thread

t

t

(amortize preparation)

Thread

t

(exploit register space)

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

Introduction Challenges GPU Metaprog.

Evaluating the Surface Flux Idea: Potential for data reuse in computing flux on Γa , Γb Problem: Should store entire element’s worth of fluxes for optimal bandwidth use Da

→ Must break some face pairs. Partition mesh, obtain three types of faces: Within one part

Γa Γb Db

Straddling parts (no reuse) Boundary Evaluate: Walk one partition’s list of faces in parallel within work group

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

Introduction Challenges GPU Metaprog.

Work Partition for Surface Flux Evaluation

Granularity Tradeoff: Large Blocks: + More Data Reuse - Less Parallelism - Less Latency Hiding Block Size limited by two factors: Output buffer size Face metadata size

Optimal Block Size: not obvious

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

Introduction Challenges GPU Metaprog.

Work Partition for Surface Flux Evaluation

Granularity Tradeoff: Large Blocks: + More Data Reuse - Less Parallelism - Less Latency Hiding Block Size limited by two factors: Output buffer size Face metadata size

Optimal Block Size: not obvious

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

Introduction Challenges GPU Metaprog.

Work Partition for Surface Flux Evaluation

Granularity Tradeoff: Large Blocks: + More Data Reuse - Less Parallelism - Less Latency Hiding Block Size limited by two factors: Output buffer size Face metadata size

Optimal Block Size: not obvious

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

Introduction Challenges GPU Metaprog.

More than one Granularity Different block sizes introduced so far: Differentiation Lifting Padding

Surface Fluxes 128

...

...

64

Element Element Element

0

Element Element Element Np

Andreas Kl¨ ockner

...

KM Np

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

Introduction Challenges GPU Metaprog.

More than one Granularity Different block sizes introduced so far: Differentiation Lifting Padding

Surface Fluxes

Idea

128

Introduce another, smaller block size to satisfy SIMD width and alignment constraints. (“Microblock”) And demand other block sizes be a multiple of this new size

Andreas Kl¨ ockner

...

...

...

64

Element Element Element

0

Element Element Element Np

KM Np

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

Introduction Challenges GPU Metaprog.

More than one Granularity Different block sizes introduced so far: Differentiation Lifting Padding

Surface Fluxes

Idea

128

Introduce another, smaller block size to satisfy SIMD width and alignment constraints. (“Microblock”) And demand other block sizes be a multiple of this new size

...

...

64

Element Element Element

0

Element Element Element Np

KM Np

How big? Not obvious.

Andreas Kl¨ ockner

...

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

Introduction Challenges GPU Metaprog.

DG on GPUs: Implementation Choices

Many difficult questions Insufficient heuristics Answers are hardware-specific and have no lasting value

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

Introduction Challenges GPU Metaprog.

DG on GPUs: Implementation Choices

Many difficult questions Insufficient heuristics Answers are hardware-specific and have no lasting value Proposed Solution: Tune automatically for hardware at computation time, cache tuning results. Decrease reliance on knowledge of hardware internals Shift emphasis from tuning results to tuning ideas

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

Introduction Challenges GPU Metaprog.

Outline

1 Introduction 2 GPU-DG: Challenges and Solutions

Introduction Challenges A Different Way of Writing GPU Codes 3 Results

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

Introduction Challenges GPU Metaprog.

Metaprogramming

In GPU scripting, GPU code does not need to be a compile-time constant.

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

Introduction Challenges GPU Metaprog.

Metaprogramming

In GPU scripting, GPU code does not need to be a compile-time constant.

(Key: Code is data–it wants to be reasoned about at run time)

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

Introduction Challenges GPU Metaprog.

Metaprogramming Idea

In GPU scripting, GPU code does not need to be a compile-time constant.

(Key: Code is data–it wants to be reasoned about at run time)

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

Introduction Challenges GPU Metaprog.

Metaprogramming Idea

In GPU scripting, GPU code does not need to be a compile-time constant.

Python Code GPU Code GPU Compiler GPU Binary

(Key: Code is data–it wants to be reasoned about at run time)

GPU Result

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

Introduction Challenges GPU Metaprog.

Metaprogramming Idea

In GPU scripting, GPU code does not need to be a compile-time constant.

Python Code GPU Code GPU Compiler GPU Binary

Machine (Key: Code is data–it wants to be reasoned about at run time)

GPU Result

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

Introduction Challenges GPU Metaprog.

Metaprogramming Idea

Human Python Code GPU Code GPU Compiler

In GPU scripting, GPU code does not need to be a compile-time constant.

GPU Binary

(Key: Code is data–it wants to be reasoned about at run time)

GPU Result

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

Introduction Challenges GPU Metaprog.

Metaprogramming Idea Python Code

Good for code generation

GPU Code GPU Compiler

In GPU scripting, GPU code does not need to be a compile-time constant.

GPU Binary

(Key: Code is data–it wants to be reasoned about at run time)

GPU Result

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

Introduction Challenges GPU Metaprog.

Metaprogramming Idea Python Code

Good for code generation

GPU Code GPU Compiler

UDA PyCscripting, In GPU GPU code does not need to be a compile-time constant.

GPU Binary

(Key: Code is data–it wants to be reasoned about at run time)

GPU Result

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

Introduction Challenges GPU Metaprog.

Metaprogramming Idea Python Code

Good for code generation

GPU Code GPU Compiler

A PyPOp UDCL en yCscripting, In GPU GPU code does not need to be a compile-time constant.

GPU Binary

(Key: Code is data–it wants to be reasoned about at run time)

GPU Result

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

Introduction Challenges GPU Metaprog.

Machine-generated Code

Why machine-generate code? Automated Tuning (cf. ATLAS, FFTW) Data types Specialize code for given problem Constants faster than variables (→ register pressure) Loop Unrolling

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

Introduction Challenges GPU Metaprog.

Why do Scripting for GPUs? GPUs are everything that scripting languages are not. Highly parallel Very architecture-sensitive Built for maximum FP/memory throughput

→ complement each other CPU: largely restricted to control tasks (∼1000/sec) Scripting fast enough

Python + CUDA = PyCUDA Python + OpenCL = PyOpenCL

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

Introduction Challenges GPU Metaprog.

PyOpenCL, PyCUDA: Vital Information http://mathema.tician.de/ software/pyopencl (or /pycuda) Complete documentation X Consortium License (no warranty, free for all use) Convenient abstractions Arrays, Elementwise op., Reductions, soon: Scan Require: numpy, Python 2.4+ (Win/OS X/Linux) Community: mailing list, wiki, add-on packages (FFT, scikits.cuda, . . . )

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

GPU-DG: Performance and Generality Conclusions

Outline

1 Introduction 2 GPU-DG: Challenges and Solutions 3 Results

GPU-DG: Performance and Generality Conclusions

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

GPU-DG: Performance and Generality Conclusions

Outline

1 Introduction 2 GPU-DG: Challenges and Solutions 3 Results

GPU-DG: Performance and Generality Conclusions

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

GPU-DG: Performance and Generality Conclusions

Loop Slicing for Differentiation 2.2

1.8

1.6 1.4 1.2 1.0 0.8 15

3.0 2.8 2.6 2.4 2.2 2.0 1.8 1.6 1.4 1.2 1.0

ws

Execution time [ms]

Local differentiation, matrix-in-shared, order 4, with microblocking © ª 2.0 point size denotes wi ∈ 1, ,4

20

wp

25

Andreas Kl¨ ockner

30

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

GPU-DG: Performance and Generality Conclusions

Nvidia GTX280 vs. single core of Intel Core 2 Duo E8400 300 250

GPU CPU

GFlops/s

200 150 100 50 00

2

4 6 Polynomial Order N Andreas Kl¨ ockner

8

10

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

GPU-DG: Performance and Generality Conclusions

Memory Bandwidth on a GTX 280

Global Memory Bandwidth [GB/s]

200 180 160 140 120 100 80 60 40 201

Gather Lift Diff Assy. Peak

2

3

4 5 6 Polynomial Order N Andreas Kl¨ ockner

7

8

9

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

GPU-DG: Performance and Generality Conclusions

Multiple GPUs via MPI: 16 GPUs vs. 64 CPUs

GFlops/s

Flop Rates: 16 GPUs vs 64 CPU cores GPU 4000 CPU 3000 2000 1000 00

2

4 6 8 Polynomial Order N Andreas Kl¨ ockner

10

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

GPU-DG: Performance and Generality Conclusions

Hedge DG Solver

High-Level Operator Description Maxwell’s Euler Poisson Compressible Navier-Stokes, . . .

One Code runs. . . . . . on CPU, CUDA . . . on {CPU,CUDA}+MPI . . . in 1D, 2D, 3D . . . at any order Andreas Kl¨ ockner

Generates C++ or CUDA code at Run Time Open Source (GPL3) Written in Python, uses PyCUDA

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

GPU-DG: Performance and Generality Conclusions

Outline

1 Introduction 2 GPU-DG: Challenges and Solutions 3 Results

GPU-DG: Performance and Generality Conclusions

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

GPU-DG: Performance and Generality Conclusions

GPU DG Showcase

Eletromagnetism

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

GPU-DG: Performance and Generality Conclusions

GPU DG Showcase

Eletromagnetism Poisson

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

GPU-DG: Performance and Generality Conclusions

GPU DG Showcase

Eletromagnetism Poisson CFD

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

GPU-DG: Performance and Generality Conclusions

GPU DG Showcase

Eletromagnetism Poisson CFD

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

GPU-DG: Performance and Generality Conclusions

Where to from here? PyCUDA, PyOpenCL, hedge → http://www.cims.nyu.edu/~kloeckner/

GPU-DG Article AK, T. Warburton, J. Bridge, J.S. Hesthaven, “Nodal Discontinuous Galerkin Methods on Graphics Processors”, J. Comp. Phys., 228 (21), 7863–7882. Also: Intro in GPU Computing Gems Vol 2

GPU RTCG AK, N. Pinto et al. PyCUDA: GPU Run-Time Code Generation for High-Performance Computing, submitted. Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

GPU-DG: Performance and Generality Conclusions

Conclusions GPU-DG is significantly faster than CPU-DG Method well-suited a priori Numerous tricks enable good performance

GPUs and scripting work surprisingly well together Enable Run-Time Code Generation

Ongoing work in GPU-DG: Curvilinear Elements (T. Warburton) Local Time Stepping Shock Detection for Nonlinear Equations

Shameless plugs: A peek at hedge’s guts: Tomorrow 6pm Shock detection for GPU-DG: Thursday 1pm

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

GPU-DG: Performance and Generality Conclusions

Questions?

? Thank you for your attention!

http://www.cims.nyu.edu/~kloeckner/

image credits

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Introduction GPU-DG Results

GPU-DG: Performance and Generality Conclusions

Image Credits Carrot: OpenClipart.org Dart board: sxc.hu/195617 Question Mark: sxc.hu/svilen001 ?/! Marks: sxc.hu/svilen001 Machine: flickr.com/13521837@N00 C870 GPU: Nvidia Corp. Floppy disk: flickr.com/ethanhein

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Loo.py

RTCG via Templates from mako.template import Template tpl = Template(””” kernel void add( global ${ type name } ∗tgt, global const ${ type name } ∗op1, global const ${ type name } ∗op2) { int idx = get local id (0) + ${ local size } ∗ ${ thread strides } ∗ get group id (0); % for i in range( thread strides ): tgt [ idx + ${ offset }] = op1[idx + ${ offset }] + op2[idx + ${ offset } ]; % endfor }”””) rendered tpl = tpl . render(type name=”float”, local size = local size , thread strides = thread strides ) Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Loo.py

RTCG via AST Generation from codepy.cgen import ∗ from codepy.cgen.opencl import \ CLKernel, CLGlobal, CLRequiredWorkGroupSize mod = Module([ FunctionBody( CLKernel(CLRequiredWorkGroupSize((local size,), FunctionDeclaration (Value(”void”, ”twice”), arg decls =[CLGlobal(Pointer(Const(POD(dtype, ”tgt”))))]))), Block([ Initializer (POD(numpy.int32, ”idx”), ” get local id (0) + %d ∗ get group id(0)” % ( local size ∗ thread strides )) ]+[ Statement(”tgt[idx+%d] ∗= 2” % (o∗local size)) for o in range( thread strides )] ))]) knl = cl.Program(ctx, str (mod)).build (). twice

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Loo.py

GPU-DG in Double Precision

GFlops/s

GPU-DG: Double vs. Single Precision

400 350 300 250 200 150 100 50 00

Single Double

Ratio

1.0 0.8 0.6 0.4 0.2

2

4 6 8 Polynomial Order N Andreas Kl¨ ockner

100.0

High-Order GPU-DG by Metaprogramming

Loo.py

Outline

Automatic GPU Programming

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Loo.py

Automating GPU Programming GPU programming can be time-consuming, unintuitive and error-prone. Obvious idea: Let the computer do it. One way: Smart compilers

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Loo.py

Automating GPU Programming GPU programming can be time-consuming, unintuitive and error-prone. Obvious idea: Let the computer do it. One way: Smart compilers GPU programming requires complex tradeoffs Tradeoffs require heuristics Heuristics are fragile

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Loo.py

Automating GPU Programming GPU programming can be time-consuming, unintuitive and error-prone. Obvious idea: Let the computer do it. One way: Smart compilers GPU programming requires complex tradeoffs Tradeoffs require heuristics Heuristics are fragile

Another way: Dumb enumeration Enumerate loop slicings Enumerate prefetch options Choose by running resulting code on actual hardware

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Loo.py

Loo.py Example Empirical GPU loop optimization: a, b, c, i , j , k = [var(s) for s in ” abcijk ”] n = 500 k = make loop kernel([ LoopDimension(”i”, n), LoopDimension(”j”, n), LoopDimension(”k”, n), ], [ (c[ i +n∗j], a[ i +n∗k]∗b[k+n∗j]) ]) gen kwargs = { ”min threads”: 128, ”min blocks”: 32, }

→ Ideal case: Finds 160 GF/s kernel without human intervention. Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Loo.py

Loo.py Status

Limited scope: Require input/output separation Kernels must be expressible using “loopy” model (i.e. indices decompose into “output” and “reduction”) Enough for DG, LA, FD, . . .

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Loo.py

Loo.py Status

Limited scope: Require input/output separation Kernels must be expressible using “loopy” model (i.e. indices decompose into “output” and “reduction”) Enough for DG, LA, FD, . . .

Kernel compilation limits trial rate Non-Goal: Peak performance Good results currently for dense linear algebra and (some) DG subkernels

Andreas Kl¨ ockner

High-Order GPU-DG by Metaprogramming

Suggest Documents