Feb 28, 2011 - Main: Speed. Reduce need for ... Loop Slicing for element-local parts of GPU DG. Per Block: KL .... fast enough. Python + CUDA = PyCUDA.
Introduction GPU-DG Results
High-Order Discontinuous Galerkin Methods by GPU Metaprogramming Andreas Kl¨ ockner Courant Institute of Mathematical Sciences New York University
February 28, 2011
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
Thanks
Jan Hesthaven (Brown) Tim Warburton (Rice) Leslie Greengard (NYU) Hendrik Riedmann (Stuttgart/Brown) PyOpenCL, PyCUDA contributors Nvidia Corporation
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
Outline
1 Introduction 2 GPU-DG: Challenges and Solutions 3 Results
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
Discontinuous Galerkin Methods
Outline
1 Introduction
Discontinuous Galerkin Methods 2 GPU-DG: Challenges and Solutions 3 Results
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
Discontinuous Galerkin Methods
Outline
1 Introduction
Discontinuous Galerkin Methods 2 GPU-DG: Challenges and Solutions 3 Results
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
Discontinuous Galerkin Methods
Discontinuous Galerkin Method Let Ω :=
S
i
Dk ⊂ Rd .
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
Discontinuous Galerkin Methods
Discontinuous Galerkin Method Let Ω :=
S
i
Dk ⊂ Rd .
Goal Solve a conservation law on Ω:
Andreas Kl¨ ockner
ut + ∇ · F (u) = 0
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
Discontinuous Galerkin Methods
Discontinuous Galerkin Method Let Ω :=
S
i
Dk ⊂ Rd .
Goal ut + ∇ · F (u) = 0
Solve a conservation law on Ω:
Example Maxwell’s Equations: EM field: E (x, t), H(x, t) on Ω governed by j 1 ∂t E − ∇ × H = − , ε ε ρ ∇·E = , ε
Andreas Kl¨ ockner
∂t H +
1 ∇ × E = 0, µ ∇ · H = 0.
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
Discontinuous Galerkin Methods
Discontinuous Galerkin Method Multiply by test function, integrate by parts: ˆ 0= ut ϕ + [∇ · F (u)]ϕ dx Dk ˆ ˆ = ut ϕ − F (u) · ∇ϕ dx + (ˆ n · F )∗ ϕ dSx , Dk
∂Dk
Subsitute in basis functions, introduce elementwise stiffness, mass, and surface mass matrices matrices S, M, MA : X ∂t u k = − D ∂ν ,k [F (u k )] + Lk [ˆ n · F − (ˆ n · F )∗ ]|A⊂∂Dk . ν
For straight-sided simplicial elements: Reduce D ∂ν and L to reference matrices. Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
Discontinuous Galerkin Methods
Decomposition of a DG operator into Subtasks
DG’s execution decomposes into two (mostly) separate branches: Flux Gather
Flux Lifting
uk
∂t u k F (u k )
Local Differentiation
Green: Element-local parts of the DG operator.
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
Introduction Challenges GPU Metaprog.
Outline
1 Introduction 2 GPU-DG: Challenges and Solutions
Introduction Challenges A Different Way of Writing GPU Codes 3 Results
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
Introduction Challenges GPU Metaprog.
Outline
1 Introduction 2 GPU-DG: Challenges and Solutions
Introduction Challenges A Different Way of Writing GPU Codes 3 Results
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
Introduction Challenges GPU Metaprog.
DG on GPUs: Possible Advantages
DG on GPUs: Why? GPUs have deep Memory Hierarchy The majority of DG is local.
Compute Bandwidth Memory Bandwidth DG is arithmetically intense.
GPUs favor dense data. Local parts of the DG operator are dense.
GPUs don’t like scattered access. DG’s cell connectivity is sparser than CG’s and more regular.
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
Introduction Challenges GPU Metaprog.
DG on the GPU: What are we trying to achieve? Objectives: Main: Speed Reduce need for compute-bound clusters Secondary: Generality Be applicable to many problems Tertiary: Ease-of-Use Hide complexity of GPU hardware Setting (for now): Specialize to straight-sided simplices Optimize for (but don’t specialize to) tetrahedra (ie. 3D) Optimize for “medium” order (3. . . 5)
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
Introduction Challenges GPU Metaprog.
Outline
1 Introduction 2 GPU-DG: Challenges and Solutions
Introduction Challenges A Different Way of Writing GPU Codes 3 Results
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
Introduction Challenges GPU Metaprog.
Element-Local Operations: Differentiation
K
Np
Local Templated Local Templated Local Templated Derivative Matrices Derivative Matrices Derivative Matrices
Np
Field Data
Np
Geometric Factors
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
Introduction Challenges GPU Metaprog.
Element-Local Operations: Lifting K
Np
Local Templated Lifting Matrix
Facial Field Data
Nf Nfp (Inverse) Jacobians
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Nf Nfp
Introduction GPU-DG Results
Introduction Challenges GPU Metaprog.
Element-Local Operations: Lifting K
Np
Local Templated Lifting Matrix
Facial Field Data
Nf Nfp (Inverse) Jacobians
On-Chip Storage Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Nf Nfp
Introduction GPU-DG Results
Introduction Challenges GPU Metaprog.
Element-Local Operations: Lifting K
Np
Local Templated Lifting Matrix
Nf Nfp
Facial Field Data
? (Inverse) Jacobians
On-Chip Storage Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Nf Nfp
Introduction GPU-DG Results
Introduction Challenges GPU Metaprog.
Element-Local Operations: Lifting K
Np
Local Templated Lifting Matrix
Facial Field Data
? Nf Nfp (Inverse) Jacobians
On-Chip Storage Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Nf Nfp
Introduction GPU-DG Results
Introduction Challenges GPU Metaprog.
Best use for on-chip memory?
Basic Problem On-chip storage is scarce. . . . . . and will be for the foreseeable future. Possible uses: Matrix/Matrices Part of a matrix Field Data Both How to decide? Does it matter?
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
Introduction Challenges GPU Metaprog.
Work Partition for Element-Local Operators
Natural Work Decomposition: One Element per Block
?
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
Introduction Challenges GPU Metaprog.
Work Partition for Element-Local Operators
Natural Work Decomposition: One Element per Block + Straightforward to implement + No granularity penalty
- Cannot fill wide SIMD: unused compute power for small to medium elements - Data alignment: Padding wastes memory - Cannot amortize cost of preparation steps (e.g. fetching)
Andreas Kl¨ ockner
?
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
Introduction Challenges GPU Metaprog.
Loop Slicing for element-local parts of GPU DG Per Block: KL element-local mat.mult. + matrix load Preparation
Question: How should one assign work to threads?
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
Introduction Challenges GPU Metaprog.
Loop Slicing for element-local parts of GPU DG Per Block: KL element-local mat.mult. + matrix load Preparation
Question: How should one assign work to threads? ws : in sequence
wi : “inline-parallel”
Thread
wp : in parallel
Thread
t
t
(amortize preparation)
Thread
t
(exploit register space)
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
Introduction Challenges GPU Metaprog.
Evaluating the Surface Flux Idea: Potential for data reuse in computing flux on Γa , Γb Problem: Should store entire element’s worth of fluxes for optimal bandwidth use Da
→ Must break some face pairs. Partition mesh, obtain three types of faces: Within one part
Γa Γb Db
Straddling parts (no reuse) Boundary Evaluate: Walk one partition’s list of faces in parallel within work group
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
Introduction Challenges GPU Metaprog.
Work Partition for Surface Flux Evaluation
Granularity Tradeoff: Large Blocks: + More Data Reuse - Less Parallelism - Less Latency Hiding Block Size limited by two factors: Output buffer size Face metadata size
Optimal Block Size: not obvious
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
Introduction Challenges GPU Metaprog.
Work Partition for Surface Flux Evaluation
Granularity Tradeoff: Large Blocks: + More Data Reuse - Less Parallelism - Less Latency Hiding Block Size limited by two factors: Output buffer size Face metadata size
Optimal Block Size: not obvious
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
Introduction Challenges GPU Metaprog.
Work Partition for Surface Flux Evaluation
Granularity Tradeoff: Large Blocks: + More Data Reuse - Less Parallelism - Less Latency Hiding Block Size limited by two factors: Output buffer size Face metadata size
Optimal Block Size: not obvious
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
Introduction Challenges GPU Metaprog.
More than one Granularity Different block sizes introduced so far: Differentiation Lifting Padding
Surface Fluxes 128
...
...
64
Element Element Element
0
Element Element Element Np
Andreas Kl¨ ockner
...
KM Np
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
Introduction Challenges GPU Metaprog.
More than one Granularity Different block sizes introduced so far: Differentiation Lifting Padding
Surface Fluxes
Idea
128
Introduce another, smaller block size to satisfy SIMD width and alignment constraints. (“Microblock”) And demand other block sizes be a multiple of this new size
Andreas Kl¨ ockner
...
...
...
64
Element Element Element
0
Element Element Element Np
KM Np
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
Introduction Challenges GPU Metaprog.
More than one Granularity Different block sizes introduced so far: Differentiation Lifting Padding
Surface Fluxes
Idea
128
Introduce another, smaller block size to satisfy SIMD width and alignment constraints. (“Microblock”) And demand other block sizes be a multiple of this new size
...
...
64
Element Element Element
0
Element Element Element Np
KM Np
How big? Not obvious.
Andreas Kl¨ ockner
...
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
Introduction Challenges GPU Metaprog.
DG on GPUs: Implementation Choices
Many difficult questions Insufficient heuristics Answers are hardware-specific and have no lasting value
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
Introduction Challenges GPU Metaprog.
DG on GPUs: Implementation Choices
Many difficult questions Insufficient heuristics Answers are hardware-specific and have no lasting value Proposed Solution: Tune automatically for hardware at computation time, cache tuning results. Decrease reliance on knowledge of hardware internals Shift emphasis from tuning results to tuning ideas
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
Introduction Challenges GPU Metaprog.
Outline
1 Introduction 2 GPU-DG: Challenges and Solutions
Introduction Challenges A Different Way of Writing GPU Codes 3 Results
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
Introduction Challenges GPU Metaprog.
Metaprogramming
In GPU scripting, GPU code does not need to be a compile-time constant.
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
Introduction Challenges GPU Metaprog.
Metaprogramming
In GPU scripting, GPU code does not need to be a compile-time constant.
(Key: Code is data–it wants to be reasoned about at run time)
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
Introduction Challenges GPU Metaprog.
Metaprogramming Idea
In GPU scripting, GPU code does not need to be a compile-time constant.
(Key: Code is data–it wants to be reasoned about at run time)
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
Introduction Challenges GPU Metaprog.
Metaprogramming Idea
In GPU scripting, GPU code does not need to be a compile-time constant.
Python Code GPU Code GPU Compiler GPU Binary
(Key: Code is data–it wants to be reasoned about at run time)
GPU Result
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
Introduction Challenges GPU Metaprog.
Metaprogramming Idea
In GPU scripting, GPU code does not need to be a compile-time constant.
Python Code GPU Code GPU Compiler GPU Binary
Machine (Key: Code is data–it wants to be reasoned about at run time)
GPU Result
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
Introduction Challenges GPU Metaprog.
Metaprogramming Idea
Human Python Code GPU Code GPU Compiler
In GPU scripting, GPU code does not need to be a compile-time constant.
GPU Binary
(Key: Code is data–it wants to be reasoned about at run time)
GPU Result
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
Introduction Challenges GPU Metaprog.
Metaprogramming Idea Python Code
Good for code generation
GPU Code GPU Compiler
In GPU scripting, GPU code does not need to be a compile-time constant.
GPU Binary
(Key: Code is data–it wants to be reasoned about at run time)
GPU Result
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
Introduction Challenges GPU Metaprog.
Metaprogramming Idea Python Code
Good for code generation
GPU Code GPU Compiler
UDA PyCscripting, In GPU GPU code does not need to be a compile-time constant.
GPU Binary
(Key: Code is data–it wants to be reasoned about at run time)
GPU Result
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
Introduction Challenges GPU Metaprog.
Metaprogramming Idea Python Code
Good for code generation
GPU Code GPU Compiler
A PyPOp UDCL en yCscripting, In GPU GPU code does not need to be a compile-time constant.
GPU Binary
(Key: Code is data–it wants to be reasoned about at run time)
GPU Result
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
Introduction Challenges GPU Metaprog.
Machine-generated Code
Why machine-generate code? Automated Tuning (cf. ATLAS, FFTW) Data types Specialize code for given problem Constants faster than variables (→ register pressure) Loop Unrolling
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
Introduction Challenges GPU Metaprog.
Why do Scripting for GPUs? GPUs are everything that scripting languages are not. Highly parallel Very architecture-sensitive Built for maximum FP/memory throughput
→ complement each other CPU: largely restricted to control tasks (∼1000/sec) Scripting fast enough
Python + CUDA = PyCUDA Python + OpenCL = PyOpenCL
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
Introduction Challenges GPU Metaprog.
PyOpenCL, PyCUDA: Vital Information http://mathema.tician.de/ software/pyopencl (or /pycuda) Complete documentation X Consortium License (no warranty, free for all use) Convenient abstractions Arrays, Elementwise op., Reductions, soon: Scan Require: numpy, Python 2.4+ (Win/OS X/Linux) Community: mailing list, wiki, add-on packages (FFT, scikits.cuda, . . . )
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
GPU-DG: Performance and Generality Conclusions
Outline
1 Introduction 2 GPU-DG: Challenges and Solutions 3 Results
GPU-DG: Performance and Generality Conclusions
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
GPU-DG: Performance and Generality Conclusions
Outline
1 Introduction 2 GPU-DG: Challenges and Solutions 3 Results
GPU-DG: Performance and Generality Conclusions
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
GPU-DG: Performance and Generality Conclusions
Loop Slicing for Differentiation 2.2
1.8
1.6 1.4 1.2 1.0 0.8 15
3.0 2.8 2.6 2.4 2.2 2.0 1.8 1.6 1.4 1.2 1.0
ws
Execution time [ms]
Local differentiation, matrix-in-shared, order 4, with microblocking © ª 2.0 point size denotes wi ∈ 1, ,4
20
wp
25
Andreas Kl¨ ockner
30
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
GPU-DG: Performance and Generality Conclusions
Nvidia GTX280 vs. single core of Intel Core 2 Duo E8400 300 250
GPU CPU
GFlops/s
200 150 100 50 00
2
4 6 Polynomial Order N Andreas Kl¨ ockner
8
10
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
GPU-DG: Performance and Generality Conclusions
Memory Bandwidth on a GTX 280
Global Memory Bandwidth [GB/s]
200 180 160 140 120 100 80 60 40 201
Gather Lift Diff Assy. Peak
2
3
4 5 6 Polynomial Order N Andreas Kl¨ ockner
7
8
9
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
GPU-DG: Performance and Generality Conclusions
Multiple GPUs via MPI: 16 GPUs vs. 64 CPUs
GFlops/s
Flop Rates: 16 GPUs vs 64 CPU cores GPU 4000 CPU 3000 2000 1000 00
2
4 6 8 Polynomial Order N Andreas Kl¨ ockner
10
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
GPU-DG: Performance and Generality Conclusions
Hedge DG Solver
High-Level Operator Description Maxwell’s Euler Poisson Compressible Navier-Stokes, . . .
One Code runs. . . . . . on CPU, CUDA . . . on {CPU,CUDA}+MPI . . . in 1D, 2D, 3D . . . at any order Andreas Kl¨ ockner
Generates C++ or CUDA code at Run Time Open Source (GPL3) Written in Python, uses PyCUDA
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
GPU-DG: Performance and Generality Conclusions
Outline
1 Introduction 2 GPU-DG: Challenges and Solutions 3 Results
GPU-DG: Performance and Generality Conclusions
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
GPU-DG: Performance and Generality Conclusions
GPU DG Showcase
Eletromagnetism
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
GPU-DG: Performance and Generality Conclusions
GPU DG Showcase
Eletromagnetism Poisson
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
GPU-DG: Performance and Generality Conclusions
GPU DG Showcase
Eletromagnetism Poisson CFD
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
GPU-DG: Performance and Generality Conclusions
GPU DG Showcase
Eletromagnetism Poisson CFD
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
GPU-DG: Performance and Generality Conclusions
Where to from here? PyCUDA, PyOpenCL, hedge → http://www.cims.nyu.edu/~kloeckner/
GPU-DG Article AK, T. Warburton, J. Bridge, J.S. Hesthaven, “Nodal Discontinuous Galerkin Methods on Graphics Processors”, J. Comp. Phys., 228 (21), 7863–7882. Also: Intro in GPU Computing Gems Vol 2
GPU RTCG AK, N. Pinto et al. PyCUDA: GPU Run-Time Code Generation for High-Performance Computing, submitted. Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
GPU-DG: Performance and Generality Conclusions
Conclusions GPU-DG is significantly faster than CPU-DG Method well-suited a priori Numerous tricks enable good performance
GPUs and scripting work surprisingly well together Enable Run-Time Code Generation
Ongoing work in GPU-DG: Curvilinear Elements (T. Warburton) Local Time Stepping Shock Detection for Nonlinear Equations
Shameless plugs: A peek at hedge’s guts: Tomorrow 6pm Shock detection for GPU-DG: Thursday 1pm
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
GPU-DG: Performance and Generality Conclusions
Questions?
? Thank you for your attention!
http://www.cims.nyu.edu/~kloeckner/
image credits
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Introduction GPU-DG Results
GPU-DG: Performance and Generality Conclusions
Image Credits Carrot: OpenClipart.org Dart board: sxc.hu/195617 Question Mark: sxc.hu/svilen001 ?/! Marks: sxc.hu/svilen001 Machine: flickr.com/13521837@N00 C870 GPU: Nvidia Corp. Floppy disk: flickr.com/ethanhein
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Loo.py
RTCG via Templates from mako.template import Template tpl = Template(””” kernel void add( global ${ type name } ∗tgt, global const ${ type name } ∗op1, global const ${ type name } ∗op2) { int idx = get local id (0) + ${ local size } ∗ ${ thread strides } ∗ get group id (0); % for i in range( thread strides ): tgt [ idx + ${ offset }] = op1[idx + ${ offset }] + op2[idx + ${ offset } ]; % endfor }”””) rendered tpl = tpl . render(type name=”float”, local size = local size , thread strides = thread strides ) Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Loo.py
RTCG via AST Generation from codepy.cgen import ∗ from codepy.cgen.opencl import \ CLKernel, CLGlobal, CLRequiredWorkGroupSize mod = Module([ FunctionBody( CLKernel(CLRequiredWorkGroupSize((local size,), FunctionDeclaration (Value(”void”, ”twice”), arg decls =[CLGlobal(Pointer(Const(POD(dtype, ”tgt”))))]))), Block([ Initializer (POD(numpy.int32, ”idx”), ” get local id (0) + %d ∗ get group id(0)” % ( local size ∗ thread strides )) ]+[ Statement(”tgt[idx+%d] ∗= 2” % (o∗local size)) for o in range( thread strides )] ))]) knl = cl.Program(ctx, str (mod)).build (). twice
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Loo.py
GPU-DG in Double Precision
GFlops/s
GPU-DG: Double vs. Single Precision
400 350 300 250 200 150 100 50 00
Single Double
Ratio
1.0 0.8 0.6 0.4 0.2
2
4 6 8 Polynomial Order N Andreas Kl¨ ockner
100.0
High-Order GPU-DG by Metaprogramming
Loo.py
Outline
Automatic GPU Programming
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Loo.py
Automating GPU Programming GPU programming can be time-consuming, unintuitive and error-prone. Obvious idea: Let the computer do it. One way: Smart compilers
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Loo.py
Automating GPU Programming GPU programming can be time-consuming, unintuitive and error-prone. Obvious idea: Let the computer do it. One way: Smart compilers GPU programming requires complex tradeoffs Tradeoffs require heuristics Heuristics are fragile
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Loo.py
Automating GPU Programming GPU programming can be time-consuming, unintuitive and error-prone. Obvious idea: Let the computer do it. One way: Smart compilers GPU programming requires complex tradeoffs Tradeoffs require heuristics Heuristics are fragile
Another way: Dumb enumeration Enumerate loop slicings Enumerate prefetch options Choose by running resulting code on actual hardware
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Loo.py
Loo.py Example Empirical GPU loop optimization: a, b, c, i , j , k = [var(s) for s in ” abcijk ”] n = 500 k = make loop kernel([ LoopDimension(”i”, n), LoopDimension(”j”, n), LoopDimension(”k”, n), ], [ (c[ i +n∗j], a[ i +n∗k]∗b[k+n∗j]) ]) gen kwargs = { ”min threads”: 128, ”min blocks”: 32, }
→ Ideal case: Finds 160 GF/s kernel without human intervention. Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Loo.py
Loo.py Status
Limited scope: Require input/output separation Kernels must be expressible using “loopy” model (i.e. indices decompose into “output” and “reduction”) Enough for DG, LA, FD, . . .
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming
Loo.py
Loo.py Status
Limited scope: Require input/output separation Kernels must be expressible using “loopy” model (i.e. indices decompose into “output” and “reduction”) Enough for DG, LA, FD, . . .
Kernel compilation limits trial rate Non-Goal: Peak performance Good results currently for dense linear algebra and (some) DG subkernels
Andreas Kl¨ ockner
High-Order GPU-DG by Metaprogramming