Strategies and tools for programming GPUs ..... Setup input textures, fragment program .... NVIDIA Developer Technology ...... Apple OpenGL Shader Builder.
General-Purpose Computation on Graphics Hardware
Welcome & Overview David Luebke NVIDIA
1
Introduction • The GPU on commodity video cards has evolved into an extremely flexible and powerful processor – Programmability – Precision – Power
• This tutorial will address how to harness that power for general-purpose computation
3
Motivation: Computational Power • GPUs are fast… – – – –
3.0 GHz dual-core Pentium4: 24.6 GFLOPS NVIDIA GeForce 7800 GTX: 165 GFLOPs 1066 MHz FSB Pentium Extreme Edition : 8.5 GB/s ATI Radeon X850 XT Platinum Edition: 37.8 GB/s
• GPUs are getting faster, faster – CPUs: 1.4× annual growth – GPUs: 1.7×(pixels) to 2.3× (vertices) annual growth
Courtesy Kurt Akeley, Ian Buck & Tim Purcell
4
2
Motivation: Computational Power
Courtesy John Owens, Ian Buck
5
An Aside: Computational Power • Why are GPUs getting faster so fast? – Arithmetic intensity • The specialized nature of GPUs makes it easier to use additional transistors for computation not cache
– Economics • Multi-billion dollar video game market is a pressure cooker that drives innovation to exploit this property • Fierce competition!
6
3
Motivation: Flexible and Precise • Modern GPUs are deeply programmable – Programmable pixel, vertex, and geometry engines – Solid high-level language support
• Modern GPUs support “real” precision – 32 bit floating point throughout the pipeline – High enough for many (not all) applications
7
Motivation: The Potential of GPGPU • In short: – The power and flexibility of GPUs makes them an attractive platform for general-purpose computation – Example applications range from in-game physics simulation to conventional computational science – Goal: make the inexpensive power of the GPU available to developers as a sort of computational coprocessor
8
4
The Problem: Difficult To Use • GPUs designed for & driven by video games – Programming model unusual – Programming idioms tied to computer graphics – Programming environment tightly constrained
• Underlying architectures are: – Inherently data parallel – Rapidly evolving (even in basic feature set!) – Largely secret
• Can’t simply “port” CPU code! 9
Course goals • A detailed introduction to general-purpose computing on graphics hardware • We emphasize: – Core computational building blocks – Strategies and tools for programming GPUs – Tips & tricks, perils & pitfalls of GPU programming
• Case studies to bring it all together
10
5
Course Prerequisites • Tutorial intended to be accessible to any savvy computer scientist • Helpful but not required: familiarity with – Interactive 3D graphics APIs and graphics hardware – Data-parallel algorithms and programming
• Target audience – HPC researchers interested in GPGPU research – HPC developers interested in incorporating GPGPU techniques into their work – Attendees wishing a survey of this exciting field 11
Speakers • In order of appearance: – – – – – – – – –
David Luebke, NVIDIA Mark Harris, NVIDIA John Owens, University of California Davis Naga Govindaraju, Microsoft Research Aaron Lefohn, Neoptica Mike Houston, Stanford Mark Segal, ATI Ian Buck, NVIDIA Matt Papakipos, PeakStream 12
6
Schedule 8:30
Introduction
Luebke
Tutorial overview, GPU architecture, GPGPU programming
GPU Building Blocks 9:10 Data-Parallel Algorithms
Harris
Reduce, scan, scatter/gather, sort, and search
9:30
Memory Models
Owens
GPU memory resources, CPU & Cell
9:45
Data Structures
Lefohn
Static & dynamically updated data structures
10:00 Break 13
Schedule 10:30 Sorting & Data Queries
Govindaraju
Sorting networks & specializations, searching, data mining
11:00 Mathematical Primitives
Lefohn
Linear algebra, finite different & finite element methods
Languages & Programming Environments 11:30 High-Level Languages
Houston
Brook, RapidMind, Accelerator
12:00 Lunch 14
7
Schedule 1:30
Debugging & Profiling
Houston
imdebug, DirectX/OpenGL shader IDEs, ShadeSmith
1:50
Direct GPU Computing
Segal
CTM, Data Parallel Virtual Machine
High Performance GPGPU 2:00 GPGPU Strategies & Tricks
Owens
GPU performance guidelines, scatter, conditionals
2:30
Performance Analysis & Arch Insights
Houston
GPUBench, architectural models for programming
3:00
Break 15
Schedule GPGPU In Practice 3:00 HavokFX
Harris
Game Physics Simulation on GPUs
3:25
PeakStream Platform
Papakipos
Commercial GPGPU platform, HPC case studies
3:50
GPGPU Cluster Computing
Houston
Building GPU clusters; HMMer, GROMACS, Folding@Home
Conclusion 4:45 Question-and-answer session 5:00 Wrap!
All
8
GPU Fundamentals: The Graphics Pipeline Graphics State
GPU
Shade
Final Pixels (Color, Depth)
CPU
Rasterize
Fragments (pre-pixels)
Assemble Primitives
Screenspace triangles (2D)
Transform & Light
Xformed, Lit Vertices (2D)
Vertices (3D)
Application
Video Memory (Textures)
Render-to-texture
• A simplified traditional graphics pipeline – It’s actually ultra-parallel! – Note that pipe widths vary – Many caches, FIFOs, and so on not shown 17
GPU Fundamentals: The Recent Graphics Pipeline Graphics State
• Programmable vertex processor!
GPU
Fragment Shade Processor
Final Pixels (Color, Depth)
CPU
Rasterize
Fragments (pre-pixels)
Assemble Primitives
Screenspace triangles (2D)
Vertex Transform Processor
Xformed, Lit Vertices (2D)
Vertices (3D)
Application
Video Memory (Textures)
Render-to-texture
• Programmable pixel processor!
18
9
GPU Fundamentals: The New Graphics Pipeline Graphics State
GPU
• Programmable primitive assembly!
Fragment Processor
Final Pixels (Color, Depth)
CPU
Rasterize
Fragments (pre-pixels)
Geometry Assemble Processor Primitives
Screenspace triangles (2D)
Vertex Processor
Xformed, Lit Vertices (2D)
Vertices (3D)
Application
Video Memory (Textures)
Render-to-texture
• More flexible memory access!
• And much, much more! 19
GPU Pipeline: Transform • Vertex processor (multiple in parallel) – Transform from “world space” to “image space” – Compute per-vertex lighting
20
10
GPU Pipeline: Rasterize • Primitive Assembly & Rasterization – Convert vertices into primitives with area • Triangles, quadrilaterals, points
– Convert geometric rep. (vertex) to image rep. (fragment) • Fragment = image fragment – Pixel + associated data: color, depth, stencil, etc.
– Interpolate per-vertex quantities across pixels
21
GPU Pipeline: Shade • Fragment processors (multiple in parallel) – Compute a color for each pixel – Optionally read colors from textures (images)
22
11
Introduction to GPGPU Programming David Luebke NVIDIA
Outline • • • •
Data Parallelism and Stream Processing Computational Resources Inventory CPU-GPU Analogies Example: N-body gravitational simulation
24
12
The Importance of Data Parallelism • GPUs are designed for graphics – Highly parallel tasks
• Graphics processes independent vertices & pixels – Temporary registers are zeroed – No shared or static data – No read-modify-write buffers
• Data-parallel processing – GPUs architecture is ALU-heavy • Multiple vertex & pixel pipelines, multiple ALUs per pipe
– Hide memory latency (with more computation)
25
Arithmetic Intensity • Arithmetic intensity – ops per word transferred – Computation / bandwidth
• Best to have high arithmetic intensity • Ideal GPGPU apps have – Large data sets – High parallelism – High independence between data elements
26
13
Stream Processing • Streams – Collection of records requiring similar computation • Vertex positions, Voxels, FEM cells, etc.
– Provide data parallelism
• Kernels – Functions applied to each element in stream • transforms, PDE, …
– Few dependencies between stream elements • Encourage high Arithmetic Intensity
27
Example: Simulation Grid • Common GPGPU computation style – Textures represent computational grids = streams
• Many computations map to grids – – – –
Matrix algebra Image & Volume processing Physically-based simulation Global Illumination • ray tracing, photon mapping, radiosity
• Non-grid streams can be mapped to grids
28
14
Stream Computation • Grid Simulation algorithm – Made up of steps – Each step updates entire grid – Must complete before next step can begin
• Grid is a stream, steps are kernels – Kernel applied to each stream element
Cloud simulation algorithm
29
Scatter vs. Gather • Grid communication – Grid cells share information
30
15
Computational Resources Inventory • Programmable parallel processors – Vertex, Geometry, & Fragment pipelines
• Rasterizer – Mostly useful for interpolating addresses (texture coordinates) and per-vertex constants
• Texture unit – Read-only memory interface
• Render to texture – Write-only memory interface
31
Vertex Processor • Fully programmable (SIMD / MIMD) • Processes 4-vectors (RGBA / XYZW) • Capable of scatter but not gather – Can change the location of current vertex – Cannot read info from other vertices – Vertex Texture Fetch • Random access memory for vertices • Arguably still not gather
32
16
Fragment Processor • • • •
Fully programmable (SIMD) Processes 4-component vectors (RGBA / XYZW) Random access memory read (textures) Capable of gather but not scatter – RAM read (texture fetch), but no RAM write – Output address fixed to a specific pixel
• Typically more useful than vertex processor – More fragment pipelines than vertex pipelines – Direct output (fragment processor is at end of pipeline)
• More on scatter/gather later… 33
CPU-GPU Analogies • CPU programming is familiar – GPU programming is graphics-centric
• Analogies can aid understanding
34
17
CPU-GPU Analogies CPU
GPU
Stream / Data Array = Texture Memory Read = Texture Sample 35
Kernels CPU
GPU
Kernel / loop body / algorithm step = Fragment Program 36
18
Feedback • Each algorithm step depends on the results of previous steps • Each time step depends on the results of the previous time step
37
Feedback CPU
GPU
. . Grid[i][j]= x; . . . Array Write
=
Render to Texture 38
19
GPU Simulation Overview • Analogies lead to implementation – Algorithm steps are fragment programs • Computational kernels
– Current state is stored in textures – Feedback via render to texture
• One question: how do we invoke computation?
39
Invoking Computation • Must invoke computation at each pixel – Just draw geometry! – Most common GPGPU invocation is a full-screen quad
• Other Useful Analogies – Rasterization = Kernel Invocation – Texture Coordinates = Computational Domain – Vertex Coordinates = Computational Range
40
20
Typical “Grid” Computation • Initialize “view” (so that pixels:texels::1:1) glMatrixMode(GL_MODELVIEW); glLoadIdentity(); glMatrixMode(GL_PROJECTION); glLoadIdentity(); glOrtho(0, 1, 0, 1, 0, 1); glViewport(0, 0, outTexResX, outTexResY);
• For each algorithm step: – Activate render-to-texture – Setup input textures, fragment program – Draw a full-screen quad (1x1)
41
Example: N-Body Simulation • Brute force • N = 8192 bodies • N2 gravity computations • 64M force comps. / frame • ~25 flops per force • 7.5 fps • 12.5+ GFLOPs sustained – GeForce 6800 Ultra
Nyland, Harris, Prins, GP2 2004 poster
42
21
Computing Gravitational Forces • Each body attracts all other bodies – N bodies, so N2 forces
• Draw into an NxN buffer – Pixel (i,j) computes force between bodies i and j – Very simple fragment program • More than 2048 bodies makes it trickier – Limited by max texture size… – “exercise for the reader”
43
Computing Gravitational Forces
F(i,j) = gMiMj / r(i,j)2, r(i,j) = |pos(i) - pos(j)|
Force is proportional to the inverse square of the distance between bodies 44
22
Computing Gravitational Forces N
Body Position Texture
N-body force Texture
j j force(i,j)
i F(i,j) = gMiMj / r(i,j)2,
0
i
N
r(i,j) = |pos(i) - pos(j)|
Coordinates (i,j) in force texture used to find bodies i and j in body position texture 45
Computing Gravitational Forces float4 force(float2 ij
: WPOS,
uniform sampler2D pos) : COLOR0 { // Pos texture is 2D, not 1D, so we need to // convert body index into 2D coords for pos tex float4 iCoords = getBodyCoords(ij); float4 iPosMass = texture2D(pos, iCoords.xy); float4 jPosMass = texture2D(pos, iCoords.zw); float3 dir = iPos.xyz - jPos.xyz; float r2 = dot(dir, dir); dir = normalize(dir); return dir * g * iPosMass.w * jPosMass.w / r2; }
46
23
Computing Total Force • Have: array of (i,j) forces • Need: total force on each particle i
N
N-body force Texture
force(i,j)
0
N
i
47
Computing Total Force • Have: array of (i,j) forces • Need: total force on each particle i
N
N-body force Texture
– Sum of each column of the force array
force(i,j)
0
i
N
48
24
Computing Total Force • Have: array of (i,j) forces • Need: total force on each particle i
N
N-body force Texture
– Sum of each column of the force array
force(i,j)
• Can do all N columns in parallel 0
N
i
This is called a Parallel Reduction 49
Parallel Reductions • 1D parallel reduction: – sum N columns or rows in parallel – add two halves of texture together
NxN
+ 50
25
Parallel Reductions • 1D parallel reduction: – sum N columns or rows in parallel – add two halves of texture together – repeatedly...
Nx(N x(N/2)
+ 51
Parallel Reductions • 1D parallel reduction: – sum N columns or rows in parallel – add two halves of texture together – repeatedly...
Nx(N x(N/4)
+ 52
26
Parallel Reductions • 1D parallel reduction: – – – –
sum N columns or rows in parallel add two halves of texture together repeatedly... Until we’re left with a single row of texels
Nx1 Requires log2N steps 53
Update Positions and Velocities • Now we have a 1-D array of total forces – One per body
• Update Velocity – u(i,t+dt) = u(i,t) + Ftotal(i) * dt – Simple fragment shader reads previous velocity and force textures, creates new velocity texture
• Update Position – x(i, t+dt) = x(i,t) + u(i,t) * dt – Simple fragment shader reads previous position and velocity textures, creates new position texture 54
27
Summary • Presented mappings of basic computational concepts to GPUs – Basic concepts and terminology – For introductory “Hello GPGPU” sample code, see http://www.gpgpu.org/developer
• Only the beginning: – Rest of course presents advanced techniques, strategies, and specific algorithms.
55
Data-Parallel Algorithms on GPUs Mark Harris NVIDIA Developer Technology
28
Outline • Introduction • Algorithmic complexity on GPUs • Algorithmic Building Blocks – – – – –
Gather & Scatter Reductions Scan (parallel prefix) Sort Search
57
Data-Parallel Algorithms • The GPU is a data-parallel processor – Data-parallel kernels of applications can be accelerated on the GPU
• Efficient algorithms require efficient building blocks • This talk: data-parallel building blocks – Gather & Scatter – Reduce and Scan – Sort and Search
58
29
Algorithmic Complexity on GPUs • We will use standard “Big O” notation – e.g., optimal sequential sort is O(n log n)
• GPGPU element of parallelism is the pixel – Each pixel generates one output element – O(n) typically means n pixels processed
• In general, GPGPU O(n) usually means O(n/p) processing time – p is the number of “pixel processors” on the GPU • e.g. NVIDIA G70 has 24 pixel shader pipelines
59
Step vs. Work Complexity • Important to distinguish between the two • Work Complexity: O(# pixels processed) • Step Complexity: O(# rendering passes)
60
30
Data-Parallel Building Blocks • • • • •
Gather & Scatter Reduce Scan Sort Search
61
Scatter vs. Gather • Gather: p = a[i] – Vertex or Fragment programs
• Scatter: a[i] = p – Vertex programs only
62
31
Scatter Techniques • Scatter not available on most GPUs – Recently available on ATI CTM (see later talks)
• Problem: a[i] = p – Indirect write – Can’t set the x,y of fragment in pixel shader – Often want to do a[i] += p
63
Scatter Technique 1 • Convert to Gather for each spring f = computed force mass_force[left] += f; mass_force[right] -= f;
64
32
Scatter Technique 1 • Convert to Gather for each spring f = computed force for each mass mass_force = f[left] – f[right]
65
Scatter Technique 2 • Address Sorting – Sort & Search • Shader outputs destination address and data • Bitonic Sort based on address (see sorting, later) • Run binary search shader over destination buffer – Each fragment searches for source data
66
33
Scatter Technique 3 • Vertex Processor – Render Points • Use vertex shader to set destination • Or just read back the data and re-issue
– Vertex Textures • Render data and address to texture • Issue points, set point x,y in vertex shader using address texture
67
Parallel Reductions • Given: – Binary associative operator ⊕ with identity I – Ordered set s = [a0, a1, …, an-1] of n elements
• reduce(⊕, s) returns a0 ⊕ a1 ⊕ … ⊕ an-1 • Example: reduce(+, [3 1 7 0 4 1 6 3]) = 25 • Reductions common in parallel algorithms – Common reduction operators are +, ×, min and max 68
34
Parallel Reductions on the GPU • 1D parallel reduction: – add two halves of texture together – repeatedly... – Until we’re left with a single row of texels
+
+
N/4… /4…
+
1
N/2 N
O(log2N) steps, O(N O(N) work 69
Multiple 1D Parallel Reductions • Can run many reductions in parallel – Use 2D texture and reduce one dimension
+
+
+ MxN/4 … MxN/4…
Mx1
MxN/2 MxN/2 MxN 70
35
2D reductions • Like 1D reduction, only reduce in both directions simultaneously
– Note: can add more than 2x2 elements per pixel • Trade per-pixel work for # steps • Best perf depends on specific GPU (cache, etc.) 71
Parallel Scan (aka prefix sum) • Given: – Binary associative operator ⊕ with identity I – Ordered set s = [a0, a1, …, an-1] of n elements
• scan(⊕, s) returns [a0, (a0 ⊕ a1), …, (a0 ⊕ a1 ⊕ … ⊕ an-1)]
• Example: scan(+, [3 1 7 0 4 1 6 3]) = [3 4 11 11 14 16 22 25] (From Blelloch, 1990, “Prefix Sums and Their Applications) 72
36
A Naïve Parallel Scan Algorithm Log(n) iterations
In
3
1
7
0
4
1
6
3
T0 3
1
7
0
4
1
6
3
4
8
7
4
5
7
9
Stride 1
T1 3 Stride 2
T0 3
For i from 1 to log(n)-1: • Render a quad from 2i to n. Fragment k computes
4 11 11 12 12 11 14
Stride 4
Out
Note: Can’t read and write the same texture, so must “ping-pong”
vout = v[k] + v[k-2i]. •
3
4 11 11 15 16 22 25
Due to ping-pong, render a 2nd quad from 2(i-1) to 2i with a simple pass-through shader vout = vin.
73
A Naïve Parallel Scan Algorithm • Algorithm given in more detail in [Horn ‘05] • Step efficient, but not work-efficient – O(log n) steps, but O(n log n) adds – Sequential version is O(n) – A factor of log(n) hurts: 20x for 10^6 elements!
• Dig into parallel algorithms literature for a better solution – See Blelloch 1990, “Prefix Sums and Their Applications” 74
37
Balanced Trees • Common parallel algorithms pattern – Build a balanced binary tree on the input data and sweep it to and from the root – Tree is conceptual, not an actual data structure
• For prescan: – Traverse down from leaves to root building partial sums at internal nodes in the tree • Root holds sum of all leaves
– Traverse back up the tree building the scan from the partial sums
75
Balanced Tree Scan 1. First build sums in place up the tree 2. Use partial sums to scan back down and generate scan • Note: tricky to implement using graphics API – –
Due to interleaving of new and old results Can reformulate layout Figure courtesy Shubho Sengupta
76
38
Further Improvement • [Sengupta et al. ’06] observes that balanced tree algorithm is not step-efficient – Loses efficiency on steps that contain fewer pixels than the GPU has pipelines
• Hybrid work-efficient / step-efficient algorithm – Simply switch from balanced tree to naïve algorithm for smaller steps
77
Hybrid work- and step-efficient algo
Figure courtesy Shubho Sengupta
78
39
Parallel Sorting • Given an unordered list of elements, produce list ordered by key value – Kernel: compare and swap
• GPUs constrained programming environment limits viable algorithms – Bitonic merge sort [Batcher 68] – Periodic balanced sorting networks [Dowd 89]
79
Bitonic Merge Sort Overview • Repeatedly build bitonic lists and then sort them – Bitonic list is two monotonic lists concatenated together, one increasing and one decreasing. • List A: (3, 4, 7, 8) monotonically increasing • List B: (6, 5, 2, 1) monotonically decreasing • List AB: (3, 4, 7, 8, 6, 5, 2, 1) bitonic
80
40
Bitonic Merge Sort 3 7 4 8 6 2 1 5 8x monotonic lists: (3) (7) (4) (8) (6) (2) (1) (5) 4x bitonic lists: (3,7) (4,8) (6,2) (1,5)
81
Bitonic Merge Sort 3 7 4 8 6 2 1 5 Sort the bitonic lists 82
41
Bitonic Merge Sort 3
3
7
7
4
8
8
4
6
2
2
6
1
5
5
1 4x monotonic lists: (3,7) (8,4) (2,6) (5,1) 2x bitonic lists: (3,7,8,4) (2,6,5,1)
83
Bitonic Merge Sort 3
3
7
7
4
8
8
4
6
2
2
6
1
5
5
1 Sort the bitonic lists 84
42
Bitonic Merge Sort 3
3
3
7
7
4
4
8
8
8
4
7
6
2
5
2
6
6
1
5
2
5
1
1
Sort the bitonic lists 85
Bitonic Merge Sort 3
3
3
7
7
4
4
8
8
8
4
7
6
2
5
2
6
6
1
5
2
5
1
1
Sort the bitonic lists 86
43
Bitonic Merge Sort 3
3
3
3
7
7
4
4
4
8
8
7
8
4
7
8
6
2
5
6
2
6
6
5
1
5
2
2
5
1
1
1
2x monotonic lists: (3,4,7,8) (6,5,2,1) 1x bitonic list: (3,4,7,8, 6,5,2,1)
87
Bitonic Merge Sort 3
3
3
3
7
7
4
4
4
8
8
7
8
4
7
8
6
2
5
6
2
6
6
5
1
5
2
2
5
1
1
1
Sort the bitonic list 88
44
Bitonic Merge Sort 3
3
3
3
3
7
7
4
4
4
4
8
8
7
2
8
4
7
8
1
6
2
5
6
6
2
6
6
5
5
1
5
2
2
7
5
1
1
1
8
Sort the bitonic list 89
Bitonic Merge Sort 3
3
3
3
3
7
7
4
4
4
4
8
8
7
2
8
4
7
8
1
6
2
5
6
6
2
6
6
5
5
1
5
2
2
7
5
1
1
1
8
Sort the bitonic list 90
45
Bitonic Merge Sort 3
3
3
3
3
2
7
7
4
4
4
1
4
8
8
7
2
3
8
4
7
8
1
4
6
2
5
6
6
6
2
6
6
5
5
5
1
5
2
2
7
7
5
1
1
1
8
8
Sort the bitonic list 91
Bitonic Merge Sort 3
3
3
3
3
2
7
7
4
4
4
1
4
8
8
7
2
3
8
4
7
8
1
4
6
2
5
6
6
6
2
6
6
5
5
5
1
5
2
2
7
7
5
1
1
1
8
8
Sort the bitonic list 92
46
Bitonic Merge Sort 3
3
3
3
3
2
1
7
7
4
4
4
1
2
4
8
8
7
2
3
3
8
4
7
8
1
4
4
6
2
5
6
6
6
5
2
6
6
5
5
5
6
1
5
2
2
7
7
7
5
1
1
1
8
8
8
Done! 93
Bitonic Merge Sort Summary • Separate rendering pass for each set of swaps – Step Complexity: O(log2n) passes – Each pass performs n compare/swaps – Work Complexity: (total compare/swaps): O(n log2n) • Limitations of GPU cost us a factor of log n over optimal sequential sorting algorithms
94
47
Making GPU Sorting Faster • Draw several quads with similar computation instead of single quad – Reduce decision making in fragment program
• Push work into vertex processor and interpolator – Reduce computation in fragment program
• More than one compare/swap per sort kernel invocation – Reduce computational complexity
95
Grouping Computation 3
3
3
3
3
2
1
7
7
4
4
4
1
2
4
8
8
7
2
3
3
8
4
7
8
1
4
4
6
2
5
6
6
6
5
2
6
6
5
5
5
6
1
5
2
2
7
7
7
5
1
1
1
8
8
8 96
48
Implementation Details • Specify interpolants for smaller quads – ‘down’ or ‘up’ compare and swap – distance to comparison partner
• See Kipfer & Westermann article in GPU Gems 2 and Kipfer et al. Graphics Hardware 04 for more details
97
GPU Sort [Govindaraju et al. 05] • Use blending operators for comparison • Use texture mapping hw to map sorting op. • Further improvements with GPUTeraSort – [Govindaraju et al. 2006]
– Beat Penny Sort Benchmark for 2006
98
49
Parallel Binary Search
99
Binary Search • Find a specific element in an ordered list • Implement just like CPU algorithm – Assuming hardware supports long enough shaders – Finds the first element of a given value v • If v does not exist, find next smallest element > v
• Search algorithm is sequential, but many searches can be executed in parallel – Number of pixels drawn determines number of searches executed in parallel • 1 pixel == 1 search 100
50
Binary Search • Search for v0
Initialize
Search starts at center of sorted array
4
v2 >= v0 so search left half of sub-array
Sorted List
v0 0
v0 1
v0 2
v2 3
v2 4
v2 5
v5 6
v5 7
101
Binary Search • Search for v0
Initialize
4
Step 1
2
Sorted List
v0 0
v0 >= v0 so search left half of sub-array
v0 1
v0 2
v2 3
v2 4
v2 5
v5 6
v5 7
102
51
Binary Search • Search for v0
Initialize
4
Step 1
2
Step 2
1
Sorted List
v0 0
v0 >= v0 so search left half of sub-array
v0 1
v0 2
v2 3
v2 4
v2 5
v5 6
v5 7
103
Binary Search • Search for v0
Initialize
4
Step 1
2
Step 2
1
Step 3
0
Sorted List
v0 0
At this point, we either have found v0 or are 1 element too far left One last step to resolve
v0 1
v0 2
v2 3
v2 4
v2 5
v5 6
v5 7
104
52
Binary Search • Search for v0
Initialize
4
Step 1
2
Step 2
1
Step 3
0
Step 4
0
Sorted List
v0 0
Done!
v0 1
v0 2
v2 3
v2 4
v2 5
v5 6
v5 7
105
Binary Search • Search for v0 and v2
Initialize
4
Search starts at center of sorted array
4
Both searches proceed to the left half of the array
Sorted List
v0 0
v0 1
v0 2
v2 3
v2 4
v2 5
v5 6
v5 7
106
53
Binary Search • Search for v0 and v2
Initialize
4
4
Step 1
2
2
Sorted List
v0 0
The search for v0 continues as before The search for v2 overshot, so go back to the right
v0 1
v0 2
v2 3
v2 4
v2 5
v5 6
v5 7
107
Binary Search • Search for v0 and v2
Initialize
4
4
Step 1
2
2
Step 2
1
3
Sorted List
v0 0
v0 1
v0 2
We’ve found the proper v2, but are still looking for v0 Both searches continue
v2 3
v2 4
v2 5
v5 6
v5 7
108
54
Binary Search • Search for v0 and v2
Initialize
4
4
Step 1
2
2
Step 2
1
3
Step 3
0
2
Sorted List
v0 0
v0 1
v0 2
Now, we’ve found the proper v0, but overshot v2 The cleanup step takes care of this
v2 3
v2 4
v2 5
v5 6
v5 7
109
Binary Search • Search for v0 and v2
Initialize
4
4
Step 1
2
2
Step 2
1
3
Step 3
0
2
Step 4
0
3
Sorted List
v0 0
v0 1
v0 2
Done! Both v0 and v2 are located properly
v2 3
v2 4
v2 5
v5 6
v5 7
110
55
Binary Search Summary • Single rendering pass – Each pixel drawn performs independent search
• O(log n) step complexity – PS3.0 GPUs have dynamic branching and looping – So these steps are inside the pixel shader, not separate rendering passes
• O(m log n) work complexity – m is the number of parallel searches
111
References • Prefix Sums and Their Applications. Guy E. Blelloch. Technical Report CMU-CS-90-190. November, 1990. • A Toolkit for Computation on GPUs. Ian Buck and Tim Purcell. In GPU Gems. Randy Fernando, ed. 2004 • GPUTeraSort: High Performance Graphics Coprocessor Sorting for Large Database Management. Naga Govindaraju, Jim Gray, Ritesh Kumar, and Dinesh Manocha. In Proceedings of ACM SIGMOD 2006 • Stream Reduction Operations for GPGPU Applications. Daniel Horn. In GPU Gems 2. Matt Pharr, ed. 2005 • Improved GPU Sorting. Peter Kipfer. In GPU Gems 2. Matt Pharr, ed. 2005 • A Work-Efficient Step-Efficient Prefix Sum Algorithm. Shubhabrata Sengupta, Aaron E. Lefohn, John D. Owens. In Proceedings of the 2006 Workshop on Edge Computing Using New Commodity Architectures 112
56
GPU Memory Model Overview John Owens University of California, Davis
Memory Hierarchy • CPU and GPU Memory Hierarchy Disk CPU Main Memory CPU Caches
CPU Registers
GPU Video Memory
GPU Caches
GPU Constant Registers
GPU Temporary Registers 114
57
CPU Memory Model • At any program point – Allocate/free local or global memory – Random memory access • Registers – Read/write
• Local memory – Read/write to stack
• Global memory – Read/write to heap
• Disk – Read/write to disk
115
Cell • SPU memory model: • 128 128b local registers • 256 kB local store – 6 cycles access time
• Explicit, asynchronous DMA access to main memory – Allows comm/comp overlap
• No explicit I or D cache • No disk access
http://www.realworldtech.com/includes/images/articles/cell-1.gif
116
58
GPU Memory Model • Much more restricted memory access – Allocate/free memory only before computation – Limited memory access during computation (kernel) • Registers – Read/write
• Local memory – Does not exist
• Global memory – Read-only during computation – Write-only at end of computation (precomputed address)
• Disk access – Does not exist 117
GPU Memory Model • GPUs support many types of memory objects in hardware – 1D, 2D, 3D grids • 2D is most common (framebuffer, texture)
– 2D cube maps (6 faces of a cube) – Mipmapped (prefiltered) versions – DX10 adds arrayed datatypes
• Each native datatype has pros and cons from a general-purpose programming perspective
118
59
Traditional GPU Pipeline • Inputs: – Vertex data – Texture data
• Output: – Framebuffer Texture
Vertex Buffer
Vertex Processor
Rasterizer
Fragment Processor
Frame Buffer(s) 119
GPU Memory Model (DX9) • Extending memory functionality – Copy from framebuffer to texture – Texture reads from vertex processor – Render to vertex VS 3.0 GPUs buffer Texture
Vertex Buffer
Vertex Processor
Rasterizer
Fragment Processor
Frame Buffer(s) 120
60
GPU Memory Model (DX10, traditional) • More flexible memory handling – All programmable units can read texture – “Stream out” after geometry processor
Stream Out
Vertex Buffer
Vertex Processor
Arrayed Texture
Geometry Processor
Rasterizer
Fragment Processor
Arrayed Frame Buffer(s) 121
GPU Memory Model (DX10, new) • DX10 provides “resources” • Resources are flexible! DX10 Resources
Vertex Processor
Geometry Processor
Rasterizer
Fragment Processor
122
61
GPU Memory API • Each GPU memory type supports subset of the following operations – CPU interface – GPU interface
123
GPU Memory API • CPU interface – – – – – – – –
Allocate Free Copy CPU Æ GPU Copy GPU Æ CPU Copy GPU Æ GPU Bind for read-only vertex stream access Bind for read-only random access Bind for write-only framebuffer access
124
62
GPU Memory API • GPU (shader/kernel) interface – Random-access read – Stream read
125
DX10 View of Memory Resources
Buffers
Textures
Views
• Resources – Encompass buffers and textures – Retained state is stored in resources – Must be bound by API to pipeline stages before called • Same subresource cannot be bound for both read and write simultaneously
126
63
DX10 View of Memory Resources
Textures
Buffers
Views
• Buffers – Collection of elements • Few requirements on type or format (heterogeneous) • Elements are 1-4 components (e.g. R8G8B8A8, 8b int, 4x32b float)
– No filtering, subresourcing, multisampling – Layout effectively linear (“casting” is possible) – Examples: vertex buffers, index buffers, ConstantBuffers 127
DX10 View of Memory Resources
Buffers
Textures
Views
• Textures – Collection of texels – Can be filtered, subresourced, arrayed, mipmapped – Unlike buffers, must be declared with texel type • Type impacts filtering
– Layout is opaque - enables memory layout optimization – Examples: texture{1,2,3}d, mipmapped, cubemap 128
64
DX10 View of Memory Resources
Buffers
Textures
Views
• Views – “mechanism for hardware interpretation of a resource in memory” – Allows structured access of subresources – Restricting view may increase efficiency 129
Big Picture: GPU Memory Model • GPUs are a mix of: – Historical, fixed-function capabilities – Newer, flexible, programmable capabilities
• Fixed-function: – Known access patterns, behaviors – Accelerated by special-purpose hardware
• Programmable: – Unknown access patterns – Generality good
• Memory model must account for both – Consequence: Ample special-purpose functionality – Consequence: Restricting flexibility may improve performance 130
65
DX10 Bind Rules Shader Resource Input: Anything, but can only bind views
Shader Constants: Must be created as shader constant, can’t use in other views Depth/Stencil Output: not buffers/texture3D, only can bind views of other resources
Input Assembler: Buffers
Vertex Processor
Geometry Processor
Rasterizer
Fragment Processor
StreamOut: Buffers
Render Target Output: Anything, but can only bind views 131
Example: Texture • Texture mapping fundamental primitive in GPUs • Most typical use: random access, bound for read only, 2D texture map – Hardware-supported caching & filtering
Texture
Vertex Buffer
Vertex Processor
Rasterizer
Fragment Processor
Frame Buffer(s) 132
66
Example: Framebuffer • Memory written by fragment processor • Write-only GPU memory (from shader’s point of view) – FB is read-modify-write by the pipeline as a whole
• Displayed to screen • Can also store GPGPU results (not just color)
Vertex Buffer
Vertex Processor
Rasterizer
Fragment Processor
Frame Buffer(s) 133
Example: Render to Texture • Very common in both graphics & GPGPU • Allows multipass algorithms – Pass 1: Write data into framebuffer – Pass 2: Bind as texture, read from texture
• Store up to 32 32b FP values/pixel Texture
Vertex Buffer
Vertex Processor
Rasterizer
Fragment Processor
Frame Buffer(s) 134
67
Example: Render to Vertex Array • Enables top-of-pipe feedback loop • Enables dynamic creation of geometry on GPU
Vertex Buffer
Vertex Processor
Rasterizer
Fragment Processor
Frame Buffer(s) 135
Example: Stream Out to Vertex Buffer • Enabled by DX10 StreamOut capability • Expected to be used for dynamic geometry – Recall geometry processor produces 0-n outputs per input
• Possible graphics applications: – Expand point sprites – Extrude silhouettes – Extrude prisms/tets
Stream Out
Vertex Buffer
Vertex Processor
Geometry Processor 136
68
Summary • Rich set of hardware primitives – Designed for special purpose tasks, but often useful for general purpose ones
• Memory usage generally more restrictive than other processors – Becoming more general-purpose and orthogonal
• Restricting generality allows hw/sw to cooperate for higher performance
137
GPU Data Structures Aaron Lefohn Neoptica
69
Introduction • Previous talk: GPU memory model • This talk: GPU data structures – Basic building block is 2D array – (ATI’s CTM supports large 1D arrays…discussed later)
• Overview – Dense arrays – Sparse arrays – Adaptive arrays
139
GPU Arrays • Large 1D Arrays – Current GPUs limit 1D array sizes to 2048 or 4096 – Pack into 2D memory – 1D-to-2D address translation
140
70
GPU Arrays • 2D Arrays – Trivial, native implementation
141
GPU Arrays •
3D Arrays – –
Problem: GPUs do not have 3D frame buffers Solutions 1. Multiple slices per 2D buffer (“Flat 3D array”) 2. Stack of 2D slices 3. Render-to-slice of 3D array
142
71
GPU Arrays • DX 10 memory model helps – Can render to slice of 3D texture
• “Flat 3D array” still has advantages – Render entire domain in single parallel compute pass – More parallelism when writing (slower for reading)
143
GPU Arrays • Higher Dimensional Arrays – Pack into 2D buffers – N-D to 2D address translation
144
72
Sparse/Adaptive Data Structures • Why? – Reduce memory pressure – Reduce computational workload
• Basic Idea – Pack “active” data elements into GPU memory
145
Page Table Sparse/Adaptive Arrays • Dynamic sparse/adaptive N-D array (Lefohn et al. 2003/2006)
Virtual Domain
Page Table
Physical Memory
146
73
Dynamic Adaptive Data Structure • Photon map (kNN-grid)
(Purcell et al. 2003)
Image from “Implementing Efficient Parallel Data Structures on GPUs,” Lefohn et al., GPU Gems II, ch. 33, 2005 147
GPU Perfect Hash Table • Static, sparse N-D array
(Lefebvre et al. 2006)
Figure from Lefebvre, Hoppe, “Perfect Spatial Hashing,” ACM SIGGRAPH, 2006 148
74
GPU Iteration • GPU good at random-access read – Often do not need to define input iterators
• GPU (mostly) performs only streaming writes – Every GPGPU data structure must support an output iterator compatible with GPU rasterization • Traverse physical data layout • Example – Draw single quad to iterate over voxels in flattened volume – Draw one quad per page to iterate over page-table-based array 149
GPU Iteration • Optimizing input iterators – Galoppo et al. optimized input and output iterators for their matrix representation • Match 2D GPU memory layout • Match memory access pattern of algorithm
– Difficult because memory layouts are unspecified and proprietary
Galoppo, Govindaraju, Henson, Manocha, “LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware,” ACM/IEEE Supercomputing, 2005 150
75
GPU Data Structures • Conclusions – Fundamental GPU memory primitive is a fixed-size 2D array – GPGPU needs more general memory model • Low-level interfaces such as ATI’s CTM are step in the right direction • Iterator access patterns must match storage layout
– Building complex data structures on GPU still hard • Data-parallel algorithms (sort, scan, etc.) can be used • Is less parallelism more efficient? 151
More Information • Overview (with code snippets) – Lefohn, Kniss, Owens, “Implementing Efficient Parallel Data Structures on GPUs,” Chapter 33, GPU Gems II
• High-level GPU data structures – Lefohn, Kniss, Strzodka, Sengupta, Owens, “Glift: Generic, Efficient, Random-Access GPU Data Structures,” ACM Transactions on Graphics, 2006
• GPGPU State-of-the-art report – Owens, Luebke, Govindaraju, Harris, Krüger, Lefohn, Purcell, “A Survey of General-Purpose Computation on Graphics Hardware,” Eurographics STAR report, 2005 152
76
Sorting and Searching Naga Govindaraju
Topics • Sorting – Sorting networks
• Search – Binary search – Searching quantiles
154
77
Assumptions • Data organized into 1D arrays • Rendering pass == screen aligned quad – Not using vertex shaders
• PS 2.0 GPU – No data dependent branching at fragment level
155
Sorting
156
78
Sorting • Given an unordered list of elements, produce list ordered by key value – Kernel: compare and swap
• GPUs constrained programming environment limits viable algorithms – Bitonic merge sort [Batcher 68] – Periodic balanced sorting networks [Dowd 89]
157
Bitonic Merge Sort Overview • Repeatedly build bitonic lists and then sort them – Bitonic list is two monotonic lists concatenated together, one increasing and one decreasing. • List A: (3, 4, 7, 8) monotonically increasing • List B: (6, 5, 2, 1) monotonically decreasing • List AB: (3, 4, 7, 8, 6, 5, 2, 1) bitonic
158
79
Bitonic Merge Sort 3 7 4 8 6 2 1 5 8x monotonic lists: (3) (7) (4) (8) (6) (2) (1) (5) 4x bitonic lists: (3,7) (4,8) (6,2) (1,5)
159
Bitonic Merge Sort 3 7 4 8 6 2 1 5 Sort the bitonic lists 160
80
Bitonic Merge Sort 3
3
7
7
4
8
8
4
6
2
2
6
1
5
5
1 4x monotonic lists: (3,7) (8,4) (2,6) (5,1) 2x bitonic lists: (3,7,8,4) (2,6,5,1)
161
Bitonic Merge Sort 3
3
7
7
4
8
8
4
6
2
2
6
1
5
5
1 Sort the bitonic lists 162
81
Bitonic Merge Sort 3
3
3
7
7
4
4
8
8
8
4
7
6
2
5
2
6
6
1
5
2
5
1
1
Sort the bitonic lists 163
Bitonic Merge Sort 3
3
3
7
7
4
4
8
8
8
4
7
6
2
5
2
6
6
1
5
2
5
1
1
Sort the bitonic lists 164
82
Bitonic Merge Sort 3
3
3
3
7
7
4
4
4
8
8
7
8
4
7
8
6
2
5
6
2
6
6
5
1
5
2
2
5
1
1
1
2x monotonic lists: (3,4,7,8) (6,5,2,1) 1x bitonic list: (3,4,7,8, 6,5,2,1)
165
Bitonic Merge Sort 3
3
3
3
7
7
4
4
4
8
8
7
8
4
7
8
6
2
5
6
2
6
6
5
1
5
2
2
5
1
1
1
Sort the bitonic list 166
83
Bitonic Merge Sort 3
3
3
3
3
7
7
4
4
4
4
8
8
7
2
8
4
7
8
1
6
2
5
6
6
2
6
6
5
5
1
5
2
2
7
5
1
1
1
8
Sort the bitonic list 167
Bitonic Merge Sort 3
3
3
3
3
7
7
4
4
4
4
8
8
7
2
8
4
7
8
1
6
2
5
6
6
2
6
6
5
5
1
5
2
2
7
5
1
1
1
8
Sort the bitonic list 168
84
Bitonic Merge Sort 3
3
3
3
3
2
7
7
4
4
4
1
4
8
8
7
2
3
8
4
7
8
1
4
6
2
5
6
6
6
2
6
6
5
5
5
1
5
2
2
7
7
5
1
1
1
8
8
Sort the bitonic list 169
Bitonic Merge Sort 3
3
3
3
3
2
7
7
4
4
4
1
4
8
8
7
2
3
8
4
7
8
1
4
6
2
5
6
6
6
2
6
6
5
5
5
1
5
2
2
7
7
5
1
1
1
8
8
Sort the bitonic list 170
85
Bitonic Merge Sort 3
3
3
3
3
2
1
7
7
4
4
4
1
2
4
8
8
7
2
3
3
8
4
7
8
1
4
4
6
2
5
6
6
6
5
2
6
6
5
5
5
6
1
5
2
2
7
7
7
5
1
1
1
8
8
8
Done! 171
Bitonic Merge Sort Summary • Separate rendering pass for each set of swaps – O(log2n) passes – Each pass performs n compare/swaps – Total compare/swaps: O(n log2n) • Limitations of GPU cost us factor of logn over best CPU-based sorting algorithms
172
86
Making GPU Sorting Faster • Draw several quads with similar computation instead of single quad – Reduce decision making in fragment program
• Push work into vertex processor and interpolator – Reduce computation in fragment program
• More than one compare/swap per sort kernel invocation – Reduce computational complexity
173
Grouping Computation 3
3
3
3
3
2
1
7
7
4
4
4
1
2
4
8
8
7
2
3
3
8
4
7
8
1
4
4
6
2
5
6
6
6
5
2
6
6
5
5
5
6
1
5
2
2
7
7
7
5
1
1
1
8
8
8 174
87
Implementation Details • Specify interpolants for smaller quads – ‘down’ or ‘up’ compare and swap – distance to comparison partner
• See Kipfer & Westermann article in GPU Gems 2 and Kipfer et al. Graphics Hardware 04 for more details
175
GPU Sort • Use blending operators for comparison • Use texture mapping hw to map sorting op.
176
88
2D Memory Addressing • GPUs optimized for 2D representations – Map 1D arrays to 2D arrays – Minimum and maximum regions mapped to row-aligned or column-aligned quads
177
1D – 2D Mapping MIN
MAX
178
89
1D – 2D Mapping
Effectively reduce instructions per element
MIN
179
Sorting on GPU: Pipelining and Parallelism
Input Vertices
Texturing, Caching and 2D Quad Comparisons Sequential Writes
180
90
Comparison with GPU-Based Algorithms
3-6x faster than prior GPU-based algorithms!
181
GPU vs. High-End Multi-Core CPUs
2-2.5x faster than Intel high-end processors Single GPU performance comparable to high-end dual core Athlon
Optimized CPU code from Intel Corporation 182
91
GPU vs. High-End Multi-Core CPUs
2-2.5x faster than Intel high-end processors Single GPU performance comparable to high-end dual core Athlon
Slash Dot and Toms Hardware Guide Headlines, June 2005 183
N. Govindaraju, S. Larsen, J. Gray, and D. Manocha, Proc. Of ACM SuperComputing, 2006
GPU Cache Model • Small data caches
– Better hide the memory latency – Vendors do not disclose cache information – critical for scientific computing on GPUs
• We design simple model – Determine cache parameters (block and cache sizes) – Improve sorting, FFT and SGEMM performance
184
92
Cache Evictions
Cache Cache Cache Cache Eviction Eviction Eviction
185
Cache issues
h
Cache misses per step = 2 W H/ (h B) Cache Cache Cache Cache Eviction Eviction Eviction
186
93
Analysis • lg n possible steps in bitonic sorting network • Step k is performed (lg n – k+1) times and h = 2k-1 • Data fetched from memory = 2 n f(B) where f(B)=(B-1) (lg n -1) + 0.5 (lg n –lg B)2
187
Block Sizes on GPUs
188
94
Cache-Efficient Algorithm
h
Cache
189
Cache Sizes on GPUs
190
95
Cache-Efficient Algorithm Performance
191
Super-Moore’s Law Growth
50 GB/s on a single GPU
Peak Performance: Effectively hide memory latency with 15 GOP/s
192
96
N. Govindaraju, J. Gray, R. Kumar and D. Manocha, Proc. of ACM SIGMOD 2006
External Memory Sorting
• Performed on Terabyte-scale databases • Two phases algorithm [Vitter01, Salzberg90, Nyberg94, Nyberg95] – Limited main memory – First phase – partitions input file into large data chunks and writes sorted chunks known as “Runs” – Second phase – Merge the “Runs” to generate the sorted file 193
External Memory Sorting • Performance mainly governed by I/O Salzberg Analysis: Given the main memory size M and the file size N, if the I/O read size per run is T in phase 2, external memory sorting achieves efficient I/O performance if the run size R in phase 1 is given by R ≈ √(TN)
194
97
Salzberg Analysis • If N=100GB, T=2MB, then R ≈ 230MB • Large data sorting is inefficient on CPUs – R » CPU cache sizes – memory latency
195
External memory sorting • External memory sorting on CPUs can have low performance due to – High memory latency – Or low I/O performance
• Our algorithm – Sorts large data arrays on GPUs – Perform I/O operations in parallel on CPUs
196
98
GPUTeraSort
197
I/O Performance Salzberg Analysis: 100 MB Run Size
198
99
I/O Performance Salzberg Analysis: 100 MB Run Size
Pentium IV: 25MB Run Size Less work and only 75% IO efficient!
199
I/O Performance Salzberg Analysis: 100 MB Run Size
Dual 3.6 GHz Xeons: 25MB Run size More cores, less work but only 85% IO efficient!
200
100
I/O Performance Salzberg Analysis: 100 MB Run Size
7800 GT: 100MB run size Ideal work, and 92% IO efficient with single CPU!
201
Task Parallelism
Performance limited by IO and memory
202
101
Overall Performance
Faster and more scalable than Dual Xeon processors (3.6 GHz)! 203
Performance/$
1.8x faster than current Terabyte sorter
World’s best performance/$ system
204
102
Advantages • Exploit high memory bandwidth on GPUs – Higher memory performance than CPUbased algorithms
• High I/O performance due to large run sizes
205
Advantages • Offload work from CPUs – CPU cycles well-utilized for resource management
• Scalable solution for large databases • Best performance/price solution for terabyte sorting 206
103
Searching
207
Types of Search • Search for specific element – Binary search
• Search for nearest element(s) – k-nearest neighbor search
• Both searches require ordered data
208
104
Binary Search • Find a specific element in an ordered list • Implement just like CPU algorithm – Assuming hardware supports long enough shaders – Finds the first element of a given value v • If v does not exist, find next smallest element > v
• Search algorithm is sequential, but many searches can be executed in parallel – Number of pixels drawn determines number of searches executed in parallel • 1 pixel == 1 search 209
Binary Search • Search for v0
Initialize
Search starts at center of sorted array
4
v2 >= v0 so search left half of sub-array
Sorted List
v0 0
v0 1
v0 2
v2 3
v2 4
v2 5
v5 6
v5 7
210
105
Binary Search • Search for v0
Initialize
4
Step 1
2
Sorted List
v0 0
v0 >= v0 so search left half of sub-array
v0 1
v0 2
v2 3
v2 4
v2 5
v5 6
v5 7
211
Binary Search • Search for v0
Initialize
4
Step 1
2
Step 2
1
Sorted List
v0 0
v0 >= v0 so search left half of sub-array
v0 1
v0 2
v2 3
v2 4
v2 5
v5 6
v5 7
212
106
Binary Search • Search for v0
Initialize
4
Step 1
2
Step 2
1
Step 3
0
Sorted List
v0 0
At this point, we either have found v0 or are 1 element too far left One last step to resolve
v0 1
v0 2
v2 3
v2 4
v2 5
v5 6
v5 7
213
v2 5
v5 6
v5 7
214
Binary Search • Search for v0
Initialize
4
Step 1
2
Step 2
1
Step 3
0
Step 4
0
Sorted List
v0 0
Done!
v0 1
v0 2
v2 3
v2 4
107
Binary Search • Search for v0 and v2
Initialize
4
Search starts at center of sorted array
4
Both searches proceed to the left half of the array
Sorted List
v0 0
v0 1
v0 2
v2 3
v2 4
v2 5
v5 6
v5 7
215
Binary Search • Search for v0 and v2
Initialize
4
4
Step 1
2
2
Sorted List
v0 0
The search for v0 continues as before The search for v2 overshot, so go back to the right
v0 1
v0 2
v2 3
v2 4
v2 5
v5 6
v5 7
216
108
Binary Search • Search for v0 and v2
Initialize
4
4
Step 1
2
2
Step 2
1
3
Sorted List
v0 0
v0 1
v0 2
We’ve found the proper v2, but are still looking for v0 Both searches continue
v2 3
v2 4
v2 5
v5 6
v5 7
217
Binary Search • Search for v0 and v2
Initialize
4
4
Step 1
2
2
Step 2
1
3
Step 3
0
2
Sorted List
v0 0
v0 1
v0 2
Now, we’ve found the proper v0, but overshot v2 The cleanup step takes care of this
v2 3
v2 4
v2 5
v5 6
v5 7
218
109
Binary Search • Search for v0 and v2
Initialize
4
4
Step 1
2
2
Step 2
1
3
Step 3
0
2
Step 4
0
3
Sorted List
v0 0
v0 1
v0 2
Done! Both v0 and v2 are located properly
v2 3
v2 4
v2 5
v5 6
v5 7
219
Binary Search Summary • Single rendering pass – Each pixel drawn performs independent search
• O(log n) steps
220
110
Searching for Quantiles • Given a set of values in the GPU, compute the Kthlargest number • Traditional CPU algorithms require arbitrary data writes - require new algorithm without – Data rearrangement – Data readback to CPU
• Our solution – search for the Kth-largest number
N. Govindaraju, B. Lloyd, W. Wang, M. Lin and D. Manocha, Proc. Of ACM SIGMOD, 2004 221
K-th Largest Number • Let vk denote the k-th largest number • How do we generate a number m equal to vk ? – Without knowing vk’s value – Count the number of values ≥ some given value – Starting from the most significant bit, determine the value of each bit at a time
222
111
K-th Largest Number • Given a set S of values – c(m) —number of values ≥ m – vk — the k-th largest number
• We have – If c(m) ≥ k, then m ≤ vk – If c(m) < k, then m > vk
• c(m) computed using occlusion queries
223
2nd Largest in 9 Values
0011
1011
1101
0111
0101
0001
0111
1010
0010
m = 0000 v2 = 1011
224
112
Draw a Quad at Depth 8 Compute c(1000)
0011
1011
1101
0111
0101
0001
0111
1010
0010
m = 1000 v2 = 1011
225
1st bit = 1
0011
1011
1101
0111
0101
0001
0111
1010
0010
m = 1000 v2 = 1011 c(m) = 3
226
113
Draw a Quad at Depth 12 Compute c(1100)
0011
1011
1101
0111
0101
0001
0111
1010
0010
m = 1100 v2 = 1011
227
2nd bit = 0
0011
1011
1101
0111
0101
0001
0111
1010
0010
m = 1100 v2 = 1011 c(m) = 1
228
114
Draw a Quad at Depth 10 Compute c(1010)
0011
1011
1101
0111
0101
0001
0111
1010
0010
m = 1010 v2 = 1011
229
3rd bit = 1
0011
1011
1101
0111
0101
0001
0111
1010
0010
m = 1010 v2 = 1011 c(m) = 3
230
115
Draw a Quad at Depth 11 Compute c(1011)
0011
1011
1101
0111
0101
0001
0111
1010
0010
m = 1011 v2 = 1011
231
4th bit = 1
0011
1011
1101
0111
0101
0001
0111
1010
0010
m = 1011 v2 = 1011 c(m) = 2
232
116
Our algorithm • Initialize m to 0 • Start with the MSB and scan all bits till LSB • At each bit, put 1 in the corresponding bitposition of m • If c(m) < k, make that bit 0 • Proceed to the next bit
233
Kth-Largest
NV35
234
117
Median
3x performance improvement per year! 235
GPGPU Mathematical Primitives Aaron Lefohn Neoptica
118
GPGPU Non-Linear PDEs
– Strodka, Garbe, “Real-Time Motion Estimation and Visualization on Graphics Cards,” IEEE Visualization, 2004 237
GPGPU Direct Tridiagonal Solver
– 1000, 1000-element tridiagonal linear systems – Kass, Lefohn, Owens, “Interactive Depth-of-Field,” Pixar Technical Report, 2006 238
119
Overview • • • •
Linear Algebra Differential Equations Performance Results Summary
239
Linear Algebra on GPUs • The basics – Vector-vector – Matrix-vector – Matrix-matrix
240
120
Basics: Vector-Vector Operations • Add / subtract – Trivial parallel map operation
+
241
Basics: Vector-Vector Operations • Inner product / normalize – Trivial implementation might render a single fragment (1 thread) and perform serial computation • No parallelism
242
121
Basics: Vector-Vector Operations • Inner product / normalize – Parallel reduction
243
Basics: Matrix-Vector Multiplication • N inner products in parallel
xN 244
122
Basics: Matrix-Matrix Operations • Add / transpose – Parallel map
245
Basics: Matrix-Matrix Operations • Multiply – N2 inner products
x N2
246
123
Basics: Matrix-Matrix Multiply • Interesting parallelism note – N2 inner products provide enough parallelism that it is OK to perform each one in a single pass • This differs from computing a single inner product, which must be parallelized for good performance
• Performance challenges – No writeable cache to capture reuse – Cache-to-register pathway is bottleneck on many GPUs 247
Basics: Sparse Matrices • N-Diagonal (Banded) – Store each diagonal as vector – Special cases: 1-4-Diagonal • Store diagonals in quadword elements of single vector
• Unstructured – ITPACK format (padded compressed row) is attractive (comes with Brook distribution) • Same number of non-zero elements in each row • Keeps computation SIMD
– Other formats use more indirection and a varying amount of computation per vector element 248
124
GPU Linear Solvers • Solve My = x for y – Use basic linear algebra parallel constructs – Data-parallel algorithms
• Examples – – – – –
Conjugate gradient Jacobi Gauss-Seidel Dense LU-decomposition Tridiagonal LU-decomposition 249
Dense LU-Decomposition • GPU iterators (rasterization quads) – Match memory access pattern with GPU memory layout
Galoppo, Govindaraju, Henson, Manocha, “LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware,” ACM/IEEE Supercomputing, 2005 250
125
Tridiagonal LU-Decomposition • Numerical recipes algorithm is sequential • Use scan operation to parallelize – Cyclic reduction – O(N) computation in O(log N) passes
251
GPGPU Differential Equations • Ordinary differential equation example • Partial differential equation examples
252
126
GPGPU ODEs • N-body particle system – Brute force solution maps well to GPU • “Stream all N particles past all N particles” foreach pi in Particles foreach pj in Particles »
pi += computeInteraction(pi,pj)
• Replace outer loop with parallel GPU foreach
253
GPGPU ODEs • O(N log N) optimized algorithms – More difficult to map to GPUs • Must build irregular data structure each iteration – Neighbor lists or hierarchical grid
• Varying number of interactions per particle
– Architectural improvements making this easier • Scatter • More efficient conditional execution • See Mark Harris’s talk on Havok FX physics
254
127
GPGPU Partial Differential Equations • Example GPGPU PDE Applications – Navier-Stokes (incompressible fluids) – Level sets (deformable implicit surfaces) – Image processing • Registration • Segmentation • Computer vision
255
GPGPU Partial Differential Equations • Explicit, finite difference PDE solvers map well to GPUs – Gather small number of local neighbors – Grid Æ texture Step n+1
Step n
(Figure from Robert Strzodka) Strzodka) 256
128
GPU PDEs • Finite difference optimizations – Multigrid – Banded sparse grids – Adaptive grids
257
Performance Results • Matrix-matrix multiply – GPU: 110 GFLOPS (ATI X19K, CTM) – CPU: 8-10 GFLOPS (Single Intel P4 3.2 GHz) – Cell: > 200 GFLOPS (3.2 GHz)
258
129
Performance Results • Dense LU-Decomposition (SC ‘05) – 15% - 35% faster than ATLAS (partial pivot) • Matrix sizes > 35002 • NVIDIA GeForce 7800 / Intel Pentium 4 3.4 Ghz
– Up to 10x faster than LAPACK (full pivot) • Intel Math Kernel Lib • Matrix sizes > 35002 • NVIDIA GeForce 7800 / Intel Pentium 4 3.4 Ghz 259
Summary • Techniques – Use data-parallel linear algebra algorithms – Redefine memory access patterns for GPU • Contiguous output domain • Avoid scatter • Leverage 2D memory layout
– Minimize indirections
260
130
Summary • Challenges – No writeable cache / local store • Hard to beat block-based decomposition
– Must combine multiple operations before reading data back to CPU
• Iterative solvers work very well!
261
GPGPU Math Libraries – LU-GPU (dense LU-decomposition) • http://gamma.cs.unc.edu/LUGPULIB/
– Linear algebra framework • http://wwwcg.in.tum.de/Research/Publications/LinAlg
– GPUFFTW • http://gamma.cs.unc.edu/GPUFFTW/
– GPU FFT • http://sourceforge.net/projects/gpufft/
– PeakStream • http://www.peakstreaminc.com/
– RapidMind • http://www.rapidmind.com
262
131
References •
Kass, Lefohn, Owens, “Interactive Depth of Field,” (Cyclic reduction, direct tridiagonal linear solver), Pixar Technical Report, 2006
•
Galoppo, Govindaraju, Henson, Manocha, “LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware,” ACM/IEEE Supercomputing, 2005
•
Jiang, Snir, “Automatic Tuning Matrix Multiplication on Graphics Hardware,” Parallel Architecture and Compilation Techniques (PACT), 2005
•
Fatahalian, Sugerman, Hanrahan, “Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication,” ACM/EG Graphics Hardware, 2004
•
Lefohn, Kniss, Hansen, Whitaker, “Interactive Deformation and Visualization of Level Set Surfaces Using Graphics Hardware,” IEEE Transactions on Visualization and Computer Graphics, 2004
•
Strodka, Garbe, “Real-Time Motion Estimation and Visualization on Graphics Cards,” IEEE Visualization, 2004
•
Harris, Baxter, Scheuermann, Lastra, “Simulation of Cloud Dynamics on Graphics Hardware,” ACM/EG Graphics Hardware, 2003
263
References •
Krüger, Westermann, “Linear algebra operators for GPU implementation of numerical algorithms,” ACM SIGGRAPH, 2003
•
Bolz, Farmer, Grinspun, Schröder, “Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid,” ACM SIGGRAPH, 2003
•
Hillesland, Molinov, Grzeszczuk, “Nonlinear Optimization Framework for Image-Based Modeling on Programmable Graphics Hardware,” ACM SIGGRAPH, 2003
•
Harris, Coombe, Scheuermann, Lastra, “Physically-Based Visual Simulation on Graphics Hardware,” ACM/EG Graphics Hardware, 2002
•
Rumpf, Strzodka, “Using graphics cards for quantized FEM computations,” IASTED Visualization, Imaging and Image Processing, 2001
264
132
High Level Languages for GPUs Mike Houston Stanford University
High Level Shading Languages • Cg, HLSL, & OpenGL Shading Language – Cg: • http://www.nvidia.com/cg
– HLSL: • http://msdn.microsoft.com/library/default.asp?url=/library/enus/directx9_c/directx/graphics/reference/highlevellanguageshad ers.asp
– OpenGL Shading Language: • http://www.3dlabs.com/support/developer/ogl2/whitepapers/i ndex.html
266
133
Compilers: CGC & FXC • HLSL and Cg are syntactically almost identical – Exception: Cg 1.3 allows shader “interfaces”, unsized arrays
• Command line compilers – Microsoft’s FXC.exe • Compiles to DirectX vertex and pixel shader assembly only • fxc /Tps_3_0 myshader.hlsl
– NVIDIA’s CGC.exe • Compiles to everything • cgc -profile ps_3_0 myshader.cg
– Can generate very different assembly! • Driver will recompile code
– Compliance may vary 267
Babelshader http://graphics.stanford.edu/~danielrh/babelshader.html
• Converts between DirectX pixel shaders and OpenGL shaders • Allows OpenGL programs to use DirectX HLSL compilers to compile programs into ARB or fp30 assembly. Example Conversion Between Ps2.0 and ARB
• Enables fair benchmarking competition between the HLSL compiler and the Cg compiler on the same platform with the same demo and driver. 268
134
GPGPU Languages • Why do want them? –
Make programming GPUs easier! • Don’t need to know OpenGL, DirectX, or ATI/NV extensions • Simplify common operations • Focus on the algorithm, not on the implementation
• Accelerator
Microsoft Research http://research.microsoft.com/research/downloads/
• Brook
Stanford University http://brook.sourceforge.net http://graphics.stanford.edu/projects/brookgpu
• CTM
ATI Technologies
• Peakstream
http://www.peakstreaminc.com
• RapidMind Commercial follow-on to Sh http://www.rapidmind.net
269
Microsoft Research Accelerator Project • GPGPU programming using dataparallelism • Presents a data-parallel library to the programmer. – Simple, high-level set of operations
• Library just-in-time compiles to GPU pixel shaders or CPU code. – Runs on top of product version of .NET
270
135
Data-parallel array library • Explicit conversions between dataparallel arrays and normal arrays • Functional: each operation produces a new data-parallel array. • Eliminate certain operations on arrays to make them data-parallel – No aliasing, pointer arithmetic, individual element access
271
Data-parallel array types CPU
GPU
DPArray1[ … ] library_calls()
DPArrayN[ … ]
API/Driver/ Hardware
Array1[ … ] txtr1[ … ] pix_shdrs()
…
txtrN[ … ]
ArrayN[ … ]
272
136
Explicit conversion CPU
GPU
API/Driver/ Hardware
Array1[ … ] DPArray1[ … ] library_calls()
DPArrayN[ … ]
Explicit conversion between dataparallel arrays and normal arrays trigger GPU execution
txtr1[ … ] pix_shdrs()
…
txtrN[ … ]
ArrayN[ … ]
273
Functional style CPU
GPU
DPArray1[ … ]
DPArrayN[ … ]
API/Driver/ Hardware
Array1[ … ]
Functional style: each operation produces a new data-parallel array
txtr1[ … ] pix_shdrs()
…
txtrN[ … ]
ArrayN[ … ]
274
137
Types of operations CPU
GPU
DPArray1[ … ] library_calls()
DPArrayN[ … ]
API/Driver/ Hardware
Array1[ … ]
Restrict operations to allow data-parallel programming: No pointer arithmetic, individual element access/update
txtr1[ … ] pix_shdrs()
…
txtrN[ … ]
ArrayN[ … ]
275
Operations • Array creation • Element-wise arithmetic operations: +, *, -, etc. • Element-wise boolean operations: and, or, >, < etc. • Type conversions: integer to float, etc. • Reductions/scans: sum, product, max, etc. • Transformations: expand, pad, shift, gather, scatter, etc. • Basic linear algebra: inner product, outer product.
276
138
Example: 2-D convolution float[,] Blur(float[,] array, float[] kernel) { using (DFPA parallelArray = new DFPA(array)) { FPA resultX = new FPA(0.0f, parallelArray.Shape); for (int i = 0; i < kernel.Length; i++) { // Convolve in X direction. resultX += parallelArray.Shift(0,i) * kernel[i]; } FPA resultY = new FPA(0.0f, parallelArray.Shape); for (int i = 0; i < kernel.Length; i++) { // Convolve in Y direction. resultY += resultX.Shift(i,0) * kernel[i]; } using (DFPA result = resultY.Eval()) { float[,] resultArray; result.ToArray(out resultArray); return resultArray; } } } 277
Just-in-time compiler
278
139
Availability and more information • Binary version of Accelerator available for download – http://research.microsoft.com/downloads
• Available for non-commercial use – Meant to support research community use. – Licensing for commercial use possible.
• Includes documentation and a few samples • Runs on Microsoft.NET, most GPUs shipping since 2002. • More information: – ASPLOS 2006 “Accelerator: using data-parallelism to program GPUs for general-purpose uses”, David Tarditi, Sidd Puri, Jose Oglesby – http://research.microsoft.com/act
279
Brook: General Purpose Streaming Language
• Stream programming model – GPU = streaming coprocessor
• C with stream extensions • Cross platform – ATI & NVIDIA – OpenGL, DirectX, CTM – Windows & Linux
280
140
Streams • Collection of records requiring similar computation – particle positions, voxels, FEM cell, …
Ray r; float3 velocityfield;
• Similar to arrays, but… – index operations disallowed: – read/write stream operators
position[i]
streamRead (r, r_ptr); streamWrite (velocityfield, v_ptr); 281
Kernels • Functions applied to streams – similar to for_all construct – no dependencies between stream elements kernel void foo (float a, float b, out float result) { result = a + b; } float a; float b; float c; foo(a,b,c);
for (i=0; i