General-Purpose Computation on Graphics Hardware ...

4 downloads 1540 Views 5MB Size Report
Strategies and tools for programming GPUs ..... Setup input textures, fragment program .... NVIDIA Developer Technology ...... Apple OpenGL Shader Builder.
General-Purpose Computation on Graphics Hardware

Welcome & Overview David Luebke NVIDIA

1

Introduction • The GPU on commodity video cards has evolved into an extremely flexible and powerful processor – Programmability – Precision – Power

• This tutorial will address how to harness that power for general-purpose computation

3

Motivation: Computational Power • GPUs are fast… – – – –

3.0 GHz dual-core Pentium4: 24.6 GFLOPS NVIDIA GeForce 7800 GTX: 165 GFLOPs 1066 MHz FSB Pentium Extreme Edition : 8.5 GB/s ATI Radeon X850 XT Platinum Edition: 37.8 GB/s

• GPUs are getting faster, faster – CPUs: 1.4× annual growth – GPUs: 1.7×(pixels) to 2.3× (vertices) annual growth

Courtesy Kurt Akeley, Ian Buck & Tim Purcell

4

2

Motivation: Computational Power

Courtesy John Owens, Ian Buck

5

An Aside: Computational Power • Why are GPUs getting faster so fast? – Arithmetic intensity • The specialized nature of GPUs makes it easier to use additional transistors for computation not cache

– Economics • Multi-billion dollar video game market is a pressure cooker that drives innovation to exploit this property • Fierce competition!

6

3

Motivation: Flexible and Precise • Modern GPUs are deeply programmable – Programmable pixel, vertex, and geometry engines – Solid high-level language support

• Modern GPUs support “real” precision – 32 bit floating point throughout the pipeline – High enough for many (not all) applications

7

Motivation: The Potential of GPGPU • In short: – The power and flexibility of GPUs makes them an attractive platform for general-purpose computation – Example applications range from in-game physics simulation to conventional computational science – Goal: make the inexpensive power of the GPU available to developers as a sort of computational coprocessor

8

4

The Problem: Difficult To Use • GPUs designed for & driven by video games – Programming model unusual – Programming idioms tied to computer graphics – Programming environment tightly constrained

• Underlying architectures are: – Inherently data parallel – Rapidly evolving (even in basic feature set!) – Largely secret

• Can’t simply “port” CPU code! 9

Course goals • A detailed introduction to general-purpose computing on graphics hardware • We emphasize: – Core computational building blocks – Strategies and tools for programming GPUs – Tips & tricks, perils & pitfalls of GPU programming

• Case studies to bring it all together

10

5

Course Prerequisites • Tutorial intended to be accessible to any savvy computer scientist • Helpful but not required: familiarity with – Interactive 3D graphics APIs and graphics hardware – Data-parallel algorithms and programming

• Target audience – HPC researchers interested in GPGPU research – HPC developers interested in incorporating GPGPU techniques into their work – Attendees wishing a survey of this exciting field 11

Speakers • In order of appearance: – – – – – – – – –

David Luebke, NVIDIA Mark Harris, NVIDIA John Owens, University of California Davis Naga Govindaraju, Microsoft Research Aaron Lefohn, Neoptica Mike Houston, Stanford Mark Segal, ATI Ian Buck, NVIDIA Matt Papakipos, PeakStream 12

6

Schedule 8:30

Introduction

Luebke

Tutorial overview, GPU architecture, GPGPU programming

GPU Building Blocks 9:10 Data-Parallel Algorithms

Harris

Reduce, scan, scatter/gather, sort, and search

9:30

Memory Models

Owens

GPU memory resources, CPU & Cell

9:45

Data Structures

Lefohn

Static & dynamically updated data structures

10:00 Break 13

Schedule 10:30 Sorting & Data Queries

Govindaraju

Sorting networks & specializations, searching, data mining

11:00 Mathematical Primitives

Lefohn

Linear algebra, finite different & finite element methods

Languages & Programming Environments 11:30 High-Level Languages

Houston

Brook, RapidMind, Accelerator

12:00 Lunch 14

7

Schedule 1:30

Debugging & Profiling

Houston

imdebug, DirectX/OpenGL shader IDEs, ShadeSmith

1:50

Direct GPU Computing

Segal

CTM, Data Parallel Virtual Machine

High Performance GPGPU 2:00 GPGPU Strategies & Tricks

Owens

GPU performance guidelines, scatter, conditionals

2:30

Performance Analysis & Arch Insights

Houston

GPUBench, architectural models for programming

3:00

Break 15

Schedule GPGPU In Practice 3:00 HavokFX

Harris

Game Physics Simulation on GPUs

3:25

PeakStream Platform

Papakipos

Commercial GPGPU platform, HPC case studies

3:50

GPGPU Cluster Computing

Houston

Building GPU clusters; HMMer, GROMACS, Folding@Home

Conclusion 4:45 Question-and-answer session 5:00 Wrap!

All

8

GPU Fundamentals: The Graphics Pipeline Graphics State

GPU

Shade

Final Pixels (Color, Depth)

CPU

Rasterize

Fragments (pre-pixels)

Assemble Primitives

Screenspace triangles (2D)

Transform & Light

Xformed, Lit Vertices (2D)

Vertices (3D)

Application

Video Memory (Textures)

Render-to-texture

• A simplified traditional graphics pipeline – It’s actually ultra-parallel! – Note that pipe widths vary – Many caches, FIFOs, and so on not shown 17

GPU Fundamentals: The Recent Graphics Pipeline Graphics State

• Programmable vertex processor!

GPU

Fragment Shade Processor

Final Pixels (Color, Depth)

CPU

Rasterize

Fragments (pre-pixels)

Assemble Primitives

Screenspace triangles (2D)

Vertex Transform Processor

Xformed, Lit Vertices (2D)

Vertices (3D)

Application

Video Memory (Textures)

Render-to-texture

• Programmable pixel processor!

18

9

GPU Fundamentals: The New Graphics Pipeline Graphics State

GPU

• Programmable primitive assembly!

Fragment Processor

Final Pixels (Color, Depth)

CPU

Rasterize

Fragments (pre-pixels)

Geometry Assemble Processor Primitives

Screenspace triangles (2D)

Vertex Processor

Xformed, Lit Vertices (2D)

Vertices (3D)

Application

Video Memory (Textures)

Render-to-texture

• More flexible memory access!

• And much, much more! 19

GPU Pipeline: Transform • Vertex processor (multiple in parallel) – Transform from “world space” to “image space” – Compute per-vertex lighting

20

10

GPU Pipeline: Rasterize • Primitive Assembly & Rasterization – Convert vertices into primitives with area • Triangles, quadrilaterals, points

– Convert geometric rep. (vertex) to image rep. (fragment) • Fragment = image fragment – Pixel + associated data: color, depth, stencil, etc.

– Interpolate per-vertex quantities across pixels

21

GPU Pipeline: Shade • Fragment processors (multiple in parallel) – Compute a color for each pixel – Optionally read colors from textures (images)

22

11

Introduction to GPGPU Programming David Luebke NVIDIA

Outline • • • •

Data Parallelism and Stream Processing Computational Resources Inventory CPU-GPU Analogies Example: N-body gravitational simulation

24

12

The Importance of Data Parallelism • GPUs are designed for graphics – Highly parallel tasks

• Graphics processes independent vertices & pixels – Temporary registers are zeroed – No shared or static data – No read-modify-write buffers

• Data-parallel processing – GPUs architecture is ALU-heavy • Multiple vertex & pixel pipelines, multiple ALUs per pipe

– Hide memory latency (with more computation)

25

Arithmetic Intensity • Arithmetic intensity – ops per word transferred – Computation / bandwidth

• Best to have high arithmetic intensity • Ideal GPGPU apps have – Large data sets – High parallelism – High independence between data elements

26

13

Stream Processing • Streams – Collection of records requiring similar computation • Vertex positions, Voxels, FEM cells, etc.

– Provide data parallelism

• Kernels – Functions applied to each element in stream • transforms, PDE, …

– Few dependencies between stream elements • Encourage high Arithmetic Intensity

27

Example: Simulation Grid • Common GPGPU computation style – Textures represent computational grids = streams

• Many computations map to grids – – – –

Matrix algebra Image & Volume processing Physically-based simulation Global Illumination • ray tracing, photon mapping, radiosity

• Non-grid streams can be mapped to grids

28

14

Stream Computation • Grid Simulation algorithm – Made up of steps – Each step updates entire grid – Must complete before next step can begin

• Grid is a stream, steps are kernels – Kernel applied to each stream element

Cloud simulation algorithm

29

Scatter vs. Gather • Grid communication – Grid cells share information

30

15

Computational Resources Inventory • Programmable parallel processors – Vertex, Geometry, & Fragment pipelines

• Rasterizer – Mostly useful for interpolating addresses (texture coordinates) and per-vertex constants

• Texture unit – Read-only memory interface

• Render to texture – Write-only memory interface

31

Vertex Processor • Fully programmable (SIMD / MIMD) • Processes 4-vectors (RGBA / XYZW) • Capable of scatter but not gather – Can change the location of current vertex – Cannot read info from other vertices – Vertex Texture Fetch • Random access memory for vertices • Arguably still not gather

32

16

Fragment Processor • • • •

Fully programmable (SIMD) Processes 4-component vectors (RGBA / XYZW) Random access memory read (textures) Capable of gather but not scatter – RAM read (texture fetch), but no RAM write – Output address fixed to a specific pixel

• Typically more useful than vertex processor – More fragment pipelines than vertex pipelines – Direct output (fragment processor is at end of pipeline)

• More on scatter/gather later… 33

CPU-GPU Analogies • CPU programming is familiar – GPU programming is graphics-centric

• Analogies can aid understanding

34

17

CPU-GPU Analogies CPU

GPU

Stream / Data Array = Texture Memory Read = Texture Sample 35

Kernels CPU

GPU

Kernel / loop body / algorithm step = Fragment Program 36

18

Feedback • Each algorithm step depends on the results of previous steps • Each time step depends on the results of the previous time step

37

Feedback CPU

GPU

. . Grid[i][j]= x; . . . Array Write

=

Render to Texture 38

19

GPU Simulation Overview • Analogies lead to implementation – Algorithm steps are fragment programs • Computational kernels

– Current state is stored in textures – Feedback via render to texture

• One question: how do we invoke computation?

39

Invoking Computation • Must invoke computation at each pixel – Just draw geometry! – Most common GPGPU invocation is a full-screen quad

• Other Useful Analogies – Rasterization = Kernel Invocation – Texture Coordinates = Computational Domain – Vertex Coordinates = Computational Range

40

20

Typical “Grid” Computation • Initialize “view” (so that pixels:texels::1:1) glMatrixMode(GL_MODELVIEW); glLoadIdentity(); glMatrixMode(GL_PROJECTION); glLoadIdentity(); glOrtho(0, 1, 0, 1, 0, 1); glViewport(0, 0, outTexResX, outTexResY);

• For each algorithm step: – Activate render-to-texture – Setup input textures, fragment program – Draw a full-screen quad (1x1)

41

Example: N-Body Simulation • Brute force • N = 8192 bodies • N2 gravity computations • 64M force comps. / frame • ~25 flops per force • 7.5 fps • 12.5+ GFLOPs sustained – GeForce 6800 Ultra

Nyland, Harris, Prins, GP2 2004 poster

42

21

Computing Gravitational Forces • Each body attracts all other bodies – N bodies, so N2 forces

• Draw into an NxN buffer – Pixel (i,j) computes force between bodies i and j – Very simple fragment program • More than 2048 bodies makes it trickier – Limited by max texture size… – “exercise for the reader”

43

Computing Gravitational Forces

F(i,j) = gMiMj / r(i,j)2, r(i,j) = |pos(i) - pos(j)|

Force is proportional to the inverse square of the distance between bodies 44

22

Computing Gravitational Forces N

Body Position Texture

N-body force Texture

j j force(i,j)

i F(i,j) = gMiMj / r(i,j)2,

0

i

N

r(i,j) = |pos(i) - pos(j)|

Coordinates (i,j) in force texture used to find bodies i and j in body position texture 45

Computing Gravitational Forces float4 force(float2 ij

: WPOS,

uniform sampler2D pos) : COLOR0 { // Pos texture is 2D, not 1D, so we need to // convert body index into 2D coords for pos tex float4 iCoords = getBodyCoords(ij); float4 iPosMass = texture2D(pos, iCoords.xy); float4 jPosMass = texture2D(pos, iCoords.zw); float3 dir = iPos.xyz - jPos.xyz; float r2 = dot(dir, dir); dir = normalize(dir); return dir * g * iPosMass.w * jPosMass.w / r2; }

46

23

Computing Total Force • Have: array of (i,j) forces • Need: total force on each particle i

N

N-body force Texture

force(i,j)

0

N

i

47

Computing Total Force • Have: array of (i,j) forces • Need: total force on each particle i

N

N-body force Texture

– Sum of each column of the force array

force(i,j)

0

i

N

48

24

Computing Total Force • Have: array of (i,j) forces • Need: total force on each particle i

N

N-body force Texture

– Sum of each column of the force array

force(i,j)

• Can do all N columns in parallel 0

N

i

This is called a Parallel Reduction 49

Parallel Reductions • 1D parallel reduction: – sum N columns or rows in parallel – add two halves of texture together

NxN

+ 50

25

Parallel Reductions • 1D parallel reduction: – sum N columns or rows in parallel – add two halves of texture together – repeatedly...

Nx(N x(N/2)

+ 51

Parallel Reductions • 1D parallel reduction: – sum N columns or rows in parallel – add two halves of texture together – repeatedly...

Nx(N x(N/4)

+ 52

26

Parallel Reductions • 1D parallel reduction: – – – –

sum N columns or rows in parallel add two halves of texture together repeatedly... Until we’re left with a single row of texels

Nx1 Requires log2N steps 53

Update Positions and Velocities • Now we have a 1-D array of total forces – One per body

• Update Velocity – u(i,t+dt) = u(i,t) + Ftotal(i) * dt – Simple fragment shader reads previous velocity and force textures, creates new velocity texture

• Update Position – x(i, t+dt) = x(i,t) + u(i,t) * dt – Simple fragment shader reads previous position and velocity textures, creates new position texture 54

27

Summary • Presented mappings of basic computational concepts to GPUs – Basic concepts and terminology – For introductory “Hello GPGPU” sample code, see http://www.gpgpu.org/developer

• Only the beginning: – Rest of course presents advanced techniques, strategies, and specific algorithms.

55

Data-Parallel Algorithms on GPUs Mark Harris NVIDIA Developer Technology

28

Outline • Introduction • Algorithmic complexity on GPUs • Algorithmic Building Blocks – – – – –

Gather & Scatter Reductions Scan (parallel prefix) Sort Search

57

Data-Parallel Algorithms • The GPU is a data-parallel processor – Data-parallel kernels of applications can be accelerated on the GPU

• Efficient algorithms require efficient building blocks • This talk: data-parallel building blocks – Gather & Scatter – Reduce and Scan – Sort and Search

58

29

Algorithmic Complexity on GPUs • We will use standard “Big O” notation – e.g., optimal sequential sort is O(n log n)

• GPGPU element of parallelism is the pixel – Each pixel generates one output element – O(n) typically means n pixels processed

• In general, GPGPU O(n) usually means O(n/p) processing time – p is the number of “pixel processors” on the GPU • e.g. NVIDIA G70 has 24 pixel shader pipelines

59

Step vs. Work Complexity • Important to distinguish between the two • Work Complexity: O(# pixels processed) • Step Complexity: O(# rendering passes)

60

30

Data-Parallel Building Blocks • • • • •

Gather & Scatter Reduce Scan Sort Search

61

Scatter vs. Gather • Gather: p = a[i] – Vertex or Fragment programs

• Scatter: a[i] = p – Vertex programs only

62

31

Scatter Techniques • Scatter not available on most GPUs – Recently available on ATI CTM (see later talks)

• Problem: a[i] = p – Indirect write – Can’t set the x,y of fragment in pixel shader – Often want to do a[i] += p

63

Scatter Technique 1 • Convert to Gather for each spring f = computed force mass_force[left] += f; mass_force[right] -= f;

64

32

Scatter Technique 1 • Convert to Gather for each spring f = computed force for each mass mass_force = f[left] – f[right]

65

Scatter Technique 2 • Address Sorting – Sort & Search • Shader outputs destination address and data • Bitonic Sort based on address (see sorting, later) • Run binary search shader over destination buffer – Each fragment searches for source data

66

33

Scatter Technique 3 • Vertex Processor – Render Points • Use vertex shader to set destination • Or just read back the data and re-issue

– Vertex Textures • Render data and address to texture • Issue points, set point x,y in vertex shader using address texture

67

Parallel Reductions • Given: – Binary associative operator ⊕ with identity I – Ordered set s = [a0, a1, …, an-1] of n elements

• reduce(⊕, s) returns a0 ⊕ a1 ⊕ … ⊕ an-1 • Example: reduce(+, [3 1 7 0 4 1 6 3]) = 25 • Reductions common in parallel algorithms – Common reduction operators are +, ×, min and max 68

34

Parallel Reductions on the GPU • 1D parallel reduction: – add two halves of texture together – repeatedly... – Until we’re left with a single row of texels

+

+

N/4… /4…

+

1

N/2 N

O(log2N) steps, O(N O(N) work 69

Multiple 1D Parallel Reductions • Can run many reductions in parallel – Use 2D texture and reduce one dimension

+

+

+ MxN/4 … MxN/4…

Mx1

MxN/2 MxN/2 MxN 70

35

2D reductions • Like 1D reduction, only reduce in both directions simultaneously

– Note: can add more than 2x2 elements per pixel • Trade per-pixel work for # steps • Best perf depends on specific GPU (cache, etc.) 71

Parallel Scan (aka prefix sum) • Given: – Binary associative operator ⊕ with identity I – Ordered set s = [a0, a1, …, an-1] of n elements

• scan(⊕, s) returns [a0, (a0 ⊕ a1), …, (a0 ⊕ a1 ⊕ … ⊕ an-1)]

• Example: scan(+, [3 1 7 0 4 1 6 3]) = [3 4 11 11 14 16 22 25] (From Blelloch, 1990, “Prefix Sums and Their Applications) 72

36

A Naïve Parallel Scan Algorithm Log(n) iterations

In

3

1

7

0

4

1

6

3

T0 3

1

7

0

4

1

6

3

4

8

7

4

5

7

9

Stride 1

T1 3 Stride 2

T0 3

For i from 1 to log(n)-1: • Render a quad from 2i to n. Fragment k computes

4 11 11 12 12 11 14

Stride 4

Out

Note: Can’t read and write the same texture, so must “ping-pong”

vout = v[k] + v[k-2i]. •

3

4 11 11 15 16 22 25

Due to ping-pong, render a 2nd quad from 2(i-1) to 2i with a simple pass-through shader vout = vin.

73

A Naïve Parallel Scan Algorithm • Algorithm given in more detail in [Horn ‘05] • Step efficient, but not work-efficient – O(log n) steps, but O(n log n) adds – Sequential version is O(n) – A factor of log(n) hurts: 20x for 10^6 elements!

• Dig into parallel algorithms literature for a better solution – See Blelloch 1990, “Prefix Sums and Their Applications” 74

37

Balanced Trees • Common parallel algorithms pattern – Build a balanced binary tree on the input data and sweep it to and from the root – Tree is conceptual, not an actual data structure

• For prescan: – Traverse down from leaves to root building partial sums at internal nodes in the tree • Root holds sum of all leaves

– Traverse back up the tree building the scan from the partial sums

75

Balanced Tree Scan 1. First build sums in place up the tree 2. Use partial sums to scan back down and generate scan • Note: tricky to implement using graphics API – –

Due to interleaving of new and old results Can reformulate layout Figure courtesy Shubho Sengupta

76

38

Further Improvement • [Sengupta et al. ’06] observes that balanced tree algorithm is not step-efficient – Loses efficiency on steps that contain fewer pixels than the GPU has pipelines

• Hybrid work-efficient / step-efficient algorithm – Simply switch from balanced tree to naïve algorithm for smaller steps

77

Hybrid work- and step-efficient algo

Figure courtesy Shubho Sengupta

78

39

Parallel Sorting • Given an unordered list of elements, produce list ordered by key value – Kernel: compare and swap

• GPUs constrained programming environment limits viable algorithms – Bitonic merge sort [Batcher 68] – Periodic balanced sorting networks [Dowd 89]

79

Bitonic Merge Sort Overview • Repeatedly build bitonic lists and then sort them – Bitonic list is two monotonic lists concatenated together, one increasing and one decreasing. • List A: (3, 4, 7, 8) monotonically increasing • List B: (6, 5, 2, 1) monotonically decreasing • List AB: (3, 4, 7, 8, 6, 5, 2, 1) bitonic

80

40

Bitonic Merge Sort 3 7 4 8 6 2 1 5 8x monotonic lists: (3) (7) (4) (8) (6) (2) (1) (5) 4x bitonic lists: (3,7) (4,8) (6,2) (1,5)

81

Bitonic Merge Sort 3 7 4 8 6 2 1 5 Sort the bitonic lists 82

41

Bitonic Merge Sort 3

3

7

7

4

8

8

4

6

2

2

6

1

5

5

1 4x monotonic lists: (3,7) (8,4) (2,6) (5,1) 2x bitonic lists: (3,7,8,4) (2,6,5,1)

83

Bitonic Merge Sort 3

3

7

7

4

8

8

4

6

2

2

6

1

5

5

1 Sort the bitonic lists 84

42

Bitonic Merge Sort 3

3

3

7

7

4

4

8

8

8

4

7

6

2

5

2

6

6

1

5

2

5

1

1

Sort the bitonic lists 85

Bitonic Merge Sort 3

3

3

7

7

4

4

8

8

8

4

7

6

2

5

2

6

6

1

5

2

5

1

1

Sort the bitonic lists 86

43

Bitonic Merge Sort 3

3

3

3

7

7

4

4

4

8

8

7

8

4

7

8

6

2

5

6

2

6

6

5

1

5

2

2

5

1

1

1

2x monotonic lists: (3,4,7,8) (6,5,2,1) 1x bitonic list: (3,4,7,8, 6,5,2,1)

87

Bitonic Merge Sort 3

3

3

3

7

7

4

4

4

8

8

7

8

4

7

8

6

2

5

6

2

6

6

5

1

5

2

2

5

1

1

1

Sort the bitonic list 88

44

Bitonic Merge Sort 3

3

3

3

3

7

7

4

4

4

4

8

8

7

2

8

4

7

8

1

6

2

5

6

6

2

6

6

5

5

1

5

2

2

7

5

1

1

1

8

Sort the bitonic list 89

Bitonic Merge Sort 3

3

3

3

3

7

7

4

4

4

4

8

8

7

2

8

4

7

8

1

6

2

5

6

6

2

6

6

5

5

1

5

2

2

7

5

1

1

1

8

Sort the bitonic list 90

45

Bitonic Merge Sort 3

3

3

3

3

2

7

7

4

4

4

1

4

8

8

7

2

3

8

4

7

8

1

4

6

2

5

6

6

6

2

6

6

5

5

5

1

5

2

2

7

7

5

1

1

1

8

8

Sort the bitonic list 91

Bitonic Merge Sort 3

3

3

3

3

2

7

7

4

4

4

1

4

8

8

7

2

3

8

4

7

8

1

4

6

2

5

6

6

6

2

6

6

5

5

5

1

5

2

2

7

7

5

1

1

1

8

8

Sort the bitonic list 92

46

Bitonic Merge Sort 3

3

3

3

3

2

1

7

7

4

4

4

1

2

4

8

8

7

2

3

3

8

4

7

8

1

4

4

6

2

5

6

6

6

5

2

6

6

5

5

5

6

1

5

2

2

7

7

7

5

1

1

1

8

8

8

Done! 93

Bitonic Merge Sort Summary • Separate rendering pass for each set of swaps – Step Complexity: O(log2n) passes – Each pass performs n compare/swaps – Work Complexity: (total compare/swaps): O(n log2n) • Limitations of GPU cost us a factor of log n over optimal sequential sorting algorithms

94

47

Making GPU Sorting Faster • Draw several quads with similar computation instead of single quad – Reduce decision making in fragment program

• Push work into vertex processor and interpolator – Reduce computation in fragment program

• More than one compare/swap per sort kernel invocation – Reduce computational complexity

95

Grouping Computation 3

3

3

3

3

2

1

7

7

4

4

4

1

2

4

8

8

7

2

3

3

8

4

7

8

1

4

4

6

2

5

6

6

6

5

2

6

6

5

5

5

6

1

5

2

2

7

7

7

5

1

1

1

8

8

8 96

48

Implementation Details • Specify interpolants for smaller quads – ‘down’ or ‘up’ compare and swap – distance to comparison partner

• See Kipfer & Westermann article in GPU Gems 2 and Kipfer et al. Graphics Hardware 04 for more details

97

GPU Sort [Govindaraju et al. 05] • Use blending operators for comparison • Use texture mapping hw to map sorting op. • Further improvements with GPUTeraSort – [Govindaraju et al. 2006]

– Beat Penny Sort Benchmark for 2006

98

49

Parallel Binary Search

99

Binary Search • Find a specific element in an ordered list • Implement just like CPU algorithm – Assuming hardware supports long enough shaders – Finds the first element of a given value v • If v does not exist, find next smallest element > v

• Search algorithm is sequential, but many searches can be executed in parallel – Number of pixels drawn determines number of searches executed in parallel • 1 pixel == 1 search 100

50

Binary Search • Search for v0

Initialize

Search starts at center of sorted array

4

v2 >= v0 so search left half of sub-array

Sorted List

v0 0

v0 1

v0 2

v2 3

v2 4

v2 5

v5 6

v5 7

101

Binary Search • Search for v0

Initialize

4

Step 1

2

Sorted List

v0 0

v0 >= v0 so search left half of sub-array

v0 1

v0 2

v2 3

v2 4

v2 5

v5 6

v5 7

102

51

Binary Search • Search for v0

Initialize

4

Step 1

2

Step 2

1

Sorted List

v0 0

v0 >= v0 so search left half of sub-array

v0 1

v0 2

v2 3

v2 4

v2 5

v5 6

v5 7

103

Binary Search • Search for v0

Initialize

4

Step 1

2

Step 2

1

Step 3

0

Sorted List

v0 0

At this point, we either have found v0 or are 1 element too far left One last step to resolve

v0 1

v0 2

v2 3

v2 4

v2 5

v5 6

v5 7

104

52

Binary Search • Search for v0

Initialize

4

Step 1

2

Step 2

1

Step 3

0

Step 4

0

Sorted List

v0 0

Done!

v0 1

v0 2

v2 3

v2 4

v2 5

v5 6

v5 7

105

Binary Search • Search for v0 and v2

Initialize

4

Search starts at center of sorted array

4

Both searches proceed to the left half of the array

Sorted List

v0 0

v0 1

v0 2

v2 3

v2 4

v2 5

v5 6

v5 7

106

53

Binary Search • Search for v0 and v2

Initialize

4

4

Step 1

2

2

Sorted List

v0 0

The search for v0 continues as before The search for v2 overshot, so go back to the right

v0 1

v0 2

v2 3

v2 4

v2 5

v5 6

v5 7

107

Binary Search • Search for v0 and v2

Initialize

4

4

Step 1

2

2

Step 2

1

3

Sorted List

v0 0

v0 1

v0 2

We’ve found the proper v2, but are still looking for v0 Both searches continue

v2 3

v2 4

v2 5

v5 6

v5 7

108

54

Binary Search • Search for v0 and v2

Initialize

4

4

Step 1

2

2

Step 2

1

3

Step 3

0

2

Sorted List

v0 0

v0 1

v0 2

Now, we’ve found the proper v0, but overshot v2 The cleanup step takes care of this

v2 3

v2 4

v2 5

v5 6

v5 7

109

Binary Search • Search for v0 and v2

Initialize

4

4

Step 1

2

2

Step 2

1

3

Step 3

0

2

Step 4

0

3

Sorted List

v0 0

v0 1

v0 2

Done! Both v0 and v2 are located properly

v2 3

v2 4

v2 5

v5 6

v5 7

110

55

Binary Search Summary • Single rendering pass – Each pixel drawn performs independent search

• O(log n) step complexity – PS3.0 GPUs have dynamic branching and looping – So these steps are inside the pixel shader, not separate rendering passes

• O(m log n) work complexity – m is the number of parallel searches

111

References • Prefix Sums and Their Applications. Guy E. Blelloch. Technical Report CMU-CS-90-190. November, 1990. • A Toolkit for Computation on GPUs. Ian Buck and Tim Purcell. In GPU Gems. Randy Fernando, ed. 2004 • GPUTeraSort: High Performance Graphics Coprocessor Sorting for Large Database Management. Naga Govindaraju, Jim Gray, Ritesh Kumar, and Dinesh Manocha. In Proceedings of ACM SIGMOD 2006 • Stream Reduction Operations for GPGPU Applications. Daniel Horn. In GPU Gems 2. Matt Pharr, ed. 2005 • Improved GPU Sorting. Peter Kipfer. In GPU Gems 2. Matt Pharr, ed. 2005 • A Work-Efficient Step-Efficient Prefix Sum Algorithm. Shubhabrata Sengupta, Aaron E. Lefohn, John D. Owens. In Proceedings of the 2006 Workshop on Edge Computing Using New Commodity Architectures 112

56

GPU Memory Model Overview John Owens University of California, Davis

Memory Hierarchy • CPU and GPU Memory Hierarchy Disk CPU Main Memory CPU Caches

CPU Registers

GPU Video Memory

GPU Caches

GPU Constant Registers

GPU Temporary Registers 114

57

CPU Memory Model • At any program point – Allocate/free local or global memory – Random memory access • Registers – Read/write

• Local memory – Read/write to stack

• Global memory – Read/write to heap

• Disk – Read/write to disk

115

Cell • SPU memory model: • 128 128b local registers • 256 kB local store – 6 cycles access time

• Explicit, asynchronous DMA access to main memory – Allows comm/comp overlap

• No explicit I or D cache • No disk access

http://www.realworldtech.com/includes/images/articles/cell-1.gif

116

58

GPU Memory Model • Much more restricted memory access – Allocate/free memory only before computation – Limited memory access during computation (kernel) • Registers – Read/write

• Local memory – Does not exist

• Global memory – Read-only during computation – Write-only at end of computation (precomputed address)

• Disk access – Does not exist 117

GPU Memory Model • GPUs support many types of memory objects in hardware – 1D, 2D, 3D grids • 2D is most common (framebuffer, texture)

– 2D cube maps (6 faces of a cube) – Mipmapped (prefiltered) versions – DX10 adds arrayed datatypes

• Each native datatype has pros and cons from a general-purpose programming perspective

118

59

Traditional GPU Pipeline • Inputs: – Vertex data – Texture data

• Output: – Framebuffer Texture

Vertex Buffer

Vertex Processor

Rasterizer

Fragment Processor

Frame Buffer(s) 119

GPU Memory Model (DX9) • Extending memory functionality – Copy from framebuffer to texture – Texture reads from vertex processor – Render to vertex VS 3.0 GPUs buffer Texture

Vertex Buffer

Vertex Processor

Rasterizer

Fragment Processor

Frame Buffer(s) 120

60

GPU Memory Model (DX10, traditional) • More flexible memory handling – All programmable units can read texture – “Stream out” after geometry processor

Stream Out

Vertex Buffer

Vertex Processor

Arrayed Texture

Geometry Processor

Rasterizer

Fragment Processor

Arrayed Frame Buffer(s) 121

GPU Memory Model (DX10, new) • DX10 provides “resources” • Resources are flexible! DX10 Resources

Vertex Processor

Geometry Processor

Rasterizer

Fragment Processor

122

61

GPU Memory API • Each GPU memory type supports subset of the following operations – CPU interface – GPU interface

123

GPU Memory API • CPU interface – – – – – – – –

Allocate Free Copy CPU Æ GPU Copy GPU Æ CPU Copy GPU Æ GPU Bind for read-only vertex stream access Bind for read-only random access Bind for write-only framebuffer access

124

62

GPU Memory API • GPU (shader/kernel) interface – Random-access read – Stream read

125

DX10 View of Memory Resources

Buffers

Textures

Views

• Resources – Encompass buffers and textures – Retained state is stored in resources – Must be bound by API to pipeline stages before called • Same subresource cannot be bound for both read and write simultaneously

126

63

DX10 View of Memory Resources

Textures

Buffers

Views

• Buffers – Collection of elements • Few requirements on type or format (heterogeneous) • Elements are 1-4 components (e.g. R8G8B8A8, 8b int, 4x32b float)

– No filtering, subresourcing, multisampling – Layout effectively linear (“casting” is possible) – Examples: vertex buffers, index buffers, ConstantBuffers 127

DX10 View of Memory Resources

Buffers

Textures

Views

• Textures – Collection of texels – Can be filtered, subresourced, arrayed, mipmapped – Unlike buffers, must be declared with texel type • Type impacts filtering

– Layout is opaque - enables memory layout optimization – Examples: texture{1,2,3}d, mipmapped, cubemap 128

64

DX10 View of Memory Resources

Buffers

Textures

Views

• Views – “mechanism for hardware interpretation of a resource in memory” – Allows structured access of subresources – Restricting view may increase efficiency 129

Big Picture: GPU Memory Model • GPUs are a mix of: – Historical, fixed-function capabilities – Newer, flexible, programmable capabilities

• Fixed-function: – Known access patterns, behaviors – Accelerated by special-purpose hardware

• Programmable: – Unknown access patterns – Generality good

• Memory model must account for both – Consequence: Ample special-purpose functionality – Consequence: Restricting flexibility may improve performance 130

65

DX10 Bind Rules Shader Resource Input: Anything, but can only bind views

Shader Constants: Must be created as shader constant, can’t use in other views Depth/Stencil Output: not buffers/texture3D, only can bind views of other resources

Input Assembler: Buffers

Vertex Processor

Geometry Processor

Rasterizer

Fragment Processor

StreamOut: Buffers

Render Target Output: Anything, but can only bind views 131

Example: Texture • Texture mapping fundamental primitive in GPUs • Most typical use: random access, bound for read only, 2D texture map – Hardware-supported caching & filtering

Texture

Vertex Buffer

Vertex Processor

Rasterizer

Fragment Processor

Frame Buffer(s) 132

66

Example: Framebuffer • Memory written by fragment processor • Write-only GPU memory (from shader’s point of view) – FB is read-modify-write by the pipeline as a whole

• Displayed to screen • Can also store GPGPU results (not just color)

Vertex Buffer

Vertex Processor

Rasterizer

Fragment Processor

Frame Buffer(s) 133

Example: Render to Texture • Very common in both graphics & GPGPU • Allows multipass algorithms – Pass 1: Write data into framebuffer – Pass 2: Bind as texture, read from texture

• Store up to 32 32b FP values/pixel Texture

Vertex Buffer

Vertex Processor

Rasterizer

Fragment Processor

Frame Buffer(s) 134

67

Example: Render to Vertex Array • Enables top-of-pipe feedback loop • Enables dynamic creation of geometry on GPU

Vertex Buffer

Vertex Processor

Rasterizer

Fragment Processor

Frame Buffer(s) 135

Example: Stream Out to Vertex Buffer • Enabled by DX10 StreamOut capability • Expected to be used for dynamic geometry – Recall geometry processor produces 0-n outputs per input

• Possible graphics applications: – Expand point sprites – Extrude silhouettes – Extrude prisms/tets

Stream Out

Vertex Buffer

Vertex Processor

Geometry Processor 136

68

Summary • Rich set of hardware primitives – Designed for special purpose tasks, but often useful for general purpose ones

• Memory usage generally more restrictive than other processors – Becoming more general-purpose and orthogonal

• Restricting generality allows hw/sw to cooperate for higher performance

137

GPU Data Structures Aaron Lefohn Neoptica

69

Introduction • Previous talk: GPU memory model • This talk: GPU data structures – Basic building block is 2D array – (ATI’s CTM supports large 1D arrays…discussed later)

• Overview – Dense arrays – Sparse arrays – Adaptive arrays

139

GPU Arrays • Large 1D Arrays – Current GPUs limit 1D array sizes to 2048 or 4096 – Pack into 2D memory – 1D-to-2D address translation

140

70

GPU Arrays • 2D Arrays – Trivial, native implementation

141

GPU Arrays •

3D Arrays – –

Problem: GPUs do not have 3D frame buffers Solutions 1. Multiple slices per 2D buffer (“Flat 3D array”) 2. Stack of 2D slices 3. Render-to-slice of 3D array

142

71

GPU Arrays • DX 10 memory model helps – Can render to slice of 3D texture

• “Flat 3D array” still has advantages – Render entire domain in single parallel compute pass – More parallelism when writing (slower for reading)

143

GPU Arrays • Higher Dimensional Arrays – Pack into 2D buffers – N-D to 2D address translation

144

72

Sparse/Adaptive Data Structures • Why? – Reduce memory pressure – Reduce computational workload

• Basic Idea – Pack “active” data elements into GPU memory

145

Page Table Sparse/Adaptive Arrays • Dynamic sparse/adaptive N-D array (Lefohn et al. 2003/2006)

Virtual Domain

Page Table

Physical Memory

146

73

Dynamic Adaptive Data Structure • Photon map (kNN-grid)

(Purcell et al. 2003)

Image from “Implementing Efficient Parallel Data Structures on GPUs,” Lefohn et al., GPU Gems II, ch. 33, 2005 147

GPU Perfect Hash Table • Static, sparse N-D array

(Lefebvre et al. 2006)

Figure from Lefebvre, Hoppe, “Perfect Spatial Hashing,” ACM SIGGRAPH, 2006 148

74

GPU Iteration • GPU good at random-access read – Often do not need to define input iterators

• GPU (mostly) performs only streaming writes – Every GPGPU data structure must support an output iterator compatible with GPU rasterization • Traverse physical data layout • Example – Draw single quad to iterate over voxels in flattened volume – Draw one quad per page to iterate over page-table-based array 149

GPU Iteration • Optimizing input iterators – Galoppo et al. optimized input and output iterators for their matrix representation • Match 2D GPU memory layout • Match memory access pattern of algorithm

– Difficult because memory layouts are unspecified and proprietary

Galoppo, Govindaraju, Henson, Manocha, “LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware,” ACM/IEEE Supercomputing, 2005 150

75

GPU Data Structures • Conclusions – Fundamental GPU memory primitive is a fixed-size 2D array – GPGPU needs more general memory model • Low-level interfaces such as ATI’s CTM are step in the right direction • Iterator access patterns must match storage layout

– Building complex data structures on GPU still hard • Data-parallel algorithms (sort, scan, etc.) can be used • Is less parallelism more efficient? 151

More Information • Overview (with code snippets) – Lefohn, Kniss, Owens, “Implementing Efficient Parallel Data Structures on GPUs,” Chapter 33, GPU Gems II

• High-level GPU data structures – Lefohn, Kniss, Strzodka, Sengupta, Owens, “Glift: Generic, Efficient, Random-Access GPU Data Structures,” ACM Transactions on Graphics, 2006

• GPGPU State-of-the-art report – Owens, Luebke, Govindaraju, Harris, Krüger, Lefohn, Purcell, “A Survey of General-Purpose Computation on Graphics Hardware,” Eurographics STAR report, 2005 152

76

Sorting and Searching Naga Govindaraju

Topics • Sorting – Sorting networks

• Search – Binary search – Searching quantiles

154

77

Assumptions • Data organized into 1D arrays • Rendering pass == screen aligned quad – Not using vertex shaders

• PS 2.0 GPU – No data dependent branching at fragment level

155

Sorting

156

78

Sorting • Given an unordered list of elements, produce list ordered by key value – Kernel: compare and swap

• GPUs constrained programming environment limits viable algorithms – Bitonic merge sort [Batcher 68] – Periodic balanced sorting networks [Dowd 89]

157

Bitonic Merge Sort Overview • Repeatedly build bitonic lists and then sort them – Bitonic list is two monotonic lists concatenated together, one increasing and one decreasing. • List A: (3, 4, 7, 8) monotonically increasing • List B: (6, 5, 2, 1) monotonically decreasing • List AB: (3, 4, 7, 8, 6, 5, 2, 1) bitonic

158

79

Bitonic Merge Sort 3 7 4 8 6 2 1 5 8x monotonic lists: (3) (7) (4) (8) (6) (2) (1) (5) 4x bitonic lists: (3,7) (4,8) (6,2) (1,5)

159

Bitonic Merge Sort 3 7 4 8 6 2 1 5 Sort the bitonic lists 160

80

Bitonic Merge Sort 3

3

7

7

4

8

8

4

6

2

2

6

1

5

5

1 4x monotonic lists: (3,7) (8,4) (2,6) (5,1) 2x bitonic lists: (3,7,8,4) (2,6,5,1)

161

Bitonic Merge Sort 3

3

7

7

4

8

8

4

6

2

2

6

1

5

5

1 Sort the bitonic lists 162

81

Bitonic Merge Sort 3

3

3

7

7

4

4

8

8

8

4

7

6

2

5

2

6

6

1

5

2

5

1

1

Sort the bitonic lists 163

Bitonic Merge Sort 3

3

3

7

7

4

4

8

8

8

4

7

6

2

5

2

6

6

1

5

2

5

1

1

Sort the bitonic lists 164

82

Bitonic Merge Sort 3

3

3

3

7

7

4

4

4

8

8

7

8

4

7

8

6

2

5

6

2

6

6

5

1

5

2

2

5

1

1

1

2x monotonic lists: (3,4,7,8) (6,5,2,1) 1x bitonic list: (3,4,7,8, 6,5,2,1)

165

Bitonic Merge Sort 3

3

3

3

7

7

4

4

4

8

8

7

8

4

7

8

6

2

5

6

2

6

6

5

1

5

2

2

5

1

1

1

Sort the bitonic list 166

83

Bitonic Merge Sort 3

3

3

3

3

7

7

4

4

4

4

8

8

7

2

8

4

7

8

1

6

2

5

6

6

2

6

6

5

5

1

5

2

2

7

5

1

1

1

8

Sort the bitonic list 167

Bitonic Merge Sort 3

3

3

3

3

7

7

4

4

4

4

8

8

7

2

8

4

7

8

1

6

2

5

6

6

2

6

6

5

5

1

5

2

2

7

5

1

1

1

8

Sort the bitonic list 168

84

Bitonic Merge Sort 3

3

3

3

3

2

7

7

4

4

4

1

4

8

8

7

2

3

8

4

7

8

1

4

6

2

5

6

6

6

2

6

6

5

5

5

1

5

2

2

7

7

5

1

1

1

8

8

Sort the bitonic list 169

Bitonic Merge Sort 3

3

3

3

3

2

7

7

4

4

4

1

4

8

8

7

2

3

8

4

7

8

1

4

6

2

5

6

6

6

2

6

6

5

5

5

1

5

2

2

7

7

5

1

1

1

8

8

Sort the bitonic list 170

85

Bitonic Merge Sort 3

3

3

3

3

2

1

7

7

4

4

4

1

2

4

8

8

7

2

3

3

8

4

7

8

1

4

4

6

2

5

6

6

6

5

2

6

6

5

5

5

6

1

5

2

2

7

7

7

5

1

1

1

8

8

8

Done! 171

Bitonic Merge Sort Summary • Separate rendering pass for each set of swaps – O(log2n) passes – Each pass performs n compare/swaps – Total compare/swaps: O(n log2n) • Limitations of GPU cost us factor of logn over best CPU-based sorting algorithms

172

86

Making GPU Sorting Faster • Draw several quads with similar computation instead of single quad – Reduce decision making in fragment program

• Push work into vertex processor and interpolator – Reduce computation in fragment program

• More than one compare/swap per sort kernel invocation – Reduce computational complexity

173

Grouping Computation 3

3

3

3

3

2

1

7

7

4

4

4

1

2

4

8

8

7

2

3

3

8

4

7

8

1

4

4

6

2

5

6

6

6

5

2

6

6

5

5

5

6

1

5

2

2

7

7

7

5

1

1

1

8

8

8 174

87

Implementation Details • Specify interpolants for smaller quads – ‘down’ or ‘up’ compare and swap – distance to comparison partner

• See Kipfer & Westermann article in GPU Gems 2 and Kipfer et al. Graphics Hardware 04 for more details

175

GPU Sort • Use blending operators for comparison • Use texture mapping hw to map sorting op.

176

88

2D Memory Addressing • GPUs optimized for 2D representations – Map 1D arrays to 2D arrays – Minimum and maximum regions mapped to row-aligned or column-aligned quads

177

1D – 2D Mapping MIN

MAX

178

89

1D – 2D Mapping

Effectively reduce instructions per element

MIN

179

Sorting on GPU: Pipelining and Parallelism

Input Vertices

Texturing, Caching and 2D Quad Comparisons Sequential Writes

180

90

Comparison with GPU-Based Algorithms

3-6x faster than prior GPU-based algorithms!

181

GPU vs. High-End Multi-Core CPUs

2-2.5x faster than Intel high-end processors Single GPU performance comparable to high-end dual core Athlon

Optimized CPU code from Intel Corporation 182

91

GPU vs. High-End Multi-Core CPUs

2-2.5x faster than Intel high-end processors Single GPU performance comparable to high-end dual core Athlon

Slash Dot and Toms Hardware Guide Headlines, June 2005 183

N. Govindaraju, S. Larsen, J. Gray, and D. Manocha, Proc. Of ACM SuperComputing, 2006

GPU Cache Model • Small data caches

– Better hide the memory latency – Vendors do not disclose cache information – critical for scientific computing on GPUs

• We design simple model – Determine cache parameters (block and cache sizes) – Improve sorting, FFT and SGEMM performance

184

92

Cache Evictions

Cache Cache Cache Cache Eviction Eviction Eviction

185

Cache issues

h

Cache misses per step = 2 W H/ (h B) Cache Cache Cache Cache Eviction Eviction Eviction

186

93

Analysis • lg n possible steps in bitonic sorting network • Step k is performed (lg n – k+1) times and h = 2k-1 • Data fetched from memory = 2 n f(B) where f(B)=(B-1) (lg n -1) + 0.5 (lg n –lg B)2

187

Block Sizes on GPUs

188

94

Cache-Efficient Algorithm

h

Cache

189

Cache Sizes on GPUs

190

95

Cache-Efficient Algorithm Performance

191

Super-Moore’s Law Growth

50 GB/s on a single GPU

Peak Performance: Effectively hide memory latency with 15 GOP/s

192

96

N. Govindaraju, J. Gray, R. Kumar and D. Manocha, Proc. of ACM SIGMOD 2006

External Memory Sorting

• Performed on Terabyte-scale databases • Two phases algorithm [Vitter01, Salzberg90, Nyberg94, Nyberg95] – Limited main memory – First phase – partitions input file into large data chunks and writes sorted chunks known as “Runs” – Second phase – Merge the “Runs” to generate the sorted file 193

External Memory Sorting • Performance mainly governed by I/O Salzberg Analysis: Given the main memory size M and the file size N, if the I/O read size per run is T in phase 2, external memory sorting achieves efficient I/O performance if the run size R in phase 1 is given by R ≈ √(TN)

194

97

Salzberg Analysis • If N=100GB, T=2MB, then R ≈ 230MB • Large data sorting is inefficient on CPUs – R » CPU cache sizes – memory latency

195

External memory sorting • External memory sorting on CPUs can have low performance due to – High memory latency – Or low I/O performance

• Our algorithm – Sorts large data arrays on GPUs – Perform I/O operations in parallel on CPUs

196

98

GPUTeraSort

197

I/O Performance Salzberg Analysis: 100 MB Run Size

198

99

I/O Performance Salzberg Analysis: 100 MB Run Size

Pentium IV: 25MB Run Size Less work and only 75% IO efficient!

199

I/O Performance Salzberg Analysis: 100 MB Run Size

Dual 3.6 GHz Xeons: 25MB Run size More cores, less work but only 85% IO efficient!

200

100

I/O Performance Salzberg Analysis: 100 MB Run Size

7800 GT: 100MB run size Ideal work, and 92% IO efficient with single CPU!

201

Task Parallelism

Performance limited by IO and memory

202

101

Overall Performance

Faster and more scalable than Dual Xeon processors (3.6 GHz)! 203

Performance/$

1.8x faster than current Terabyte sorter

World’s best performance/$ system

204

102

Advantages • Exploit high memory bandwidth on GPUs – Higher memory performance than CPUbased algorithms

• High I/O performance due to large run sizes

205

Advantages • Offload work from CPUs – CPU cycles well-utilized for resource management

• Scalable solution for large databases • Best performance/price solution for terabyte sorting 206

103

Searching

207

Types of Search • Search for specific element – Binary search

• Search for nearest element(s) – k-nearest neighbor search

• Both searches require ordered data

208

104

Binary Search • Find a specific element in an ordered list • Implement just like CPU algorithm – Assuming hardware supports long enough shaders – Finds the first element of a given value v • If v does not exist, find next smallest element > v

• Search algorithm is sequential, but many searches can be executed in parallel – Number of pixels drawn determines number of searches executed in parallel • 1 pixel == 1 search 209

Binary Search • Search for v0

Initialize

Search starts at center of sorted array

4

v2 >= v0 so search left half of sub-array

Sorted List

v0 0

v0 1

v0 2

v2 3

v2 4

v2 5

v5 6

v5 7

210

105

Binary Search • Search for v0

Initialize

4

Step 1

2

Sorted List

v0 0

v0 >= v0 so search left half of sub-array

v0 1

v0 2

v2 3

v2 4

v2 5

v5 6

v5 7

211

Binary Search • Search for v0

Initialize

4

Step 1

2

Step 2

1

Sorted List

v0 0

v0 >= v0 so search left half of sub-array

v0 1

v0 2

v2 3

v2 4

v2 5

v5 6

v5 7

212

106

Binary Search • Search for v0

Initialize

4

Step 1

2

Step 2

1

Step 3

0

Sorted List

v0 0

At this point, we either have found v0 or are 1 element too far left One last step to resolve

v0 1

v0 2

v2 3

v2 4

v2 5

v5 6

v5 7

213

v2 5

v5 6

v5 7

214

Binary Search • Search for v0

Initialize

4

Step 1

2

Step 2

1

Step 3

0

Step 4

0

Sorted List

v0 0

Done!

v0 1

v0 2

v2 3

v2 4

107

Binary Search • Search for v0 and v2

Initialize

4

Search starts at center of sorted array

4

Both searches proceed to the left half of the array

Sorted List

v0 0

v0 1

v0 2

v2 3

v2 4

v2 5

v5 6

v5 7

215

Binary Search • Search for v0 and v2

Initialize

4

4

Step 1

2

2

Sorted List

v0 0

The search for v0 continues as before The search for v2 overshot, so go back to the right

v0 1

v0 2

v2 3

v2 4

v2 5

v5 6

v5 7

216

108

Binary Search • Search for v0 and v2

Initialize

4

4

Step 1

2

2

Step 2

1

3

Sorted List

v0 0

v0 1

v0 2

We’ve found the proper v2, but are still looking for v0 Both searches continue

v2 3

v2 4

v2 5

v5 6

v5 7

217

Binary Search • Search for v0 and v2

Initialize

4

4

Step 1

2

2

Step 2

1

3

Step 3

0

2

Sorted List

v0 0

v0 1

v0 2

Now, we’ve found the proper v0, but overshot v2 The cleanup step takes care of this

v2 3

v2 4

v2 5

v5 6

v5 7

218

109

Binary Search • Search for v0 and v2

Initialize

4

4

Step 1

2

2

Step 2

1

3

Step 3

0

2

Step 4

0

3

Sorted List

v0 0

v0 1

v0 2

Done! Both v0 and v2 are located properly

v2 3

v2 4

v2 5

v5 6

v5 7

219

Binary Search Summary • Single rendering pass – Each pixel drawn performs independent search

• O(log n) steps

220

110

Searching for Quantiles • Given a set of values in the GPU, compute the Kthlargest number • Traditional CPU algorithms require arbitrary data writes - require new algorithm without – Data rearrangement – Data readback to CPU

• Our solution – search for the Kth-largest number

N. Govindaraju, B. Lloyd, W. Wang, M. Lin and D. Manocha, Proc. Of ACM SIGMOD, 2004 221

K-th Largest Number • Let vk denote the k-th largest number • How do we generate a number m equal to vk ? – Without knowing vk’s value – Count the number of values ≥ some given value – Starting from the most significant bit, determine the value of each bit at a time

222

111

K-th Largest Number • Given a set S of values – c(m) —number of values ≥ m – vk — the k-th largest number

• We have – If c(m) ≥ k, then m ≤ vk – If c(m) < k, then m > vk

• c(m) computed using occlusion queries

223

2nd Largest in 9 Values

0011

1011

1101

0111

0101

0001

0111

1010

0010

m = 0000 v2 = 1011

224

112

Draw a Quad at Depth 8 Compute c(1000)

0011

1011

1101

0111

0101

0001

0111

1010

0010

m = 1000 v2 = 1011

225

1st bit = 1

0011

1011

1101

0111

0101

0001

0111

1010

0010

m = 1000 v2 = 1011 c(m) = 3

226

113

Draw a Quad at Depth 12 Compute c(1100)

0011

1011

1101

0111

0101

0001

0111

1010

0010

m = 1100 v2 = 1011

227

2nd bit = 0

0011

1011

1101

0111

0101

0001

0111

1010

0010

m = 1100 v2 = 1011 c(m) = 1

228

114

Draw a Quad at Depth 10 Compute c(1010)

0011

1011

1101

0111

0101

0001

0111

1010

0010

m = 1010 v2 = 1011

229

3rd bit = 1

0011

1011

1101

0111

0101

0001

0111

1010

0010

m = 1010 v2 = 1011 c(m) = 3

230

115

Draw a Quad at Depth 11 Compute c(1011)

0011

1011

1101

0111

0101

0001

0111

1010

0010

m = 1011 v2 = 1011

231

4th bit = 1

0011

1011

1101

0111

0101

0001

0111

1010

0010

m = 1011 v2 = 1011 c(m) = 2

232

116

Our algorithm • Initialize m to 0 • Start with the MSB and scan all bits till LSB • At each bit, put 1 in the corresponding bitposition of m • If c(m) < k, make that bit 0 • Proceed to the next bit

233

Kth-Largest

NV35

234

117

Median

3x performance improvement per year! 235

GPGPU Mathematical Primitives Aaron Lefohn Neoptica

118

GPGPU Non-Linear PDEs

– Strodka, Garbe, “Real-Time Motion Estimation and Visualization on Graphics Cards,” IEEE Visualization, 2004 237

GPGPU Direct Tridiagonal Solver

– 1000, 1000-element tridiagonal linear systems – Kass, Lefohn, Owens, “Interactive Depth-of-Field,” Pixar Technical Report, 2006 238

119

Overview • • • •

Linear Algebra Differential Equations Performance Results Summary

239

Linear Algebra on GPUs • The basics – Vector-vector – Matrix-vector – Matrix-matrix

240

120

Basics: Vector-Vector Operations • Add / subtract – Trivial parallel map operation

+

241

Basics: Vector-Vector Operations • Inner product / normalize – Trivial implementation might render a single fragment (1 thread) and perform serial computation • No parallelism

242

121

Basics: Vector-Vector Operations • Inner product / normalize – Parallel reduction

243

Basics: Matrix-Vector Multiplication • N inner products in parallel

xN 244

122

Basics: Matrix-Matrix Operations • Add / transpose – Parallel map

245

Basics: Matrix-Matrix Operations • Multiply – N2 inner products

x N2

246

123

Basics: Matrix-Matrix Multiply • Interesting parallelism note – N2 inner products provide enough parallelism that it is OK to perform each one in a single pass • This differs from computing a single inner product, which must be parallelized for good performance

• Performance challenges – No writeable cache to capture reuse – Cache-to-register pathway is bottleneck on many GPUs 247

Basics: Sparse Matrices • N-Diagonal (Banded) – Store each diagonal as vector – Special cases: 1-4-Diagonal • Store diagonals in quadword elements of single vector

• Unstructured – ITPACK format (padded compressed row) is attractive (comes with Brook distribution) • Same number of non-zero elements in each row • Keeps computation SIMD

– Other formats use more indirection and a varying amount of computation per vector element 248

124

GPU Linear Solvers • Solve My = x for y – Use basic linear algebra parallel constructs – Data-parallel algorithms

• Examples – – – – –

Conjugate gradient Jacobi Gauss-Seidel Dense LU-decomposition Tridiagonal LU-decomposition 249

Dense LU-Decomposition • GPU iterators (rasterization quads) – Match memory access pattern with GPU memory layout

Galoppo, Govindaraju, Henson, Manocha, “LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware,” ACM/IEEE Supercomputing, 2005 250

125

Tridiagonal LU-Decomposition • Numerical recipes algorithm is sequential • Use scan operation to parallelize – Cyclic reduction – O(N) computation in O(log N) passes

251

GPGPU Differential Equations • Ordinary differential equation example • Partial differential equation examples

252

126

GPGPU ODEs • N-body particle system – Brute force solution maps well to GPU • “Stream all N particles past all N particles” foreach pi in Particles foreach pj in Particles »

pi += computeInteraction(pi,pj)

• Replace outer loop with parallel GPU foreach

253

GPGPU ODEs • O(N log N) optimized algorithms – More difficult to map to GPUs • Must build irregular data structure each iteration – Neighbor lists or hierarchical grid

• Varying number of interactions per particle

– Architectural improvements making this easier • Scatter • More efficient conditional execution • See Mark Harris’s talk on Havok FX physics

254

127

GPGPU Partial Differential Equations • Example GPGPU PDE Applications – Navier-Stokes (incompressible fluids) – Level sets (deformable implicit surfaces) – Image processing • Registration • Segmentation • Computer vision

255

GPGPU Partial Differential Equations • Explicit, finite difference PDE solvers map well to GPUs – Gather small number of local neighbors – Grid Æ texture Step n+1

Step n

(Figure from Robert Strzodka) Strzodka) 256

128

GPU PDEs • Finite difference optimizations – Multigrid – Banded sparse grids – Adaptive grids

257

Performance Results • Matrix-matrix multiply – GPU: 110 GFLOPS (ATI X19K, CTM) – CPU: 8-10 GFLOPS (Single Intel P4 3.2 GHz) – Cell: > 200 GFLOPS (3.2 GHz)

258

129

Performance Results • Dense LU-Decomposition (SC ‘05) – 15% - 35% faster than ATLAS (partial pivot) • Matrix sizes > 35002 • NVIDIA GeForce 7800 / Intel Pentium 4 3.4 Ghz

– Up to 10x faster than LAPACK (full pivot) • Intel Math Kernel Lib • Matrix sizes > 35002 • NVIDIA GeForce 7800 / Intel Pentium 4 3.4 Ghz 259

Summary • Techniques – Use data-parallel linear algebra algorithms – Redefine memory access patterns for GPU • Contiguous output domain • Avoid scatter • Leverage 2D memory layout

– Minimize indirections

260

130

Summary • Challenges – No writeable cache / local store • Hard to beat block-based decomposition

– Must combine multiple operations before reading data back to CPU

• Iterative solvers work very well!

261

GPGPU Math Libraries – LU-GPU (dense LU-decomposition) • http://gamma.cs.unc.edu/LUGPULIB/

– Linear algebra framework • http://wwwcg.in.tum.de/Research/Publications/LinAlg

– GPUFFTW • http://gamma.cs.unc.edu/GPUFFTW/

– GPU FFT • http://sourceforge.net/projects/gpufft/

– PeakStream • http://www.peakstreaminc.com/

– RapidMind • http://www.rapidmind.com

262

131

References •

Kass, Lefohn, Owens, “Interactive Depth of Field,” (Cyclic reduction, direct tridiagonal linear solver), Pixar Technical Report, 2006



Galoppo, Govindaraju, Henson, Manocha, “LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware,” ACM/IEEE Supercomputing, 2005



Jiang, Snir, “Automatic Tuning Matrix Multiplication on Graphics Hardware,” Parallel Architecture and Compilation Techniques (PACT), 2005



Fatahalian, Sugerman, Hanrahan, “Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication,” ACM/EG Graphics Hardware, 2004



Lefohn, Kniss, Hansen, Whitaker, “Interactive Deformation and Visualization of Level Set Surfaces Using Graphics Hardware,” IEEE Transactions on Visualization and Computer Graphics, 2004



Strodka, Garbe, “Real-Time Motion Estimation and Visualization on Graphics Cards,” IEEE Visualization, 2004



Harris, Baxter, Scheuermann, Lastra, “Simulation of Cloud Dynamics on Graphics Hardware,” ACM/EG Graphics Hardware, 2003

263

References •

Krüger, Westermann, “Linear algebra operators for GPU implementation of numerical algorithms,” ACM SIGGRAPH, 2003



Bolz, Farmer, Grinspun, Schröder, “Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid,” ACM SIGGRAPH, 2003



Hillesland, Molinov, Grzeszczuk, “Nonlinear Optimization Framework for Image-Based Modeling on Programmable Graphics Hardware,” ACM SIGGRAPH, 2003



Harris, Coombe, Scheuermann, Lastra, “Physically-Based Visual Simulation on Graphics Hardware,” ACM/EG Graphics Hardware, 2002



Rumpf, Strzodka, “Using graphics cards for quantized FEM computations,” IASTED Visualization, Imaging and Image Processing, 2001

264

132

High Level Languages for GPUs Mike Houston Stanford University

High Level Shading Languages • Cg, HLSL, & OpenGL Shading Language – Cg: • http://www.nvidia.com/cg

– HLSL: • http://msdn.microsoft.com/library/default.asp?url=/library/enus/directx9_c/directx/graphics/reference/highlevellanguageshad ers.asp

– OpenGL Shading Language: • http://www.3dlabs.com/support/developer/ogl2/whitepapers/i ndex.html

266

133

Compilers: CGC & FXC • HLSL and Cg are syntactically almost identical – Exception: Cg 1.3 allows shader “interfaces”, unsized arrays

• Command line compilers – Microsoft’s FXC.exe • Compiles to DirectX vertex and pixel shader assembly only • fxc /Tps_3_0 myshader.hlsl

– NVIDIA’s CGC.exe • Compiles to everything • cgc -profile ps_3_0 myshader.cg

– Can generate very different assembly! • Driver will recompile code

– Compliance may vary 267

Babelshader http://graphics.stanford.edu/~danielrh/babelshader.html

• Converts between DirectX pixel shaders and OpenGL shaders • Allows OpenGL programs to use DirectX HLSL compilers to compile programs into ARB or fp30 assembly. Example Conversion Between Ps2.0 and ARB

• Enables fair benchmarking competition between the HLSL compiler and the Cg compiler on the same platform with the same demo and driver. 268

134

GPGPU Languages • Why do want them? –

Make programming GPUs easier! • Don’t need to know OpenGL, DirectX, or ATI/NV extensions • Simplify common operations • Focus on the algorithm, not on the implementation

• Accelerator

Microsoft Research http://research.microsoft.com/research/downloads/

• Brook

Stanford University http://brook.sourceforge.net http://graphics.stanford.edu/projects/brookgpu

• CTM

ATI Technologies

• Peakstream

http://www.peakstreaminc.com

• RapidMind Commercial follow-on to Sh http://www.rapidmind.net

269

Microsoft Research Accelerator Project • GPGPU programming using dataparallelism • Presents a data-parallel library to the programmer. – Simple, high-level set of operations

• Library just-in-time compiles to GPU pixel shaders or CPU code. – Runs on top of product version of .NET

270

135

Data-parallel array library • Explicit conversions between dataparallel arrays and normal arrays • Functional: each operation produces a new data-parallel array. • Eliminate certain operations on arrays to make them data-parallel – No aliasing, pointer arithmetic, individual element access

271

Data-parallel array types CPU

GPU

DPArray1[ … ] library_calls()

DPArrayN[ … ]

API/Driver/ Hardware

Array1[ … ] txtr1[ … ] pix_shdrs()



txtrN[ … ]

ArrayN[ … ]

272

136

Explicit conversion CPU

GPU

API/Driver/ Hardware

Array1[ … ] DPArray1[ … ] library_calls()

DPArrayN[ … ]

Explicit conversion between dataparallel arrays and normal arrays trigger GPU execution

txtr1[ … ] pix_shdrs()



txtrN[ … ]

ArrayN[ … ]

273

Functional style CPU

GPU

DPArray1[ … ]

DPArrayN[ … ]

API/Driver/ Hardware

Array1[ … ]

Functional style: each operation produces a new data-parallel array

txtr1[ … ] pix_shdrs()



txtrN[ … ]

ArrayN[ … ]

274

137

Types of operations CPU

GPU

DPArray1[ … ] library_calls()

DPArrayN[ … ]

API/Driver/ Hardware

Array1[ … ]

Restrict operations to allow data-parallel programming: No pointer arithmetic, individual element access/update

txtr1[ … ] pix_shdrs()



txtrN[ … ]

ArrayN[ … ]

275

Operations • Array creation • Element-wise arithmetic operations: +, *, -, etc. • Element-wise boolean operations: and, or, >, < etc. • Type conversions: integer to float, etc. • Reductions/scans: sum, product, max, etc. • Transformations: expand, pad, shift, gather, scatter, etc. • Basic linear algebra: inner product, outer product.

276

138

Example: 2-D convolution float[,] Blur(float[,] array, float[] kernel) { using (DFPA parallelArray = new DFPA(array)) { FPA resultX = new FPA(0.0f, parallelArray.Shape); for (int i = 0; i < kernel.Length; i++) { // Convolve in X direction. resultX += parallelArray.Shift(0,i) * kernel[i]; } FPA resultY = new FPA(0.0f, parallelArray.Shape); for (int i = 0; i < kernel.Length; i++) { // Convolve in Y direction. resultY += resultX.Shift(i,0) * kernel[i]; } using (DFPA result = resultY.Eval()) { float[,] resultArray; result.ToArray(out resultArray); return resultArray; } } } 277

Just-in-time compiler

278

139

Availability and more information • Binary version of Accelerator available for download – http://research.microsoft.com/downloads

• Available for non-commercial use – Meant to support research community use. – Licensing for commercial use possible.

• Includes documentation and a few samples • Runs on Microsoft.NET, most GPUs shipping since 2002. • More information: – ASPLOS 2006 “Accelerator: using data-parallelism to program GPUs for general-purpose uses”, David Tarditi, Sidd Puri, Jose Oglesby – http://research.microsoft.com/act

279

Brook: General Purpose Streaming Language

• Stream programming model – GPU = streaming coprocessor

• C with stream extensions • Cross platform – ATI & NVIDIA – OpenGL, DirectX, CTM – Windows & Linux

280

140

Streams • Collection of records requiring similar computation – particle positions, voxels, FEM cell, …

Ray r; float3 velocityfield;

• Similar to arrays, but… – index operations disallowed: – read/write stream operators

position[i]

streamRead (r, r_ptr); streamWrite (velocityfield, v_ptr); 281

Kernels • Functions applied to streams – similar to for_all construct – no dependencies between stream elements kernel void foo (float a, float b, out float result) { result = a + b; } float a; float b; float c; foo(a,b,c);

for (i=0; i

Suggest Documents