CUDA Programming

1

Basics of CUDA Programming Weijun Xiao Department of Electrical and Computer Engineering University of Minnesota

2

Outline • • • • • • •

What’s GPU computing? CUDA programming model Basic Memory Management Basic Kernels and Execution CPU and GPU Coordination CUDA debugging and profiling Conclusions

3

What is GPU? • Graphic Processing Unit

Logical Representation of Visual Information

Output Signal

4

Performance Gap between GPUs and CPUs

5

GPU = Fast Parallel Machine • GPU speed increasing at faster pace than Moore’s Law. • This is a consequence of the data-parallel streaming aspects of the GPU. • Gaming market simulates the development of GPU • GPUs are cheap ! Put enough together, and you can get a super-computer.

So can we use the GPU for general-purpose computing ?

6

Sure, thousands of Applications • • • • • • • • • •

Large matrix/vector operations (BLAS) Protein Folding (Molecular Dynamics) FFT (signal processing) VMD(Visual Molecular Dynamics) Speech Recognition (Hidden Markov Models, Neural nets) Databases Sort/Search Storages MRI …

7

Why are We Interested in GPU? • • • • •

High-performance Computing High Parallelism Low Cost GPU can be Programmable GPGPU

8

Growth and Development of GPU

• A quiet revolution and potential build-up

GFLOPS

– – –

Calculation: 367 GFLOPS vs. 32 GFLOPS Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s Before CUDA , programmed through graphics API

G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce 6800 Ultra NV35 = GeForce FX 5950 Ultra NV30 = GeForce FX 5800

–

GPU in every PC and workstation – massive volume and potential impact

16 highly threaded MP, 128 Cores, 367 GFLOPS, 768 MB DRAM, 86.4 GB/S Mem BW, 4GB/S BW to CPU Host Input Assembler Thread Execution Manager

r tu x e T

r tu x e T

r tu x e T

Load/store

Parallel Data Cache

Load/store

Parallel Data Cache

Parallel Data Cache

r tu x e T

Parallel Data Cache

Load/store

Global Memory

r tu x e T

rT tu x e T r tu x e Load/store

Parallel Data Cache

Parallel Data Cache r tu x e T

Parallel Data Cache

Load/store

Parallel Data Cache r tu x e T

9

GeForce 8800

Load/store

10

Telsa 2050 14 MP, 448 Cores, 1.03 TFLOPS/515 GFLOPS, 3GB GDDR5 DRAM with ECC, 144GB/S Mem BW, PCIe 2 x16 (8GB/S BW to CPU)

11

GPU Languages • Assembly • Cg (NVIDIA) - C for Graphics • GLSL (OpenGL) - OpenGL Shading Language • HLSL (Microsoft) - High-level Shading language • Brook C/C++ (AMD) • CUDA (NVIDIA) • Open CL

12

How GPGPU Works before CUDA?

• Follow graphics pipeline • Pretend to be graphics • Take an advantage of massive parallelism of GPU • Disguise data as textures or geometry • Disguise algorithm as render passes • Fool graphics pipeline to do computation

13

CUDA Programming Model • Compute Unified Device Architecture • Simple and General-Purpose Programming Model • Standalone driver to load computation programs into GPU • Graphics-free API • Data sharing with OpenGL buffer objects • Easy to use and low-learning curve

14

CUDA – C with no shader limitations!

• Integrated host+device app C program – Serial or modestly parallel parts in host C code – Highly parallel parts in device SPMD kernel C code Serial Code (host) Parallel Kernel (device) KernelA>(args);

...

Serial Code (host) Parallel Kernel (device) KernelB>(args);

...

15

CUDA Devices and Threads •

A compute device – – – –

• •

Is a coprocessor to the CPU or host Has its own DRAM (device memory) Runs many threads in parallel Is typically a GPU but can also be another type of parallel processing device

Data-parallel portions of an application are expressed as device kernels which run on many threads Differences between GPU and CPU threads –

GPU threads are extremely lightweight •

–

Very little creation overhead

GPU needs 1000s of threads for full efficiency •

Multi-core CPU needs only a few

© David Kirk/NVIDIA and Wenmei W. Hwu, 20072009 ECE 498AL, University of Illinois, UrbanaChampaign

16

Extended C • Declspecs – global, device, shared, local, constant

__device__ float filter[N]; __global__ void convolve (float *image) __shared__ float region[M]; ...

• Keywords – threadIdx, blockIdx

region[threadIdx] = image[i];

• Intrinsics – __syncthreads

__syncthreads() ...

• Runtime API – Memory, symbol, execution management

image[j] = result; } // Allocate GPU memory void *myimage = cudaMalloc(bytes)

• Function launch // 100 blocks, 10 threads per block convolve (myimage);

{

17

Compiling a CUDA Program C/C++ CUDA Application

float4 me = gx[gtid]; me.x += me.y * me.z;

•

Parallel Thread eXecution (PTX)

– CPU Code

NVCC

–

PTX Code

Virtual Physical

PTX to Target

Compiler

G80

…

–

GPU

Target code

ld.global.v4.f32 mad.f32

Virtual Machine and ISA Programming model Execution resources and state

{$f1,$f3,$f5,$f7}, [$r9+0]; $f1, $f5, $f3, $f1;

18

Arrays of Parallel Threads • A CUDA kernel is executed by an array of threads – –

All threads run the same code (SPMD) Each thread has an ID that it uses to compute memory addresses and make control decisions threadID

0 1 2 3 4 5 6 7

… float x = input[threadID]; float y = func(x); output[threadID] = y; …


19

Thread Blocks: Scalable Cooperation

• Divide monolithic thread array into multiple blocks – Threads within a block cooperate via shared memory, atomic operations and barrier synchronization – Threads in different blocks cannot cooperate

Thread Block 1

Thread Block 0 threadID

0

1

2

3

4

5

6


7

0

1

2

3

4

5

6

Thread Block N - 1 7



0

…

1

2

3

4

5

6

7


20

Block IDs and Thread IDs •

Each thread uses IDs to decide what data to work on – –

•

Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D

Simplifies memory addressing when processing multidimensional data – – –

Image processing Solving PDEs on volumes …

Click to edit Master text styles HostSecondDevice level Third level Grid 1 Fourth level Kernel Block Block 1 Fifth level (0, 0) (1, 0) Block (0, 1)

Block (1, 1)

Grid 2 Kernel 2 Block (1, 1) (0,0,1) (1,0,1) (2,0,1) (3,0,1)

Thread Thread Thread Thread (0,0,0) (1,0,0) (2,0,0) (3,0,0) Thread Thread Thread Thread (0,1,0) (1,1,0) (2,1,0) (3,1,0)

Courtesy: NDVIA

Figure 3.2. An Example of CUDA Thread Organizatio

21

CUDA Memory Model • Global memory – Main means of communicating R/W Data between host and device – Contents visible to all threads – Long latency Host access © David Kirk/NVIDIA and Wenmei W. Hwu, 20072009 ECE 498AL, University of Illinois, UrbanaChampaign

Grid Block (0, 0)

Block (1, 0)

Shared Memory Registers

Registers

Thread (0, 0) Thread (1, 0)

Global Memory

Shared Memory Registers

Registers

Thread (0, 0) Thread (1, 0)

22

Basic Memory Management

23

Memory Spaces • CPU and GPU have separate memory spaces – Data is moved across PCIe bus – Use functions to allocate/set/copy memory on GPU • Very similar to corresponding C functions

• Pointers are just addresses – Can’t tell from the pointer value whether the address is on CPU or GPU – Must exercise care when dereferencing: • Dereferencing CPU pointer on GPU will likely crash • Same for vice versa

24

GPU Memory Allocation / Release • Host (CPU) manages device (GPU) memory: – cudaMalloc (void ** pointer, size_t nbytes) – cudaMemset (void * pointer, int value, size_t count) – cudaFree (void* pointer) int n = 1024; int nbytes = 1024*sizeof(int); int * d_a = 0; cudaMalloc( (void**)&d_a, nbytes ); cudaMemset( d_a, 0, nbytes); cudaFree(d_a);

25

Data Copies • cudaMemcpy( void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction); – returns after the copy is complete – blocks CPU thread until all bytes have been copied – doesn’t start copying until previous CUDA calls complete

• enum cudaMemcpyKind – cudaMemcpyHostToDevice – cudaMemcpyDeviceToHost – cudaMemcpyDeviceToDevice

• Non-blocking memcopies are provided

26

Code Walkthrough 1 • • • • •

Allocate CPU memory for n integers Allocate GPU memory for n integers Initialize GPU memory to 0s Copy from GPU to CPU Print the values

27

Code Walkthrough 1 #include

int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers

28


int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes ); if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; }

29


int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes ); if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; } cudaMemset( d_a, 0, num_bytes ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );

30


int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes ); if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; } cudaMemset( d_a, 0, num_bytes ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost ); for(int i=0; i

CUDA Programming - University of Minnesota

CUDA Programming - University of Minnesota

Suggest Documents

The CUDA Programming Model

NVIDIA CUDA Programming Guide

NVIDIA CUDA Programming Guide

NVIDIA CUDA Programming Guide

NVIDIA CUDA Programming Guide

NVIDIA CUDA Programming Guide

CUDA programming on NVIDIA GPUs

NVIDIA CUDA C Programming Guide

CnC-CUDA: Declarative Programming for GPUs - Rice University ...

CUDA-lite: Reducing GPU Programming Complexity - CiteSeerX

CUDA-lite: Reducing GPU Programming Complexity - IMPACT

GPU Parallel Computing Architecture and CUDA Programming ...

Standard Introduction to CUDA C Programming

OpenCL Programming Guide for the Cuda Architecture

University of Minnesota Duluth - University of Minnesota Twin Cities

University of Minnesota

university of minnesota

Untitled - University of Minnesota

university of minnesota

University of Minnesota - Papers.ssrn.com

University of Minnesota - Papers.ssrn.com

GPU: CUDA Programming Model - NanoCAD Group

University of Minnesota - Papers.ssrn.com