The OpenCL Programming Model Part 2 - Theoretical and ...

Illinois UPCRC Summer School 2010

The OpenCL Programming Model Part 2: Case Studies Wen-mei Hwu and John Stone with special contributions from Deepthi Nandakumar 1 © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

OpenCL Data Parallel Model • Parallel work is submitted to devices by launching kernels • Kernels run over global dimension index ranges (NDRange), broken up into “work groups”, and “work items” • Work items executing within the same work group can synchronize with each other with barriers or memory fences • Work items in different work groups can’t sync with each other, except by launching a new kernel

2 © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

Transparent Scalability • Hardware is free to assigns work groups to any processor at any time – A kernel scales across any number of parallel processors NDRange

Device

Device

Group 0 Group 1 Group 2 Group 3 Group 0

Group 1

Group 4 Group 5 Group6 Group 7

Group 2

Group 3

Group 4

Group 5

Group 6

Group 7

time

Group 0

Group 1

Group2

Group 3

Group 4

Group 5

Group 6

Group 7

Each group can execute in any order relative to other groups.

3 Wen-mei W. Hwu and John Stone, Urbana July 22, 2010 ©

OpenCL NDRange Configuration Work Group

Global Size(0) Group ID

Global Size(1)

Local Size(1)

Local Size(0)

0,0

0,1

…

1,0

1,1

…

…

…

…

Work Item 4 © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

Mapping Data Parallelism Models: OpenCL to CUDA OpenCL Parallelism Concept

CUDA Equivalent

kernel

kernel

host program

host program

NDRange (index space)

grid

work item

thread

work group

block


A Simple Running Example Matrix Multiplication • A simple matrix multiplication example that illustrates the basic features of memory and thread management in OpenCL programs – – – –

Private register usage Work item ID usage Memory data transfer API between host and device Assume square matrix for simplicity


Programming Model: Square Matrix Multiplication Example N

WIDTH

• P = M * N of size WIDTH x WIDTH • Without tiling: – One work item calculates one element of P – M and N are loaded WIDTH times from global memory

P

WIDTH

M

WIDTH

© Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

WIDTH

7

Memory Layout of a Matrix in C M0,0 M0,1 M0,2 M0,3 M1,0 M1,1 M1,2 M1,3 M2,0 M2,1 M2,,2 M2,3 M3,0 M3,1 M3,2 M3,3

M M0,0 M0,1 M0,2 M0,3 M1,0 M1,1 M1,2 M1,3 M2,0 M2,1 M2,2 M2,3 M3,0 M3,1 M3,2 M3,3


Step 1: Matrix Multiplication A Simple Host Version in C // Matrix multiplication on the (CPU) host N void MatrixMulOnHost(float* M, float* N, float* P, int Width) { for (int i = 0; i < Width; ++i) j for (int j = 0; j < Width; ++j) { float sum = 0; for (int k = 0; k < Width; ++k) { float a = M[i * Width + k]; float b = N[k * Width + j]; M P sum += a * b; } i P[i * Width + j] = sum; } }

WIDTH

WIDTH

k

k WIDTH


WIDTH

9

Step 2: Input Matrix Data Transfer (Host-side Code) void MatrixMulOnDevice(float* M, float* N, float* P, int Width) { int size = Width * Width * sizeof(float); cl_mem Md, Nd, Pd; Md=clCreateBuffer(clctxt, CL_MEM_READ_WRITE, mem_size_M, NULL, NULL); Nd=clCreateBuffer(clctxt, CL_MEM_READ_WRITE, mem_size_N, NULL, &ciErrNum); clEnqueueWriteBuffer(clcmdque, Md, CL_FALSE, 0, mem_size_M, (const void * )M, 0, 0, NULL); clEnqueueWriteBuffer(clcmdque, Nd, CL_FALSE, 0, mem_size_N, (const void *)N, 0, 0, NULL); Pd=clCreateBuffer(clctxt, CL_MEM_READ_WRITE, mem_size_P, NULL, NULL);


Step 3: Output Matrix Data Transfer (Host-side Code) 2. // Kernel invocation code – to be shown later 3. // Read P from the device clEnqueueReadBuffer(clcmdque, Pd, CL_FALSE, 0, mem_size_P,(void*)P), 0, 0, &ReadDone);

// Free device matrices clReleaseMemObject(Md); clReleaseMemObject(Nd); clReleaseMemObject(Pd);

}


bx

Matrix Multiplication Using Multiple Work Groups

1

2

tx 0 1 2 TILE_WIDTH-1 Nd

WIDTH

• Break-up Pd into tiles • Each work group calculates one tile

0

– Each work item calculates one element – Set work group size to tile size

Md

Pd

1

ty

Pdsub

TILE_WIDTH-1 TILE_WIDTH

2

12Wen-mei W. Hwu and John Stone, Urbana July 22, 2010 ©

WIDTH

WIDTH

WIDTH

by

0 1 2

TILE_WIDTHE

0

A Very Small Example

Group(0,0)

Group(0,1)

P0,0 P0,1 P0,2 P0,3 P1,0 P1,1 P1,2 P1,3 P2,0 P2,1 P2,2 P2,3 P3,0 P3,1 P3,2 P3,3 Group(1,0)


Group(1,1)

TILE_WIDTH = 2

A Very Small Example: Multiplication Nd0,0 Nd0,1

work item (0,0)

Nd1,0 Nd1,1 Nd2,0 Nd2,1 Nd3,0 Nd3,1

Md0,0Md0,1Md0,2Md0,3

Pd0,0 Pd0,1 Pd0,2 Pd0,3

Md1,0Md1,1Md1,2Md1,3

Pd1,0 Pd1,1 Pd1,2 Pd1,3 Pd2,0 Pd2,1 Pd2,2 Pd2,3 Pd3,0 Pd3,1 Pd3,2 Pd3,3


OpenCL Matrix Multiplication Kernel __kernel void MatrixMulKernel(__global float* Md, __global float* Nd, __global float* Pd, int Width) { // Calculate the row index of the Pd element and M int Row = get_global_id(1); // Calculate the column idenx of Pd and N int Col = get_global_id(0); float Pvalue = 0; // each thread computes one element of the block sub-matrix for (int k = 0; k < Width; ++k) Pvalue += Md[Row*Width+k] * Nd[k*Width+Col]; Pd[Row*Width+Col] = Pvalue; }

15


Revised Step 5: Kernel Invocation (Host-side Code) // Setup the execution configuration size_t cl_DimBlock[2], cl_DimGrid[2]; cl_DimBlock[0] = TILE_WIDTH; cl_DimBlock[1] = TILE_WIDTH; cl_DimGrid[0] = Width; cl_DimGrid[1] = Width; clSetKernelArg(clkern, 0, sizeof (cl_mem), (void*)(&deviceP)); clSetKernelArg(clkern, 1, sizeof (cl_mem), (void*)(&deviceM)); clSetKernelArg(clkern, 2, sizeof (cl_mem), (void*)(&deviceN)); clSetKernelArg(clkern, 3, sizeof (int), (void *)(&Width)); // Launch the device kernel clEnqueueNDRangeKernel(clcmdque, clkern, 2, NULL, cl_DimGrid, cl_DimBlock, 0, NULL, &DeviceDone);

© Wen-mei W. Hwu and John Stone, Urbana July 16 22, 2010

A Real Application Example


Electrostatic Potential Maps • Electrostatic potentials evaluated on 3-D lattice:

• Applications include: – Ion placement for structure building – Time-averaged potentials for simulation – Visualization and analysis Isoleucine tRNA synthetase 18 © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

Direct Coulomb Summation • Each lattice point accumulates electrostatic potential contribution from all atoms: potential[j] += charge[i] / rij

Lattice point j being evaluated

rij: distance from lattice[j] to atom[i] atom[i]


Data Parallel Direct Coulomb Summation Algorithm • Work is decomposed into tens of thousands of independent calculations – multiplexed onto all of the processing units on the target device (hundreds in the case of modern GPUs)

• Single-precision FP arithmetic is adequate for intended application • Numerical accuracy can be improved by compensated summation, spatially ordered summation groupings, or accumulation of potential in double-precision • Starting point for more sophisticated linear-time algorithms like multilevel summation 20 © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

DCS Data Parallel Decomposition Unrolling increases computational tile size Work Groups: 64-256 work items

Work items compute up to 8 potentials, skipping by memory coalescing width

(unrolled, coalesced) Grid of thread blocks:

0,0

0,1

…

1,0

1,1

…

…

…

…

Padding waste 21 © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

Direct Coulomb Summation in OpenCL Host Atomic Coordinates Charges

NDRange containing all work items, decomposed into work groups Lattice padding Work groups: 64-256 work items

Constant Memory

Work items compute up to 8 potentials, skipping by coalesced memory width

Parallel Data Parallel Data Cache Cache Texture Texture

Parallel Data Parallel Data Cache Cache Texture Texture

Texture Global Memory

GPU

Parallel Data Cache Texture

Parallel Data Cache Texture


Direct Coulomb Summation Kernel Setup OpenCL:

CUDA:

__kernel void clenergy(…) { unsigned int xindex = (get_global_id(0) get_local_id(0)) * UNROLLX + get_local_id(0); unsigned int yindex = get_global_id(1); unsigned int outaddr = get_global_size(0) * UNROLLX * yindex + xindex;

__global__ void cuenergy (…) { unsigned int xindex = blockIdx.x * blockDim.x * UNROLLX + threadIdx.x; unsigned int yindex = blockIdx.y * blockDim.y + threadIdx.y; unsigned int outaddr = gridDim.x * blockDim.x * UNROLLX * yindex + xindex;


DCS Inner Loop (CUDA) …for (atomid=0; atomid

The OpenCL Programming Model Part 2 - Theoretical and ...

The OpenCL Programming Model Part 2 - Theoretical and ...

Suggest Documents

The OpenCL Programming Model Part 1: Basic Concepts

An Introduction to the OpenCL Programming Model

Heterogeneous Client Programming & OpenCL

OpenCL Programming Guide - Pearsoncmg.com

OpenCL Programming Guide - Nvidia

OpenCL programming using Python syntax

OpenCL Programming - Prace Training Portal

Programming Linux sockets, Part 2

OpenCL Programming Language And Aspects Of ...

OpenCL Programming Guide for the Cuda Architecture

OpenCL Programming Language And Aspects Of ...

AMD Accelerated Parallel Processing OpenCL Programming Guide

OpenCL for programming shared memory multicore CPUs

OpenCL Parallel Programming Development Cookbook

opencl: a parallel programming standard for heterogeneous

opencl programming book pdf - Google Drive

Arduino Programming Part 2 - Web Services Overview

Why Functional Programming Matters Part 2 - xFront

Introduction to OpenCL™ Programming Training Guide - AMD

Programming Linux sockets, Part 2: Using UDP

PART I 2 Construction Grammar and the Usage-Based Model

Queue Streaming Model, Part 2: Algorithms

Theoretical study of photochromic compounds, part 2: Thermal ...

Graphics and OpenCL