CUDA - Utk

Recommend Documents

been provided to support debugging CUDA code. cuda-gdb is supported on 32- bit and ... Just as programming in CUDA C is an extension to C programming, ...

cvREMOD UTK

also prepared to comply with ISO EN13606 (even hl7) in order to provide the ... for healthcare information management HL7, HL7-CDA specifies a document ...

CUDA Programming

CUDA Software Development. • Is done on the host (CPU). – programming environment, compilers and libraries. – Profiler, emulator ...

CUDA DBMS

Jun 10, 2009 - Figure 1.1: The GPU thread hierarchy and how threads access the memory .... It is completely transparent, from a syntax pointâofâview, that you are ... Listing 2.1: A LINQ statement querying an instance of our DBMS for positional d

CUDA - Nvidia

Top 100 NVIDIA CUDA application showcase speedups as of. May, 9 .... application development spanning decades. 2. ... Sh

Multilevel Power Converters - Utk

Chapter 31 Multilevel Power Converters. Surin Khomfoi and Leon M. Tolbert. The University of Tennessee. 31.1 Introduction. Numerous industrial applications ...

Untitled - UTK Math

variable and multivariable calculus classes. The goals of the software are to: â¢ Enhance the teaching and learning of calculus,. â¢ Make it easier for teachers to do ...

Embodied Computing - UTK-EECS

Feb 20, 2008 - 20 February 2008. NSF Workshop: Biological. Communications Technology. 11. Example of Simple Computation. ((KS)S) S ...

CUDA-BLASTP: Accelerating BLASTP on CUDA ... - Semantic Scholar

max. ),( ,. )1,(. )1,( max. ),( j. iF j. iH. jiF. jiE. jiH. jiE. Smith-Waterman Algorithm. â¢ Performs an exhaustive search for the optimal local alignment of two sequences ...

sustainable electricity - UTK-EECS

Aug 18, 2014 - âis to make the electric energy systems of two islandsâFlores and sÃ£o Miguel in Azores archipelago, Portugalâas green as possible without ...

The CUDA Programming Model

Course on CUDA Programming by Mike Giles, Oxford University. Mathematical Institute. CPS343 (Parallel and HPC). The CUDA Programming Model.

Power Theft - UTK-EECS

raid, these Agents discovered an unu arrests and 447 .... task by having their data processing ..... include a provision whereby the utility can recover treble.

ECE 313 - Utk

Text - Probability and Stochastic Processes by. Yates and Goodman. • Course Topics - Probability, Discrete and. Continuous Random Variables, Pairs of ...

Microfluidic Dynamics.pdf - Utk

1. Chapter 5. Thermofluid Engineering and. Microsystems. Microfluidic Dynamics. Navier-Stokes equation. 1. The momentum equation. 2. Interpretation of the ...

NVIDIA CUDA Programming Guide

Apr 24, 2007 ... ii. CUDA Programming Guide Version 0.8.2 ... CUDA: A New Architecture for Computing on the GPU ....................................3. 1.3. Document's ...

NVIDIA CUDA Programming Guide

Aug 27, 2009 - NVIDIA OpenCL Programming Guide Version 2.3 ... CUDA's Scalable Programming Model . ..... and Application Programming Interfaces. 1.3.

CUDA Quick Guide

Apply custom build rule to project. 4.Additional Include Directories : $(CUDA_INC_PATH). 5.Additional Library Directorie

CUDA Quick Guide

âCoding with visual c++. âBasic concept ... âv2.3 CUDA toolkit (v3.2 in 2011/04) ... It must be possible to execut

Advanced CUDA Webinar - Nvidia

Local. Off-chip. No. R/W. One thread. Thread. Shared. On-chip. N/A. R/W. All threads in a block Block. Global. Off-chip.

NVIDIA CUDA Programming Guide

Aug 27, 2009 - NVIDIA OpenCL Programming Guide Version 2.3 ... CUDA's Scalable Programming Model . ..... and Application

Introducing CUDA 5

GPU Debugging and Profiling. CUDA-GDB debugger. NVIDIA Visual Profiler. Open Compiler. Tool Chain .... Java, Python, R,

NVIDIA CUDA Programming Guide

Aug 31, 2009 - NVIDIA OpenCL Programming Guide [ 1 ] and NVIDIA OpenCL Best Practices. Guide [ 2 ]. Heterogeneous Comput

CUDA C/C++ Basics

Small set of extensions to enable heterogeneous programming. ▫ Straightforward APIs to manage devices, memory etc. ▫ This session introduces CUDA C/C++ ...

Accelerating MATLAB with CUDA

MATLAB®‡ is a powerful tool for prototyping and analysis. MATLAB could ... files are a way to call custom C or FORTRAN code directly ... isotropic turbulence.

CUDA - Utk

Download PDF

3 downloads 162 Views 148KB Size Report

Comment

NVIDIA's Compute Unified Device Architecture. ( CUDA ). Stan Tomov. 02/05/ 2014 specifications and graphs taken from. CUDA Programming Guide Version 1.0.

Discussion on NVIDIA's Compute Unified Device Architecture ( CUDA ) Stan Tomov 02/05/2014 specifications and graphs taken from CUDA Programming Guide Version 1.0 (references on: http://www.nvidia.com/object/cuda_get.html)

CUDA

CUBLAS, CUFFT, ... we can easily use LAPACK with CUBLAS

C like API

Programming model A highly multithreaded coprocessor * thread block ( a batch of threads with fast shared memory executes a kernel )

* Grid of thread blocks ( blocks of the same dimension, grouped together to execute the same kernel; reduces thread cooperation ) // set the grid and thread configuration Dim3 dimBlock(3,5); Dim3 dimGrid(2,3); // Launch the device computation MatVec( . . . );

__global__ void MatVec( . . . ) { // Block index int bx = blockIdx.x; int by = blockIdx.y; // Thread index int tx = threadIdx.x; int ty = threadIdx.y; ... }

GPUs & Challenges •  Programming is 'easier' with NVIDIA's Compute Unified Device Architecture (CUDA)

16

Quadro FX 5600 Installed (at ICL) on a 4 x Dual Core AMD Opteron Processor 265 (1800 MHz, 1024 KB cache) 16 KB

Some numbers: processors: 128 (total) registers : 8192 / block warp size : 32 max threads / block: 512

8

1.5 GB

max performance: 346 GFlop/s memory bandwidth: 76.8 GB/s bandwidth to CPU: 8 GB/s shared memory: 16 KB among 8 processors on a multiproc.

GPUs & Challenges •  Programming is 'easier' with NVIDIA's Compute Unified Device Architecture (CUDA)

16

A

B

C =

16 KB

8 1. Get data into shared memory

1.5 GB

2. Compute For DLA the CI is about 32 (on IBM Cell about 64) * not enough to get close to peak (346 GFlop/s) * CUBLAS sgemm is about 120 Gflop/s

GPUs & Challenges •  Programming is 'easier' with NVIDIA's Compute Unified Device Architecture (CUDA)

16

A

BT

C =

16 KB

8 * Small red rectangles (to overlap communication & computation) are of size 32 x 4 and are red by 32 x 2 threads 1.5 GB

2008. Volkov and Demmel. Benchmarking GPUs to tune dense linear algebra, SC08 http://mc.stanford.edu/cgi-bin/images/6/65/SC08_Volkov_GPU.pdf 2010. Nath, Tomov, and Dongarra. An Improved MAGMA GEMM for Fermi Graphics Processing Units, IJHPCA http://www.netlib.org/lapack/lawnspdf/lawn227.pdf

Discussion  

 

Dense Linear Algebra - 

Matrix-matrix product

- 

LAPACK with CUDA

Sparse Linear Algebra - 

 

Sparse matrix-vector product

Projects using CUDA