GPU Computing with CUDA Lecture 10 ... - Boston University
Recommend Documents
‣Main references. - Kirk, D. and Hwu, W. Programming Massively Parallel
Processors. - Sanders, J. and Kandrot, E. CUDA by Example. - CUDA C
Programming ...
on compiled parallel C applications. ⫠Available in laptops, desktops, and clusters. ⢠GPU parallelism is doubling e
Dec 8, 2014 - CUDA software abstraction / programming model: basics, development ...... 3D in Symbian , Android, iPhone SDK, ... ⢠OpenGL ES 2.0.X (2007- ...
Dec 10, 2014 - book CUDA Application Design and Development, 2011, by R. ...... https://developer.nvidia.com/content/native-android-development-tutorial.
Data Parallel Problem Decomposition. Parallel Memory Sharing. Transparent
Scalability. CUDA Programming Model. CUDA: C on the GPU. CUDA Example.
command-line profiler; export COMPUTE_PROFILE=1; export .... PyCuda wraps the CUDA driver API into the Python language reference : Andeas Klöckner,.
Abstract. In the past, graphics processors were special purpose hardwired
application accelerators ... elected to the National Academy of Engineering (NAE)
for.
atom potentials per second, performing 290 GFLOPS of floating point arithmetic. ... contribution of each atom float dx = x ... desktop in a personal supercomputer.
Mar 4, 2011 ... ABSTRACT |The graphics processing unit (GPU) has become an integral part of
today's ... interest from the scientific computing community. Simi- larly, the ...
Downloaded on June 20, 2009 at 18:30 from IEEE Xplore. Restrictions a
Martin Zlatuška. Czech Technical University in Prague. Faculty of Electrical .... per by Horn et al. [Hor07] addresses the lack of lo- cal memory to implement the ...
the memory interface can enhance the main memory access efficiency if the ac- ... This paper presents CUDA-lite, an experimental enhancement to CUDA that.
result was on AES 128, for an 8 MB input file, the performance being of 8.28 Gbps. The GPU .... CUDA AES is up to 134 times faster than AES CPU (JAVA). ... AES GPU vs CPU .... http://www.hardwareheaven.com/reviews/8800GTs/whatis.php.
The CUDA programming environment from. NVIDIA is an attempt to make
programming many-core GPUs more accessible to programmers. However, there
are ...
Introduction. • GPU. – A highly parallel multithreaded many core processor. •
CUDA. – A parallel programming model and software environment that exposes
...
It's 93% of GPU compute time, but only 36% of total time. What's going on during
that other ~60% of time? Maybe it's one-time startup overhead not worth looking ...
[Courtesy of BMW and Opel]. SIFT-based Approach. (SIFT: Scale Invariant
Feature .... (feature vector) for each keypoint to provide invariance to viewpoint
and.
Book Citation Index in Web of Scienceâ¢. Core Collection ... Keywords: host, device, GPU computing, single program multiple data (SPMD). 1. Objective ... perform computations or using graphics processing unit (GPU) to speed up computations. 2.1. ...
Nov 16, 2012 - To provide better abstractions for programming GPU ... compare GPU directive models; these include Hernandez et al. ...... neural network.
The following lecture notes are based on the book Quantum Computation ...
computing course that I teach here at the University of Washington to computer
science grad- ... 8. 3 Entanglement. 9. 4 Teleportation. 11. 5 Super-dense Coding
. 15.
Architecture and Programming. Mohamed Zahran (aka Z) [email protected]
http://www.mzahran.com. CSCI-GA.3033-012. Lecture 4: CUDA Programming.
This lecture was based on Elaine Rich and Kevin Knight, Artificial Intelligence
textbook. Exercise: compare goal-stack planning and regression planning from ...
P(book|verb) > P(book|noun), because there are many more nouns than ..... MM
based Tagger: Small Example that. VERB. NOUN. PRON. CONJ. Book. ADJ.
GPU Computing with CUDA Lecture 10 ... - Boston University
‣Tomorrow's lab: implement O(N2) N body calculation on GPU. ‣Nyland L.,
Harris M., Prins J. “Fast N-Body Simulation with CUDA”. GPU Gems 3, Chapter
31.
GPU Computing with CUDA Lecture 10 - Applications - N body problem
Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1
Outline of lecture ‣ What is an N body problem? ‣ What kind of problems can be solved with N body approach? ‣ Fast calculation algorithms ‣ N-body problem on GPU
2
Introduction ‣ N body problem calculates the interaction between N bodies
xj xi
x = xi − xj
! |x|ij = (xi − xj )2 + (yi − yj )2 + (zi − zj )2
‣ Classic example: Gravitational Potential
N ! mj φi = |x| ij j=0
∇φ = g 3
Introduction ‣ Useful for anything with Green’s function! ‣ Poisson ∇ φ = f - Astrophysics - Fluid mechanics - Electrostatics 2
!
!
Ω
φ∇2 GdΩ +
Ω
‣ Helmholtz ∇ φ + k φ = f - Acoustics - Electromagnetics 2
2
G∇2 φdΩ = !
Γ
!
Gf dΩ
Ω
n · [G∇φ − (∇G)φ]dΓ =
φ=
!
!
Gf dΩ
Ω
Gf dΩ
Ω
‣ Poisson- Boltzmann ∇(!∇φ) + k 2 φ = f - Geophysics - Biophysics
4
Introduction ‣ Radial Basis Function interpolation
u(x) =
N ! i=0
αi φ(|x − xi |)
5
Order of N body problems ‣ Direct calculation is O(N2) N ! mj Φi = |x| ij j=0
6
Fast algorithms ‣ Multipole expansions
7
Fast algorithms Independent of i
Independent of j
8
Fast algorithms
9
Fast algorithms ‣ Tree Code - Barnes and Hut ‣ Fast Multipole Methods (FMM) - Greengard and Rokhlin
Tree code
FMM 10
Fast algorithms ‣ Flow of calculation
11
N body on GPU ‣ Tomorrow’s lab: implement O(N2) N body calculation on GPU ‣ Nyland L., Harris M., Prins J. “Fast N-Body Simulation with CUDA”. GPU Gems 3, Chapter 31. ‣ Lots of operations per load ‣ Much faster, but still O(N2)!
12
N body on GPU ‣ Look at the problem as a matrix vector product N ! mj Φi = |x| ij j=0
Φi =
N ! j=0
mj (|x|ij + !2 )
‣ Each thread will compute one row (not one element)
13
N body on GPU ‣ Data reuse: Tiling - Load tiles to shared memory for data reuse
14
N body on GPU ‣ Kernel Allocate shared memory Start loop on blocks Load to shared memory Start loop on elements Do sum 15
N body on GPU - Optimization ‣ Loop unrolling - Avoid unnecessary operation ‐ #pragma unroll 32
‣ Separate last loop - The number of elements might not be multiple of the block size - Separate last loop to avoid unnecessary warps performing a calculation
‣ Vary block size
16
N body on GPU - Performance metric ‣ Compute bounded problem - Performance in FLOPS/s - Count number of floating point operations and divide by kernel execution time