Learn about the latest breakthroughs developers, engineers and researchers are achieving on the GPU. Learn about the sei
Top 100 NVIDIA CUDA application showcase speedups as of. May, 9 .... application development spanning decades. 2. ... Sh
CUDA programming on NVIDIA GPUs. Mike Giles [email protected].
Oxford University Mathematical Institute. Oxford-Man Institute for Quantitative ...
been provided to support debugging CUDA code. cuda-gdb is supported on 32-
bit and ... Just as programming in CUDA C is an extension to C programming, ...
Apr 24, 2007 ... ii. CUDA Programming Guide Version 0.8.2 ... CUDA: A New Architecture for
Computing on the GPU ....................................3. 1.3. Document's ...
Aug 27, 2009 - NVIDIA OpenCL Programming Guide Version 2.3 ... CUDA's Scalable Programming Model . ..... and Application Programming Interfaces. 1.3.
Local. Off-chip. No. R/W. One thread. Thread. Shared. On-chip. N/A. R/W. All threads in a block Block. Global. Off-chip.
Aug 27, 2009 - NVIDIA OpenCL Programming Guide Version 2.3 ... CUDA's Scalable Programming Model . ..... and Application
Aug 31, 2009 - NVIDIA OpenCL Programming Guide [ 1 ] and NVIDIA OpenCL Best Practices. Guide [ 2 ]. Heterogeneous Comput
Aug 27, 2009 - 1BOpenCL on the CUDA Architecture. 12. NVIDIA OpenCL Programming Guide Version 2.3. A kernel is executed over an NDRange by a grid ...
Unified Virtual Addressing. Faster Multi-GPU ... and then call cudaMemcpy() as usual ..... Amazon EC2. Peer 1 ... ICHEC.
Jul 12, 2011 - Many find 66% is enough to saturate the bandwidth. Look at increasing ... http://developer.download.nvidi
which to operate. Additional parallelism can be exposed to the GPU's hardware schedulers and load balancers dynamically,
Oct 18, 2011 - functions using proprietary NVIDIA compilers/assemblers, compiles the host code using a general purpose .
Dynamic Parallelism in CUDA is supported via an extension to the CUDA programming model that enables a CUDA kernel to cr
coordinates defines a 2D (x, y) location, this pair is a control point. Intuitively a ... A contour is a trajectory with the same start and end point; in other words, a ... Despite path rendering's 30 year heritage and broad adoption, it has not ....
Each thread block reduces a portion of the array. But how do we communicate ... Expensive to build in hardware for GPUs
Adobe Premiere Pro CS5 takes a bold step forward with its new Mercury
Playback. Engine by .... (Figure 2 shows a representation of heterogeneous
computing.) ...
mendation algorithms should be optimized to work with large amounts ..... diction code from existing C# implementation to Python, utilizing NumPy [19].
CATIA Live Rendering blurs the lines between traditional modeling ... design.
Choose and place physically accurate lights that cast perfect shadows on your
model so you can ... accurate light reflects and improve surface connections for
the ...
at center uv c. â Fetch the ... Useful for rendering alpha-blended vegetation, semi- transparent .... From Call of Dut
Distributed rendering architecture. Server machine. Client machine. Server
machine. Modeling app. V-Ray translator. V-Ray renderer. Render server.
Visual C++ 11.0 ... This document is intended for readers familiar with Microsoft
Windows operating systems and the .... of Microsoft Visual C++ Express Edition).
Learn about the latest breakthroughs developers, engineers and researchers are achieving on the GPU. Learn about the sei
Alternative Rendering Pipelines Using NVIDIA CUDA Andrei Tatarinov Alexander Kharlamov
Outline
CUDA overview Ray-tracing REYES pipeline Future ideas
CUDA Overview
Compute Unified Device Architecture (CUDA)
Parallel computing
Application
architecture Allows easy access to GPU
C/C++ OpenCL Fortran
DirectX Compute
A back-end for different APIs
CUDA
…
Streaming Multiprocessor
Texture Processing Cluster
Streaming Multiprocessor Instruction $
Shared Memory
SM SP Texture
SM
SP SP SP
SM
Constant $
SP SFU
SP SP
SFU
SP Register File Double Precision
Threads and Blocks
Streaming Multiprocessor
One block is executed on one SM
Instruction $
Threads within a block can cooperate
Shared Memory SP SP
Shared memory __syncthreads()
Constant $
SP SP
SP SFU
SP SP
SFU
SP Register File Double Precision
Multiprocessor Occupancy Registers (r.) & Threads 8192 r. per Streaming Multiprocessor on 8800GTX
128 r. – way too many registers r. ≤ 40: 6 active warps r. ≤ 32: 8 active warps r. ≤ 24: 10 active warps r. ≤ 20: 12 active warps r. ≤ 16: 16 active warps
Usecases
Ray tracing
Ray tracing
Natural rendering pipeline Important tool for determining visibility
Research goals
Investigate rendering pipelines Collaborative research with Moscow State University
Path of a ray Tree traversal kernel
Primitive intersection kernel
Material & Light Kernel
Select K Leaves
Select Next Leaf
Intersection found: Primitive ID
Tree traversal
Ray-triangle intersect
Generate Shadow Rays
Shading Kernel Compute light equation
Ray
Generate Secondary Rays
Select Next Primitive Shaded cluster is sampled
Shading
Path of a ray Unknown number of rays Ray workload and memory access is highly irregular Register & Bandwidth pressure is high
Kd-tree
Kd-tree LRLRL
LRLRR
LRR
RRR
LRLL
RL LLLR
LLLL
LLR
R
L LL
LLLL
LR
LLR LRL
LLL
LLLR
LRLL
RRL
RL
RR
RRL
LRLR
RRR
Kd-tree A
tmax
Registers – 13 min: Ray – 6 t, tmin, tmax – 3 node – 2 tid, stack_top – 2 19 registers – is a practical number Stack in local memory
tmin tmax
B
t*
tmin tmax
C
tmin
t*
t*
Kd-tree Tree traversing LRLRL
LRLRR LRR
RRR
LRLL
LLLL
Stack: Current Node:
RL LLLR
LLR
RRL
Kd-tree Tree traversing LRLRL
LRLRR LRR
RRR
LRLL
LLLL
Stack: R Current Node: L
RL LLLR
LLR
RRL
Kd-tree Tree traversing LRLRL
LRLRR LRR
RRR
LRLL
LLLL
Stack: R Current Node: LL
RL LLLR
LLR
RRL
Kd-tree Tree traversing LRLRL
LRLRR LRR
RRR
LRLL
LLLL
RL LLLR
Stack: LLR, R Current Node: LLL
LLR
RRL
Kd-tree Tree traversing LRLRL
LRLRR LRR
RRR
LRLL
LLLL
RL LLLR
Stack: LLR, R Current Node: LLLR
LLR
RRL
Kd-tree Tree traversing LRLRL
LRLRR LRR
RRR
LRLL
LLLL
RL LLLR
Stack: R Current Node: LLR
LLR
We could stop here!
RRL
Kd-tree Tree traversing LRLRL
LRLRR LRR
RRR
LRLL
LLLL
Stack: Current Node: R
RL LLLR
LLR
RRL
Kd-tree Tree traversing LRLRL
LRLRR LRR
RRR
LRLL
LLLL
Stack: RR Current Node: RL
RL LLLR
LLR
RRL
Kd-tree Tree traversing LRLRL
LRLRR LRR
RRR
LRLL
LLLL
Stack: Current Node: RR
RL LLLR
LLR
RRL
Kd-tree Tree traversing LRLRL
LRLRR LRR
RRR
LRLL
LLLL
RL LLLR
Stack: Current Node: RRR
LLR
Result: LLR, RRR
RRL
Tree traversal
Different rays may run for different time One thread can stall a whole block
Each thread needs a buffer to store all possible leafs Worst case: a ray intersects all possible leafs of a tree
Tree traversal
Different rays may run for different time Solution: Persistent threads
Each thread needs a buffer to store all possible leafs Solution: Screen tiling
Persistent threads
Launch as many threads as possible Depends on HW architecture and kernel requisites
Keep all threads busy Create a pool of rays to traverse a tree
Regular execution Disadvantages Waiting until all threads finish execution to launch new block
Block 0
Block 1
time
Warp 0
Warp 1
Warp 2
Warp 3
Regular execution Disadvantages Waiting until all threads finish execution to launch new block
Block 0
Block 1
time
Warp 0
Warp 1
Warp 2
Warp 3
Persistent threads execution Advantages Workload is balanced between warps
time
Warp 0
Block 0
Warp 1
Warp 2
Warp 3
Screen Tiling
Split the screen into multiple tiles Render tiles separately Tiles of 128x128 / 256x256 work well 128x128 is still 16K of threads!
Allows easy multi-GPU performance scaling Control over memory
Tree traversal
Screen is split into tiles (256x256) Reserve place for a number of non-empty leafs
Computational complexity (>30 MADs) Register Pressure (>23) 6 r. per ray 9 r. per triangle 3 r. for intersection result (t, u, v) 1 r. for Triangle Count 1 r. for loop index 1 r. for thread ID (tid) 2 r. min_t и min_id
v1
u z v0 D p
v t1 v2
Ray-triangle kernel Each thread is mapped to a ray Each ray operates on its triangle
Block of threads shares triangles (packet)
Ray-triangle intersection Each thread is mapped to a ray triangles texture