Scheduling the Graphics Pipeline - Beyond Programmable Shading

Beyond Programmable Shading 2010 Keeping Many Cores Busy: Scheduling the Graphics Pipeline Jonathan Ragan-Kelley, MIT CSAIL 29 July 2010

Beyond Programmable Shading 2010

This talk How to think about scheduling GPU-style pipelines Four constraints which drive scheduling decisions Examples of these concepts in real GPU designs Goals Know why GPUs, APIs impose the constraints they do. Develop intuition for what they can do well. Understand key patterns for building your own pipelines.


First, a definition

Scheduling [n.]:


First, a definition

Scheduling [n.]: Assigning computations and data to resources in space and time.


The workload: Direct3D IA VS PA HS Tess DS PA GS Rast PS Blend Beyond Programmable Shading 2010

The workload: Direct3D IA VS PA HS Tess DS PA GS Rast PS Blend Beyond Programmable Shading 2010

The workload: Direct3D IA VS

data flow

PA HS Tess DS PA GS Rast PS Blend Beyond Programmable Shading 2010


data flow



data flow


Logical pipeline Fixed-function stage Programmable stage

The machine: a modern GPU Shader Core

Shader Core

Shader Core

Shader Core

Shader Core

Shader Core

Tex

Shader Core

Shader Core

Tex

Tex Tex

Primitive Input Assembler Assembler Rasterizer

Output Blend

Logical pipeline Fixed-function stage Programmable stage Physical processor Fixed-function logic

Task Distributor


Programmable core Fixed-function control

Scheduling a draw call as a series of tasks Shader Core

Shader Core

Shader Core

Primitive Input Rasterizer Assembler Assembler

time

Shader Core


Output Blend


Shader Core

Shader Core

Shader Core


time

IA


Output Blend


Shader Core

Shader Core

Shader Core

Primitive Input Rasterizer Assembler Assembler IA

time

VS


Output Blend


Shader Core

Shader Core

Shader Core


time

VS PA


Output Blend


Shader Core

Shader Core

Shader Core


time

VS PA Rast


Output Blend


Shader Core

Shader Core

Shader Core


VS

time

PA Rast PS

PS

PS


Output Blend


Shader Core

Shader Core

Shader Core


Output Blend

IA VS

time

PA Rast PS

PS

PS Blend Blend Blend


An efficient schedule keeps hardware busy Shader Core

time

VS VS

Shader Core PS VS

PS PS

PS

VS PS

PS


Shader Core

Shader Core

PS

VS

IA

PA

PS

PS

IA

PA

VS

IA

PA

PS

PS

IA

PS

PS

VS

VS

IA

PS

PS

IA

PA PA PA


Rast Rast Rast

Output Blend Blend Blend

Blend

Rast Rast

Blend

Rast

Blend

Choosing which tasks to run when (and where) Resource constraints Tasks can only execute when there are sufficient resources for their computation and their data.

Coherence Control coherence is essential to shader core efficiency. Data coherence is essential to memory and communication efficiency.

Load balance Irregularity in execution time create bubbles in the pipeline schedule.

Ordering Graphics APIs define strict ordering semantics, which restrict possible schedules. Beyond Programmable Shading 2010

Resource constraints limit scheduling options

time

Shader Core

PS

Shader Core

PS

Shader Core

PS

Shader Core


VS PS


Output Blend


time

Shader Core

PS

Shader Core

PS

Shader Core

PS

Shader Core

VS


IA

PS


Output Blend


time

Shader Core

PS

Shader Core

PS

Shader Core

??? PS

Shader Core

VS


IA

PS


Output Blend


time

Shader Core

PS

Shader Core

PS

Shader Core

??? PS

Shader Core

VS PS


IA ???


Output Blend


time

Shader Core

PS

Shader Core

PS

Shader Core

??? PS

Shader Core

VS PS


IA ??? Deadlock


Output Blend


time

Shader Core

PS

Shader Core

PS

Shader Core

??? PS

Shader Core

VS PS


IA ??? Deadlock

Key concept: Preallocation of resources helps guarantee forward progress. Beyond Programmable Shading 2010

Output Blend

Coherence is a balancing act Intrinsic tension between: Horizontal (control, fetch) coherence and Vertical (producer-consumer) locality. Locality and Load Balance.


Graphics workloads are irregular Rasterizer






…!


SuperExpensiveShader( ) TrivialShader( )


…!



…!

But: Shaders are optimized for regular, self-similar work. Imbalanced work creates bubbles in the task schedule.




…!

But: Shaders are optimized for regular, self-similar work. Imbalanced work creates bubbles in the task schedule. Solution: Dynamically generating and aggregating tasks isolates irregularity and recaptures coherence. Redistributing tasks restores load balance. Beyond Programmable Shading 2010

Redistribution after irregular amplification Shader Core

Shader Core

Shader Core

Shader Core


Output Blend

IA VS

time

PA Rast PS

PS




Shader Core

Shader Core

Shader Core


Output Blend

IA VS

time

PA Rast PS

PS




Shader Core

Shader Core

Shader Core


Output Blend

IA VS

time

PA Rast PS

PS

PS Blend

Key concept: Managing irregularity by dynamically generating, aggregating, and redistributing tasks Beyond Programmable Shading 2010

Blend Blend

Ordering

Rule: All framebuffer updates must appear as though all triangles were drawn in strict sequential order


Ordering

Rule: All framebuffer updates must appear as though all triangles were drawn in strict sequential order Key concept: Carefully structuring task redistribution to maintain API ordering. Beyond Programmable Shading 2010

Building a real pipeline


Static tile scheduling The simplest thing that could possibly work. Vertex

Pixel

Pixel

Pixel

Pixel

Multiple cores: 1 front-end n back-end Beyond Programmable Shading 2010

Exemplar: ARM Mali 400

Static tile scheduling Pixel

Pixel Vertex Pixel

Pixel




Pixel Vertex Pixel

Pixel




Pixel Vertex Pixel

Pixel




Pixel Vertex Pixel

Pixel




Pixel Vertex Pixel

Pixel



Static tile scheduling Locality

Pixel

captured within tiles

Resource constraints

Pixel

static = simple Pixel

Ordering single front-end, sequential processing within each tile

Pixel




The problem: load imbalance

Pixel

only one task creation point.

Pixel

no dynamic task redistribution.

Pixel





Pixel


Pixel


Pixel



Static tile scheduling Pixel !!!


Pixel idle…


Pixel idle…


Pixel idle…



Sort-last fragment shading Pixel

Pixel Vertex

Rasterizer Pixel

Pixel


Exemplars: NVIDIA G80, ATI RV770


Vertex

Pixel

Redistribution restores fragment load balance.

Pixel

But how can we maintain order?

Rasterizer

Pixel




Pixel Vertex

Rasterizer Pixel

Pixel


Preallocate outputs in FIFO order Exemplars: NVIDIA G80, ATI RV770


Pixel Vertex

Rasterizer Pixel

Pixel


Complete shading asynchronously Exemplars: NVIDIA G80, ATI RV770


Pixel Vertex

Blend fragments in FIFO order Output Blend

Rasterizer Pixel

Pixel



Unified shaders Solve load balance by time-multiplexing different stages onto shared processors according to load Shader Core

Shader Core

Shader Core

Shader Core

Tex

Shader Core

Shader Core

Tex

Shader Core

Shader Core

Tex

Tex

Primitive Input Assembler Assembler Rasterizer

Output Blend

Task Distributor



Unified Shaders: time-multiplexing cores Shader Core

Shader Core

Shader Core

Shader Core


Output Blend

IA VS

time

PA Rast PS

PS




Unified Shaders: time-multiplexing cores Shader Core

Shader Core

Shader Core

Shader Core


Output Blend

IA VS

time

PA Rast PS

PS




Prioritizing the logical pipeline IA VS PA Rast PS Blend Beyond Programmable Shading 2010

Prioritizing the logical pipeline 5

VS

4

PA

3

Rast

2

PS

1

Blend

0

priority

IA



VS

4

PA

3

Rast

2

PS

1

Blend

0

priority

IA



VS

4

PA

3

Rast

2

PS

1

Blend

0

priority

fixed-size queue storage

IA


Scheduling the pipeline Shader Core

IA

Shader Core

VS

time

PA Rast PS Blend Beyond Programmable Shading 2010


Shader Core

IA

VS

VS

VS

time



Shader Core

IA

VS

VS

VS

time


time


Shader Core

IA

VS

VS

VS

PS

PS

High priority, but stalled on output

PA Rast PS Blend


Lower priority, but ready to run

time


Shader Core

IA

VS

VS

VS

PS

PS

…

PA Rast PS Blend


Scheduling the pipeline

Queue sizes and backpressure provide a natural knob for balancing horizontal batch coherence and producer-consumer locality.


Summary


Key concepts Think of scheduling the pipeline as mapping tasks onto cores. Preallocate resources before launching a task. Preallocation helps ensure forward progress and prevent deadlock.

Graphics is irregular. Dynamically generating, aggregating and redistributing tasks at irregular amplification points regains coherence and load balance.

Order matters. Carefully structure task redistribution to maintain ordering.


Questions for the future Can we relax the strict ordering requirements? Can you build a generic scheduler for application-defined pipelines? What application-specific information would a generic scheduler need to work well? Beyond Programmable Shading 2010

Starting points to learn more The next step: parallel primitive processing Eldridge et al. Pomegranate: A Fully Scalable Graphics Architecture. SIGGRAPH 2000. Tim Purcell. Fast Tessellated Rendering on Fermi GF100. Hot3D, HPG 2010.

Scheduling cyclic graphs, in software, on current GPUs Parker et al. OptiX: A General Purpose Ray Tracing Engine. SIGGRAPH 2010. Details of the ARM Mali design Tom Olson. Mali-400 MP: A Scalable GPU for Mobile Devices. Hot3D, HPG 2010. Beyond Programmable Shading 2010

Thank you Special thanks: Tim Purcell, Steve Molnar, Henry Moreton, Steve Parker, Austin Robison - NVIDIA Jeremy Sugerman - Stanford Mike Houston - AMD Mike Doggett - Lund University Tom Olson - ARM Beyond Programmable Shading 2010

Some Lessons


Why don’t we have dynamic resource allocation? e.g. recursion, malloc() in shaders Static preallocation of resources guarantees forward progress. Tasks which outgrow available resources can stall, causing deadlock.


Geometry Shaders are slow because they allow dynamic amplification in shaders. Pick your poison: Always stream through DRAM. exemplar: ATI R600 Smooth falloff for large amplification, but very slow for small amplification (DRAM latency). Scale down parallelism to fit. exemplar: NVIDIA G80 Fast for small amplification, poor shader throughput (no parallelism) for large amplification. Beyond Programmable Shading 2010

Why isn’t rasterization programmable? (Yes, it is computationally intensive.)

It is highly irregular. It must generate and aggregate regular output. It must integrate with an order-preserving task redistribution mechanism.


Scheduling the Graphics Pipeline - Beyond Programmable Shading

Scheduling the Graphics Pipeline - Beyond Programmable Shading

Suggest Documents

OpenCL - Beyond Programmable Shading

GPU Architecture - Beyond Programmable Shading

Keeping Many Cores Busy: Scheduling the Graphics Pipeline Beyond ...

Compute Shader -Beyond Programmable Shading: Fundamentals

Interactive Global Illumination - Beyond Programmable Shading

id Tech 5 Challenges - Beyond Programmable Shading

AVC Using Programmable Graphics

Beyond the Visualization Pipeline: The

Hybrid Computational Voxelization Using the Graphics Pipeline

Interactive Graphics Applications with OpenGL Shading ... - Decom

Hybrid Computational Voxelization Using the Graphics Pipeline

Programmable Packet Scheduling - Stanford University

Real-Time Programmable Volume Rendering - Computer Graphics ...

Ray Casting on Programmable Graphics Hardware - Stanford ...

Object-Space Interference Detection on Programmable Graphics ...

Programmable Graphics Processing Units for Urban Landscape ...

Parallel Genetic Algorithms on Programmable Graphics Hardware

Game Graphics Programming - Cap 5 Programmable Shaders.pdf ...

STANDARDS PIPELINE Imaging and Graphics Business Team ...

An image-based shading pipeline for 2D animation - Visgraf - IMPA

GPU Architecture: Implications & Trends - Beyond Programmable ...

Current Generation Parallelism In Games - Beyond Programmable ...

scheduling of a multiproduct pipeline system - Scielo.br

Computation on GPUs: From a Programmable Pipeline to ... - SCI Utah