Symmetric Key Cryptography on Modern Graphics Hardware. April 23 ... Modern
Graphics Hardware”, Asiacrypt 2007. Motivation ..... Thread communication.
Symmetric Key Cryptography on Modern Graphics Hardware
Jason C. Yang Graphics Products Group Advanced Micro Devices, Inc. 04/23/08 | Session Code: C1-3
Introduction Jason Yang and James Goodman, “Symmetric Key Cryptography on Modern Graphics Hardware”, Asiacrypt 2007.
Motivation • Digital Rights Management • Advanced Access Content System (AACS) –Blu-Ray / HD-DVD
• Key Searching
April 23, 2008 2
Symmetric Key Cryptography on Modern Graphics Hardware
Goals for Today’s Talk Introduce graphics hardware (GPUs) as a high performance compute device Describe symmetric key implementations on GPUs Discuss future directions
April 23, 2008 3
Symmetric Key Cryptography on Modern Graphics Hardware
Outline Why Graphics Hardware (GPUs)? GPU Programming Model Block-Based AES • GPU Programming Example Key Searching • Bitsliced DES and AES Conclusions
April 23, 2008 4
Symmetric Key Cryptography on Modern Graphics Hardware
Why GPUs?
April 23, 2008 5
Symmetric Key Cryptography on Modern Graphics Hardware
What is a GPU? Hardware originally designed specifically for graphics and multimedia • Geometry • Images • Think “pixel processing” Massively parallel Lots of memory bandwidth
April 23, 2008 6
Symmetric Key Cryptography on Modern Graphics Hardware
GPU Form Factors
Notebook
Desktop Home Media Server
Game console Home Cinema
HDTV Handset
UMPC
Digital STB/PVR OCUR
April 23, 2008 7
Symmetric Key Cryptography on Modern Graphics Hardware
April 23, 2008
X6
80
0
0 lQ te In
In
te
lQ
X6
70
00 66
lQ te
In
te
lE
68
2
Symmetric Key Cryptography on Modern Graphics Hardware
In
M
D
FX
-6
+ 00 60 A
~1-2 orders of magnitude better in several key metrics vs. current CPUs • Memory bandwidth • GFLOPS per Watt • GFLOPS per $
8
50
80 70 60 50 40 30 20 10 0
D A
Designed to handle processing massive amounts of data efficiently • Supports 1000’s of concurrent threads • Massive memory bandwidth • Memory latency hiding
M
Potential for significant speedup for data parallel problems • Basic: 5-10x • Tuned: 20-100x or even more
RGPU/RCPU
Why Use GPUs Instead of CPUs?
Memory BW vs. 2950XT GFLOP/W vs. 2950XT GFLOP/$ vs. 2950XT Memory BW vs. 2950XT2 GFLOP/W vs. 2950XT2 GFLOP/$ vs. 2950XT2 Memory BW vs. 2900XT GFLOP/W vs. 2900XT GFLOP/$ vs. 2900XT
GPU vs. CPU: Quick Comparison
Barcelona
GPU
CPU
# Processors
64+
4
ALU area
~40% of die
~5% of die
Memory System
Max bandwidth (10x)
Min latency (0.1x)
Memory Access
Complex (tiling + arithmetic in memory)
Simple LD/ST
Cache
Small cache
Large cache (10x)
FP Compliance
Partial IEEE DP/SP
Full IEEE 754 DP/SP
April 23, 2008 9
Radeon HD 3870
Symmetric Key Cryptography on Modern Graphics Hardware
GPU vs. CPU: Design Points GPU Program Style
Few instructions, lots of data
Lots of instructions, little data
Control Flow
SIMD Hardware threading
Out of order execution Branch prediction
Access Patterns
Little reuse
Reuse + locality
Program Model
Data parallel
Task parallel
Synchronization
Very simple sync
Complex sync
ISA
Proprietary
Standardized
Legacy Support
Not necessarily
Backwards compatible
Functional Deltas Large and frequent
April 23, 2008 10
CPU
Small and infrequent
Symmetric Key Cryptography on Modern Graphics Hardware
Typical CPU Operation
One iteration at a time Single CPU unit Cannot reach 100%
Hard to prefetch data Multi-core does not help Cluster does not help Limited number of outstanding fetches
Fetch Alu
Wait for memory, gaps prevent peak performance Gap size varies dynamically Hard to tolerate latency April 23, 2008 11
Symmetric Key Cryptography on Modern Graphics Hardware
GPU THREADS (Lower Clock – Different Scale) Fetch 100 % ALU utilization
Overlapped fetch and alu Many outstanding fetches Alu
ALU units reach 100% utilization Hardware sync for final Output
April 23, 2008 12
Lots of threads Fetch unit + ALU unit Fast thread switch In-order finish
Symmetric Key Cryptography on Modern Graphics Hardware
GPU Internals: ATI Radeon HD 2900 XT >100 GB/s memory bandwidth • 512b DDR3/4 interface Targeted for handling thousands of simultaneous lightweight threads Instruction cache and constant cache for unlimited program size Scalar ALU implementation with 320 (64x5) independent stream processors • 256 (64x4) basic units (FMAC, ADD/SUB, SIN, etc.) • 64 enhanced transcedental units (adds COS, LOG, EXP, RSQ, etc.) • Support for INT/UINT in all units (ADD/SUB, AND, XOR, NOT, OR, etc.) April 23, 2008 13
Symmetric Key Cryptography on Modern Graphics Hardware
GPU Programming Model
April 23, 2008 14
Symmetric Key Cryptography on Modern Graphics Hardware
General Purpose GPU (GPGPU) Computing Not a new idea, graphics APIs (e.g., OpenGL) have been used for general purpose computation for years • VERY difficult to use due to constraints of graphics APIs and the graphics programming model itself • High overhead of graphics APIs yielded generally poor performance Key enabler today is that GPU developers are investing effort to improve usability and GPGPU performance • Introduction of high level C-like programming languages that access GPUs’ features (e.g., AMD’s Brook+ and Nvidia’s CUDA) • Support for simpler programming model • Exposure of proprietary internal GPU features (e.g., ISA and IL specs) • Creation of toolsets for emulation, performance tuning, and debugging • Explicit architectural support for GPGPU (e.g., shared memory) • Industry standardization efforts GPUs are also starting to more closely resemble CPUs • Native integer and DP support • Advanced control flow features for branching/looping April 23, 2008 15
Symmetric Key Cryptography on Modern Graphics Hardware
Understanding GPU Parallelization Matrix addition • C(i,j) = A(i,j) + B(i,j)
April 23, 2008 16
Symmetric Key Cryptography on Modern Graphics Hardware
Some Pseudo C-Code
void sum(float A[], float B[], float C[]) { for(int i=0; i