Symmetric Key Cryptography on Modern Graphics Hardware - AMD

3 downloads 44 Views 7MB Size Report
Symmetric Key Cryptography on Modern Graphics Hardware. April 23 ... Modern Graphics Hardware”, Asiacrypt 2007. Motivation ..... Thread communication.
Symmetric Key Cryptography on Modern Graphics Hardware

Jason C. Yang Graphics Products Group Advanced Micro Devices, Inc. 04/23/08 | Session Code: C1-3

Introduction Jason Yang and James Goodman, “Symmetric Key Cryptography on Modern Graphics Hardware”, Asiacrypt 2007.

Motivation • Digital Rights Management • Advanced Access Content System (AACS) –Blu-Ray / HD-DVD

• Key Searching

April 23, 2008 2

Symmetric Key Cryptography on Modern Graphics Hardware

Goals for Today’s Talk Introduce graphics hardware (GPUs) as a high performance compute device Describe symmetric key implementations on GPUs Discuss future directions

April 23, 2008 3

Symmetric Key Cryptography on Modern Graphics Hardware

Outline Why Graphics Hardware (GPUs)? GPU Programming Model Block-Based AES • GPU Programming Example Key Searching • Bitsliced DES and AES Conclusions

April 23, 2008 4

Symmetric Key Cryptography on Modern Graphics Hardware

Why GPUs?

April 23, 2008 5

Symmetric Key Cryptography on Modern Graphics Hardware

What is a GPU? Hardware originally designed specifically for graphics and multimedia • Geometry • Images • Think “pixel processing” Massively parallel Lots of memory bandwidth

April 23, 2008 6

Symmetric Key Cryptography on Modern Graphics Hardware

GPU Form Factors

Notebook

Desktop Home Media Server

Game console Home Cinema

HDTV Handset

UMPC

Digital STB/PVR OCUR

April 23, 2008 7

Symmetric Key Cryptography on Modern Graphics Hardware

April 23, 2008

X6

80

0

0 lQ te In

In

te

lQ

X6

70

00 66

lQ te

In

te

lE

68

2

Symmetric Key Cryptography on Modern Graphics Hardware

In

M

D

FX

-6

+ 00 60 A

~1-2 orders of magnitude better in several key metrics vs. current CPUs • Memory bandwidth • GFLOPS per Watt • GFLOPS per $

8

50

80 70 60 50 40 30 20 10 0

D A

Designed to handle processing massive amounts of data efficiently • Supports 1000’s of concurrent threads • Massive memory bandwidth • Memory latency hiding

M

Potential for significant speedup for data parallel problems • Basic: 5-10x • Tuned: 20-100x or even more

RGPU/RCPU

Why Use GPUs Instead of CPUs?

Memory BW vs. 2950XT GFLOP/W vs. 2950XT GFLOP/$ vs. 2950XT Memory BW vs. 2950XT2 GFLOP/W vs. 2950XT2 GFLOP/$ vs. 2950XT2 Memory BW vs. 2900XT GFLOP/W vs. 2900XT GFLOP/$ vs. 2900XT

GPU vs. CPU: Quick Comparison

Barcelona

GPU

CPU

# Processors

64+

4

ALU area

~40% of die

~5% of die

Memory System

Max bandwidth (10x)

Min latency (0.1x)

Memory Access

Complex (tiling + arithmetic in memory)

Simple LD/ST

Cache

Small cache

Large cache (10x)

FP Compliance

Partial IEEE DP/SP

Full IEEE 754 DP/SP

April 23, 2008 9

Radeon HD 3870

Symmetric Key Cryptography on Modern Graphics Hardware

GPU vs. CPU: Design Points GPU Program Style

Few instructions, lots of data

Lots of instructions, little data

Control Flow

SIMD Hardware threading

Out of order execution Branch prediction

Access Patterns

Little reuse

Reuse + locality

Program Model

Data parallel

Task parallel

Synchronization

Very simple sync

Complex sync

ISA

Proprietary

Standardized

Legacy Support

Not necessarily

Backwards compatible

Functional Deltas Large and frequent

April 23, 2008 10

CPU

Small and infrequent

Symmetric Key Cryptography on Modern Graphics Hardware

Typical CPU Operation

One iteration at a time Single CPU unit Cannot reach 100%

Hard to prefetch data Multi-core does not help Cluster does not help Limited number of outstanding fetches

Fetch Alu

Wait for memory, gaps prevent peak performance Gap size varies dynamically Hard to tolerate latency April 23, 2008 11

Symmetric Key Cryptography on Modern Graphics Hardware

GPU THREADS (Lower Clock – Different Scale) Fetch 100 % ALU utilization

Overlapped fetch and alu Many outstanding fetches Alu

ALU units reach 100% utilization Hardware sync for final Output

April 23, 2008 12

Lots of threads Fetch unit + ALU unit Fast thread switch In-order finish

Symmetric Key Cryptography on Modern Graphics Hardware

GPU Internals: ATI Radeon HD 2900 XT >100 GB/s memory bandwidth • 512b DDR3/4 interface Targeted for handling thousands of simultaneous lightweight threads Instruction cache and constant cache for unlimited program size Scalar ALU implementation with 320 (64x5) independent stream processors • 256 (64x4) basic units (FMAC, ADD/SUB, SIN, etc.) • 64 enhanced transcedental units (adds COS, LOG, EXP, RSQ, etc.) • Support for INT/UINT in all units (ADD/SUB, AND, XOR, NOT, OR, etc.) April 23, 2008 13

Symmetric Key Cryptography on Modern Graphics Hardware

GPU Programming Model

April 23, 2008 14

Symmetric Key Cryptography on Modern Graphics Hardware

General Purpose GPU (GPGPU) Computing Not a new idea, graphics APIs (e.g., OpenGL) have been used for general purpose computation for years • VERY difficult to use due to constraints of graphics APIs and the graphics programming model itself • High overhead of graphics APIs yielded generally poor performance Key enabler today is that GPU developers are investing effort to improve usability and GPGPU performance • Introduction of high level C-like programming languages that access GPUs’ features (e.g., AMD’s Brook+ and Nvidia’s CUDA) • Support for simpler programming model • Exposure of proprietary internal GPU features (e.g., ISA and IL specs) • Creation of toolsets for emulation, performance tuning, and debugging • Explicit architectural support for GPGPU (e.g., shared memory) • Industry standardization efforts GPUs are also starting to more closely resemble CPUs • Native integer and DP support • Advanced control flow features for branching/looping April 23, 2008 15

Symmetric Key Cryptography on Modern Graphics Hardware

Understanding GPU Parallelization Matrix addition • C(i,j) = A(i,j) + B(i,j)

April 23, 2008 16

Symmetric Key Cryptography on Modern Graphics Hardware

Some Pseudo C-Code

void sum(float A[], float B[], float C[]) { for(int i=0; i

Suggest Documents