Lecture 1: Computer Architecture - Zettaflops.org

3 downloads 746 Views 5MB Size Report
12 Oct 2004 ... LASCI04.ppt. Workshop on ... Sense Amplifiers/Latches. Column Multiplexing. Memory. Array. 1 “Full Word”/ ..... Variant Two: Smart Memory.
Advanced Architectures & Execution Models: How New Architectures May Help Give Silicon Some Temporary New Life And Pave the Way for New Technologies Peter M. Kogge McCourtney Prof. of CS & Engr, Concurrent Prof. of EE Assoc. Dean for Research, University of Notre Dame IBM Fellow (ret)

LASCI04.ppt

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 1

Why Is Today’s Supercomputing Hard In Silicon: Little’s Tyranny ILP: Getting tougher & tougher to increase • Must extract from program • Must support in H/W

Concurrency Latency

= Throughput

Getting worse fast!!!! (The Memory Wall) LASCI04.ppt

Much less than peak and degrading rapidly

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 2

Why Is Zettaflops Even Harder? • Silicon density: Sheer space taken up implies large distances & loooooong latencies • Silicon mindset: – Processing logic “over here” – Memory “over there” – And we add acres of high heat producing stuff to bridge the gap

• Thesis: how far can we go with a mindset change LASCI04.ppt

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 3

This Talk: Climbing the Wall a “Different Way” • Enabling concepts implementable in silicon: – Processing In Memory • Lowering the wall, both bandwidth & latency

– Relentless Multi-threading with Light Weight Threads • to change number of times we must climb it • to reduce the state we need to keep behind

• Finding architectures & execution models that support both • With emphasis on “Highly Scalable” Systems LASCI04.ppt

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 4

• High density memory on same chip with high speed logic • Very fast access from logic to memory • Very high bandwidth • ISA/microarchitecture designed to A S utilize high bandwidth A • Tile with “memory+logic” nodes P

Row Decode Logic

“Processing-In-Memory” Memory Array 1 “Full Word”/Row

1 Column/Full Word Bit

Sense Amplifiers/Latches

Column Multiplexing Address

“Wide Word” Interface

Performance Monitor Broadcast Bus

Wide ALUs Permutation Network Thread State Package Global Address Translation Parcel Decode and Assembly

Stand Alone Memory Units

Tiling a Chip

Wide Register File

A Memory/Logic Node

Interconnect incoming parcels

outgoing parcels

Parcel = Object Address + Method_name + Parameters

Processing Logic LASCI04.ppt

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 5

A Short History of PIM @ ND Our IBM Origins 16b CPU

64 KB

SRAM

D R A M I/F

ASAP CPU Core

1 3 9 4 I/F

PCI Memory I/F

EXECUBE

RTAIS

• 1st DRAM MIMD PIM • 8-way hypercube

EXESPHERE

• FPGA ASAP prototype • Early multi-threading

PIM Fast

• 9 PIMs + FPGA interconnect • Place PIM in PC memory space

• PIM for Spacecraft • Parcels & Scalability

PIM Lite • 1st Mobile, Multithreaded PIM • Demo all lessons learned

Coming Soon To a Computer Near You

SRAM PIM

A Cray Inc. HPCS Partner

C A

CPU

DRAM PIM

Cascade

HTMT

• World’s First Trans-petaflop • PIM-enhanced memory system

LASCI04.ppt

PIM Macros • Fabbed in adv. Technologies • Explore key layout issues

• 2-level PIM for petaflop • Ultra-scalability

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

C H E

Conventional Motherboard

DIVA • Multi-PIM memory module • Model potential PIM programs

Slide 6

Acknowledgements • My personal work with PIM dates to late 1980s • But also! ND/CalTech/Cray collaboration now a decade old!

Architecture Working Group 1st Workshop on Petaflops Computing Pasadena, Ca Feb. 22-24, 1994 LASCI04.ppt

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 7

Topics • • • • • •

How We Spend Today’s Silicon The Silicon Roadmap – A Different Way PIM as an Alternate Technology PIM-Enabled Architectures Matching ISA & Execution Models Some Examples

LASCI04.ppt

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 8

How We Spend Our Silicon Today: Or “The State of State”

LASCI04.ppt

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 9

*

How Are We Using Our Silicon? Compare CPU to a DP FPU 450 400

Area (sq. mm)

350 300 250

66 to 1: Is This State?

Crossover

200 150 100 50 0

1970

1975

1980

1985

CPU Die Area

LASCI04.ppt

1990

1995

2000

2005

Eqvt. DP FPU

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 10

CPU State vs Time Estimated State (k bits)

100,000.00 10,000.00 1,000.00 100.00

nd u o mp o C X 5 1.

10.00 1.00

G

a R h t row

er p te

ar e Y

0.10 0.01

1970

1980

1990

2000

Total State

Machine

Supervisor

User

Transient

Latency Enhancing

Access Enhancing LASCI04.ppt

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 11

So We Expect State & Transistor Count to be Related 1,000,000

# of Transistors (x1000)

100,000 10,000 1,000 100

And They Are!

10 1

0

1

10

100

1,000

10,000

100,000

Total State (x1000)

LASCI04.ppt

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 12

The Silicon Roadmap: Or “How are we Using an Average Square of Silicon?”

LASCI04.ppt

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 13

The Perfect “Knee Curves” No “Overhead” of Any Kind “Knee”: 50% Logic & 50% Memory 1.00E+05 1.00E+04

GF/sq.cm

1.00E+03 1.00E+02

e m i T

1.00E+01 1.00E+00 1.00E-01 1.00E-02 0.01

0.10

1.00

10.00

GB/sq.cm

2003 LASCI04.ppt

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

2018 Slide 14

Adding In “Lines of Constant Performance” 1.00E+05 1.00E+04

GF/sq.cm

1.00E+03 1.00E+02 1.00E+01

F G / B G 0.001

1.00E+00 1.00E-01 1.00E-02 0.01

/ GF B G 0.5 /GF B G 1 0.10

1.00

10.00

GB/sq.cm

LASCI04.ppt

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 15

What if We Include Basic Overhead? Perfect 2018

1.00E+05 1.00E+04

GF/sq.cm

1.00E+03

Perfect 2003

1.00E+02 1.00E+01 1.00E+00 1.00E-01

B/GF G 1 0 0.0

/ GF B G 0.01 B 0.5 G

/ GF

/ GF 1 GB

1.00E-02 0.01

0.10

1.00 GB/sq.cm

2003 LASCI04.ppt

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

10.00

2018 Slide 16

What If We Look At Today’s Separate Chip Systems? Perfect 2018

1.00E+05 1.00E+04

Min O/H 2018

GF/sq.cm

1.00E+03

Perfect 2003

1.00E+02

Min1.00E+01 O/H 2003 1.00E+00 1.00E-01

B/GF G 1 0 0.0 F GB/G 0.01

/ GF B G 0.5 / GF 1 GB

1.00E-02 0.01

0.10

1.00 GB/sq.cm

2003 LASCI04.ppt

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

10.00

2018 Slide 17

How Many of Today’s Chips Make Up a “Zetta”? 1.00E+13

A Ze

ttaB yte

Chip Count

1.00E+12

AZ etta F

1.00E+11

lop

1.00E+10

1.00E+09 2002

2004

2006

2008

2010 1 ZF

2012

2014

2016

2018

2020

1 ZB

And This Does Not Include “Routing” LASCI04.ppt

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 18

How Big is a 1 ZB System? 1.00E+12

Area of 1ZB System (sq. m)

1.00E+11 1.00E+10 1.00E+09

Area of Manhatten Island

1.00E+08 1.00E+07

1 Sq. Mile

1.00E+06 1.00E+05

1 Football Field

1.00E+04 1.00E+03

2003

LASCI04.ppt

2004

2005

2006

2007

2008

2009

2010

2012

2013

2015

2016

2018

0.001DC

0.01DC

0.1DC

0.5DC

1DC

Football Field

1 Square Mile

Manhatten Island

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 19

How Does Chip I/O Bandwidth Relate to Performance?

Ratio of GBps/die to GFps/die

1000.00

100.00

10.00 2000

LASCI04.ppt

2005

2010

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

2015

2020

Slide 20

Problems • Complexity & Area Infeasible • Flop numbers assume perfect utilization • But latencies are huge – Diameter of Manhatten = 28 microseconds

• And efficiencies will plummet – At 0.1% efficiency we need area of Rhode Island for microprocessor chips • Whose diameter is 240 microseconds

• And we don’t have enough pins! LASCI04.ppt

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 21

PIM as an Alternative Technology

LASCI04.ppt

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 22

PIM Objective • Move processing logic onto dense DRAM die • Obtain decreased latency & increased latency for local reference – Without needing pins

• AND simplify logic down to a simple core • Thus allowing many more “processors” • And off chip pins used for true remote reference

LASCI04.ppt

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 23

Classical DRAM • Memory mats: ~ 1 Mbit each • Row Decoders • Primary Sense Amps • Secondary sense amps & “page” multiplexing • Timing, BIST, Interface • Kerf

45% of Die is non storage

1.00 0.90

100

% Chip Overhead

Density (Mbits/sq.mm)

1000

10 1 0.1 0.01 0.001

0.70 0.60 0.50 0.40 0.30 0.20 0.10

0.0001 1970

0.80

1980

1990

2000

2010

2020

Chip: Historical

Chip: SIA Production

Chip: SIA Introduction

Cell: Historical

Cell: SIA Production

Cell: SIA Introduction

0.00

1970

1980 Historical

LASCI04.ppt

1990

2000

SIA Production

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

2010

2020

SIA Introduction

Slide 24

Embedded DRAM Macros Today MAT

(Almost)

Row Decode

MAT

Prim. SA

...

Prim. SA

MAT

Row Decode

MAT

Prim. SA

Some maximum # of memory blocks

Memory Block: 512x2048 = 1 Mbit

Prim. SA

Base Block

2nd Sense, Mux, BIST, Timing

Address

“Wide Word” Data: 100’s of bits

Cycle Time Estimates from IBM for Embedded RAM 100

10

1 1998

1999

2000 SRAM

LASCI04.ppt

2001 DRAM

2002

2003

Est. SRAM

2004

2005

#REF!

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 25

PIM Chip MicroArchitectural Spectrum M E M

L O M G E I M C

.. 8 Times ..

M E M

L O M G E I M C

DRAM

Word Drivers & Row Decoder M E M

L O M G E I M C

.. 8 Times ..

M E M

DRAM

L O M G E I M C

Data

16 Mbit DRAM Macro

32B 512B Line 512B Line

512B Victim Cache

16 Mbit

Simple Micro Sparc II

..X16.. DRAM Macro

512B Line

512B Line

Memory Coherence Controller

DRAM

CPU CPU L1

L1 L2

Instructions

Complete SMP Node: Proposed SUN part LASCI04.ppt

Serial Inter Connect

DRAM

Single Chip Computer: Mitsubishi M32R/D

SIMD: Linden DAAM

512B Line 512B Line

C C Mpy a a CPU c c Mem h h I/F e e

Chip Level SMP: POWER4

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Tiled & Scalable: BLUE GENE, EXECUBE Slide 26

The PIM “Bandwidth Bump” 1000 Bandwidth (GB/s)

32 Nodes Complex RegFile

100 Simple 3Port L1 RegFile Local Chip 10 Memory

Between 1B & 1 GB, Area under curve: 1 PIM Node = 4.3xUIII 1 PIM Chip = 137xUIII

L2

Off-Chip Memory

1 1.00E+00

1.00E+03

Region of classical Temporal Intensive Performance Advantage

1.00E+06

1.00E+09

Region of PIM Spatially Intensive Performance Advantage (1 Node)

Reachable Memory (Bytes) UltraIII CPU Chip

LASCI04.ppt

Single PIM Node

32node PIM Chip

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 27

PIM-Based Architectures: System & Chip Level

LASCI04.ppt

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 28

PIM System Design Space: Historical Evolution • Variant One: Accelerator (historical) • Variant Two: Smart Memory – Attach to existing SMP (using an existing memory bus interface) – PIM-enhanced memories, accessible as memory if you wish – Value: Enhancing performance of status quo

• Variant Three: Heterogeneous Collaborative – PIMs become “independent,” & communicate as peers – Non PIM nodes “see” PIMs as equals – Value: Enhanced concurrency and generality over variant two

• Variant Four: Uniform Fabric (“All PIM”) – PIM “fabric” with fully distributed control and emergent behavior – Extra system I/O connectivity required – Value: Simplicity and economy over variant three

• Option for any of above: Extended Storage – Any of above where each PIM supports separate dumb memory chips LASCI04.ppt

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 29

TERASYS SIMD PIM (circa 1993) • Memory part for CRAY-3 • “Looked like” SRAM memory • With extra command port •128K SRAM bits (2k x 64) • 64 1 bit ALUs • SIMD ISA • Fabbed by National • Also built into workstation with 64K processors • 5-48X Y-MP on 9 NSA benchmarks LASCI04.ppt

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 30

EXECUBE: An Early MIMD PIM (1st Silicon 1993) • First DRAM-based Multiprocessor on a Chip • Designed from onset for “glueless” one-part-type scalability • On-chip bandwidth: 6.2 GB/s; Utilization modes > 4GB/s MEMORY MEMORY

MEMORY MEMORY

CACHE

MEMORY MEMORY

MEMORY MEMORY

CACHE

CPU Include “High Bandwidth” Features in ISA LASCI04.ppt

8 Compute Nodes on ONE Chip

EXECUBE: 3D Binary Hypercube SIMD/MIMD on a chip

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 31

RAM-30

RAM-31

8Bit ALU

8Bit ALU

Inter-ALU Exchange

• • • • •

MEMORY BUS

RAM-... RAM-1 8Bit 8Bit ALU ALU

Shared Memory

RAM-0 8Bit ALU

Controller

RTAIS: The First ASAP (circa 1993)

Application: “Linda in Memory” Designed from onset to perform wide ops “at the sense amps” More than SIMD: flexible mix of VLIW “Object oriented” multi-threaded memory interface Result: 1 card 60X faster than state-of-art R3000 card

LASCI04.ppt

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 32

Mitsubishi M32R/D DRAM

DRAM

C C Mpy a a CPU c c Mem h h I/F e e

16 bit data bus

DRAM

Also two 1-bit I/Os DRAM

24 bit address bus

• 32-bit fixed point CPU + 2 MB DRAM • “Memory-like” Interface • Utilize wide word I/F from DRAM macro for cache line LASCI04.ppt

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 33

DIVA: Smart DIMMs for Irregular Data Structures uP Host

Cache

…. Host issues Parcels • Generalized “Loads & Stores” • Treat memory as Active Objectoriented store

LASCI04.ppt

ADR MAP

Interconnect

TLB

ADR MAP

Local Prog. CPU

A A S AMemory Memory S A S Stack Memory A P A Stack Stack P P

….

DIVA Functions: • Prefix operators CPU • Dereferencing & pointer chasing • Compiled methods • 1 CPU + 2MB • Multi-threaded • MIPS + “Wide Word” • May generate parcels C

A

C

H E

Conventional Motherboard

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 34

Micron Yukon

Host (remote)

SDRAM-like interface

FIFO

• 0.15µ µm eDRAM/ 0.18µ µm logic process

Task Dispatch Unit FIFO

HMI

FIFO Synchronisation

• 128Mbits DRAM

256 Processing Elements

Register Files

– 2048 data bits per access • 256 8-bit integer processors

DRAM Control Unit

M16 PE sequencer

16MBytes Embedded DRAM

– Configurable in multiple topologies

• On-chip programmable controller

• Operates like an SDRAM

LASCI04.ppt

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 35

Berkeley VIRAM • System Architecture: single chip media processing • ISA: MIPS Core + Vectors + DSP ops • 13 MB DRAM in 8 banks • Includes flt pt • 2 Watts @ 200 MHz, 1.6GFlops

LASCI04.ppt

MIPS 4 “Vector Lanes”

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 36

The HTMT Architecture & PIM Functions DRAM PIM

3D Mem

I/O FARM • Compress/Decompress • ECC/Redundancy

• Compress/Decompress • Spectral Transforms

OPTICAL SWITCH SRAM PIM

• Data Structure Initializations •“In the Memory” Operations

RSFQ Nodes

• Compress/Decompress • Routing

• RSFQ Thread Management • Context Percolation • Scatter/Gather Indexing • Pointer chasing • Push/Pull Closures • Synchronization Activities

New Technologies: • Rapid Single Flux Quantum (RSFQ) devices for 100 GHz CPU nodes • WDM all optical network for petabit/sec bi-section bandwidth • Holographic 3D crystals for Petabytes of on-line RAM • PIM for active memories to manage latency LASCI04.ppt

PIMs in Charge

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 37

PIM Lite Thread Pool

• •

– “Thread” = IP/FP pair – “Registers” = wide words in frames

4.5 mm

Instruction Memory (4 Kbytes) Frame Memory (1 K) ALU & Permute Net

• •

Data Memory (2 Kbytes) Write-Back Logic 2.9 mm memory interconnect network

“Looks like memory” at Interfaces ISA: 16-bit multithreaded/SIMD

• •

Designed for multiple nodes per chip 1 node logic area ~ 10.3 KB SRAM (comparable to MIPS R3000) TSMC 0.18u 1-node 1st pass success 3.2 million transistors (4-node)

Parcel in (via chip data bus)

Parcel out (via chip data bus)

PIM Memory

Thread Instr Queue Memory

CPU

Frame Memory

ALU

Data Memory

WriteBack Logic

Memory interconnect network

LASCI04.ppt

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 38

Next: An “All-PIM” Supercomputer PIM PIM “PIM PIMDIMM” PIM PIM PIM PIM PIM PIM PIM APIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM

A “PIM Cluster” “Host” PIM Cluster Interconnection Network I/O PIM Cluster LASCI04.ppt

PIM Cluster

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 39

What Might an “All-PIM” Zetta Target Look Like? Area of 1ZB PIM System (sq. m)

1.00E+11

1.00E+10

I can add 1000X peak performance at 3X area!

1.00E+09

1.00E+08

1.00E+07 2003

LASCI04.ppt

2004

2005

2006

2007

2008

2009

2010

1000 ZF Peak

100 ZF Peak

10 ZF Peak

1 ZF Peak

Manhatten Island

No Logic

2012

2013

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

2015

2016

2018

2 ZF Peak

Slide 40

Matching ISA & Execution Models

LASCI04.ppt

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 41

Guiding Principles: Memory-Centric Processor Architecture 

Focus on memory, not logic 

Memory



thread frames instructions global data

Processor

Make “CPUs” anonymous

Maximize performance, keeping cost in check 

make processor cost like a memory, not vice-versa!

• “How low can you go” – minimize storage for machine state • registers, buffers, etc.

– don’t overprovision functional units 

“Wider is better”

0 swap thread state in single bus cycle 0 wide-word, SIMD data operations

LASCI04.ppt

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 42

Working Definitions Light Weight State: Refers to computational thread • Separated from rest of system state • Reduced in size to ~ cache line Serious Multi-threading • Very cheap thread creation and state access • Huge numbers of concurrent threads • Support for cheap, low level synchronization • Permit opportunities for significant latency hiding • ISA level knowledge of “locality” LASCI04.ppt

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 43

How to Use Light Weight Threads • Approaches to solving Little’s Law problems – Reduce # of latency-causing events – Reduce total number of bit/chip crossings per operation – Or, reduce effective latency

• Solutions – Replace 2 way latency by 1 way “command” – Let thread processing occur “at the memory” – Increase number of memory access points for more bandwidth

• Light Weight Thread: minimal state needed to initiate limited processing in memory

LASCI04.ppt

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 44

Parcels: The Architectural Glue • Parcel = Parallel Communication Element • Basic unit of communication between nodes – At same level as a “dumb memory reference”

• Contents: extension of “Active Message” – Destination: Object in application space – Method: function to perform on that object – Parameters: values to use in computation

Threads in Transit!!! LASCI04.ppt

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 45

Type of Parcels • Memory Access: “Dumb” Read/write to memory • AMO: Simple_prefix_op to a memory location • Remote Thread Invocation: Start new thread at node holding designated address – Simple case: booting the original node run-time – More interesting: “slices” of program code

• Remote Object Method Invocation: Invoke method against an object at object’s home • Traveling Thread = Continuation: Move entire execution state to next datum.

LASCI04.ppt

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 46

Software Design Space Evolution • Hardware “fetch_and_op” • Generic libraries only • App.-specific subprogram compilation – Explicit method invocation – Expanded storage management

• Explicit support for classical multi-threading • Inter-thread communication – Message passing – Shared memory synchronization – Atomic multi-word transactions

• Pro-active data migration/percolation • Expanded multi-threading – Extraction of very short threads, new AMOs – Thread migration

• OS, run-time, I/O management “in the memory”

LASCI04.ppt

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 47

The HTMT Percolation Execution Model Data Structures In DRAM

“Contexts” in SRAM

Ga t Da her ta

V O R T E X

tter a c S

Wo rkin gD

“Contexts” in ataCRAM

C N E T lts u s Re

s t l su e R

SPELL

All Orange Arrows are parcels LASCI04.ppt

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 48

More Exotic Parcel Functions The “Background” Ones • Garbage Collection: reclaim memory • Load Balancing: check for over/under load & suggest migration of activities • Introspection: look for livelock, deadlock, faults or pending failures • Key attributes of all of these – Independent of application programs – Inherent memory orientation – Significant possibility of mobility LASCI04.ppt

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 49

Examples: • In-Memory Multi-Threading • Traveling Threads • Very Light Weight Thread Extraction

LASCI04.ppt

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 50

N-Body Simulation • Simulate motion of N bodies under mutual attractive/repulsive force. O(N2) • Barnes-Hut method – clusters of bodies approximated by single body when dense and far away – subdivide region into cells, represent using quad/octree • Highly parallel – 90+ percent of workload can be parallelized

LASCI04.ppt

initialize for each body calculate net force synchronize update positions for each body calculate net force

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 51

Heavyweight N-Body • Each node in cluster has partition of tree in memory, distributed spatially • Needs “Locally Essential Tree” in each cache • Network traffic on every cache miss • Sterling, Salmon, et. al. Gordon Bell Prize, 1997 (performance & cost-perf)

L.E.T. in cache

LASCI04.ppt

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 52

Multithreaded In-PIM N-Body • Thread per body, accumulating net force, traversing tree in parallel, in-memory processing • Very low individual thread state (net force, next pointer, body coordinates, body mass) • Network traffic only when thread leaves partition—lower traffic overall replicate top of tree to reduce bottleneck

LASCI04.ppt

• 16 64 MB PIM Chips • Each with multiple nodes • Appear as memory to host • Host is 2X clock rate of PIM

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 53

5.5 5

N-Body on PIM: Cost vs. Performance speedup cost 4-way SIMD

Relative Increase

4.5 4 3.5 3 2.5

1 FPU

• Cost basis: Prior system • 15% Serial on Host • 85% highly threadable • 40% of 85% is short vector 4-way SIMD

2

1 FPU

1.5 1 1

LASCI04.ppt

Conclusions: • MT & latency reduction buys the most • Short FP vectors may not be worth it

2

3 Nodes per PIM Chip

4

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

5

Slide 54

Example: Traveling Thread Vector Gather

• Given Base, Stride, Count, read strided vector to compact vector • Classical CPU-centric approach: – Issue “waves” of multiple, ideally K, loads – If stride < block size, return cache line – Else return single double word

• UWT thread-based – Source issues K “gathering” threads, one to each PIM memory macro – Each thread reads local values into payload – Continuing dispatching payload when full LASCI04.ppt

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 55

Vector Gather via LWT Traffic Estimates 10000000 Data Transferred (B)

Transactions

100000 10000 1000 100 10

1000000 100000 10000 1000

1

10

100

1000

Stride (Bytes)

10000

1

Q=2

Q=3

Q=4

Q=6

Q=8

100

1000

10000

Stride (Bytes)

Q = Payload size in DW

Classical

10

Classical

Q=2

Q=3

Q=4

Q=6

Q=8

• 4X+ reduction in Transactions • 25%+ reduction in bytes transferred LASCI04.ppt

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 56

Vector Add via Traveling Threads Type 1

X

X

X

X

Spawn type 2s

M E M O R Y

M E M O R Y

M E M O R Y

M E M O R Y

Y

Y

Y

Y

Z

Z

Z

Z

M E M O R Y

M E M O R Y

M E M O R Y

M E M O R Y

M E M O R Y

M E M O R Y

M E M O R Y

M E M O R Y

Type 2 Accumulate Q X’s in payload

Transaction Reduction factor: •1.66X (Q=1) •10X (Q=6) • up to 50X (Q=30)

Type 3 Fetch Q matching Y’s, add to X’s, save in payload, store in Q Z’s

Stride thru Q elements LASCI04.ppt

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 57

Trace-Based Thread Extraction & Simulation • Applied to large-scale Sandia applications over summer 2003 P: desired # concurrent threads

Analysis

From Basic Application Data LASCI04.ppt

Through Detailed Thread Characteristics

To Overall Concurrency

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 58

Summary

LASCI04.ppt

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 59

Summary • When it comes to silicon: It’s the Memory, Stupid! • State bloat consumes huge amounts of silicon – That does no useful work! – And all due to focus on “named” processing logic

• With today’s architecture, we cannot support bandwidth between processors & memory • PIM (Close logic & memory) attacks all these problems • But it’s still not enough for Zetta • But the ideas may migrate! LASCI04.ppt

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 60

A Plea to Architects and Language/Compiler Developers! Relentlessly Attack State Bloat by Reconsidering Underlying Execution Model Starting with Multi-threading Of Mobile, Light Weight States As Enabled by PIM technology LASCI04.ppt

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 61

The Future

Will We Design Like This?

Or This?

Regardless of Technology! LASCI04.ppt

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 62

PIMs Now In Mass Production • 3D Multi Chip Module • Ultimate in Embedded Logic • Off Shore Production • Available in 2 device types

•Biscuit-Based Substrate •Amorphous Doping for Single Flavor Device Type •Single Layer Interconnect doubles as passivation LASCI04.ppt

Workshop on Extreme Computing, LASCI, Oct. 12, 2004

Slide 63