12 Oct 2004 ... LASCI04.ppt. Workshop on ... Sense Amplifiers/Latches. Column Multiplexing.
Memory. Array. 1 “Full Word”/ ..... Variant Two: Smart Memory.
Advanced Architectures & Execution Models: How New Architectures May Help Give Silicon Some Temporary New Life And Pave the Way for New Technologies Peter M. Kogge McCourtney Prof. of CS & Engr, Concurrent Prof. of EE Assoc. Dean for Research, University of Notre Dame IBM Fellow (ret)
LASCI04.ppt
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 1
Why Is Today’s Supercomputing Hard In Silicon: Little’s Tyranny ILP: Getting tougher & tougher to increase • Must extract from program • Must support in H/W
Concurrency Latency
= Throughput
Getting worse fast!!!! (The Memory Wall) LASCI04.ppt
Much less than peak and degrading rapidly
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 2
Why Is Zettaflops Even Harder? • Silicon density: Sheer space taken up implies large distances & loooooong latencies • Silicon mindset: – Processing logic “over here” – Memory “over there” – And we add acres of high heat producing stuff to bridge the gap
• Thesis: how far can we go with a mindset change LASCI04.ppt
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 3
This Talk: Climbing the Wall a “Different Way” • Enabling concepts implementable in silicon: – Processing In Memory • Lowering the wall, both bandwidth & latency
– Relentless Multi-threading with Light Weight Threads • to change number of times we must climb it • to reduce the state we need to keep behind
• Finding architectures & execution models that support both • With emphasis on “Highly Scalable” Systems LASCI04.ppt
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 4
• High density memory on same chip with high speed logic • Very fast access from logic to memory • Very high bandwidth • ISA/microarchitecture designed to A S utilize high bandwidth A • Tile with “memory+logic” nodes P
Row Decode Logic
“Processing-In-Memory” Memory Array 1 “Full Word”/Row
1 Column/Full Word Bit
Sense Amplifiers/Latches
Column Multiplexing Address
“Wide Word” Interface
Performance Monitor Broadcast Bus
Wide ALUs Permutation Network Thread State Package Global Address Translation Parcel Decode and Assembly
Stand Alone Memory Units
Tiling a Chip
Wide Register File
A Memory/Logic Node
Interconnect incoming parcels
outgoing parcels
Parcel = Object Address + Method_name + Parameters
Processing Logic LASCI04.ppt
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 5
A Short History of PIM @ ND Our IBM Origins 16b CPU
64 KB
SRAM
D R A M I/F
ASAP CPU Core
1 3 9 4 I/F
PCI Memory I/F
EXECUBE
RTAIS
• 1st DRAM MIMD PIM • 8-way hypercube
EXESPHERE
• FPGA ASAP prototype • Early multi-threading
PIM Fast
• 9 PIMs + FPGA interconnect • Place PIM in PC memory space
• PIM for Spacecraft • Parcels & Scalability
PIM Lite • 1st Mobile, Multithreaded PIM • Demo all lessons learned
Coming Soon To a Computer Near You
SRAM PIM
A Cray Inc. HPCS Partner
C A
CPU
DRAM PIM
Cascade
HTMT
• World’s First Trans-petaflop • PIM-enhanced memory system
LASCI04.ppt
PIM Macros • Fabbed in adv. Technologies • Explore key layout issues
• 2-level PIM for petaflop • Ultra-scalability
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
C H E
Conventional Motherboard
DIVA • Multi-PIM memory module • Model potential PIM programs
Slide 6
Acknowledgements • My personal work with PIM dates to late 1980s • But also! ND/CalTech/Cray collaboration now a decade old!
Architecture Working Group 1st Workshop on Petaflops Computing Pasadena, Ca Feb. 22-24, 1994 LASCI04.ppt
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 7
Topics • • • • • •
How We Spend Today’s Silicon The Silicon Roadmap – A Different Way PIM as an Alternate Technology PIM-Enabled Architectures Matching ISA & Execution Models Some Examples
LASCI04.ppt
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 8
How We Spend Our Silicon Today: Or “The State of State”
LASCI04.ppt
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 9
*
How Are We Using Our Silicon? Compare CPU to a DP FPU 450 400
Area (sq. mm)
350 300 250
66 to 1: Is This State?
Crossover
200 150 100 50 0
1970
1975
1980
1985
CPU Die Area
LASCI04.ppt
1990
1995
2000
2005
Eqvt. DP FPU
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 10
CPU State vs Time Estimated State (k bits)
100,000.00 10,000.00 1,000.00 100.00
nd u o mp o C X 5 1.
10.00 1.00
G
a R h t row
er p te
ar e Y
0.10 0.01
1970
1980
1990
2000
Total State
Machine
Supervisor
User
Transient
Latency Enhancing
Access Enhancing LASCI04.ppt
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 11
So We Expect State & Transistor Count to be Related 1,000,000
# of Transistors (x1000)
100,000 10,000 1,000 100
And They Are!
10 1
0
1
10
100
1,000
10,000
100,000
Total State (x1000)
LASCI04.ppt
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 12
The Silicon Roadmap: Or “How are we Using an Average Square of Silicon?”
LASCI04.ppt
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 13
The Perfect “Knee Curves” No “Overhead” of Any Kind “Knee”: 50% Logic & 50% Memory 1.00E+05 1.00E+04
GF/sq.cm
1.00E+03 1.00E+02
e m i T
1.00E+01 1.00E+00 1.00E-01 1.00E-02 0.01
0.10
1.00
10.00
GB/sq.cm
2003 LASCI04.ppt
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
2018 Slide 14
Adding In “Lines of Constant Performance” 1.00E+05 1.00E+04
GF/sq.cm
1.00E+03 1.00E+02 1.00E+01
F G / B G 0.001
1.00E+00 1.00E-01 1.00E-02 0.01
/ GF B G 0.5 /GF B G 1 0.10
1.00
10.00
GB/sq.cm
LASCI04.ppt
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 15
What if We Include Basic Overhead? Perfect 2018
1.00E+05 1.00E+04
GF/sq.cm
1.00E+03
Perfect 2003
1.00E+02 1.00E+01 1.00E+00 1.00E-01
B/GF G 1 0 0.0
/ GF B G 0.01 B 0.5 G
/ GF
/ GF 1 GB
1.00E-02 0.01
0.10
1.00 GB/sq.cm
2003 LASCI04.ppt
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
10.00
2018 Slide 16
What If We Look At Today’s Separate Chip Systems? Perfect 2018
1.00E+05 1.00E+04
Min O/H 2018
GF/sq.cm
1.00E+03
Perfect 2003
1.00E+02
Min1.00E+01 O/H 2003 1.00E+00 1.00E-01
B/GF G 1 0 0.0 F GB/G 0.01
/ GF B G 0.5 / GF 1 GB
1.00E-02 0.01
0.10
1.00 GB/sq.cm
2003 LASCI04.ppt
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
10.00
2018 Slide 17
How Many of Today’s Chips Make Up a “Zetta”? 1.00E+13
A Ze
ttaB yte
Chip Count
1.00E+12
AZ etta F
1.00E+11
lop
1.00E+10
1.00E+09 2002
2004
2006
2008
2010 1 ZF
2012
2014
2016
2018
2020
1 ZB
And This Does Not Include “Routing” LASCI04.ppt
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 18
How Big is a 1 ZB System? 1.00E+12
Area of 1ZB System (sq. m)
1.00E+11 1.00E+10 1.00E+09
Area of Manhatten Island
1.00E+08 1.00E+07
1 Sq. Mile
1.00E+06 1.00E+05
1 Football Field
1.00E+04 1.00E+03
2003
LASCI04.ppt
2004
2005
2006
2007
2008
2009
2010
2012
2013
2015
2016
2018
0.001DC
0.01DC
0.1DC
0.5DC
1DC
Football Field
1 Square Mile
Manhatten Island
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 19
How Does Chip I/O Bandwidth Relate to Performance?
Ratio of GBps/die to GFps/die
1000.00
100.00
10.00 2000
LASCI04.ppt
2005
2010
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
2015
2020
Slide 20
Problems • Complexity & Area Infeasible • Flop numbers assume perfect utilization • But latencies are huge – Diameter of Manhatten = 28 microseconds
• And efficiencies will plummet – At 0.1% efficiency we need area of Rhode Island for microprocessor chips • Whose diameter is 240 microseconds
• And we don’t have enough pins! LASCI04.ppt
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 21
PIM as an Alternative Technology
LASCI04.ppt
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 22
PIM Objective • Move processing logic onto dense DRAM die • Obtain decreased latency & increased latency for local reference – Without needing pins
• AND simplify logic down to a simple core • Thus allowing many more “processors” • And off chip pins used for true remote reference
LASCI04.ppt
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 23
Classical DRAM • Memory mats: ~ 1 Mbit each • Row Decoders • Primary Sense Amps • Secondary sense amps & “page” multiplexing • Timing, BIST, Interface • Kerf
45% of Die is non storage
1.00 0.90
100
% Chip Overhead
Density (Mbits/sq.mm)
1000
10 1 0.1 0.01 0.001
0.70 0.60 0.50 0.40 0.30 0.20 0.10
0.0001 1970
0.80
1980
1990
2000
2010
2020
Chip: Historical
Chip: SIA Production
Chip: SIA Introduction
Cell: Historical
Cell: SIA Production
Cell: SIA Introduction
0.00
1970
1980 Historical
LASCI04.ppt
1990
2000
SIA Production
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
2010
2020
SIA Introduction
Slide 24
Embedded DRAM Macros Today MAT
(Almost)
Row Decode
MAT
Prim. SA
...
Prim. SA
MAT
Row Decode
MAT
Prim. SA
Some maximum # of memory blocks
Memory Block: 512x2048 = 1 Mbit
Prim. SA
Base Block
2nd Sense, Mux, BIST, Timing
Address
“Wide Word” Data: 100’s of bits
Cycle Time Estimates from IBM for Embedded RAM 100
10
1 1998
1999
2000 SRAM
LASCI04.ppt
2001 DRAM
2002
2003
Est. SRAM
2004
2005
#REF!
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 25
PIM Chip MicroArchitectural Spectrum M E M
L O M G E I M C
.. 8 Times ..
M E M
L O M G E I M C
DRAM
Word Drivers & Row Decoder M E M
L O M G E I M C
.. 8 Times ..
M E M
DRAM
L O M G E I M C
Data
16 Mbit DRAM Macro
32B 512B Line 512B Line
512B Victim Cache
16 Mbit
Simple Micro Sparc II
..X16.. DRAM Macro
512B Line
512B Line
Memory Coherence Controller
DRAM
CPU CPU L1
L1 L2
Instructions
Complete SMP Node: Proposed SUN part LASCI04.ppt
Serial Inter Connect
DRAM
Single Chip Computer: Mitsubishi M32R/D
SIMD: Linden DAAM
512B Line 512B Line
C C Mpy a a CPU c c Mem h h I/F e e
Chip Level SMP: POWER4
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Tiled & Scalable: BLUE GENE, EXECUBE Slide 26
The PIM “Bandwidth Bump” 1000 Bandwidth (GB/s)
32 Nodes Complex RegFile
100 Simple 3Port L1 RegFile Local Chip 10 Memory
Between 1B & 1 GB, Area under curve: 1 PIM Node = 4.3xUIII 1 PIM Chip = 137xUIII
L2
Off-Chip Memory
1 1.00E+00
1.00E+03
Region of classical Temporal Intensive Performance Advantage
1.00E+06
1.00E+09
Region of PIM Spatially Intensive Performance Advantage (1 Node)
Reachable Memory (Bytes) UltraIII CPU Chip
LASCI04.ppt
Single PIM Node
32node PIM Chip
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 27
PIM-Based Architectures: System & Chip Level
LASCI04.ppt
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 28
PIM System Design Space: Historical Evolution • Variant One: Accelerator (historical) • Variant Two: Smart Memory – Attach to existing SMP (using an existing memory bus interface) – PIM-enhanced memories, accessible as memory if you wish – Value: Enhancing performance of status quo
• Variant Three: Heterogeneous Collaborative – PIMs become “independent,” & communicate as peers – Non PIM nodes “see” PIMs as equals – Value: Enhanced concurrency and generality over variant two
• Variant Four: Uniform Fabric (“All PIM”) – PIM “fabric” with fully distributed control and emergent behavior – Extra system I/O connectivity required – Value: Simplicity and economy over variant three
• Option for any of above: Extended Storage – Any of above where each PIM supports separate dumb memory chips LASCI04.ppt
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 29
TERASYS SIMD PIM (circa 1993) • Memory part for CRAY-3 • “Looked like” SRAM memory • With extra command port •128K SRAM bits (2k x 64) • 64 1 bit ALUs • SIMD ISA • Fabbed by National • Also built into workstation with 64K processors • 5-48X Y-MP on 9 NSA benchmarks LASCI04.ppt
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 30
EXECUBE: An Early MIMD PIM (1st Silicon 1993) • First DRAM-based Multiprocessor on a Chip • Designed from onset for “glueless” one-part-type scalability • On-chip bandwidth: 6.2 GB/s; Utilization modes > 4GB/s MEMORY MEMORY
MEMORY MEMORY
CACHE
MEMORY MEMORY
MEMORY MEMORY
CACHE
CPU Include “High Bandwidth” Features in ISA LASCI04.ppt
8 Compute Nodes on ONE Chip
EXECUBE: 3D Binary Hypercube SIMD/MIMD on a chip
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 31
RAM-30
RAM-31
8Bit ALU
8Bit ALU
Inter-ALU Exchange
• • • • •
MEMORY BUS
RAM-... RAM-1 8Bit 8Bit ALU ALU
Shared Memory
RAM-0 8Bit ALU
Controller
RTAIS: The First ASAP (circa 1993)
Application: “Linda in Memory” Designed from onset to perform wide ops “at the sense amps” More than SIMD: flexible mix of VLIW “Object oriented” multi-threaded memory interface Result: 1 card 60X faster than state-of-art R3000 card
LASCI04.ppt
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 32
Mitsubishi M32R/D DRAM
DRAM
C C Mpy a a CPU c c Mem h h I/F e e
16 bit data bus
DRAM
Also two 1-bit I/Os DRAM
24 bit address bus
• 32-bit fixed point CPU + 2 MB DRAM • “Memory-like” Interface • Utilize wide word I/F from DRAM macro for cache line LASCI04.ppt
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 33
DIVA: Smart DIMMs for Irregular Data Structures uP Host
Cache
…. Host issues Parcels • Generalized “Loads & Stores” • Treat memory as Active Objectoriented store
LASCI04.ppt
ADR MAP
Interconnect
TLB
ADR MAP
Local Prog. CPU
A A S AMemory Memory S A S Stack Memory A P A Stack Stack P P
….
DIVA Functions: • Prefix operators CPU • Dereferencing & pointer chasing • Compiled methods • 1 CPU + 2MB • Multi-threaded • MIPS + “Wide Word” • May generate parcels C
A
C
H E
Conventional Motherboard
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 34
Micron Yukon
Host (remote)
SDRAM-like interface
FIFO
• 0.15µ µm eDRAM/ 0.18µ µm logic process
Task Dispatch Unit FIFO
HMI
FIFO Synchronisation
• 128Mbits DRAM
256 Processing Elements
Register Files
– 2048 data bits per access • 256 8-bit integer processors
DRAM Control Unit
M16 PE sequencer
16MBytes Embedded DRAM
– Configurable in multiple topologies
• On-chip programmable controller
• Operates like an SDRAM
LASCI04.ppt
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 35
Berkeley VIRAM • System Architecture: single chip media processing • ISA: MIPS Core + Vectors + DSP ops • 13 MB DRAM in 8 banks • Includes flt pt • 2 Watts @ 200 MHz, 1.6GFlops
LASCI04.ppt
MIPS 4 “Vector Lanes”
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 36
The HTMT Architecture & PIM Functions DRAM PIM
3D Mem
I/O FARM • Compress/Decompress • ECC/Redundancy
• Compress/Decompress • Spectral Transforms
OPTICAL SWITCH SRAM PIM
• Data Structure Initializations •“In the Memory” Operations
RSFQ Nodes
• Compress/Decompress • Routing
• RSFQ Thread Management • Context Percolation • Scatter/Gather Indexing • Pointer chasing • Push/Pull Closures • Synchronization Activities
New Technologies: • Rapid Single Flux Quantum (RSFQ) devices for 100 GHz CPU nodes • WDM all optical network for petabit/sec bi-section bandwidth • Holographic 3D crystals for Petabytes of on-line RAM • PIM for active memories to manage latency LASCI04.ppt
PIMs in Charge
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 37
PIM Lite Thread Pool
• •
– “Thread” = IP/FP pair – “Registers” = wide words in frames
4.5 mm
Instruction Memory (4 Kbytes) Frame Memory (1 K) ALU & Permute Net
• •
Data Memory (2 Kbytes) Write-Back Logic 2.9 mm memory interconnect network
“Looks like memory” at Interfaces ISA: 16-bit multithreaded/SIMD
• •
Designed for multiple nodes per chip 1 node logic area ~ 10.3 KB SRAM (comparable to MIPS R3000) TSMC 0.18u 1-node 1st pass success 3.2 million transistors (4-node)
Parcel in (via chip data bus)
Parcel out (via chip data bus)
PIM Memory
Thread Instr Queue Memory
CPU
Frame Memory
ALU
Data Memory
WriteBack Logic
Memory interconnect network
LASCI04.ppt
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 38
Next: An “All-PIM” Supercomputer PIM PIM “PIM PIMDIMM” PIM PIM PIM PIM PIM PIM PIM APIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM
A “PIM Cluster” “Host” PIM Cluster Interconnection Network I/O PIM Cluster LASCI04.ppt
PIM Cluster
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 39
What Might an “All-PIM” Zetta Target Look Like? Area of 1ZB PIM System (sq. m)
1.00E+11
1.00E+10
I can add 1000X peak performance at 3X area!
1.00E+09
1.00E+08
1.00E+07 2003
LASCI04.ppt
2004
2005
2006
2007
2008
2009
2010
1000 ZF Peak
100 ZF Peak
10 ZF Peak
1 ZF Peak
Manhatten Island
No Logic
2012
2013
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
2015
2016
2018
2 ZF Peak
Slide 40
Matching ISA & Execution Models
LASCI04.ppt
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 41
Guiding Principles: Memory-Centric Processor Architecture
Focus on memory, not logic
Memory
thread frames instructions global data
Processor
Make “CPUs” anonymous
Maximize performance, keeping cost in check
make processor cost like a memory, not vice-versa!
• “How low can you go” – minimize storage for machine state • registers, buffers, etc.
– don’t overprovision functional units
“Wider is better”
0 swap thread state in single bus cycle 0 wide-word, SIMD data operations
LASCI04.ppt
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 42
Working Definitions Light Weight State: Refers to computational thread • Separated from rest of system state • Reduced in size to ~ cache line Serious Multi-threading • Very cheap thread creation and state access • Huge numbers of concurrent threads • Support for cheap, low level synchronization • Permit opportunities for significant latency hiding • ISA level knowledge of “locality” LASCI04.ppt
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 43
How to Use Light Weight Threads • Approaches to solving Little’s Law problems – Reduce # of latency-causing events – Reduce total number of bit/chip crossings per operation – Or, reduce effective latency
• Solutions – Replace 2 way latency by 1 way “command” – Let thread processing occur “at the memory” – Increase number of memory access points for more bandwidth
• Light Weight Thread: minimal state needed to initiate limited processing in memory
LASCI04.ppt
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 44
Parcels: The Architectural Glue • Parcel = Parallel Communication Element • Basic unit of communication between nodes – At same level as a “dumb memory reference”
• Contents: extension of “Active Message” – Destination: Object in application space – Method: function to perform on that object – Parameters: values to use in computation
Threads in Transit!!! LASCI04.ppt
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 45
Type of Parcels • Memory Access: “Dumb” Read/write to memory • AMO: Simple_prefix_op to a memory location • Remote Thread Invocation: Start new thread at node holding designated address – Simple case: booting the original node run-time – More interesting: “slices” of program code
• Remote Object Method Invocation: Invoke method against an object at object’s home • Traveling Thread = Continuation: Move entire execution state to next datum.
LASCI04.ppt
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 46
Software Design Space Evolution • Hardware “fetch_and_op” • Generic libraries only • App.-specific subprogram compilation – Explicit method invocation – Expanded storage management
• Explicit support for classical multi-threading • Inter-thread communication – Message passing – Shared memory synchronization – Atomic multi-word transactions
• Pro-active data migration/percolation • Expanded multi-threading – Extraction of very short threads, new AMOs – Thread migration
• OS, run-time, I/O management “in the memory”
LASCI04.ppt
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 47
The HTMT Percolation Execution Model Data Structures In DRAM
“Contexts” in SRAM
Ga t Da her ta
V O R T E X
tter a c S
Wo rkin gD
“Contexts” in ataCRAM
C N E T lts u s Re
s t l su e R
SPELL
All Orange Arrows are parcels LASCI04.ppt
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 48
More Exotic Parcel Functions The “Background” Ones • Garbage Collection: reclaim memory • Load Balancing: check for over/under load & suggest migration of activities • Introspection: look for livelock, deadlock, faults or pending failures • Key attributes of all of these – Independent of application programs – Inherent memory orientation – Significant possibility of mobility LASCI04.ppt
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 49
Examples: • In-Memory Multi-Threading • Traveling Threads • Very Light Weight Thread Extraction
LASCI04.ppt
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 50
N-Body Simulation • Simulate motion of N bodies under mutual attractive/repulsive force. O(N2) • Barnes-Hut method – clusters of bodies approximated by single body when dense and far away – subdivide region into cells, represent using quad/octree • Highly parallel – 90+ percent of workload can be parallelized
LASCI04.ppt
initialize for each body calculate net force synchronize update positions for each body calculate net force
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 51
Heavyweight N-Body • Each node in cluster has partition of tree in memory, distributed spatially • Needs “Locally Essential Tree” in each cache • Network traffic on every cache miss • Sterling, Salmon, et. al. Gordon Bell Prize, 1997 (performance & cost-perf)
L.E.T. in cache
LASCI04.ppt
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 52
Multithreaded In-PIM N-Body • Thread per body, accumulating net force, traversing tree in parallel, in-memory processing • Very low individual thread state (net force, next pointer, body coordinates, body mass) • Network traffic only when thread leaves partition—lower traffic overall replicate top of tree to reduce bottleneck
LASCI04.ppt
• 16 64 MB PIM Chips • Each with multiple nodes • Appear as memory to host • Host is 2X clock rate of PIM
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 53
5.5 5
N-Body on PIM: Cost vs. Performance speedup cost 4-way SIMD
Relative Increase
4.5 4 3.5 3 2.5
1 FPU
• Cost basis: Prior system • 15% Serial on Host • 85% highly threadable • 40% of 85% is short vector 4-way SIMD
2
1 FPU
1.5 1 1
LASCI04.ppt
Conclusions: • MT & latency reduction buys the most • Short FP vectors may not be worth it
2
3 Nodes per PIM Chip
4
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
5
Slide 54
Example: Traveling Thread Vector Gather
• Given Base, Stride, Count, read strided vector to compact vector • Classical CPU-centric approach: – Issue “waves” of multiple, ideally K, loads – If stride < block size, return cache line – Else return single double word
• UWT thread-based – Source issues K “gathering” threads, one to each PIM memory macro – Each thread reads local values into payload – Continuing dispatching payload when full LASCI04.ppt
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 55
Vector Gather via LWT Traffic Estimates 10000000 Data Transferred (B)
Transactions
100000 10000 1000 100 10
1000000 100000 10000 1000
1
10
100
1000
Stride (Bytes)
10000
1
Q=2
Q=3
Q=4
Q=6
Q=8
100
1000
10000
Stride (Bytes)
Q = Payload size in DW
Classical
10
Classical
Q=2
Q=3
Q=4
Q=6
Q=8
• 4X+ reduction in Transactions • 25%+ reduction in bytes transferred LASCI04.ppt
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 56
Vector Add via Traveling Threads Type 1
X
X
X
X
Spawn type 2s
M E M O R Y
M E M O R Y
M E M O R Y
M E M O R Y
Y
Y
Y
Y
Z
Z
Z
Z
M E M O R Y
M E M O R Y
M E M O R Y
M E M O R Y
M E M O R Y
M E M O R Y
M E M O R Y
M E M O R Y
Type 2 Accumulate Q X’s in payload
Transaction Reduction factor: •1.66X (Q=1) •10X (Q=6) • up to 50X (Q=30)
Type 3 Fetch Q matching Y’s, add to X’s, save in payload, store in Q Z’s
Stride thru Q elements LASCI04.ppt
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 57
Trace-Based Thread Extraction & Simulation • Applied to large-scale Sandia applications over summer 2003 P: desired # concurrent threads
Analysis
From Basic Application Data LASCI04.ppt
Through Detailed Thread Characteristics
To Overall Concurrency
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 58
Summary
LASCI04.ppt
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 59
Summary • When it comes to silicon: It’s the Memory, Stupid! • State bloat consumes huge amounts of silicon – That does no useful work! – And all due to focus on “named” processing logic
• With today’s architecture, we cannot support bandwidth between processors & memory • PIM (Close logic & memory) attacks all these problems • But it’s still not enough for Zetta • But the ideas may migrate! LASCI04.ppt
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 60
A Plea to Architects and Language/Compiler Developers! Relentlessly Attack State Bloat by Reconsidering Underlying Execution Model Starting with Multi-threading Of Mobile, Light Weight States As Enabled by PIM technology LASCI04.ppt
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 61
The Future
Will We Design Like This?
Or This?
Regardless of Technology! LASCI04.ppt
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 62
PIMs Now In Mass Production • 3D Multi Chip Module • Ultimate in Embedded Logic • Off Shore Production • Available in 2 device types
•Biscuit-Based Substrate •Amorphous Doping for Single Flavor Device Type •Single Layer Interconnect doubles as passivation LASCI04.ppt
Workshop on Extreme Computing, LASCI, Oct. 12, 2004
Slide 63