Feb 12, 2007 - Case study, the Cray Black Widow. ⢠On-Chip Interconnection Networks (OCINs) face ... Intel Paragon XP. Cray T3D. MIT Alewife. IBM Vulcan.
Interconnect-Centric Computing William J. Dally Computer Systems Laboratory Stanford University HPCA Keynote February 12, 2007
HPCA: 1
Feb 12, 2007
Outline • Interconnection Networks (INs) are THE central component of modern computer systems • Topology driven to high-radix by packaging technology • Global adaptive routing balances load - and enables efficient topologies • Case study, the Cray Black Widow • On-Chip Interconnection Networks (OCINs) face unique challenges • The road ahead… HPCA: 2
Feb 12, 2007
Outline • Interconnection Networks (INs) are THE central component of modern computer systems • Topology driven to high-radix by packaging technology • Global adaptive routing balances load - and enables efficient topologies • Case study, the Cray Black Widow • On-Chip Interconnection Networks (OCINs) face unique challenges • The road ahead… HPCA: 3
Feb 12, 2007
INs: Connect Processors in Clusters
IBM Blue Gene HPCA: 4
Feb 12, 2007
and on chip
MIT RAW HPCA: 5
Feb 12, 2007
Connect Processors to Memories in Systems
Cray Black Widow HPCA: 6
Feb 12, 2007
and on chip
HPCA: 7
Texas TRIPS
Feb 12, 2007
provide the fabric for network Switches and Routers
HPCA: 8
Avici TSR
Feb 12, 2007
and connect I/O Devices
Brocade Switch HPCA: 9
Feb 12, 2007
Group History: Routing Chips & Interconnection Networks • Mars Router, Torus Routing Chip, Network Design Frame, Reliable Router • Basis for Intel, Cray/SGI, Mercury, Avici network chips
MARS Router 1984
Reliable Router 1994
Torus Routing Chip 1985 Network Design Frame 1988
HPCA: 10
Feb 12, 2007
Group History: Parallel Computer Systems • J-Machine (MDP) led to Cray T3D/T3E • M-Machine (MAP) – Fast messaging, scalable processing nodes, scalable memory architecture
• Imagine – basis for SPI
MDP Chip HPCA: 11
J-Machine
Cray T3D
MAP Chip
Imagine Chip Feb 12, 2007
Interconnection Networks are THE Central Component of Modern Computer Systems • Processors are a commodity – Performance no longer scaling (ILP mined out) – Future growth is through CMPs - connected by INs
• Memory is a commodity – Memory system performance determined by interconnect
• I/O systems are largely interconnect • Embedded systems built using SoCs – Standard components – Connected by on-chip INs (OCINs)
HPCA: 12
Feb 12, 2007
Outline • Interconnection Networks (INs) are THE central component of modern computer systems • Topology driven to high-radix by packaging technology • Global adaptive routing balances load - and enables efficient topologies • Case study, the Cray Black Widow • On-Chip Interconnection Networks (OCINs) face unique challenges • The road ahead… HPCA: 13
Feb 12, 2007
Technology Trends…
bandwidth per router node (Gb/s)
10000
BlackWidow 1000
100
10
1
0.1 1985
1990
1995
2000
2005
2010
Torus Routing Chip Intel iPSC/2 J-Machine CM-5 Intel Paragon XP Cray T3D MIT Alewife IBM Vulcan Cray T3E SGI Origin 2000 AlphaServer GS320 IBM SP Switch2 Quadrics QsNet Cray X1 Velio 3003 IBM HPS SGI Altix 3000 Cray XT3 YARC
year HPCA: 14
Feb 12, 2007
High-Radix Router
Router
Router
HPCA: 15
Feb 12, 2007
High-Radix Router
Router
Router
Router
Low-radix (small number of fat ports)
High-radix (large number of skinny ports)
HPCA: 16
Feb 12, 2007
Low-Radix vs. High-Radix Router I0
O0
I0
O0
I1
O1
I1
O1
I2
O2
I2
O2
I3
O3
I3
O3
I4
O4
I4
O4
I5
O5
I5
O5
I6
O6
I6
O6
I7
O7
I7
O7
I8
O8
I8
O8
I9
O9
I9
O9
I10
O10
I10
O10
I11
O11
I11
O11
I12
O12
I12
O12
I13
O13
I13
O13
I14
O14
I14
O14
I15
O15
I15
O15
Latency : Cost : HPCA: 17
Low-Radix 4 hops 96 channels
High-Radix 2 hops 32 channels Feb 12, 2007
Latency
Latency
= =
H tr + L / b 2trlogkN + 2kL / B where k = radix B = total router Bandwidth N = # of nodes L = message size
HPCA: 18
Feb 12, 2007
Latency vs. Radix 2003 technology
2010 technology
300
Header latency decreases
latency (nsec)
250 200
Serialization latency increases
Optimal radix ~ 40
150
Optimal radix ~ 128
100 50 0 0
50
100
150
200
250
radix HPCA: 19
Feb 12, 2007
Determining Optimal Radix Latency
= = =
Header Latency + Serialization Latency H tr + L / b 2trlogkN + 2kL / B where k = radix B = total router Bandwidth N = # of nodes L = message size
Optimal radix k log2 k = (B tr log N) / L = Aspect Ratio HPCA: 20
Feb 12, 2007
Higher Aspect Ratio, Higher Optimal Radix
Optimal Radix (k)
1000
2010
100 2003 1996 10 1991
1 10
100
1000
10000
Aspect Ratio
HPCA: 21
Feb 12, 2007
High-Radix Topology • Use high radix, k, to get low hop count – H = logk(N)
• Provide good performance on both benign and adversarial traffic patterns – Rules out butterfly networks - no path diversity – Clos networks work well • H = 2logk(N) - with short circuit
– Cayley graphs have nice properties but are hard to route
HPCA: 22
Feb 12, 2007
Example radix-64 Clos Network
Rank 2
Y32
Y33
Y63
Rank 1
Y0
Y1
Y31
BW0
HPCA: 23
BW1
BW31
BW32
BW33
BW63
BW992
BW993
BW1023
Feb 12, 2007
Flattened Butterfly Topology
HPCA: 24
Feb 12, 2007
Packaging the Flattened Butterfly
HPCA: 25
Feb 12, 2007
Packaging the Flattened Butterfly (2)
HPCA: 26
Feb 12, 2007
Cost
HPCA: 27
Feb 12, 2007
Outline • Interconnection Networks (INs) are THE central component of modern computer systems • Topology driven to high-radix by packaging technology • Global adaptive routing balances load - and enables efficient topologies • Case study, the Cray Black Widow • On-Chip Interconnection Networks (OCINs) face unique challenges • The road ahead… HPCA: 28
Feb 12, 2007
Routing in High-Radix Networks • Adaptive routing avoids transient load imbalance • Global adaptive routing balances load for adversarial traffic – Cost/perf of a butterfly on benign traffic and at low loads – Cost/perf of a clos on adversarial traffic
HPCA: 29
Feb 12, 2007
A Clos can statically load balance traffic using oblivious routing
Rank 2
Y32
Y33
Y63
Rank 1
Y0
Y1
Y31
BW0
HPCA: 30
BW1
BW31
BW32
BW33
BW63
BW992
BW993
BW1023
Feb 12, 2007
Transient Imbalance
HPCA: 31
Feb 12, 2007
With Adaptive Routing
HPCA: 32
Feb 12, 2007
Latency for UR traffic
HPCA: 33
Feb 12, 2007
Flattened Butterfly Topology
0
HPCA: 34
1
2
3
4
5
6
7
Feb 12, 2007
Flattened Butterfly Topology
0
1
2
3
4
5
6
7
What if node 0 sends all of its traffic to node 1?
HPCA: 35
Feb 12, 2007
Flattened Butterfly Topology
0
1
2
3
4
5
6
7
What if node 0 sends all of its traffic to node 1? How much traffic should we route over alternate paths?
HPCA: 36
Feb 12, 2007
Simpler Case - ring of 8 nodes Send traffic from 2 to 5 • Model: Assume queues to be a network of independent M/D/1 queues
x2
1
2
3
4
0
7
6
5
x1
= x1 + x2 Min path delay = Dm(x1) Non-min path delay = Dnm(x2)
• Routing remains minimal as long as Dm’() Dnm’(0)
• Afterwards, route a fraction, x2, nonminimally such that Dm’(x1) = Dnm’(x2) HPCA: 37
Feb 12, 2007
Traffic divides to balance delay Load balanced at saturation
0.6
Accepted Throughput
Model Overall 0.5
Model Minimal
0.4
Model Non-minimal
0.3 0.2 0.1 0 0
HPCA: 38
0.1
0.2
0.3
0.4
Offered Load (fraction of capacity)
0.5
0.6 Feb 12, 2007
Channel-Queue Routing • Estimate delay per hop by local queue length Qi • Overall latency estimated by – L i ~ Qi H i
• Route each packet on route with lowest estimated Li • Works extremely well in practice
HPCA: 39
Feb 12, 2007
Performance on UR Traffic
HPCA: 40
Feb 12, 2007
Performance on WC Traffic
HPCA: 41
Feb 12, 2007
Allocator Design Matters
HPCA: 42
Feb 12, 2007
Outline • Interconnection Networks (INs) are THE central component of modern computer systems • Topology driven to high-radix by packaging technology • Global adaptive routing balances load - and enables efficient topologies • Case study, the Cray Black Widow • On-Chip Interconnection Networks (OCINs) face unique challenges • The road ahead… HPCA: 43
Feb 12, 2007
Putting it all together The Cray BlackWidow Network In collaboration with Steve Scott and Dennis Abts (Cray Inc.)
HPCA: 44
Feb 12, 2007
Cray Black Widow • • • •
Shared-memory vector parallel computer Up to 32K nodes Vector processor per node Shared memory across nodes
HPCA: 45
Feb 12, 2007
Black Widow Topology • Up to 32K nodes in a 3-level folded Clos • Each node has 4 18.75Gb/s channels, one to each of 4 network slices
HPCA: 46
Feb 12, 2007
YARC Yet Another Router Chip • • • •
64 Ports Each port is 18.75 Gb/s (3 x 6.25Gb/s links) Table-driven routing Fault tolerance – CRC with link-level retry – Graceful degradation of links • 3 bits -> 2 bits -> 1 bit -> OTS
HPCA: 47
Feb 12, 2007
YARC Microarchitecture
• • • •
HPCA: 48
Regular 8x8 array of tiles – Easy to lay out chip No global arbitration – All decisions local Simple routing Hierarchical organization – Input buffers – Row buffers – Column buffers
Feb 12, 2007
A Closer Look at a Tile • No global arbitration • Non-blocking with an 8x internal speedup in subswitch • Simple routing – Small 8-entry routing table per tile – High routing throughput for small packets
HPCA: 49
Feb 12, 2007
YARC Implementation
• Implemented in a 90nm CMOS standard-cell ASIC technology • 192 SerDes on the chip • (64 ports x 3-bits per port)
• 6.25Gbaud data rate • Estimated power • 80 W (idle) • 87 W (peak)
• 17mm x 17mm die
HPCA: 50
Feb 12, 2007
YARC Implementation
• Implemented in a 90nm CMOS standard-cell ASIC technology • 192 SerDes on the chip • (64 ports x 3-bits per port)
• 6.25Gbaud data rate • Estimated power • 80 W (idle) • 87 W (peak)
• 17mm x 17mm die
HPCA: 51
Feb 12, 2007
Outline • Interconnection Networks (INs) are THE central component of modern computer systems • Topology driven to high-radix by packaging technology • Global adaptive routing balances load - and enables efficient topologies • Case study, the Cray Black Widow • On-Chip Interconnection Networks (OCINs) face unique challenges • The road ahead… HPCA: 52
Feb 12, 2007
Much of the future is on-chip (CMP, SoC, Operand)
2006
2010.5 HPCA: 53
2007.5
2012
2009
2013.5
2015 Feb 12, 2007
On-Chip Networks are Fundamentally Different • Different cost model – Wires plentiful, no pin constraints – Buffers expensive (consume die area) – Slow signal propagation
• Different usage patterns – Particularly for SoCs • Significant isochronous traffic • Hard RT constraints
• Different design problems – Floorplans – Energy-efficient transmission circuits HPCA: 54
Feb 12, 2007
NSF Workshop Identified 3 Critical Issues • Power – OCINs will have 10x the required power with current approaches • Circuit and architecture innovations can close this gap
• Latency – OCIN latency currently not competitive with buses and dedicated wiring • Novel flow-control strategies required
• Tool Integration – OCINs need to be integrated with standard tool flows to enable widespread use
HPCA: 55
Feb 12, 2007
The Road Ahead • INs become an even more dominant system component – Number of processors goes up, cost of processors decreases – Communication dominates performance and cost – From hand-held media UI devices to huge data centers
• Technology drives topology in new directions – On-chip, short reach electrical (10m), optical – Expect radix to continue to increase – Hybrid topologies to match each packaging level
• Latency will approach that of dedicated wiring – Better flow-control and router architecture – Optimized circuits
• Adaptivity will optimize performance – Balance load, route around defects, tolerate variation, tune power to load
HPCA: 56
Feb 12, 2007
Summary • Interconnection Networks (INs) are THE central component of modern computing systems • High-radix topologies have evolved to exploit packaging/signaling technology – Including hybrid optical/electrical – Flattened Butterfly
• Global adaptive routing balances load and enables advanced topologies – Eliminate transient load imbalance – Use local queues to estimate global congestion
• Cray Black Widow - an example high-radix network • On-Chip INs – Very different constraints – Three “Gaps” identified - power, latency, tools.
• The road ahead – Lots of room for improvement, INs are in their infancy HPCA: 57
Feb 12, 2007
Some very good books
HPCA: 58
Feb 12, 2007
Backup
HPCA: 59
Feb 12, 2007
Virtual Channel Router Architecture
Routing Router computation
VC Allocator Allocator Switch Switch Allocator Allocator
VC 1
Input 1
VC 2
Output 1
VC v
VC 1
Input k
VC 2
Output k
VC v
Crossbar switch
HPCA: 60
Feb 12, 2007
Baseline Performance Evaluation 50
latency (cycles)
low-radix 40 30 20 10 0 0
0.2
0.4
0.6
0.8
1
offered load HPCA: 61
Feb 12, 2007
Baseline Performance Evaluation 50
low-radix
latency (cycles)
40
Low radix better
baseline (high-radix)
30 20 10 0 0
0.2
0.4
0.6
0.8
1
offered load
HPCA: 62
Feb 12, 2007