Interconnect-Centric Computing - Semantic Scholar

Interconnect-Centric Computing William J. Dally Computer Systems Laboratory Stanford University HPCA Keynote February 12, 2007

HPCA: 1

Feb 12, 2007

Outline • Interconnection Networks (INs) are THE central component of modern computer systems • Topology driven to high-radix by packaging technology • Global adaptive routing balances load - and enables efficient topologies • Case study, the Cray Black Widow • On-Chip Interconnection Networks (OCINs) face unique challenges • The road ahead… HPCA: 2

Feb 12, 2007


Feb 12, 2007

INs: Connect Processors in Clusters

IBM Blue Gene HPCA: 4

Feb 12, 2007

and on chip

MIT RAW HPCA: 5

Feb 12, 2007

Connect Processors to Memories in Systems

Cray Black Widow HPCA: 6

Feb 12, 2007

and on chip

HPCA: 7

Texas TRIPS

Feb 12, 2007

provide the fabric for network Switches and Routers

HPCA: 8

Avici TSR

Feb 12, 2007

and connect I/O Devices

Brocade Switch HPCA: 9

Feb 12, 2007

Group History: Routing Chips & Interconnection Networks • Mars Router, Torus Routing Chip, Network Design Frame, Reliable Router • Basis for Intel, Cray/SGI, Mercury, Avici network chips

MARS Router 1984

Reliable Router 1994

Torus Routing Chip 1985 Network Design Frame 1988

HPCA: 10

Feb 12, 2007

Group History: Parallel Computer Systems • J-Machine (MDP) led to Cray T3D/T3E • M-Machine (MAP) – Fast messaging, scalable processing nodes, scalable memory architecture

• Imagine – basis for SPI

MDP Chip HPCA: 11

J-Machine

Cray T3D

MAP Chip

Imagine Chip Feb 12, 2007

Interconnection Networks are THE Central Component of Modern Computer Systems • Processors are a commodity – Performance no longer scaling (ILP mined out) – Future growth is through CMPs - connected by INs

• Memory is a commodity – Memory system performance determined by interconnect

• I/O systems are largely interconnect • Embedded systems built using SoCs – Standard components – Connected by on-chip INs (OCINs)

HPCA: 12

Feb 12, 2007


Feb 12, 2007

Technology Trends…

bandwidth per router node (Gb/s)

10000

BlackWidow 1000

100

10

1

0.1 1985

1990

1995

2000

2005

2010

Torus Routing Chip Intel iPSC/2 J-Machine CM-5 Intel Paragon XP Cray T3D MIT Alewife IBM Vulcan Cray T3E SGI Origin 2000 AlphaServer GS320 IBM SP Switch2 Quadrics QsNet Cray X1 Velio 3003 IBM HPS SGI Altix 3000 Cray XT3 YARC

year HPCA: 14

Feb 12, 2007

High-Radix Router

Router

Router

HPCA: 15

Feb 12, 2007

High-Radix Router

Router

Router

Router

Low-radix (small number of fat ports)

High-radix (large number of skinny ports)

HPCA: 16

Feb 12, 2007

Low-Radix vs. High-Radix Router I0

O0

I0

O0

I1

O1

I1

O1

I2

O2

I2

O2

I3

O3

I3

O3

I4

O4

I4

O4

I5

O5

I5

O5

I6

O6

I6

O6

I7

O7

I7

O7

I8

O8

I8

O8

I9

O9

I9

O9

I10

O10

I10

O10

I11

O11

I11

O11

I12

O12

I12

O12

I13

O13

I13

O13

I14

O14

I14

O14

I15

O15

I15

O15

Latency : Cost : HPCA: 17

Low-Radix 4 hops 96 channels

High-Radix 2 hops 32 channels Feb 12, 2007

Latency

Latency

= =

H tr + L / b 2trlogkN + 2kL / B where k = radix B = total router Bandwidth N = # of nodes L = message size

HPCA: 18

Feb 12, 2007

Latency vs. Radix 2003 technology

2010 technology

300

Header latency decreases

latency (nsec)

250 200

Serialization latency increases

Optimal radix ~ 40

150

Optimal radix ~ 128

100 50 0 0

50

100

150

200

250

radix HPCA: 19

Feb 12, 2007

Determining Optimal Radix Latency

= = =

Header Latency + Serialization Latency H tr + L / b 2trlogkN + 2kL / B where k = radix B = total router Bandwidth N = # of nodes L = message size

Optimal radix k log2 k = (B tr log N) / L = Aspect Ratio HPCA: 20

Feb 12, 2007

Higher Aspect Ratio, Higher Optimal Radix

Optimal Radix (k)

1000

2010

100 2003 1996 10 1991

1 10

100

1000

10000

Aspect Ratio

HPCA: 21

Feb 12, 2007

High-Radix Topology • Use high radix, k, to get low hop count – H = logk(N)

• Provide good performance on both benign and adversarial traffic patterns – Rules out butterfly networks - no path diversity – Clos networks work well • H = 2logk(N) - with short circuit

– Cayley graphs have nice properties but are hard to route

HPCA: 22

Feb 12, 2007

Example radix-64 Clos Network

Rank 2

Y32

Y33

Y63

Rank 1

Y0

Y1

Y31

BW0

HPCA: 23

BW1

BW31

BW32

BW33

BW63

BW992

BW993

BW1023

Feb 12, 2007

Flattened Butterfly Topology

HPCA: 24

Feb 12, 2007

Packaging the Flattened Butterfly

HPCA: 25

Feb 12, 2007

Packaging the Flattened Butterfly (2)

HPCA: 26

Feb 12, 2007

Cost

HPCA: 27

Feb 12, 2007


Feb 12, 2007

Routing in High-Radix Networks • Adaptive routing avoids transient load imbalance • Global adaptive routing balances load for adversarial traffic – Cost/perf of a butterfly on benign traffic and at low loads – Cost/perf of a clos on adversarial traffic

HPCA: 29

Feb 12, 2007

A Clos can statically load balance traffic using oblivious routing

Rank 2

Y32

Y33

Y63

Rank 1

Y0

Y1

Y31

BW0

HPCA: 30

BW1

BW31

BW32

BW33

BW63

BW992

BW993

BW1023

Feb 12, 2007

Transient Imbalance

HPCA: 31

Feb 12, 2007

With Adaptive Routing

HPCA: 32

Feb 12, 2007

Latency for UR traffic

HPCA: 33

Feb 12, 2007


0

HPCA: 34

1

2

3

4

5

6

7

Feb 12, 2007


0

1

2

3

4

5

6

7

What if node 0 sends all of its traffic to node 1?

HPCA: 35

Feb 12, 2007


0

1

2

3

4

5

6

7

What if node 0 sends all of its traffic to node 1? How much traffic should we route over alternate paths?

HPCA: 36

Feb 12, 2007

Simpler Case - ring of 8 nodes Send traffic from 2 to 5 • Model: Assume queues to be a network of independent M/D/1 queues

x2

1

2

3

4

0

7

6

5

x1

= x1 + x2 Min path delay = Dm(x1) Non-min path delay = Dnm(x2)

• Routing remains minimal as long as Dm’() Dnm’(0)

• Afterwards, route a fraction, x2, nonminimally such that Dm’(x1) = Dnm’(x2) HPCA: 37

Feb 12, 2007

Traffic divides to balance delay Load balanced at saturation

0.6

Accepted Throughput

Model Overall 0.5

Model Minimal

0.4

Model Non-minimal

0.3 0.2 0.1 0 0

HPCA: 38

0.1

0.2

0.3

0.4

Offered Load (fraction of capacity)

0.5

0.6 Feb 12, 2007

Channel-Queue Routing • Estimate delay per hop by local queue length Qi • Overall latency estimated by – L i ~ Qi H i

• Route each packet on route with lowest estimated Li • Works extremely well in practice

HPCA: 39

Feb 12, 2007

Performance on UR Traffic

HPCA: 40

Feb 12, 2007

Performance on WC Traffic

HPCA: 41

Feb 12, 2007

Allocator Design Matters

HPCA: 42

Feb 12, 2007


Feb 12, 2007

Putting it all together The Cray BlackWidow Network In collaboration with Steve Scott and Dennis Abts (Cray Inc.)

HPCA: 44

Feb 12, 2007

Cray Black Widow • • • •

Shared-memory vector parallel computer Up to 32K nodes Vector processor per node Shared memory across nodes

HPCA: 45

Feb 12, 2007

Black Widow Topology • Up to 32K nodes in a 3-level folded Clos • Each node has 4 18.75Gb/s channels, one to each of 4 network slices

HPCA: 46

Feb 12, 2007

YARC Yet Another Router Chip • • • •

64 Ports Each port is 18.75 Gb/s (3 x 6.25Gb/s links) Table-driven routing Fault tolerance – CRC with link-level retry – Graceful degradation of links • 3 bits -> 2 bits -> 1 bit -> OTS

HPCA: 47

Feb 12, 2007

YARC Microarchitecture

• • • •

HPCA: 48

Regular 8x8 array of tiles – Easy to lay out chip No global arbitration – All decisions local Simple routing Hierarchical organization – Input buffers – Row buffers – Column buffers

Feb 12, 2007

A Closer Look at a Tile • No global arbitration • Non-blocking with an 8x internal speedup in subswitch • Simple routing – Small 8-entry routing table per tile – High routing throughput for small packets

HPCA: 49

Feb 12, 2007

YARC Implementation

• Implemented in a 90nm CMOS standard-cell ASIC technology • 192 SerDes on the chip • (64 ports x 3-bits per port)

• 6.25Gbaud data rate • Estimated power • 80 W (idle) • 87 W (peak)

• 17mm x 17mm die

HPCA: 50

Feb 12, 2007

YARC Implementation

• Implemented in a 90nm CMOS standard-cell ASIC technology • 192 SerDes on the chip • (64 ports x 3-bits per port)

• 6.25Gbaud data rate • Estimated power • 80 W (idle) • 87 W (peak)

• 17mm x 17mm die

HPCA: 51

Feb 12, 2007


Feb 12, 2007

Much of the future is on-chip (CMP, SoC, Operand)

2006

2010.5 HPCA: 53

2007.5

2012

2009

2013.5

2015 Feb 12, 2007

On-Chip Networks are Fundamentally Different • Different cost model – Wires plentiful, no pin constraints – Buffers expensive (consume die area) – Slow signal propagation

• Different usage patterns – Particularly for SoCs • Significant isochronous traffic • Hard RT constraints

• Different design problems – Floorplans – Energy-efficient transmission circuits HPCA: 54

Feb 12, 2007

NSF Workshop Identified 3 Critical Issues • Power – OCINs will have 10x the required power with current approaches • Circuit and architecture innovations can close this gap

• Latency – OCIN latency currently not competitive with buses and dedicated wiring • Novel flow-control strategies required

• Tool Integration – OCINs need to be integrated with standard tool flows to enable widespread use

HPCA: 55

Feb 12, 2007

The Road Ahead • INs become an even more dominant system component – Number of processors goes up, cost of processors decreases – Communication dominates performance and cost – From hand-held media UI devices to huge data centers

• Technology drives topology in new directions – On-chip, short reach electrical (10m), optical – Expect radix to continue to increase – Hybrid topologies to match each packaging level

• Latency will approach that of dedicated wiring – Better flow-control and router architecture – Optimized circuits

• Adaptivity will optimize performance – Balance load, route around defects, tolerate variation, tune power to load

HPCA: 56

Feb 12, 2007

Summary • Interconnection Networks (INs) are THE central component of modern computing systems • High-radix topologies have evolved to exploit packaging/signaling technology – Including hybrid optical/electrical – Flattened Butterfly

• Global adaptive routing balances load and enables advanced topologies – Eliminate transient load imbalance – Use local queues to estimate global congestion

• Cray Black Widow - an example high-radix network • On-Chip INs – Very different constraints – Three “Gaps” identified - power, latency, tools.

• The road ahead – Lots of room for improvement, INs are in their infancy HPCA: 57

Feb 12, 2007

Some very good books

HPCA: 58

Feb 12, 2007

Backup

HPCA: 59

Feb 12, 2007

Virtual Channel Router Architecture

Routing Router computation

VC Allocator Allocator Switch Switch Allocator Allocator

VC 1

Input 1

VC 2

Output 1

VC v

VC 1

Input k

VC 2

Output k

VC v

Crossbar switch

HPCA: 60

Feb 12, 2007

Baseline Performance Evaluation 50

latency (cycles)

low-radix 40 30 20 10 0 0

0.2

0.4

0.6

0.8

1

offered load HPCA: 61

Feb 12, 2007

Baseline Performance Evaluation 50

low-radix

latency (cycles)

40

Low radix better

baseline (high-radix)

30 20 10 0 0

0.2

0.4

0.6

0.8

1

offered load

HPCA: 62

Feb 12, 2007

Interconnect-Centric Computing - Semantic Scholar

Interconnect-Centric Computing - Semantic Scholar

Suggest Documents

Cloud Computing - Semantic Scholar

Computing Biology - Semantic Scholar

Cloud Computing - Semantic Scholar

Computing Networks - Semantic Scholar

Experiential Computing - Semantic Scholar

Cloud Computing - Semantic Scholar

Cumulative Computing - Semantic Scholar

Computing Platforms - Semantic Scholar

Wearable Computing - Semantic Scholar

Agreement Computing - Semantic Scholar

Wearable Computing - Semantic Scholar

Reconfigurable computing - Semantic Scholar

Computing Platforms - Semantic Scholar

Social Computing - Semantic Scholar

Mobile Computing - Semantic Scholar

Cloud Computing - Semantic Scholar

Cluster Computing - Semantic Scholar

Cloud Computing - Semantic Scholar

Pulp Computing - Semantic Scholar

gpgpu computing - Semantic Scholar

Forensic Computing - Semantic Scholar

Cloud Computing - Semantic Scholar

Computing Nature - Semantic Scholar

Pulp Computing - Semantic Scholar