Scalable Network Designs for

1 downloads 195 Views 11MB Size Report
Scalable Network Design. • Some Measurements. • How the ... 1970. A Brief History of Network Software. Custom monoli
Scalable Network Designs for • Scalable Network Design • Some Measurements • How the Network Can Help Benoît “tsuna” Sigoure Member of the Yak Shaving Staff [email protected]

@tsunanet

A Brief History of Network Software

Custom monolithic embedded OS

0 7 9 1

Modified BSD, QNX or Linux kernel base 000 2

0 8 19

0 9 19

0 1 0 2

A Brief History of Network Software

Custom monolithic embedded OS

0 7 9 1

Modified BSD, QNX or Linux kernel base 000 2

0 8 19

0 9 19

“The switch is just a Linux box” 0 1 0 2

A Brief History of Network Software

Custom monolithic embedded OS

0 7 9 1

Modified BSD, QNX or Linux kernel base 000 2

0 8 19

0 9 19

“The switch is just a Linux box” 0 1 0 2

EOS = Linux

A Brief History of Network Software

Custom monolithic embedded OS

0 7 9 1

Modified BSD, QNX or Linux kernel base 000 2

0 8 19

0 9 19

Custom

ASIC

“The switch is just a Linux box” 0 1 0 2

EOS = Linux

Inside a Modern Switch

Arista 7050Q-64

Inside a Modern Switch 2 PSUs

4 Fans

Arista 7050Q-64

Inside a Modern Switch 2 PSUs USB Flash SATA SSD

x86 CPU

4 Fans

DDR3

Arista 7050Q-64

Inside a Modern Switch 2 PSUs USB Flash SATA SSD

x86 CPU

Network chip(s)

4 Fans

DDR3

Arista 7050Q-64

Inside a Modern Switch 2 PSUs USB Flash SATA SSD

x86 CPU

Network chip(s)

4 Fans

DDR3

Arista 7050Q-64

It’s All About The Software

Command API

Zero Touch Replacement

VmTracer

CloudVision

Email

Wireshark

Fast Failover

DANZ

Fabric Monitoring

It’s All About The Software

Command API

OpenTSDB

Zero Touch Replacement

VmTracer

CloudVision

Email

Wireshark

Fast Failover

DANZ

Fabric Monitoring

It’s All About The Software

Command API

#!/bin/bash OpenTSDB

Zero Touch Replacement

VmTracer

CloudVision

Email

Wireshark

Fast Failover

DANZ

Fabric Monitoring

It’s All About The Software

Command API

#!/bin/bash OpenTSDB $ sudo yum install

Zero Touch Replacement

VmTracer

CloudVision

Email

Wireshark

Fast Failover

DANZ

Fabric Monitoring

Typical Design: Keep It Simple

Server 1

Start small: single rack, single switch Scale: 64 nodes max 10Gbps network, line-rate

Server 2

Server 3

Server N Rack-1 48 RJ45 Ports

Air Vents

Console Port 4 QSFP+ Ports

7050T-64

USB Port Mgmt Port

Typical Design: Keep It Simple 40Gbps uplinks

Server 1

Server 1

Server 2

Server 2

Server 3

Server 3

Server N

Server N

Rack-1

Rack-2

Start small: 2 racks, 2 switches, no core Scale: 96 nodes max with 4 uplinks Oversubscription ratio 1:3

Typical Design: Keep It Simple

Server 1

Server 1

Server 1

Server 2

Server 2

Server 2

Server 3

Server 3

Server 3

Server N

Server N

Server N

Rack-1

Rack-2

Rack-3

3 racks, 3 switches, no core Scale: 144 nodes max with 4 uplinks Oversubscription ratio 1:6

4x40Gbps uplinks

Server 1

Server 1

Server 1

Server 2

Server 2

Server 2

Server 3

Server 3

Server 3

Server N

Server N

Server N

Rack-1

Rack-2

Rack-3

3 racks, 4 switches Scale: 144 nodes max with 4 uplinks/rack Oversubscription ratio 1:3

4x40Gbps uplinks

Server 1

Server 1

Server 1

Server 1

Server 2

Server 2

Server 2

Server 2

Server 3

Server 3

Server 3

Server 3

Server N

Server N

Server N

Server N

Rack-1

Rack-2

Rack-3

4 racks, 5 switches Scale: 192 nodes max Oversubscription ratio 1:3

Rack-4

2x40Gbps uplinks

Server 1

Server 1

Server 1

Server 1

Server 1

Server 2

Server 2

Server 2

Server 2

Server 2

Server 3

Server 3

Server 3

Server 3

Server 3

Server N

Server N

Server N

Server N

Server N

Rack-1

Rack-2

Rack-3

Rack-4

5 to 7 racks, 7 to 9 switches Scale: 240 to 336 nodes Oversubscription ratio 1:3

Rack-5

... 8x10Gbps uplinks

... Server 1

Server 1

Server 2

Server 2

Server 3

Server 3

Server 3

Server N

Server N

Server N

Server 1

Server 1

Server 2

Server 2

Server 3

Server N Rack-1

Rack-2

...

Rack-55

56 racks, 8 core switches Scale: 3136 nodes Oversubscription ratio 1:3

Rack-56

... 16x10Gbps uplinks

... Server 1

Server 1

Server 2

Server 2

Server 3

Server 3

Server 3

Server N

Server N

Server N

Server 1

Server 1

Server 2

Server 2

Server 3

Server N Rack-1

Rack-2

...

Rack-1151

Rack-1152

1152 racks, 16 core modular switches Scale: 55296 nodes Oversubscription ratio still 1:3

Layer 2 vs Layer 3 • Layer 2: “data link layer” – Ethernet “bridge” • Layer 3: “network layer” – IP “router”

← A, B, C, D E, F, G, H →

Layer 2

E, F ,G ,H

→ ←A

, B,

A B C D

Layer 3

Uplinks blocked by STP C,

0.0.0.0/0 via core1 via core2

D

E F G H

A B C D

E F G H

Layer 2 vs Layer 3 • Layer 2: “data link layer” – Ethernet “bridge” • Layer 3: “network layer” – IP “router”

← A, B, C, D E, F, G, H →

Layer 2 with vendor magic

E, F ,G ,H

→ ←A

, B,

A B C D

C,

Layer 3

0.0.0.0/0 via core1 via core2

D

E F G H

A B C D

E F G H

Layer 2 vs Layer 3 • Layer 2: “data link layer” – Ethernet “bridge” • Layer 3: “network layer” – IP “router”

← A, B, C, D E, F, G, H →

Layer 2 with vendor magic

E, F ,G ,H

→ ←A

, B,

A B C D

C,

Layer 3

0.0.0.0/0 via core1 via core2

D

E F G H

A B C D

E F G H

Layer 2 Pros: • Plug’n’play Cons: • Everybody needs to be aware of everybody watch your MAC & ARP table sizes • No visibility of individual hops • Vendor magic or spanning tree • If you break STP and cause a loop game over • Broadcast packets hit everybody

Layer 3 Pros: • Horizontally scalable • Deterministic IP assignment scheme • 100% based on open standards & protocols • Can debug with ping / traceroute • Degrades better under fault or misconfig Cons: • Need to think up your IP assignment scheme ahead of time

1G vs 10G • 1Gbps is today what 100Mbps was yesterday • New generation servers come with 10G LAN on board • At 1Gbps, the network is one of the first bottlenecks • A single HBase client can very easily push over 4Gbps to HBase

1Gbps

10Gbps

700 525 350 175

3.5x

0 TeraGen in seconds

Jumbo Frames

Elephant Framed! by Dhinakaran Gajavarathan

Jumbo Frames MTU=1500

MTU=9212

2000 1500 1000 500

3x reduction

0 Million packets per TeraGen

Jumbo Frames MTU=1500 340

11%

1000

MTU=9212 8%

500 15%

255

750

375

170

500

250

85

250

125

0

0

0

Context switches

Interrupts

(millions)

(millions)

Soft IRQ CPU Time (thousand jiffies)

Deep Buffers «Hardware» buffers by Roger Ferrer Ibáñez

150

38

Arista 7500E

75

Arista 7500

113

Industry Average

128

48

1.5

0 Packet buffer per 10G port (MB)

Deep Buffers «Hardware» buffers by Roger Ferrer Ibáñez

Deep Buffers

1000000

4x10G 3x10G

... 16 hosts x 10G

... 16 hosts x 10G

1:4 1:5.33

1MB

1:4 1MB

1:5.33 48MB

1:5.33

750000 500000 250000 0 Packets dropped per TeraGen

Deep Buffers

1000000

4x10G 3x10G

... 16 hosts x 10G

... 16 hosts x 10G

1:4 1:5.33

1MB

1:4 1MB

1:5.33 48MB

1:5.33

750000 500000 250000

Zero!

0 Packets dropped per TeraGen

Deep Buffers

1000000

4x10G 3x10G

... 16 hosts x 10G

... 16 hosts x 10G

1:4 1:5.33

1MB

1:4 1MB

1:5.33 48MB

1:5.33

750000 500000 250000

Zero!

0 Packets dropped per TeraGen

Machine Crashes

Machine Crashes

Machine Crashes

Machine Crashes

Machine Crashes

How Can a Modern Network Help? • • • •

Detect machine failures Redirect traffic to the switch Have the switch’s kernel send TCP resets Immediately kills all existing and new flows

10.4.5.254

10.4.6.254

10.4.5.1

10.4.6.1

10.4.5.2

10.4.6.2

10.4.5.3

10.4.6.3

How Can a Modern Network Help? • • • •

Detect machine failures Redirect traffic to the switch Have the switch’s kernel send TCP resets Immediately kills all existing and new flows

EOS = Linux 10.4.5.2

10.4.5.254

10.4.6.254

10.4.5.1

10.4.6.1

10.4.5.2

10.4.6.2

10.4.5.3

10.4.6.3

Fast Server Failover • Switch learns & tracks IP ↔ port mapping • Port down → take over IP and MAC addresses • Kicks in as soon as hardware notifies software of the port going down, within a few milliseconds • Port back up → resume normal operation • Ideal to prevent RegionServers stalling on WAL EOS = Linux 10.4.5.2

10.4.5.254

10.4.6.254

10.4.5.1

10.4.6.1

10.4.5.2

10.4.6.2

10.4.5.3

10.4.6.3

Conclusion • Start thinking of the network as yet another service that runs on Linux boxes • Build layer 3 networks based on standards • Use jumbo frames • Deep buffers prevent packet loss on oversubscribed uplinks • The network sits in the middle of everything; it can help. Expect to see more in that area.

Thank You

We’re hiring in SF, Santa Clara, Vancouver, and Bangalore

@tsunanet

Benoît “tsuna” Sigoure Member of the Yak Shaving Staff [email protected]