Scalable Network Design. ⢠Some Measurements. ⢠How the ... 1970. A Brief History of Network Software. Custom monoli
Scalable Network Designs for • Scalable Network Design • Some Measurements • How the Network Can Help Benoît “tsuna” Sigoure Member of the Yak Shaving Staff
[email protected]
@tsunanet
A Brief History of Network Software
Custom monolithic embedded OS
0 7 9 1
Modified BSD, QNX or Linux kernel base 000 2
0 8 19
0 9 19
0 1 0 2
A Brief History of Network Software
Custom monolithic embedded OS
0 7 9 1
Modified BSD, QNX or Linux kernel base 000 2
0 8 19
0 9 19
“The switch is just a Linux box” 0 1 0 2
A Brief History of Network Software
Custom monolithic embedded OS
0 7 9 1
Modified BSD, QNX or Linux kernel base 000 2
0 8 19
0 9 19
“The switch is just a Linux box” 0 1 0 2
EOS = Linux
A Brief History of Network Software
Custom monolithic embedded OS
0 7 9 1
Modified BSD, QNX or Linux kernel base 000 2
0 8 19
0 9 19
Custom
ASIC
“The switch is just a Linux box” 0 1 0 2
EOS = Linux
Inside a Modern Switch
Arista 7050Q-64
Inside a Modern Switch 2 PSUs
4 Fans
Arista 7050Q-64
Inside a Modern Switch 2 PSUs USB Flash SATA SSD
x86 CPU
4 Fans
DDR3
Arista 7050Q-64
Inside a Modern Switch 2 PSUs USB Flash SATA SSD
x86 CPU
Network chip(s)
4 Fans
DDR3
Arista 7050Q-64
Inside a Modern Switch 2 PSUs USB Flash SATA SSD
x86 CPU
Network chip(s)
4 Fans
DDR3
Arista 7050Q-64
It’s All About The Software
Command API
Zero Touch Replacement
VmTracer
CloudVision
Email
Wireshark
Fast Failover
DANZ
Fabric Monitoring
It’s All About The Software
Command API
OpenTSDB
Zero Touch Replacement
VmTracer
CloudVision
Email
Wireshark
Fast Failover
DANZ
Fabric Monitoring
It’s All About The Software
Command API
#!/bin/bash OpenTSDB
Zero Touch Replacement
VmTracer
CloudVision
Email
Wireshark
Fast Failover
DANZ
Fabric Monitoring
It’s All About The Software
Command API
#!/bin/bash OpenTSDB $ sudo yum install
Zero Touch Replacement
VmTracer
CloudVision
Email
Wireshark
Fast Failover
DANZ
Fabric Monitoring
Typical Design: Keep It Simple
Server 1
Start small: single rack, single switch Scale: 64 nodes max 10Gbps network, line-rate
Server 2
Server 3
Server N Rack-1 48 RJ45 Ports
Air Vents
Console Port 4 QSFP+ Ports
7050T-64
USB Port Mgmt Port
Typical Design: Keep It Simple 40Gbps uplinks
Server 1
Server 1
Server 2
Server 2
Server 3
Server 3
Server N
Server N
Rack-1
Rack-2
Start small: 2 racks, 2 switches, no core Scale: 96 nodes max with 4 uplinks Oversubscription ratio 1:3
Typical Design: Keep It Simple
Server 1
Server 1
Server 1
Server 2
Server 2
Server 2
Server 3
Server 3
Server 3
Server N
Server N
Server N
Rack-1
Rack-2
Rack-3
3 racks, 3 switches, no core Scale: 144 nodes max with 4 uplinks Oversubscription ratio 1:6
4x40Gbps uplinks
Server 1
Server 1
Server 1
Server 2
Server 2
Server 2
Server 3
Server 3
Server 3
Server N
Server N
Server N
Rack-1
Rack-2
Rack-3
3 racks, 4 switches Scale: 144 nodes max with 4 uplinks/rack Oversubscription ratio 1:3
4x40Gbps uplinks
Server 1
Server 1
Server 1
Server 1
Server 2
Server 2
Server 2
Server 2
Server 3
Server 3
Server 3
Server 3
Server N
Server N
Server N
Server N
Rack-1
Rack-2
Rack-3
4 racks, 5 switches Scale: 192 nodes max Oversubscription ratio 1:3
Rack-4
2x40Gbps uplinks
Server 1
Server 1
Server 1
Server 1
Server 1
Server 2
Server 2
Server 2
Server 2
Server 2
Server 3
Server 3
Server 3
Server 3
Server 3
Server N
Server N
Server N
Server N
Server N
Rack-1
Rack-2
Rack-3
Rack-4
5 to 7 racks, 7 to 9 switches Scale: 240 to 336 nodes Oversubscription ratio 1:3
Rack-5
... 8x10Gbps uplinks
... Server 1
Server 1
Server 2
Server 2
Server 3
Server 3
Server 3
Server N
Server N
Server N
Server 1
Server 1
Server 2
Server 2
Server 3
Server N Rack-1
Rack-2
...
Rack-55
56 racks, 8 core switches Scale: 3136 nodes Oversubscription ratio 1:3
Rack-56
... 16x10Gbps uplinks
... Server 1
Server 1
Server 2
Server 2
Server 3
Server 3
Server 3
Server N
Server N
Server N
Server 1
Server 1
Server 2
Server 2
Server 3
Server N Rack-1
Rack-2
...
Rack-1151
Rack-1152
1152 racks, 16 core modular switches Scale: 55296 nodes Oversubscription ratio still 1:3
Layer 2 vs Layer 3 • Layer 2: “data link layer” – Ethernet “bridge” • Layer 3: “network layer” – IP “router”
← A, B, C, D E, F, G, H →
Layer 2
E, F ,G ,H
→ ←A
, B,
A B C D
Layer 3
Uplinks blocked by STP C,
0.0.0.0/0 via core1 via core2
D
E F G H
A B C D
E F G H
Layer 2 vs Layer 3 • Layer 2: “data link layer” – Ethernet “bridge” • Layer 3: “network layer” – IP “router”
← A, B, C, D E, F, G, H →
Layer 2 with vendor magic
E, F ,G ,H
→ ←A
, B,
A B C D
C,
Layer 3
0.0.0.0/0 via core1 via core2
D
E F G H
A B C D
E F G H
Layer 2 vs Layer 3 • Layer 2: “data link layer” – Ethernet “bridge” • Layer 3: “network layer” – IP “router”
← A, B, C, D E, F, G, H →
Layer 2 with vendor magic
E, F ,G ,H
→ ←A
, B,
A B C D
C,
Layer 3
0.0.0.0/0 via core1 via core2
D
E F G H
A B C D
E F G H
Layer 2 Pros: • Plug’n’play Cons: • Everybody needs to be aware of everybody watch your MAC & ARP table sizes • No visibility of individual hops • Vendor magic or spanning tree • If you break STP and cause a loop game over • Broadcast packets hit everybody
Layer 3 Pros: • Horizontally scalable • Deterministic IP assignment scheme • 100% based on open standards & protocols • Can debug with ping / traceroute • Degrades better under fault or misconfig Cons: • Need to think up your IP assignment scheme ahead of time
1G vs 10G • 1Gbps is today what 100Mbps was yesterday • New generation servers come with 10G LAN on board • At 1Gbps, the network is one of the first bottlenecks • A single HBase client can very easily push over 4Gbps to HBase
1Gbps
10Gbps
700 525 350 175
3.5x
0 TeraGen in seconds
Jumbo Frames
Elephant Framed! by Dhinakaran Gajavarathan
Jumbo Frames MTU=1500
MTU=9212
2000 1500 1000 500
3x reduction
0 Million packets per TeraGen
Jumbo Frames MTU=1500 340
11%
1000
MTU=9212 8%
500 15%
255
750
375
170
500
250
85
250
125
0
0
0
Context switches
Interrupts
(millions)
(millions)
Soft IRQ CPU Time (thousand jiffies)
Deep Buffers «Hardware» buffers by Roger Ferrer Ibáñez
150
38
Arista 7500E
75
Arista 7500
113
Industry Average
128
48
1.5
0 Packet buffer per 10G port (MB)
Deep Buffers «Hardware» buffers by Roger Ferrer Ibáñez
Deep Buffers
1000000
4x10G 3x10G
... 16 hosts x 10G
... 16 hosts x 10G
1:4 1:5.33
1MB
1:4 1MB
1:5.33 48MB
1:5.33
750000 500000 250000 0 Packets dropped per TeraGen
Deep Buffers
1000000
4x10G 3x10G
... 16 hosts x 10G
... 16 hosts x 10G
1:4 1:5.33
1MB
1:4 1MB
1:5.33 48MB
1:5.33
750000 500000 250000
Zero!
0 Packets dropped per TeraGen
Deep Buffers
1000000
4x10G 3x10G
... 16 hosts x 10G
... 16 hosts x 10G
1:4 1:5.33
1MB
1:4 1MB
1:5.33 48MB
1:5.33
750000 500000 250000
Zero!
0 Packets dropped per TeraGen
Machine Crashes
Machine Crashes
Machine Crashes
Machine Crashes
Machine Crashes
How Can a Modern Network Help? • • • •
Detect machine failures Redirect traffic to the switch Have the switch’s kernel send TCP resets Immediately kills all existing and new flows
10.4.5.254
10.4.6.254
10.4.5.1
10.4.6.1
10.4.5.2
10.4.6.2
10.4.5.3
10.4.6.3
How Can a Modern Network Help? • • • •
Detect machine failures Redirect traffic to the switch Have the switch’s kernel send TCP resets Immediately kills all existing and new flows
EOS = Linux 10.4.5.2
10.4.5.254
10.4.6.254
10.4.5.1
10.4.6.1
10.4.5.2
10.4.6.2
10.4.5.3
10.4.6.3
Fast Server Failover • Switch learns & tracks IP ↔ port mapping • Port down → take over IP and MAC addresses • Kicks in as soon as hardware notifies software of the port going down, within a few milliseconds • Port back up → resume normal operation • Ideal to prevent RegionServers stalling on WAL EOS = Linux 10.4.5.2
10.4.5.254
10.4.6.254
10.4.5.1
10.4.6.1
10.4.5.2
10.4.6.2
10.4.5.3
10.4.6.3
Conclusion • Start thinking of the network as yet another service that runs on Linux boxes • Build layer 3 networks based on standards • Use jumbo frames • Deep buffers prevent packet loss on oversubscribed uplinks • The network sits in the middle of everything; it can help. Expect to see more in that area.
Thank You
We’re hiring in SF, Santa Clara, Vancouver, and Bangalore
@tsunanet
Benoît “tsuna” Sigoure Member of the Yak Shaving Staff
[email protected]