Apr 12, 2011 - Latency hasn't been a focus in either hardware or software. â Where is all ... Cause: Data center netwo
Low Latency RPC in RAMCloud Mendel Rosenblum Stanford University (with Mario Flajslik, Aravind Narayanan, and the RAMCloud Team)
Outline ● Theoretical minimum RPC times § What do the physicists tell us
● Latency measurements of memcached § Latency hasn’t been a focus in either hardware or software
● Where is all the time going § It’s not all software
● RAMCloud RPC § Current status and research directions
April 12, 2011
RAMCloud RPC
Datacenter Latency Lower Bounds ● Propagation speed around 2/3 speed of light(c) Cable
Speed
~ (ns/m)
Twinax
0.65c
5.10
Fiber
0.66c
5.05
Twister Pair 0.59c
5.65
Estimate: 5 ns/meter
● Assume 40 machines/rack & 2.6 m2/rack # of Machines
Floor space
Round trip Latency (ns)
1,000
8m by 8m
110 ns
10,000
25m by 25m
360 ns
100,000
80m by 80m
1130 ns
100m by 100m
1400 ns
Max April 12, 2011
RAMCloud RPC
Memcached – Get Request
● Latency sources: Total ~400µs § § § §
Network switch and wire time: 200 µs Network interface cards (NICs): 128 µs Linux kernel networking stacks: 60 µs memcached server code: 30 µs
April 12, 2011
RAMCloud RPC
Cause: Data center networking Switch
● Network latency: 150 – 300 µs § 10 – 30 µs per switch, 5 switches each way § Can do better ● Need cut-through routing (no buffering)
Switch Switch
Switch
Client
Server
§ Arista switch 0.6 µs per switch § Infiniband switch 0.1 µs per switch § Cray routers 20-40ns per switch
● Solution:
Stanford Experimental Data Center Lab
April 12, 2011
RAMCloud RPC
Switch
Cause: NIC hardware ● Most NICs designed and configured for throughput,
not latency – 128 µs § If switches add 100s of microseconds, what does it matter?
● Example: Linux driver configuration Intel NIC § 32 µs delay for interrupt coalescing/throttling § Reduce CPU overheads, avoid receiver livelock
● Solution:
§ Reconfigure delay settings of NIC § Get better NICs
April 12, 2011
RAMCloud RPC
Cause: Kernel software ● Kernel networking stack: 15 µs x 4 = 60 µs ● General purpose solution: sockets, protocol layers § Many instructions and cache misses § Intermediate copies § System call and context switch overheads
● Solution: § Bypass kernel networking stack § User-level NIC access, specialized packet processing software
April 12, 2011
RAMCloud RPC
Low latency NIC experimentation ● Building a NIC with NetFPGA ● 10G NetFPGA: § § § §
Xilinx Virtex5 4 x 10G ethernet ports 4 NetLogic AEL2005 PHY x8 PCIe
April 12, 2011
RAMCloud RPC
One-‐way 10G Ethernet 10G Ethernet (3m twinax cable) 2500
latency (ns)
2000
1500
1000
500
0 0
200
400
600
800
1000
size(B)
• RX + TX for 3m twinax cable: 872 ns + 0.8 ns/B • RX + TX for 8m fiber cable: 908 ns + 0.8 ns/B • Cable delay: 15.3ns April 12, 2011
RAMCloud RPC
1200
1400
1600
10G breakdown • TX path: 64 B XAUI 19 ns AXIS 36 ns
MAC 96 ns
0 36ns
GTX 51 ns
cable 15 ns
PHY 135 ns
337ns
132ns 151ns 202ns
• RX path: 64 B XAUI 25 ns
PHY 135 ns 352ns
GTX 244 ns 487ns
MAC 89 ns 731ns 756ns
• AXIS clock: 200 MHz • MAC, XAUI and GTX clock: 156.25 MHz • April 1PHY 2, 2011 is AEL2005 by NetLogic RAMCloud RPC
AXIS_buf 80 ns 845ns
925ns
PCIe NetFPGA PCIe CPU x8 i7-‐2600k 10G
• x8 PCIe 1.1
RAM
– bandwidth: 16 Gbps – wire _me: 0.5 ns/B
DMI PCH
(SATA, USB, PCIe)
PCIe read latency 1000 900
• Completely idle CPU addi_onal: 800 ns
latency (ns)
• DMA read from FPGA: 755 ns + 1.5 ns/B
800 700 600 500 400 300 200 100 0 0
10
20
30
40
size(B)
April 12, 2011
RAMCloud RPC
50
60
70
PCIe breakdown • TX path: 12 B PCIe core 36 ns 0
Host (TX+RX) 204 ns
PHY 64 ns 114ns
178ns
• RX path: 16 B PHY 209 ns 382ns
PCIe core 173 ns 591ns
• PHY is Xilinx Rocket IO, opera_ng at 125 MHz • PCIe core is Xilinx core opera_ng at 250 MHz • Host part of read _me: 204ns April 12, 2011
RAMCloud RPC
764ns
RAMCloud prototype NIC selection ● Minimum request/reply message exchange test ● Intel 82599 10GigE NIC § Best case (hacked driver, no switch): 9.5us
● Mellanox MT26428 (User level + one switch) § Infiniband: 3.25us § 10GigE (Arista): 4.5us
April 12, 2011
RAMCloud RPC
Research in low latency RPC ● With user-level NIC access most software overheads
disappear § Even with optimized RPC overheads of read is less than 5 microseconds
● Can we handle some of the RAMCloud RPC in the
NIC? § Implementing in NetFPGA
April 12, 2011
RAMCloud RPC
Conclusion ● Low latency RPC faces challenge on the hardware
and software side § We are confident about the software side § Need help on the networking and computing platform
April 12, 2011
RAMCloud RPC