Low Latency RPC in RAMCloud - Stanford Computer Forum

7 downloads 115 Views 2MB Size Report
Apr 12, 2011 - Latency hasn't been a focus in either hardware or software. ○ Where is all ... Cause: Data center netwo
Low Latency RPC in RAMCloud Mendel Rosenblum Stanford University (with Mario Flajslik, Aravind Narayanan, and the RAMCloud Team)

Outline ●  Theoretical minimum RPC times §  What do the physicists tell us

●  Latency measurements of memcached §  Latency hasn’t been a focus in either hardware or software

●  Where is all the time going §  It’s not all software

●  RAMCloud RPC §  Current status and research directions

April 12, 2011

RAMCloud RPC

Datacenter Latency Lower Bounds ●  Propagation speed around 2/3 speed of light(c) Cable

Speed

~ (ns/m)

Twinax

0.65c

5.10

Fiber

0.66c

5.05

Twister Pair 0.59c

5.65

Estimate: 5 ns/meter

●  Assume 40 machines/rack & 2.6 m2/rack # of Machines

Floor space

Round trip Latency (ns)

1,000

8m by 8m

110 ns

10,000

25m by 25m

360 ns

100,000

80m by 80m

1130 ns

100m by 100m

1400 ns

Max April 12, 2011

RAMCloud RPC

Memcached – Get Request

●  Latency sources: Total ~400µs §  §  §  § 

Network switch and wire time: 200 µs Network interface cards (NICs): 128 µs Linux kernel networking stacks: 60 µs memcached server code: 30 µs

April 12, 2011

RAMCloud RPC

Cause: Data center networking Switch

●  Network latency: 150 – 300 µs §  10 – 30 µs per switch, 5 switches each way §  Can do better ●  Need cut-through routing (no buffering)

Switch Switch

Switch

Client

Server

§  Arista switch 0.6 µs per switch §  Infiniband switch 0.1 µs per switch §  Cray routers 20-40ns per switch

●  Solution:

Stanford Experimental Data Center Lab

April 12, 2011

RAMCloud RPC

Switch

Cause: NIC hardware ●  Most NICs designed and configured for throughput,

not latency – 128 µs §  If switches add 100s of microseconds, what does it matter?

●  Example: Linux driver configuration Intel NIC §  32 µs delay for interrupt coalescing/throttling §  Reduce CPU overheads, avoid receiver livelock

●  Solution:

§  Reconfigure delay settings of NIC §  Get better NICs

April 12, 2011

RAMCloud RPC

Cause: Kernel software ●  Kernel networking stack: 15 µs x 4 = 60 µs ●  General purpose solution: sockets, protocol layers §  Many instructions and cache misses §  Intermediate copies §  System call and context switch overheads

●  Solution: §  Bypass kernel networking stack §  User-level NIC access, specialized packet processing software

April 12, 2011

RAMCloud RPC

Low latency NIC experimentation ●  Building a NIC with NetFPGA ●  10G NetFPGA: §  §  §  § 

Xilinx Virtex5 4 x 10G ethernet ports 4 NetLogic AEL2005 PHY x8 PCIe

April 12, 2011

RAMCloud RPC

One-­‐way  10G  Ethernet   10G  Ethernet  (3m  twinax  cable)   2500  

latency  (ns)  

2000  

1500  

1000  

500  

0   0  

200  

400  

600  

800  

1000  

size(B)  

•  RX  +  TX  for  3m  twinax  cable:    872  ns  +  0.8  ns/B   •  RX  +  TX  for  8m  fiber  cable:  908  ns  +  0.8  ns/B   •  Cable  delay:  15.3ns   April  12,  2011  

RAMCloud  RPC  

1200  

1400  

1600  

10G  breakdown   •  TX  path:  64  B   XAUI   19  ns   AXIS   36  ns  

MAC   96  ns  

0   36ns  

GTX   51  ns  

cable   15  ns  

PHY   135  ns  

337ns  

132ns   151ns   202ns  

•  RX  path:  64  B   XAUI   25  ns  

PHY   135  ns   352ns  

GTX   244  ns   487ns  

MAC   89  ns   731ns   756ns  

•  AXIS  clock:  200  MHz   •  MAC,  XAUI  and  GTX  clock:  156.25  MHz   • April  1PHY   2,  2011   is  AEL2005  by  NetLogic   RAMCloud  RPC  

AXIS_buf   80  ns   845ns  

925ns  

PCIe   NetFPGA   PCIe CPU   x8 i7-­‐2600k   10G  

•  x8  PCIe  1.1    

RAM  

–  bandwidth:  16  Gbps   –  wire  _me:  0.5  ns/B  

DMI PCH  

(SATA,   USB,   PCIe)  

PCIe  read  latency   1000   900  

•  Completely  idle  CPU   addi_onal:  800  ns  

latency  (ns)  

•  DMA  read  from  FPGA:   755  ns  +  1.5  ns/B  

800   700   600   500   400   300   200   100   0   0  

10  

20  

30  

40  

size(B)  

April  12,  2011  

RAMCloud  RPC  

50  

60  

70  

PCIe  breakdown     •  TX  path:  12  B   PCIe  core   36  ns   0

Host  (TX+RX)   204  ns  

PHY   64  ns   114ns

178ns

•  RX  path:  16  B   PHY   209  ns   382ns

PCIe  core   173  ns   591ns

•  PHY  is  Xilinx  Rocket  IO,  opera_ng  at  125  MHz •  PCIe  core  is  Xilinx  core  opera_ng  at  250  MHz   •  Host  part  of  read  _me:  204ns   April  12,  2011  

RAMCloud  RPC  

764ns

RAMCloud prototype NIC selection ●  Minimum request/reply message exchange test ●  Intel 82599 10GigE NIC §  Best case (hacked driver, no switch): 9.5us

●  Mellanox MT26428 (User level + one switch) §  Infiniband: 3.25us §  10GigE (Arista): 4.5us

April 12, 2011

RAMCloud RPC

Research in low latency RPC ●  With user-level NIC access most software overheads

disappear §  Even with optimized RPC overheads of read is less than 5 microseconds

●  Can we handle some of the RAMCloud RPC in the

NIC? §  Implementing in NetFPGA

April 12, 2011

RAMCloud RPC

Conclusion ●  Low latency RPC faces challenge on the hardware

and software side §  We are confident about the software side §  Need help on the networking and computing platform

April 12, 2011

RAMCloud RPC