Adapter. Ethernet. Switch. Network. Adapter. Network. Switch. 1/10 GigE. InfiniBand. Adapter. InfiniBand. Switch. IPoIB. IPoIB. SDP. RDMA. User space. IB Verbs.
Memcached Design on High Performance RDMA Capable Interconnects Jithin Jose, Hari Subramoni, Miao Luo, Minjia Zhang, Jian Huang, Md. Wasi-‐ur-‐Rahman, Nusrat S. Islam, Xiangyong Ouyang, Hao Wang, Sayantan Sur & D. K. Panda Network-‐Based Compu2ng Laboratory Department of Computer Science and Engineering The Ohio State University, USA
Outline • IntroducLon • Overview of Memcached • Modern High Performance Interconnects • Unified CommunicaLon RunLme (UCR) • Memcached Design using UCR • Performance EvaluaLon • Conclusion & Future Work 2 ICPP - 2011
IntroducLon • Tremendous increase in interest in interactive web-sites (social networking, e-commerce etc.) • Dynamic data is stored in databases for future retrieval and analysis • Database lookups are expensive • Memcached – a distributed memory caching layer, implemented using traditional BSD sockets • Socket interface provides portability, but entails additional processing and multiple message copies • High-Performance Computing (HPC) has adopted advanced interconnects (e.g. InfiniBand, 10 Gigabit Ethernet/iWARP, RoCE) – Low latency, High Bandwidth, Low CPU overhead
• Many machines in Top500 list (http://www.top500.org) 3 ICPP - 2011
Outline • IntroducLon • Overview of Memcached • Modern High Performance Interconnects • Unified CommunicaLon RunLme (UCR) • Memcached Design using UCR • Performance EvaluaLon • Conclusion & Future Work 4 ICPP - 2011
Memcached Overview System Area Network
System Area Network
Internet
Proxy Servers (Memcached Clients)
Memcached Servers
Database Servers
• Memcached provides a scalable distributed caching • Spare memory in data-‐center servers can be aggregated to speedup lookups • Basically a key-‐value distributed memory store • Keys can be any character strings, typically MD5 sums or hashes • Typically used to cache database queries, results of API calls or webpage rendering elements • Scalable model, but typical usage very network intensive -‐Performance directly related to that of underlying networking technology 5 ICPP - 2011
Outline • IntroducLon • Overview of Memcached • Modern High Performance Interconnects • Unified CommunicaLon RunLme (UCR) • Memcached Design using UCR • Performance EvaluaLon • Conclusion & Future Work 6 ICPP - 2011
Modern High Performance Interconnects ApplicaAon
ApplicaLon Interface
Sockets
Kernel Space
TCP/IP
IB Verbs
TCP/IP
Protocol ImplementaLon
Hardware Offload
SDP
RDMA
User space
Ethernet Driver
IPoIB
Network Adapter
1/10 GigE Adapter
InfiniBand Adapter
10 GigE Adapter
InfiniBand Adapter
InfiniBand Adapter
Network Switch
Ethernet Switch
InfiniBand Switch
10 GigE Switch
InfiniBand Switch
InfiniBand Switch
10 GigE-‐TOE
SDP
IB Verbs
1/10 GigE
IPoIB
RDMA
7 ICPP - 2011
Problem Statement • High-performance RDMA capable interconnects have emerged in the scientific computation domain • Applications using Memcached are still relying on sockets • Performance of Memcached is critical to most of its deployments • Can Memcached be re-designed from the ground up to utilize RDMA capable networks?
8 ICPP - 2011
A New Approach using Unified CommunicaLon RunLme (UCR) Current Approach
Our Approach
ApplicaAon
ApplicaAon
UCR Sockets IB Verbs 1/10 GigE Network
RDMA Capable N/ws (IB, 10GE, iWARP, RoCE ...)
• Sockets not designed for high-‐performance – Stream semanLcs o\en mismatch for upper layers (Memcached, Hadoop) – MulLple copies can be involved
9 ICPP - 2011
Outline • IntroducLon & MoLvaLon • Overview of Memcached • Modern High Performance Interconnects • Unified CommunicaLon RunLme (UCR) • Memcached Design using UCR • Performance EvaluaLon • Conclusion & Future Work 10 ICPP - 2011
Unified CommunicaLon RunLme (UCR) • Initially proposed to unify communication runtimes of different parallel programming models – J. Jose, M. Luo, S. Sur and D. K. Panda, Unifying UPC and MPI Runtimes: Experience with MVAPICH, (PGAS’10)
• Design of UCR evolved from MVAPICH/MVAPICH2 software stacks (h`p://mvapich.cse.ohio-‐state.edu/) • UCR provides interfaces for Active Messages as well as onesided put/get operations • Enhanced APIs to support Cloud computing applications • Several enhancements in UCR – end-point based design, revamped active-message API, fault tolerance and synchronization with timeouts.
• Communications based on endpoint, analogous to sockets 11 ICPP - 2011
AcLve Messaging in UCR • Active messages are proven to be very powerful in many environments • GASNet Project (UC Berkeley), MPI design using LAPI (IBM), etc.
• We introduce Active messages into the data-center domain • An Active Message consists of two parts – header and data • When the message arrives at the target, header handler is run • Header handler identifies the destination buffer for the data • Data is put into the destination buffer • Completion handler is run afterwards (optional) • Special flags to indicate local & remote completions (optional) 12 ICPP - 2011
AcLve Messaging in UCR (contd.) Origin
Target
Origin
Header
Header + Data Set LComplFlag
Header Handler
Set ComplFlag
Header Handler Copy Data
RDMA Data
Set LComplFlag
Target
Completion Handler
Completion Handler
Set RComplFlag
Set RComplFlag
Set ComplFlag
(General Active Message Functionality)
(Optimized Short Active Message Functionality)
13 ICPP - 2011
Outline • IntroducLon & MoLvaLon • Overview of Memcached • Modern High Performance Interconnects • Unified CommunicaLon RunLme (UCR) • Memcached Design using UCR • Performance EvaluaLon • Conclusion & Future Work 14 ICPP - 2011
Memcached Design using UCR Sockets Client
Master Thread
Sockets Worker Thread
Verbs Worker Thread
Shared Data Memory Slabs Items …
Sockets Worker Thread
RDMA Client
Verbs Worker Thread
• Server and client perform a negoLaLon protocol – Master thread assigns clients to appropriate worker thread • Once a client is assigned a verbs worker thread, it can communicate directly and is “bound” to that thread • All other Memcached data structures are shared among RDMA and Sockets worker threads 15 ICPP - 2011
Outline • IntroducLon & MoLvaLon • Overview of Memcached • Modern High Performance Interconnects • Unified CommunicaLon RunLme (UCR) • Memcached Design using UCR • Performance EvaluaLon • Conclusion & Future Work 16 ICPP - 2011
Experimental Setup • Used Two Clusters – Intel Clovertown • Each node has 8 processor cores on 2 Intel Xeon 2.33 GHz Quad-‐core CPUs, 6 GB main memory, 250 GB hard disk • Network: 1GigE, IPoIB, 10GigE TOE and IB (DDR)
– Intel Westmere • Each node has 8 processor cores on 2 Intel Xeon 2.67 GHz Quad-‐core CPUs, 12 GB main memory, 160 GB hard disk • Network: 1GigE, IPoIB, and IB (QDR)
• Memcached So\ware – Memcached Server: 1.4.5 – Memcached Client: (libmemcached) 0.45
17 ICPP - 2011
Memcached Get Latency (Small Message) 250
350 300
200 Time (us)
Time (us)
250 150 100
SDP
200
IPoIB
150
OSU Design
100 50
1G
50
10G-TOE
0
0 1
2
4
8
16 32 64 128 256 512 1K 2K 4K
1
2
4
16 32 64 128 256 512 1K 2K 4K Message Size
Message Size
Intel Clovertown Cluster (IB: DDR)
8
Intel Westmere Cluster (IB: QDR)
• Memcached Get latency – 4 bytes – DDR: 6 us; QDR: 5 us – 4K bytes -‐-‐ DDR: 20 us; QDR:12 us • Almost factor of four improvement over 10GE (TOE) for 4KB on the DDR cluster • Almost factor of seven improvement over IPoIB for 4KB on the QDR cluster 18
ICPP - 2011
Memcached Get Latency (Large Message) 5000
700
4500
600
4000
500
Time (us)
3500 3000
SDP
400
IPoIB
2500 2000
300
1500
200
1G
100
10G-TOE
1000 500 0
OSU Design
0 4K
8K
16K
32K
64K 128K 256K 512K
4K
Message Size
8K
16K
32K
64K
128K
512K
Message Size
Intel Clovertown Cluster (IB: DDR)
Intel Westmere Cluster (IB: QDR)
• Memcached Get latency – 8K bytes – DDR: 17 us; QDR: 13 us – 512K bytes -‐-‐ DDR: 362 us; QDR: 94 us • Almost factor of three improvement over 10GE(TOE) for 512KB on the DDR cluster • Almost factor of four improvement over IPoIB for 512K bytes on the QDR cluster 19
ICPP - 2011
Memcached Set Latency (Small Message) 250
100 90
200
80
Time (us)
70 150 100 50
60
SDP
50
IPoIB
40
OSU Design
30
1G
20
10G-TOE
10 0
0 1
2
4
8
16 32 64 128 256 512 1K 2K 4K Message Size
Intel Clovertown Cluster (IB: DDR)
1
2
4
8
16 32 64 128 256 512 1K 2K 4K Message Size
Intel Westmere Cluster (IB: QDR)
• Memcached Set latency – 4 bytes – DDR: 7 us; QDR: 5 us – 4K bytes -‐-‐ DDR: 15 us; QDR:13 us • Almost factor of four improvement over 10GE (TOE) for 4KB on the DDR Cluster • Almost factor of six improvement over IPoIB for 4KB on the QDR Cluster 20 ICPP - 2011
Memcached Set Latency (Large Message) 4500
800
4000
700
3500
600
Time (us)
3000 2500 2000 1500
500
SDP
400
IPoIB
300
OSU Design
1000
200
500
100
0 4K
8K
16K
32K
64K
128K 256K 512K
1G 10G-TOE
0 4K
Message Size
8K
16K
32K
64K
128K 256K 512K
Message Size
Intel Westmere Cluster (IB: QDR)
Intel Clovertown Cluster (IB: DDR)
• Memcached Get latency – 8K bytes – DDR: 18 us; QDR: 15 us – 512K bytes -‐-‐ DDR: 375 us; QDR:185 us • Almost factor of two improvement over 10GE (TOE) for 512KB on the DDR cluster • Almost factor of three improvement over IPoIB for 512KB on the QDR cluster 21
ICPP - 2011
Memcached Latency (10% Set, 90% Get) 350
250
300
200 Time (us)
250 150
SDP
200
100 50
IPoIB
150
OSU Design
100
1G 10G
50 0
0 1
2
4
8
16 32 64 128 256 512 1K 2K 4K
1
2
4
8
16 32 64 128 256 512 1K 2K 4K Message Size
Message Size
Intel Clovertown Cluster (IB: DDR)
Intel Westmere Cluster (IB: QDR)
• Memcached Get latency – 4 bytes – DDR: 7 us; QDR: 5 us – 4K bytes -‐-‐ DDR: 15 us; QDR:12 us • Almost factor of four improvement over 10GE (TOE) for 4K bytes on the DDR cluster ICPP - 2011
22
Memcached Latency (50% Set, 50% Get) 250
100 90
200
80
Time (us)
70 150 100 50
60
SDP
50
IPoIB
40
OSU Design
30
1G
20
10G-TOE
10 0
0 1
2
4
8
16
32 64 128 256 512 1K 2K 4K Message Size
Intel Clovertown Cluster (IB: DDR)
1
2
4
8
16 32 64 128 256 512 1K 2K 4K Message Size
Intel Westmere Cluster (IB: QDR)
• Memcached Get latency – 4 bytes – DDR: 7 us; QDR: 5 us – 4K bytes -‐-‐ DDR: 15 us; QDR:12 us • Almost factor of four improvement over 10GE (TOE) for 4K bytes on the DDR cluster ICPP - 2011
23
Memcached Get TPS (4byte) Thousands of Transactions per second (TPS)
700 600 500
2000 SDP
1800
10G - TOE
1600
OSU Design
1400
SDP IPoIB OSU Design
1200
400
1000 300
800
200
600 400
100
200
0
0 8 Clients
16 Clients
Intel Clovertown Cluster (IB: DDR)
8 Clients
16 Clients
Intel Westmere Cluster (IB: QDR)
• Memcached Get transacLons per second for 4 bytes – On IB DDR about 600K/s for 16 clients – On IB QDR 1.9M/s for 16 clients • Almost factor of six improvement over 10GE (TOE) • Significant improvement with naLve IB QDR compared to SDP and IPoIB ICPP - 2011
24
Memcached Get TPS (4KB) Transactions per second(TPS)
350000 300000 250000
900000 SDP
800000
10G - TOE
700000
OSU Design
600000
200000
500000
150000
400000
SDP IPoIB OSU Design
300000
100000
200000 50000
100000
0
0 8 Clients
16 Clients
8 Clients
16 Clients
Intel Westmere Cluster (IB: QDR)
Intel Clovertown Cluster (IB: DDR)
• Memcached Get transacLons per second for 4K bytes – On IB DDR about 330K/s for 16 clients – On IB QDR 842K/s for 16 clients • Almost factor of four improvement over 10GE (TOE) • Significant improvement with naLve IB QDR ICPP - 2011
25
Outline • IntroducLon & MoLvaLon • Overview of Memcached • Modern High Performance Interconnects • Unified CommunicaLon RunLme (UCR) • Memcached Design using UCR • Performance EvaluaLon • Conclusion & Future Work 26 ICPP - 2011
Conclusion & Future Work • Described a novel design of Memcached for RDMA capable networks • Provided a detailed performance comparison of our design compared to unmodified Memcached using sockets over RDMA and 10GE networks • Observed significant performance improvement with the proposed design • Factor of four improvement in Memcached get latency (4K bytes) • Factor of six improvement in Memcached get transacLons/s (4 bytes) • We plan to improve UCR by taking into account the many features in OpenFabrics API , Unreliable Datagram transport and designing iWARP and RoCE versions of UCR, and thereby scaling Memcached • We are working on enhancing the Hadoop/HBase designs for RDMA capable networks 27 ICPP - 2011
Thank You! {jose, subramon, luom, zhanjmin, huangjia, rahmanmd, islamn, ouyangx, wangh, surs,panda}@cse.ohio-‐state.edu
Network-‐Based CompuLng Laboratory h`p://nowlab.cse.ohio-‐state.edu/ MVAPICH Web Page h`p://mvapich.cse.ohio-‐state.edu/ ICPP - 2011
28