Memcached Design on High Performance RDMA ... - Semantic Scholar

Memcached Design on High Performance RDMA Capable Interconnects Jithin Jose, Hari Subramoni, Miao Luo, Minjia Zhang, Jian Huang, Md. Wasi-‐ur-‐Rahman, Nusrat S. Islam, Xiangyong Ouyang, Hao Wang, Sayantan Sur & D. K. Panda Network-‐Based Compu2ng Laboratory Department of Computer Science and Engineering The Ohio State University, USA

Outline •  IntroducLon •  Overview of Memcached •  Modern High Performance Interconnects •  Unified CommunicaLon RunLme (UCR) •  Memcached Design using UCR •  Performance EvaluaLon •  Conclusion & Future Work 2 ICPP - 2011

IntroducLon •  Tremendous increase in interest in interactive web-sites (social networking, e-commerce etc.) •  Dynamic data is stored in databases for future retrieval and analysis •  Database lookups are expensive •  Memcached – a distributed memory caching layer, implemented using traditional BSD sockets •  Socket interface provides portability, but entails additional processing and multiple message copies •  High-Performance Computing (HPC) has adopted advanced interconnects (e.g. InfiniBand, 10 Gigabit Ethernet/iWARP, RoCE) –  Low latency, High Bandwidth, Low CPU overhead

•  Many machines in Top500 list (http://www.top500.org) 3 ICPP - 2011


Memcached Overview System Area Network

System Area Network

Internet

Proxy Servers (Memcached Clients)

Memcached Servers

Database Servers

•  Memcached provides a scalable distributed caching •  Spare memory in data-‐center servers can be aggregated to speedup lookups •  Basically a key-‐value distributed memory store •  Keys can be any character strings, typically MD5 sums or hashes •  Typically used to cache database queries, results of API calls or webpage rendering elements •  Scalable model, but typical usage very network intensive -‐Performance directly related to that of underlying networking technology 5 ICPP - 2011


Modern High Performance Interconnects ApplicaAon

ApplicaLon Interface

Sockets

Kernel Space

TCP/IP

IB Verbs

TCP/IP

Protocol ImplementaLon

Hardware Offload

SDP

RDMA

User space

Ethernet Driver

IPoIB

Network Adapter

1/10 GigE Adapter

InfiniBand Adapter

10 GigE Adapter

InfiniBand Adapter

InfiniBand Adapter

Network Switch

Ethernet Switch

InfiniBand Switch

10 GigE Switch

InfiniBand Switch

InfiniBand Switch

10 GigE-‐TOE

SDP

IB Verbs

1/10 GigE

IPoIB

RDMA

7 ICPP - 2011

Problem Statement •  High-performance RDMA capable interconnects have emerged in the scientific computation domain •  Applications using Memcached are still relying on sockets •  Performance of Memcached is critical to most of its deployments •  Can Memcached be re-designed from the ground up to utilize RDMA capable networks?

8 ICPP - 2011

A New Approach using Unified CommunicaLon RunLme (UCR) Current Approach

Our Approach

ApplicaAon

ApplicaAon

UCR Sockets IB Verbs 1/10 GigE Network

RDMA Capable N/ws (IB, 10GE, iWARP, RoCE ...)

•  Sockets not designed for high-‐performance –  Stream semanLcs o\en mismatch for upper layers (Memcached, Hadoop) –  MulLple copies can be involved

9 ICPP - 2011

Outline •  IntroducLon & MoLvaLon •  Overview of Memcached •  Modern High Performance Interconnects •  Unified CommunicaLon RunLme (UCR) •  Memcached Design using UCR •  Performance EvaluaLon •  Conclusion & Future Work 10 ICPP - 2011

Unified CommunicaLon RunLme (UCR) •  Initially proposed to unify communication runtimes of different parallel programming models –  J. Jose, M. Luo, S. Sur and D. K. Panda, Unifying UPC and MPI Runtimes: Experience with MVAPICH, (PGAS’10)

•  Design of UCR evolved from MVAPICH/MVAPICH2 software stacks (h`p://mvapich.cse.ohio-‐state.edu/) •  UCR provides interfaces for Active Messages as well as onesided put/get operations •  Enhanced APIs to support Cloud computing applications •  Several enhancements in UCR –  end-point based design, revamped active-message API, fault tolerance and synchronization with timeouts.

•  Communications based on endpoint, analogous to sockets 11 ICPP - 2011

AcLve Messaging in UCR •  Active messages are proven to be very powerful in many environments •  GASNet Project (UC Berkeley), MPI design using LAPI (IBM), etc.

•  We introduce Active messages into the data-center domain •  An Active Message consists of two parts – header and data •  When the message arrives at the target, header handler is run •  Header handler identifies the destination buffer for the data •  Data is put into the destination buffer •  Completion handler is run afterwards (optional) •  Special flags to indicate local & remote completions (optional) 12 ICPP - 2011

AcLve Messaging in UCR (contd.) Origin

Target

Origin

Header

Header + Data Set LComplFlag

Header Handler

Set ComplFlag

Header Handler Copy Data

RDMA Data

Set LComplFlag

Target

Completion Handler

Completion Handler

Set RComplFlag

Set RComplFlag

Set ComplFlag

(General Active Message Functionality)

(Optimized Short Active Message Functionality)

13 ICPP - 2011


Memcached Design using UCR Sockets Client

Master Thread

Sockets Worker Thread

Verbs Worker Thread

Shared Data Memory Slabs Items …

Sockets Worker Thread

RDMA Client

Verbs Worker Thread

•  Server and client perform a negoLaLon protocol –  Master thread assigns clients to appropriate worker thread •  Once a client is assigned a verbs worker thread, it can communicate directly and is “bound” to that thread •  All other Memcached data structures are shared among RDMA and Sockets worker threads 15 ICPP - 2011


Experimental Setup •  Used Two Clusters –  Intel Clovertown •  Each node has 8 processor cores on 2 Intel Xeon 2.33 GHz Quad-‐core CPUs, 6 GB main memory, 250 GB hard disk •  Network: 1GigE, IPoIB, 10GigE TOE and IB (DDR)

–  Intel Westmere •  Each node has 8 processor cores on 2 Intel Xeon 2.67 GHz Quad-‐core CPUs, 12 GB main memory, 160 GB hard disk •  Network: 1GigE, IPoIB, and IB (QDR)

•  Memcached So\ware –  Memcached Server: 1.4.5 –  Memcached Client: (libmemcached) 0.45

17 ICPP - 2011

Memcached Get Latency (Small Message) 250

350 300

200 Time (us)

Time (us)

250 150 100

SDP

200

IPoIB

150

OSU Design

100 50

1G

50

10G-TOE

0

0 1

2

4

8

16 32 64 128 256 512 1K 2K 4K

1

2

4

16 32 64 128 256 512 1K 2K 4K Message Size

Message Size

Intel Clovertown Cluster (IB: DDR)

8

Intel Westmere Cluster (IB: QDR)

•  Memcached Get latency –  4 bytes – DDR: 6 us; QDR: 5 us –  4K bytes -‐-‐ DDR: 20 us; QDR:12 us •  Almost factor of four improvement over 10GE (TOE) for 4KB on the DDR cluster •  Almost factor of seven improvement over IPoIB for 4KB on the QDR cluster 18

ICPP - 2011

Memcached Get Latency (Large Message) 5000

700

4500

600

4000

500

Time (us)

3500 3000

SDP

400

IPoIB

2500 2000

300

1500

200

1G

100

10G-TOE

1000 500 0

OSU Design

0 4K

8K

16K

32K

64K 128K 256K 512K

4K

Message Size

8K

16K

32K

64K

128K

512K

Message Size



•  Memcached Get latency –  8K bytes – DDR: 17 us; QDR: 13 us –  512K bytes -‐-‐ DDR: 362 us; QDR: 94 us •  Almost factor of three improvement over 10GE(TOE) for 512KB on the DDR cluster •  Almost factor of four improvement over IPoIB for 512K bytes on the QDR cluster 19

ICPP - 2011

Memcached Set Latency (Small Message) 250

100 90

200

80

Time (us)

70 150 100 50

60

SDP

50

IPoIB

40

OSU Design

30

1G

20

10G-TOE

10 0

0 1

2

4

8

16 32 64 128 256 512 1K 2K 4K Message Size


1

2

4

8

16 32 64 128 256 512 1K 2K 4K Message Size


•  Memcached Set latency –  4 bytes – DDR: 7 us; QDR: 5 us –  4K bytes -‐-‐ DDR: 15 us; QDR:13 us •  Almost factor of four improvement over 10GE (TOE) for 4KB on the DDR Cluster •  Almost factor of six improvement over IPoIB for 4KB on the QDR Cluster 20 ICPP - 2011

Memcached Set Latency (Large Message) 4500

800

4000

700

3500

600

Time (us)

3000 2500 2000 1500

500

SDP

400

IPoIB

300

OSU Design

1000

200

500

100

0 4K

8K

16K

32K

64K

128K 256K 512K

1G 10G-TOE

0 4K

Message Size

8K

16K

32K

64K

128K 256K 512K

Message Size



•  Memcached Get latency –  8K bytes – DDR: 18 us; QDR: 15 us –  512K bytes -‐-‐ DDR: 375 us; QDR:185 us •  Almost factor of two improvement over 10GE (TOE) for 512KB on the DDR cluster •  Almost factor of three improvement over IPoIB for 512KB on the QDR cluster 21

ICPP - 2011

Memcached Latency (10% Set, 90% Get) 350

250

300

200 Time (us)

250 150

SDP

200

100 50

IPoIB

150

OSU Design

100

1G 10G

50 0

0 1

2

4

8

16 32 64 128 256 512 1K 2K 4K

1

2

4

8

16 32 64 128 256 512 1K 2K 4K Message Size

Message Size



•  Memcached Get latency –  4 bytes – DDR: 7 us; QDR: 5 us –  4K bytes -‐-‐ DDR: 15 us; QDR:12 us •  Almost factor of four improvement over 10GE (TOE) for 4K bytes on the DDR cluster ICPP - 2011

22

Memcached Latency (50% Set, 50% Get) 250

100 90

200

80

Time (us)

70 150 100 50

60

SDP

50

IPoIB

40

OSU Design

30

1G

20

10G-TOE

10 0

0 1

2

4

8

16

32 64 128 256 512 1K 2K 4K Message Size


1

2

4

8

16 32 64 128 256 512 1K 2K 4K Message Size


•  Memcached Get latency –  4 bytes – DDR: 7 us; QDR: 5 us –  4K bytes -‐-‐ DDR: 15 us; QDR:12 us •  Almost factor of four improvement over 10GE (TOE) for 4K bytes on the DDR cluster ICPP - 2011

23

Memcached Get TPS (4byte) Thousands of Transactions per second (TPS)

700 600 500

2000 SDP

1800

10G - TOE

1600

OSU Design

1400

SDP IPoIB OSU Design

1200

400

1000 300

800

200

600 400

100

200

0

0 8 Clients

16 Clients


8 Clients

16 Clients


•  Memcached Get transacLons per second for 4 bytes –  On IB DDR about 600K/s for 16 clients –  On IB QDR 1.9M/s for 16 clients •  Almost factor of six improvement over 10GE (TOE) •  Significant improvement with naLve IB QDR compared to SDP and IPoIB ICPP - 2011

24

Memcached Get TPS (4KB) Transactions per second(TPS)

350000 300000 250000

900000 SDP

800000

10G - TOE

700000

OSU Design

600000

200000

500000

150000

400000

SDP IPoIB OSU Design

300000

100000

200000 50000

100000

0

0 8 Clients

16 Clients

8 Clients

16 Clients



•  Memcached Get transacLons per second for 4K bytes –  On IB DDR about 330K/s for 16 clients –  On IB QDR 842K/s for 16 clients •  Almost factor of four improvement over 10GE (TOE) •  Significant improvement with naLve IB QDR ICPP - 2011

25


Conclusion & Future Work •  Described a novel design of Memcached for RDMA capable networks •  Provided a detailed performance comparison of our design compared to unmodified Memcached using sockets over RDMA and 10GE networks •  Observed significant performance improvement with the proposed design •  Factor of four improvement in Memcached get latency (4K bytes) •  Factor of six improvement in Memcached get transacLons/s (4 bytes) •  We plan to improve UCR by taking into account the many features in OpenFabrics API , Unreliable Datagram transport and designing iWARP and RoCE versions of UCR, and thereby scaling Memcached •  We are working on enhancing the Hadoop/HBase designs for RDMA capable networks 27 ICPP - 2011

Thank You! {jose, subramon, luom, zhanjmin, huangjia, rahmanmd, islamn, ouyangx, wangh, surs,panda}@cse.ohio-‐state.edu

Network-‐Based CompuLng Laboratory h`p://nowlab.cse.ohio-‐state.edu/ MVAPICH Web Page h`p://mvapich.cse.ohio-‐state.edu/ ICPP - 2011

28

Memcached Design on High Performance RDMA ... - Semantic Scholar

Memcached Design on High Performance RDMA ... - Semantic Scholar

Suggest Documents

High Performance RDMA-Based MPI ... - Semantic Scholar

Design Review on High-Swing, High Performance - Semantic Scholar

Memcached Performance Tuning

High Performance Pipelined Process Migration with RDMA

Achieving Higher Performance of Memcached by ... - Semantic Scholar

Scalable RDMA performance in PGAS languages - Semantic Scholar

High Performance RDMA-Based MPI Implementation ... - Google Sites

RDMA Verbs - RDMA Consortium

RDMA Verbs - RDMA Consortium

High Performance Pipelined Process Migration with RDMA - CiteSeerX

High Performance RDMA Based All-to-all Broadcast for ... - CiteSeerX

Evaluating High Performance Data Transfer with RDMA-based

High Performance Computing - Semantic Scholar

High Performance Computing - Semantic Scholar

High Performance Computing - Semantic Scholar

Ultra-High Performance, High-Temperature ... - Semantic Scholar

Portable High-Performance Supercomputing: High ... - Semantic Scholar

Engineering Design Performance - Semantic Scholar

Design and evaluation of a high-performance atm ... - Semantic Scholar

PAM-Blox: High Performance FPGA Design for ... - Semantic Scholar

Design of a High Performance Quad-Rotor Robot ... - Semantic Scholar

Design of High-performance Digital Logic Circuits ... - Semantic Scholar

The Design of a High-Performance File Server - Semantic Scholar

Design and exploitation of a high-performance ... - Semantic Scholar