Memcached Design on High Performance RDMA ... - Semantic Scholar

2 downloads 0 Views 1MB Size Report
Adapter. Ethernet. Switch. Network. Adapter. Network. Switch. 1/10 GigE. InfiniBand. Adapter. InfiniBand. Switch. IPoIB. IPoIB. SDP. RDMA. User space. IB Verbs.
Memcached  Design  on  High  Performance   RDMA  Capable  Interconnects   Jithin  Jose,  Hari  Subramoni,  Miao  Luo,  Minjia  Zhang,     Jian  Huang,  Md.  Wasi-­‐ur-­‐Rahman,  Nusrat  S.  Islam,     Xiangyong  Ouyang,  Hao  Wang,  Sayantan  Sur  &  D.  K.  Panda     Network-­‐Based  Compu2ng  Laboratory   Department  of  Computer  Science  and  Engineering   The  Ohio  State  University,  USA    

Outline     •  IntroducLon     •  Overview  of  Memcached   •  Modern  High  Performance  Interconnects   •  Unified  CommunicaLon  RunLme  (UCR)   •  Memcached  Design  using  UCR   •  Performance  EvaluaLon   •  Conclusion  &  Future  Work   2 ICPP - 2011

IntroducLon   •  Tremendous increase in interest in interactive web-sites (social networking, e-commerce etc.) •  Dynamic data is stored in databases for future retrieval and analysis •  Database lookups are expensive •  Memcached – a distributed memory caching layer, implemented using traditional BSD sockets •  Socket interface provides portability, but entails additional processing and multiple message copies •  High-Performance Computing (HPC) has adopted advanced interconnects (e.g. InfiniBand, 10 Gigabit Ethernet/iWARP, RoCE) –  Low latency, High Bandwidth, Low CPU overhead

•  Many machines in Top500 list (http://www.top500.org) 3 ICPP - 2011

Outline     •  IntroducLon   •  Overview  of  Memcached   •  Modern  High  Performance  Interconnects   •  Unified  CommunicaLon  RunLme  (UCR)   •  Memcached  Design  using  UCR   •  Performance  EvaluaLon   •  Conclusion  &  Future  Work   4 ICPP - 2011

Memcached  Overview   System   Area   Network  

System   Area   Network  

Internet  

Proxy  Servers     (Memcached  Clients)  

Memcached   Servers  

Database   Servers  

•  Memcached  provides  a  scalable  distributed  caching     •  Spare  memory  in  data-­‐center  servers  can  be  aggregated  to  speedup   lookups     •  Basically  a  key-­‐value  distributed  memory  store   •  Keys  can  be  any  character  strings,  typically  MD5  sums  or  hashes   •  Typically  used  to  cache  database  queries,  results  of  API  calls  or  webpage   rendering  elements   •  Scalable  model,  but  typical  usage  very  network  intensive  -­‐Performance   directly  related  to  that  of  underlying  networking  technology   5 ICPP - 2011

Outline     •  IntroducLon   •  Overview  of  Memcached   •  Modern  High  Performance  Interconnects   •  Unified  CommunicaLon  RunLme  (UCR)   •  Memcached  Design  using  UCR   •  Performance  EvaluaLon   •  Conclusion  &  Future  Work   6 ICPP - 2011

Modern  High  Performance  Interconnects   ApplicaAon  

ApplicaLon   Interface  

Sockets  

Kernel Space

TCP/IP  

IB  Verbs  

TCP/IP  

Protocol   ImplementaLon  

Hardware   Offload  

SDP

RDMA  

User space

Ethernet   Driver  

IPoIB  

Network   Adapter  

1/10  GigE   Adapter  

InfiniBand   Adapter  

10  GigE   Adapter  

InfiniBand   Adapter  

InfiniBand   Adapter  

Network   Switch  

Ethernet   Switch  

InfiniBand   Switch  

10  GigE   Switch  

InfiniBand   Switch  

InfiniBand   Switch  

10  GigE-­‐TOE  

SDP  

IB  Verbs  

1/10  GigE  

IPoIB  

RDMA  

7 ICPP - 2011

Problem  Statement   •  High-performance RDMA capable interconnects have emerged in the scientific computation domain •  Applications using Memcached are still relying on sockets •  Performance of Memcached is critical to most of its deployments •  Can Memcached be re-designed from the ground up to utilize RDMA capable networks?

8 ICPP - 2011

A  New  Approach  using     Unified  CommunicaLon  RunLme  (UCR)     Current  Approach  

Our  Approach  

ApplicaAon  

ApplicaAon  

UCR   Sockets   IB  Verbs   1/10  GigE   Network  

RDMA  Capable  N/ws   (IB,  10GE,  iWARP,  RoCE  ...)  

•  Sockets  not  designed  for  high-­‐performance   –  Stream  semanLcs  o\en  mismatch  for  upper  layers  (Memcached,  Hadoop)   –  MulLple  copies  can  be  involved  

9 ICPP - 2011

Outline     •  IntroducLon  &  MoLvaLon   •  Overview  of  Memcached   •  Modern  High  Performance  Interconnects   •  Unified  CommunicaLon  RunLme  (UCR)   •  Memcached  Design  using  UCR   •  Performance  EvaluaLon   •  Conclusion  &  Future  Work   10 ICPP - 2011

Unified  CommunicaLon  RunLme  (UCR) •  Initially proposed to unify communication runtimes of different parallel programming models –  J. Jose, M. Luo, S. Sur and D. K. Panda, Unifying UPC and MPI Runtimes: Experience with MVAPICH, (PGAS’10)

•  Design of UCR evolved from MVAPICH/MVAPICH2 software stacks (h`p://mvapich.cse.ohio-­‐state.edu/) •  UCR provides interfaces for Active Messages as well as onesided put/get operations •  Enhanced APIs to support Cloud computing applications •  Several enhancements in UCR –  end-point based design, revamped active-message API, fault tolerance and synchronization with timeouts.

•  Communications based on endpoint, analogous to sockets 11 ICPP - 2011

AcLve  Messaging  in  UCR   •  Active messages are proven to be very powerful in many environments •  GASNet Project (UC Berkeley), MPI design using LAPI (IBM), etc.

•  We introduce Active messages into the data-center domain •  An Active Message consists of two parts – header and data •  When the message arrives at the target, header handler is run •  Header handler identifies the destination buffer for the data •  Data is put into the destination buffer •  Completion handler is run afterwards (optional) •  Special flags to indicate local & remote completions (optional) 12 ICPP - 2011

AcLve  Messaging  in  UCR  (contd.)   Origin

Target

Origin

Header

Header + Data Set LComplFlag

Header Handler

Set ComplFlag

Header Handler Copy Data

RDMA Data

Set LComplFlag

Target

Completion Handler

Completion Handler

Set RComplFlag

Set RComplFlag

Set ComplFlag

(General Active Message Functionality)

(Optimized Short Active Message Functionality)

13 ICPP - 2011

Outline     •  IntroducLon  &  MoLvaLon   •  Overview  of  Memcached   •  Modern  High  Performance  Interconnects   •  Unified  CommunicaLon  RunLme  (UCR)   •  Memcached  Design  using  UCR   •  Performance  EvaluaLon   •  Conclusion  &  Future  Work   14 ICPP - 2011

Memcached  Design  using  UCR   Sockets   Client  

Master   Thread  

Sockets   Worker   Thread  

Verbs   Worker   Thread  

Shared   Data     Memory   Slabs   Items   …  

Sockets   Worker   Thread  

RDMA   Client  

 

Verbs   Worker   Thread  

•  Server  and  client  perform  a  negoLaLon  protocol   –  Master  thread  assigns  clients  to  appropriate  worker  thread   •  Once  a  client  is  assigned  a  verbs  worker  thread,  it  can  communicate  directly   and  is  “bound”  to  that  thread   •  All  other  Memcached  data  structures  are  shared  among  RDMA  and  Sockets   worker  threads   15 ICPP - 2011

Outline     •  IntroducLon  &  MoLvaLon   •  Overview  of  Memcached   •  Modern  High  Performance  Interconnects   •  Unified  CommunicaLon  RunLme  (UCR)   •  Memcached  Design  using  UCR   •  Performance  EvaluaLon   •  Conclusion  &  Future  Work   16 ICPP - 2011

Experimental  Setup   •  Used  Two  Clusters   –  Intel  Clovertown   •  Each  node  has  8  processor  cores  on  2  Intel  Xeon  2.33  GHz  Quad-­‐core  CPUs,     6  GB  main  memory,  250  GB  hard  disk   •  Network:  1GigE,    IPoIB,  10GigE  TOE  and  IB  (DDR)  

–  Intel  Westmere   •  Each  node  has  8  processor  cores  on  2  Intel  Xeon  2.67  GHz  Quad-­‐core  CPUs,     12  GB  main  memory,  160  GB  hard  disk   •  Network:  1GigE,    IPoIB,  and  IB  (QDR)  

•  Memcached  So\ware   –  Memcached  Server:  1.4.5     –  Memcached  Client:  (libmemcached)  0.45  

  17 ICPP - 2011

Memcached  Get  Latency  (Small  Message)     250

350 300

200 Time (us)

Time (us)

250 150 100

SDP

200

IPoIB

150

OSU Design

100 50

1G

50

10G-TOE

0

0 1

2

4

8

16 32 64 128 256 512 1K 2K 4K

1

2

4

16 32 64 128 256 512 1K 2K 4K Message Size

Message Size

Intel Clovertown Cluster (IB: DDR)

8

Intel Westmere Cluster (IB: QDR)

•  Memcached  Get  latency   –  4  bytes  –  DDR:  6  us;  QDR:  5  us   –  4K  bytes  -­‐-­‐  DDR:  20  us;  QDR:12  us   •  Almost  factor  of  four  improvement  over  10GE  (TOE)  for  4KB  on  the  DDR  cluster   •  Almost  factor  of  seven  improvement  over  IPoIB  for  4KB  on  the  QDR  cluster   18

 

ICPP - 2011

Memcached  Get  Latency  (Large  Message)     5000

700

4500

600

4000

500

Time (us)

3500 3000

SDP

400

IPoIB

2500 2000

300

1500

200

1G

100

10G-TOE

1000 500 0

OSU Design

0 4K

8K

16K

32K

64K 128K 256K 512K

4K

Message Size

8K

16K

32K

64K

128K

512K

Message Size

Intel Clovertown Cluster (IB: DDR)

Intel Westmere Cluster (IB: QDR)

•  Memcached  Get  latency   –  8K  bytes  –  DDR:  17  us;  QDR:  13  us   –  512K  bytes  -­‐-­‐  DDR:  362  us;  QDR:  94  us   •  Almost  factor  of  three  improvement  over  10GE(TOE)  for  512KB  on  the  DDR  cluster   •  Almost  factor  of  four  improvement  over  IPoIB  for  512K  bytes  on  the  QDR  cluster   19

ICPP - 2011

   

Memcached  Set  Latency  (Small  Message)     250

100 90

200

80

Time (us)

70 150 100 50

60

SDP

50

IPoIB

40

OSU Design

30

1G

20

10G-TOE

10 0

0 1

2

4

8

16 32 64 128 256 512 1K 2K 4K Message Size

Intel Clovertown Cluster (IB: DDR)

1

2

4

8

16 32 64 128 256 512 1K 2K 4K Message Size

Intel Westmere Cluster (IB: QDR)

•  Memcached  Set  latency   –  4  bytes  –  DDR:  7  us;  QDR:  5  us   –  4K  bytes  -­‐-­‐  DDR:  15  us;  QDR:13  us   •  Almost  factor  of  four  improvement  over  10GE  (TOE)  for  4KB  on  the  DDR  Cluster   •  Almost  factor  of  six  improvement  over  IPoIB  for  4KB  on  the  QDR  Cluster   20   ICPP - 2011  

Memcached  Set  Latency  (Large  Message)   4500

800

4000

700

3500

600

Time (us)

3000 2500 2000 1500

500

SDP

400

IPoIB

300

OSU Design

1000

200

500

100

0 4K

8K

16K

32K

64K

128K 256K 512K

1G 10G-TOE

0 4K

Message Size

8K

16K

32K

64K

128K 256K 512K

Message Size

Intel Westmere Cluster (IB: QDR)

Intel Clovertown Cluster (IB: DDR)

•  Memcached  Get  latency   –  8K  bytes  –  DDR:  18  us;  QDR:  15  us   –  512K  bytes  -­‐-­‐  DDR:  375  us;  QDR:185  us   •  Almost  factor  of  two  improvement  over  10GE  (TOE)  for  512KB  on  the  DDR  cluster   •  Almost  factor  of  three  improvement  over  IPoIB  for  512KB  on  the  QDR  cluster   21

 

ICPP - 2011

Memcached  Latency     (10%  Set,  90%  Get)   350

250

300

200 Time (us)

250 150

SDP

200

100 50

IPoIB

150

OSU Design

100

1G 10G

50 0

0 1

2

4

8

16 32 64 128 256 512 1K 2K 4K

1

2

4

8

16 32 64 128 256 512 1K 2K 4K Message Size

Message Size

Intel Clovertown Cluster (IB: DDR)

Intel Westmere Cluster (IB: QDR)

•  Memcached  Get  latency   –  4  bytes  –  DDR:  7  us;  QDR:  5  us   –  4K  bytes  -­‐-­‐  DDR:  15  us;  QDR:12  us   •  Almost  factor  of  four  improvement  over  10GE  (TOE)  for  4K  bytes  on   the  DDR  cluster     ICPP - 2011

22

Memcached  Latency     (50%  Set,  50%  Get)   250

100 90

200

80

Time (us)

70 150 100 50

60

SDP

50

IPoIB

40

OSU Design

30

1G

20

10G-TOE

10 0

0 1

2

4

8

16

32 64 128 256 512 1K 2K 4K Message Size

Intel Clovertown Cluster (IB: DDR)

1

2

4

8

16 32 64 128 256 512 1K 2K 4K Message Size

Intel Westmere Cluster (IB: QDR)

•  Memcached  Get  latency   –  4  bytes  –  DDR:  7  us;  QDR:  5  us   –  4K  bytes  -­‐-­‐  DDR:  15  us;  QDR:12  us   •  Almost  factor  of  four  improvement  over  10GE  (TOE)  for  4K  bytes  on   the  DDR  cluster     ICPP - 2011

23

Memcached  Get  TPS  (4byte)   Thousands of Transactions per second (TPS)

700 600 500

2000 SDP

1800

10G - TOE

1600

OSU Design

1400

SDP IPoIB OSU Design

1200

400

1000 300

800

200

600 400

100

200

0

0 8 Clients

16 Clients

Intel Clovertown Cluster (IB: DDR)

8 Clients

16 Clients

Intel Westmere Cluster (IB: QDR)

•  Memcached  Get  transacLons  per  second  for  4  bytes   –  On  IB  DDR  about  600K/s  for  16  clients     –  On  IB  QDR  1.9M/s  for  16  clients   •  Almost  factor  of  six  improvement  over  10GE  (TOE)   •  Significant  improvement  with  naLve  IB  QDR  compared  to  SDP  and   IPoIB   ICPP - 2011    

24

Memcached  Get  TPS  (4KB)   Transactions per second(TPS)

350000 300000 250000

900000 SDP

800000

10G - TOE

700000

OSU Design

600000

200000

500000

150000

400000

SDP IPoIB OSU Design

300000

100000

200000 50000

100000

0

0 8 Clients

16 Clients

8 Clients

16 Clients

Intel Westmere Cluster (IB: QDR)

Intel Clovertown Cluster (IB: DDR)

•  Memcached  Get  transacLons  per  second  for  4K  bytes   –  On  IB  DDR  about  330K/s  for  16  clients     –  On  IB  QDR  842K/s  for  16  clients   •  Almost  factor  of  four  improvement  over  10GE  (TOE)   •  Significant  improvement  with  naLve  IB  QDR     ICPP - 2011

25

Outline     •  IntroducLon  &  MoLvaLon   •  Overview  of  Memcached   •  Modern  High  Performance  Interconnects   •  Unified  CommunicaLon  RunLme  (UCR)   •  Memcached  Design  using  UCR   •  Performance  EvaluaLon   •  Conclusion  &  Future  Work   26 ICPP - 2011

Conclusion  &  Future  Work   •  Described  a  novel  design  of  Memcached  for  RDMA  capable  networks   •  Provided  a  detailed  performance  comparison  of  our  design  compared  to   unmodified  Memcached  using  sockets  over  RDMA  and  10GE  networks   •  Observed  significant  performance  improvement  with  the  proposed   design   •  Factor  of  four  improvement  in  Memcached  get  latency  (4K  bytes)   •  Factor  of  six  improvement  in  Memcached  get  transacLons/s  (4  bytes)   •  We  plan  to  improve  UCR  by  taking  into  account  the  many  features  in   OpenFabrics  API  ,  Unreliable  Datagram  transport  and  designing  iWARP   and  RoCE  versions  of  UCR,  and  thereby  scaling  Memcached   •  We  are  working  on  enhancing  the  Hadoop/HBase  designs  for  RDMA   capable  networks   27   ICPP - 2011

 Thank  You!   {jose,  subramon,  luom,  zhanjmin,  huangjia,  rahmanmd,   islamn,  ouyangx,  wangh,  surs,panda}@cse.ohio-­‐state.edu  

Network-­‐Based  CompuLng  Laboratory   h`p://nowlab.cse.ohio-­‐state.edu/ MVAPICH  Web  Page   h`p://mvapich.cse.ohio-­‐state.edu/ ICPP - 2011

28

Suggest Documents