Use QSFP/SFP+ cables instead of CAT6/RJ45 cables. ⢠Justification: ⢠Improved OPEX costs based on Power requirements. ⢠Improved Latency for better ...
Faster Interconnect using RDMA -- Quick Overview on Concepts Bhavin Thaker https://www.linkedin.com/in/bhavinthaker/
Agenda 1. 2. 3. 4. 5. 6. 7.
What is RDMA? What are Infiniband Hardware Components? What are Infiniband Capabilities? What are RDMA-Eth: RoCE, iWARP, SRP, iSER? What are OFED, Verbs API? What is the RDMA Programming Model? What is the TCO (Total Cost of Ownership)?
References 1.
Training Slide-decks from Open Fabrics Alliance (OFA) and Mellanox; images.google.com 2. Infiniband Specification 3. OFED: OFA: Top500: https://www.openfabrics.org/images/docs/PR/OFA_Top500.pdf 4. Attaining High-Performance Communications: A Vertical Approach – Ada Gavrilovska 5. IBM Redbook: Implementing Infiniband on IBM System p 6. http://www.mellanox.com/pdf/whitepapers/InfiniBandFAQ_FQ_100.pdf 7. http://h20000.www2.hp.com/bc/docs/support/SupportManual/c00593119/c005 93119.pdf 8. http://www.violin-memory.com/wp-content/uploads/Violin-Datasheet6000.pdf?d=1 9. Windows Interop Blog: http://blogs.technet.com/b/josebda/archive/2012/05/11/windows-server-2012beta-with-smb-3-0-demo-at-interop-shows-smb-direct-at-5-8-gbytes-sec-overmellanox-connectx-3-network-adapters.aspx 10. Mellanox Critique: http://seekingalpha.com/article/925211-mellanox-on-theroad-to-technological-obsolescence
Does anybody use Infiniband?
• Mellanox – – – – – – –
IB cards in Violin Memory, EMC Isilon, Oracle Exadata, PureStorage, Netapp DAFS Mellanox IB cards rebranded by IBM, HP, etc. Mellanox RoCE support with Arista switches Voltaire IB (switches) bought by Mellanox Customers: Chevron, Viacom, JPMorgan, Comcast, Airbus, Amadeus, Verizon Nov 13, 2012 - 'Readers’ Choice: Best HPC Interconnect Product – HPCwire Oracle owns 10% of Mellanox stock
• Intel (has big plans!)
– Bought Qlogic IB division, Cray Interconnect Division, Fulcrum High-end Ethernet, $300 Infiniband NIC Vs $700 Mellanox IB NIC
• AMD bought SeaMicro (Freedom: 3D mesh/torus interconnect) • NASA: largest IB cluster in the world: 40,000 nodes with 5,000 switches • Standard RDMA stack: OFED in Windows, Linux, AIX, Solaris • Internal Competitive Advantage – so not discussed much
OFED Interconnects Emergence
• •
Share of OFED interconnects in TOP500 Supercomputers increased from almost 2% in 2004 –toà about 42% in 2010
Windows Interop: 1 DVD/sec
Requirement: Fast Interconnect T L C 1) Throughput 2) Latency 3) CPU usage
(Mnemonic) (High) => FAT PIPE (Low) => SHORTER PIPE (Low) => LESS RESOURCES TO SEND DATA
Fast Interconnect: T L C Analogy .
T L C 1) Throughput (High) => FAT PIPE 2) Latency (Low) => SHORTER PIPE 3) CPU usage (Low) => LESS RESOURCES TO SEND DATA
Terminology: Overview –
Networking: bits/sec, Storage: Bytes/sec (bps/8 = Bps)
–
Tradeoff: Optimize for Throughput OR Latency
–
Throughput = 80% of Bandwidth (8/10b encoding) •
aka: Max Data Rate < Signaling Rate
RDMA technology: Overview
– Goal: Overlap Computation and Communication
– CPU savings: 3 fundamental components of RDMA: – ZC: Zero-Copy – OS-bypass – Protocol Offload
– Microsoft: •
6 GBytes/sec (1 DVD/s: 1 link)
• 110 GBytes/s • 1 PBytes scan/sec
Infiniband: Capability Overview -1 –
Scalable, Time-proof Architecture better than Ethernet
–
Protocol Offloading: Implements in h/w what TCP in s/w
–
Reliable Message-Passing: Max hw MTU = 4 KB, sw MsgSz = 2 GB!
–
Switched Fabric (similar to PCI Express) (!= Shared Bus arch)
–
Serial, Channel-based
–
End-to-end QoS: Virtual Lanes: Service Levels in Packets: •
Max 16 VLs per port: 0-15 VLs: 15 th VL for Mgt
•
Switch: VL-SL mapping table
•
Avoids HOL blocking –
Right Turn in Traffic
–
Credit-based Flow Control
–
Congestion Avoidance
–
Routing through IPv6
Infiniband: Capability Overview-2 Transport Layer implemented in hardware
Infiniband: Hardware Overview –
IB: Infinite Bandwidth : IO interconnect architecture
–
HCA ßà Switch ßà TCA
–
HCA : GUID (128-bit) : LID (16-bit) •
NIC : MAC (48-bit) : IP (32-bit) 1x: 4 wires (2 pairs: snd/rcv),
– •
4x: 16 wires, 12x: 48 wires
•
1x QDR: 10 Gbps, 4x QDR: 40 Gbps, 12x QDR: 120 Gbps
•
Scalable for future needs Distances:
– •
Eth: Copper TP: 100 m
•
IB: Copper TP: 17 m
•
IB: Optical Fiber: 1 km
•
IB Range-Extender: 10 km -- 80 km
3 Types of CA (Channel Adapters)
RDMA Ecosystems • With 3 types of CA (Channel Adapters): 1. Native Infiniband [ CA = HCA, TCA ] 2. RoCE [ CA = NIC ] 3. iWARP (over TCP) [ CA = RNIC ]
RDMA: Storage networks • SRP Vs iSER – SRP: SCSI RDMA Protocol: NOT an official standard – iSER: iSCSI Extensions for RDMA: An official IETF standard • More management functionality (e.g. target discovery infrastructure) • Principle: iSCSI is “assisted” by a Datamover protocol (iSER) • DA: Datamover Architecture; DI: Datamover Interface
• SRP is easier to implement than iSER • SRP initiators: – Linux, Windows, VMWARE, Solaris, LinuxOnPower IBM
Infiniband: Network elements 1. CA : HCA or TCA; LID = GUID (128-bit) 2. Switch: Ports, generates Management traffic (MAD) 3. Router: Subnet to Subnet Routing using IPv6 SM: Subnet Manager: • >= 1 SM for each Subnet: 1 Master SM, n Slaves (HA) • SM in Switch (licensed) • What does SM do? – LID granted by SM (~ IP by DHCP) – Discover network topology – Setup forwarding tables in switches
Cable/Port Efficiency • Types: QSFP (IB, Eth), SFP+ (Eth), RJ45/CAT6 (Eth) • Recommendation:
• Use QSFP/SFP+ cables instead of CAT6/RJ45 cables
• Justification:
• Improved OPEX costs based on Power requirements • Improved Latency for better performance
•
OPEX = (4 Watts * 365 days * 24 hours) Watt-hour = 35040 Wh = 35 kWh * $0.22 = $8/year
RDMA-capable Switches • IB Switches – inherently RDMA-capable • 10gE Switches – need DCB support for RDMAcapability (RoCE) – DCB: Data Center Bridging Eth – DCB support requires: 1. 802.1Q bb (aka PFC: Priority Flow Control) 2. 802.1Q az (aka ETS: Enhanced Tx Selection) 3. 802.1Q au (aka QCN: Congestion Notification) 4. DCBX (ability to exchange information)
RDMA Communication Model
Channel I/O: Msg Passing Paradigm
• •
An I/O Channel is a conduit between applications for efficient communication Applications are in disjoint physical address spaces
• •
The OS establishes the Channel, but the OS is NOT itself part of the Channel Drawback: Slight Burden on the Application programmer to use new Verbs API
• •
Contrast with TCP/Sockets: Stream-oriented & Synchronous Channel I/O uses RDMA underneath
Channel I/O: Msg Passing Paradigm
Channel I/O: Msg Passing Paradigm
Typical Operations 1. Register Memory region (MR) – Receive side 2. DMA map the buffer – Receive side 3. Advertise buffer address and key to peer – Receive side 4. Issue RDMA WRITE to advertised buffer – Sender side 5. Wait for completion event on completion queue – Sender side
Infiniband: Verbs API Introduction – Infiniband Verbs API (U-Net API @Cornell in mid1990s) • Verbs = Abstract Interface Specification to HCA (~actions) • Performed on Objects • Queue Pair (aka Work Q) & Completion Queue: WQE, CQE • Message-oriented • Memory semantics: One-sided operations • Pinned Pre-Registered Memory Region, identified by an Id
RDMA: OFED implementation
– OFED: Open Fabrics Enterprise Distribution – From Open Fabrics Alliance (OFA) –
formerly OpenIB
– Common RDMA Verbs API for Infiniband and 10GigE (RoCE, iWARP) – –
Earlier VIA: No h/w interoperability User-land and Kernel-land APIs
– RDMA CM: RDMA Communication Manager – Open-source, BSD licensed, In Linux kernel – Available on Linux, Windows – Also available on AIX, Solaris
RDMA: OFED Verbs APIs
RDMA (IB/Ethernet) TCO 2 Node Cluster
Interconnect Cables
Switches
TOTAL
Infiniband
Ethernet (RDMA)
Ethernet ( non RDMA)
ConnectX-2 VPI (40G IB & 10 G Eth) 650$ x 4
ConnectX-2 VPI (40G IB & 10 G Eth) 650$ x 4
Intel 10G NIC 650$ x 4
SFP+ cable 150$ x 8
SFP+ cable 150$ x 8
SFP+ cable 150$ x 8
16-port Infiniband Switch 4,000$ x 2
16-port Ethernet RDMASwitch 5,000$ x 2
16-port Brocade Switch 4,500$ x 2
11,800$
13,800$
12,800$
RDMA technology: Summary – Goal: Overlap Computation and Communication – CPU savings: 3 fundamental components of RDMA: – ZC: Zero-Copy – OS-bypass – Protocol Offload
– 6 GBytes/sec (1 DVD/s: 1 link)
Thanks. Bhavin Thaker https://www.linkedin.com/in/bhavinthaker/