Designing Cloud and Grid Computing Systems with InfiniBand and High-Speed
Ethernet. Dhabaleswar K. (DK) Panda. The Ohio State University. E-mail: ...
Designing Cloud and Grid Computing Systems with InfiniBand and High-Speed Ethernet A Tutorial at CCGrid ’11 by
Dhabaleswar K. (DK) Panda
Sayantan Sur
The Ohio State University
The Ohio State University
E-mail:
[email protected]
E-mail:
[email protected]
http://www.cse.ohio-state.edu/~panda
http://www.cse.ohio-state.edu/~surs
Presentation Overview • Introduction • Why InfiniBand and High-speed Ethernet? • Overview of IB, HSE, their Convergence and Features • IB and HSE HW/SW Products and Installations • Sample Case Studies and Performance Numbers • Conclusions and Final Q&A
CCGrid '11
2
Current and Next Generation Applications and Computing Systems
•
Diverse Range of Applications –
•
Processing and dataset characteristics vary
Growth of High Performance Computing – Growth in processor performance • Chip density doubles every 18 months
– Growth in commodity networking • Increase in speed/features + reducing cost
•
Different Kinds of Systems –
Clusters, Grid, Cloud, Datacenters, …..
CCGrid '11
3
Cluster Computing Environment Compute cluster
Frontend
LAN
LAN Storage cluster Compute Node Compute Node Compute Node Compute Node
CCGrid '11
L A N
Meta-Data Manager
Meta Data
I/O Server Node
Data
I/O Server Node
Data
I/O Server Node
Data
4
Trends for Computing Clusters in the Top 500 List (http://www.top500.org) Nov. 1996: 0/500 (0%)
Nov. 2001: 43/500 (8.6%)
Nov. 2006: 361/500 (72.2%)
Jun. 1997: 1/500 (0.2%)
Jun. 2002: 80/500 (16%)
Jun. 2007: 373/500 (74.6%)
Nov. 1997: 1/500 (0.2%)
Nov. 2002: 93/500 (18.6%)
Nov. 2007: 406/500 (81.2%)
Jun. 1998: 1/500 (0.2%)
Jun. 2003: 149/500 (29.8%)
Jun. 2008: 400/500 (80.0%)
Nov. 1998: 2/500 (0.4%)
Nov. 2003: 208/500 (41.6%)
Nov. 2008: 410/500 (82.0%)
Jun. 1999: 6/500 (1.2%)
Jun. 2004: 291/500 (58.2%)
Jun. 2009: 410/500 (82.0%)
Nov. 1999: 7/500 (1.4%)
Nov. 2004: 294/500 (58.8%)
Nov. 2009: 417/500 (83.4%)
Jun. 2000: 11/500 (2.2%)
Jun. 2005: 304/500 (60.8%)
Jun. 2010: 424/500 (84.8%)
Nov. 2000: 28/500 (5.6%)
Nov. 2005: 360/500 (72.0%)
Nov. 2010: 415/500 (83%)
Jun. 2001: 33/500 (6.6%)
Jun. 2006: 364/500 (72.8%)
Jun. 2011: To be announced
CCGrid '11
5
Grid Computing Environment Compute cluster
Compute cluster
LAN
Frontend
LAN
Frontend
LAN
LAN
WAN
Storage cluster Compute Node
Meta-Data Manager
Compute Node
I/O Server Node
Compute Node Compute Node
L A N
I/O Server Node I/O Server Node
Storage cluster
Meta Data
Compute Node
Data
Compute Node
Data
Compute Node
Data
Compute Node
L A N
Meta-Data Manager
Meta Data
I/O Server Node
Data
I/O Server Node
Data
I/O Server Node
Data
Compute cluster
LAN
Frontend
LAN Storage cluster Compute Node Compute Node Compute Node Compute Node
CCGrid '11
Meta-Data Manager L A N
Meta Data
I/O Server Node
Data
I/O Server Node
Data
I/O Server Node
Data
6
Multi-Tier Datacenters and Enterprise Computing
Enterprise Multi-tier Datacenter
Routers/ Servers
Application Server
Database Server
Routers/ Servers
Application Server
Database Server
Routers/ Servers
. .
Routers/ Servers Tier1
CCGrid '11
Switch
Application Server
Switch
Switch
Database Server
. . Application Server Tier2
Database Server Tier3
7
Integrated High-End Computing Environments Storage cluster
Compute cluster Compute Node Compute Node
Frontend
Compute Node
LAN
LAN/WAN
L A N
Compute Node
Meta-Data Manager
Meta Data
I/O Server Node
Data
I/O Server Node
Data
I/O Server Node
Data
Enterprise Multi-tier Datacenter for Visualization and Mining Routers/ Servers Routers/ Servers Routers/ Servers
. .
Routers/ Servers Tier1
CCGrid '11
Switch
Application Server Application Server Application Server
. .
Database Server Switch
Database Server
Switch
Database Server
Application Server
Database Server
Tier2
Tier3
8
Cloud Computing Environments
VM
VM
Physical Machine Meta-Data
Local Storage VM
VM
Virtual FS
Physical Machine
Local Storage VM
Meta Data
LAN
I/O Server
Data
I/O Server
Data
I/O Server
Data
I/O Server
Data
VM
Physical Machine VM
Local Storage
VM
Physical Machine
Local Storage
CCGrid '11
9
Hadoop Architecture • Underlying Hadoop Distributed File System (HDFS) • Fault-tolerance by replicating data blocks • NameNode: stores information on data blocks • DataNodes: store blocks and host Map-reduce computation • JobTracker: track jobs and detect failure • Model scales but high amount of communication during intermediate phases CCGrid '11
10
Memcached Architecture System Area Network
System Area Network
Internet
Web-frontend Servers
Memcached Servers
Database Servers
• Distributed Caching Layer – Allows to aggregate spare memory from multiple nodes – General purpose
• Typically used to cache database queries, results of API calls • Scalable model, but typical usage very network intensive CCGrid '11
11
Networking and I/O Requirements • Good System Area Networks with excellent performance (low latency, high bandwidth and low CPU utilization) for
inter-processor communication (IPC) and I/O • Good Storage Area Networks high performance I/O • Good WAN connectivity in addition to intra-cluster SAN/LAN connectivity • Quality of Service (QoS) for interactive applications
• RAS (Reliability, Availability, and Serviceability) • With low cost CCGrid '11
12
Major Components in Computing Systems • Hardware components P0 Core0
Core1
Core2
Core3
Memory
– I/O bus or links
Processing Bottlenecks
– Network adapters/switches
P1 Core0 Core2
– Processing cores and memory subsystem
• Software components
Core1
I / O
Core3
I/O Interface B Bottlenecks u s
Memory
– Communication stack
• Bottlenecks can artificially limit the network performance the user perceives
Network Adapter
Network Bottlenecks Network Switch
CCGrid '11
13
Processing Bottlenecks in Traditional Protocols • Ex: TCP/IP, UDP/IP • Generic architecture for all networks • Host processor handles almost all aspects of communication – Data buffering (copies on sender and receiver)
P0 Core0
Core1
Core2
Core3
Processing Bottlenecks
P1 Core0
Core1
Core2
Core3
– Data integrity (checksum)
I / O
Memory
B u s
– Routing aspects (IP routing)
Network Adapter
• Signaling between different layers – Hardware interrupt on packet arrival or transmission
Memory
Network Switch
– Software signals between different layers to handle protocol processing in different priority levels CCGrid '11
14
Bottlenecks in Traditional I/O Interfaces and Networks • Traditionally relied on bus-based technologies (last mile bottleneck)
P0 Core0
Core1
Core2
Core3
– E.g., PCI, PCI-X
Memory
P1
– One bit per wire
– Performance increase through:
Core0
Core1
Core2
Core3
I / O
Memory
I/O Interface B Bottlenecks
• Increasing clock speed
u s
• Increasing bus width
Network Adapter
– Not scalable: • Cross talk between bits • Skew between wires
Network Switch
• Signal integrity makes it difficult to increase bus width significantly, especially for high clock speeds PCI
1990
33MHz/32bit: 1.05Gbps (shared bidirectional)
PCI-X
1998 (v1.0) 2003 (v2.0)
133MHz/64bit: 8.5Gbps (shared bidirectional) 266-533MHz/64bit: 17Gbps (shared bidirectional)
CCGrid '11
15
Bottlenecks on Traditional Networks P0
• Network speeds saturated at around 1Gbps – Features provided were limited
Core0
Core1
Core2
Core3
Memory
P1 Core0
Core1
Core2
Core3
I / O
– Commodity networks were not considered scalable enough for very large-scale systems
Memory
B u s Network Adapter
Network Bottlenecks Network Switch
CCGrid '11
Ethernet (1979 - )
10 Mbit/sec
Fast Ethernet (1993 -)
100 Mbit/sec
Gigabit Ethernet (1995 -)
1000 Mbit /sec
ATM (1995 -)
155/622/1024 Mbit/sec
Myrinet (1993 -)
1 Gbit/sec
Fibre Channel (1994 -)
1 Gbit/sec 16
Motivation for InfiniBand and High-speed Ethernet • Industry Networking Standards • InfiniBand and High-speed Ethernet were introduced into the market to address these bottlenecks
• InfiniBand aimed at all three bottlenecks (protocol processing, I/O bus, and network speed) • Ethernet aimed at directly handling the network speed bottleneck and relying on complementary technologies to alleviate the protocol processing and I/O bus bottlenecks
CCGrid '11
17
Presentation Overview • Introduction • Why InfiniBand and High-speed Ethernet? • Overview of IB, HSE, their Convergence and Features • IB and HSE HW/SW Products and Installations • Sample Case Studies and Performance Numbers • Conclusions and Final Q&A
CCGrid '11
18
IB Trade Association • IB Trade Association was formed with seven industry leaders (Compaq, Dell, HP, IBM, Intel, Microsoft, and Sun) • Goal: To design a scalable and high performance communication and I/O architecture by taking an integrated view of computing, networking, and storage technologies • Many other industry participated in the effort to define the IB architecture specification • IB Architecture (Volume 1, Version 1.0) was released to public on Oct 24, 2000 – Latest version 1.2.1 released January 2008
• http://www.infinibandta.org CCGrid '11
19
High-speed Ethernet Consortium (10GE/40GE/100GE) • 10GE Alliance formed by several industry leaders to take the Ethernet family to the next speed step • Goal: To achieve a scalable and high performance communication architecture while maintaining backward compatibility with Ethernet
• http://www.ethernetalliance.org • 40-Gbps (Servers) and 100-Gbps Ethernet (Backbones, Switches, Routers): IEEE 802.3 WG • Energy-efficient and power-conscious protocols – On-the-fly link speed reduction for under-utilized links
CCGrid '11
20
Tackling Communication Bottlenecks with IB and HSE • Network speed bottlenecks • Protocol processing bottlenecks • I/O interface bottlenecks
CCGrid '11
21
Network Bottleneck Alleviation: InfiniBand (“Infinite Bandwidth”) and High-speed Ethernet (10/40/100 GE) • Bit serial differential signaling – Independent pairs of wires to transmit independent data (called a lane) – Scalable to any number of lanes
– Easy to increase clock speed of lanes (since each lane consists only of a pair of wires)
• Theoretically, no perceived limit on the bandwidth
CCGrid '11
22
Network Speed Acceleration with IB and HSE Ethernet (1979 - )
10 Mbit/sec
Fast Ethernet (1993 -)
100 Mbit/sec
Gigabit Ethernet (1995 -)
1000 Mbit /sec
ATM (1995 -)
155/622/1024 Mbit/sec
Myrinet (1993 -)
1 Gbit/sec
Fibre Channel (1994 -)
1 Gbit/sec
InfiniBand (2001 -)
2 Gbit/sec (1X SDR)
10-Gigabit Ethernet (2001 -)
10 Gbit/sec
InfiniBand (2003 -)
8 Gbit/sec (4X SDR)
InfiniBand (2005 -)
16 Gbit/sec (4X DDR) 24 Gbit/sec (12X SDR)
InfiniBand (2007 -)
32 Gbit/sec (4X QDR)
40-Gigabit Ethernet (2010 -)
40 Gbit/sec
InfiniBand (2011 -)
56 Gbit/sec (4X FDR)
InfiniBand (2012 -)
100 Gbit/sec (4X EDR) 20 times in the last 9 years
CCGrid '11
23
InfiniBand Link Speed Standardization Roadmap Per Lane & Rounded Per Link Bandwidth (Gb/s)
# of Lanes per direction
4G-IB DDR
8G-IB QDR
14G-IB-FDR (14.025)
26G-IB-EDR (25.78125)
12
48+48
96+96
168+168
300+300
8
32+32
64+64
112+112
200+200
4
16+16
32+32
56+56
100+100
1
4+4
8+8
14+14
25+25
NDR = Next Data Rate HDR = High Data Rate EDR = Enhanced Data Rate FDR = Fourteen Data Rate QDR = Quad Data Rate DDR = Double Data Rate SDR = Single Data Rate (not shown)
12x NDR
12x HDR 8x NDR
300G-IB-EDR 168G-IB-FDR 8x HDR 4x NDR
Bandwidth per direction (Gbps)
96G-IB-QDR
200G-IB-EDR 112G-IB-FDR
4x HDR
48G-IB-DDR 48G-IB-QDR
x12
100G-IB-EDR 56G-IB-FDR
1x NDR
1x HDR
x8 32G-IB-DDR
25G-IB-EDR 14G-IB-FDR
32G-IB-QDR
x4
2005
16G-IB-DDR
8G-IB-QDR x1
-
2006
CCGrid '11
-
2007
-
2008
-
2009
-
2010
-
2011
2015 24
Tackling Communication Bottlenecks with IB and HSE • Network speed bottlenecks • Protocol processing bottlenecks • I/O interface bottlenecks
CCGrid '11
25
Capabilities of High-Performance Networks • Intelligent Network Interface Cards • Support entire protocol processing completely in hardware
(hardware protocol offload engines) • Provide a rich communication interface to applications – User-level communication capability – Gets rid of intermediate data buffering requirements
• No software signaling between communication layers – All layers are implemented on a dedicated hardware unit, and not on a shared host CPU
CCGrid '11
26
Previous High-Performance Network Stacks • Fast Messages (FM) – Developed by UIUC
• Myricom GM – Proprietary protocol stack from Myricom
• These network stacks set the trend for high-performance communication requirements – Hardware offloaded protocol stack – Support for fast and secure user-level access to the protocol stack
• Virtual Interface Architecture (VIA) – Standardized by Intel, Compaq, Microsoft – Precursor to IB
CCGrid '11
27
IB Hardware Acceleration • Some IB models have multiple hardware accelerators – E.g., Mellanox IB adapters
• Protocol Offload Engines – Completely implement ISO/OSI layers 2-4 (link layer, network layer and transport layer) in hardware
• Additional hardware supported features also present – RDMA, Multicast, QoS, Fault Tolerance, and many more
CCGrid '11
28
Ethernet Hardware Acceleration • Interrupt Coalescing – Improves throughput, but degrades latency
• Jumbo Frames – No latency impact; Incompatible with existing switches
• Hardware Checksum Engines – Checksum performed in hardware significantly faster – Shown to have minimal benefit independently
• Segmentation Offload Engines (a.k.a. Virtual MTU) – Host processor “thinks” that the adapter supports large Jumbo frames, but the adapter splits it into regular sized (1500-byte) frames – Supported by most HSE products because of its backward compatibility considered “regular” Ethernet – Heavily used in the “server-on-steroids” model • High performance servers connected to regular clients CCGrid '11
29
TOE and iWARP Accelerators • TCP Offload Engines (TOE) – Hardware Acceleration for the entire TCP/IP stack – Initially patented by Tehuti Networks – Actually refers to the IC on the network adapter that implements TCP/IP – In practice, usually referred to as the entire network adapter
• Internet Wide-Area RDMA Protocol (iWARP) – Standardized by IETF and the RDMA Consortium – Support acceleration features (like IB) for Ethernet
• http://www.ietf.org & http://www.rdmaconsortium.org
CCGrid '11
30
Converged (Enhanced) Ethernet (CEE or CE) • Also known as “Datacenter Ethernet” or “Lossless Ethernet” – Combines a number of optional Ethernet standards into one umbrella as mandatory requirements
• Sample enhancements include: – Priority-based flow-control: Link-level flow control for each Class of Service (CoS) – Enhanced Transmission Selection (ETS): Bandwidth assignment to each CoS – Datacenter Bridging Exchange Protocols (DBX): Congestion notification, Priority classes – End-to-end Congestion notification: Per flow congestion control to supplement per link flow control CCGrid '11
31
Tackling Communication Bottlenecks with IB and HSE • Network speed bottlenecks • Protocol processing bottlenecks • I/O interface bottlenecks
CCGrid '11
32
Interplay with I/O Technologies • InfiniBand initially intended to replace I/O bus technologies with networking-like technology – That is, bit serial differential signaling – With enhancements in I/O technologies that use a similar architecture (HyperTransport, PCI Express), this has become mostly irrelevant now
• Both IB and HSE today come as network adapters that plug into existing I/O technologies
CCGrid '11
33
Trends in I/O Interfaces with Servers • Recent trends in I/O interfaces show that they are nearly matching head-to-head with network speeds (though they still lag a little bit) PCI
1990
33MHz/32bit: 1.05Gbps (shared bidirectional)
PCI-X
1998 (v1.0) 2003 (v2.0)
133MHz/64bit: 8.5Gbps (shared bidirectional) 266-533MHz/64bit: 17Gbps (shared bidirectional)
AMD HyperTransport (HT)
2001 (v1.0), 2004 (v2.0) 2006 (v3.0), 2008 (v3.1)
102.4Gbps (v1.0), 179.2Gbps (v2.0) 332.8Gbps (v3.0), 409.6Gbps (v3.1) (32 lanes)
PCI-Express (PCIe) by Intel
2003 (Gen1), 2007 (Gen2) 2009 (Gen3 standard)
Gen1: 4X (8Gbps), 8X (16Gbps), 16X (32Gbps) Gen2: 4X (16Gbps), 8X (32Gbps), 16X (64Gbps) Gen3: 4X (~32Gbps), 8X (~64Gbps), 16X (~128Gbps)
Intel QuickPath Interconnect (QPI)
2009
153.6-204.8Gbps (20 lanes)
CCGrid '11
34
Presentation Overview • Introduction • Why InfiniBand and High-speed Ethernet? • Overview of IB, HSE, their Convergence and Features • IB and HSE HW/SW Products and Installations • Sample Case Studies and Performance Numbers • Conclusions and Final Q&A
CCGrid '11
35
IB, HSE and their Convergence • InfiniBand – Architecture and Basic Hardware Components – Communication Model and Semantics – Novel Features – Subnet Management and Services
• High-speed Ethernet Family – Internet Wide Area RDMA Protocol (iWARP)
– Alternate vendor-specific protocol stacks
• InfiniBand/Ethernet Convergence Technologies – Virtual Protocol Interconnect (VPI)
– (InfiniBand) RDMA over Ethernet (RoE) – (InfiniBand) RDMA over Converged (Enhanced) Ethernet (RoCE)
CCGrid '11
36
Comparing InfiniBand with Traditional Networking Stack HTTP, FTP, MPI, File Systems
Application Layer
Application Layer
OpenFabrics Verbs
Sockets Interface TCP, UDP Routing DNS management tools Flow-control and Error Detection Copper, Optical or Wireless
CCGrid '11
MPI, PGAS, File Systems
Transport Layer
Transport Layer
Network Layer
Network Layer
Link Layer
Link Layer
Physical Layer
Physical Layer
Traditional Ethernet
InfiniBand
RC (reliable), UD (unreliable) Routing Flow-control, Error Detection
OpenSM (management tool) Copper or Optical
37
IB Overview • InfiniBand – Architecture and Basic Hardware Components – Communication Model and Semantics • Communication Model • Memory registration and protection • Channel and memory semantics
– Novel Features • Hardware Protocol Offload – Link, network and transport layer features
– Subnet Management and Services
CCGrid '11
38
Components: Channel Adapters • Used by processing and I/O units to
• Programmable DMA engines with
QP
Channel Adapter
…
DMA
Port
Port
VL VL
…
VL
VL VL
VL
VL VL
…
SMA MTP
C Transport
protection features
• May have multiple ports
QP
QP
QP
• Consume & generate IB packets
Memory
…
…
VL
connect to fabric
Port
– Independent buffering channeled through Virtual Lanes
• Host Channel Adapters (HCAs)
CCGrid '11
39
Components: Switches and Routers Switch
…
Port
Port
…
VL
VL VL
…
VL
VL VL
…
VL
VL VL
Packet Relay
Port
Router
Port
Port
…
…
VL
VL VL
…
VL
VL VL
…
VL
VL VL
GRH Packet Relay
Port
• Relay packets from a link to another
• Switches: intra-subnet • Routers: inter-subnet • May support multicast CCGrid '11
40
Components: Links & Repeaters • Network Links – Copper, Optical, Printed Circuit wiring on Back Plane – Not directly addressable
• Traditional adapters built for copper cabling – Restricted by cable length (signal integrity) – For example, QDR copper cables are restricted to 7m
• Intel Connects: Optical cables with Copper-to-optical conversion hubs (acquired by Emcore) – Up to 100m length – 550 picoseconds copper-to-optical conversion latency
• Available from other vendors (Luxtera)
(Courtesy Intel)
• Repeaters (Vol. 2 of InfiniBand specification) CCGrid '11
41
IB Overview • InfiniBand – Architecture and Basic Hardware Components – Communication Model and Semantics • Communication Model • Memory registration and protection • Channel and memory semantics
– Novel Features • Hardware Protocol Offload – Link, network and transport layer features
– Subnet Management and Services
CCGrid '11
42
IB Communication Model
Basic InfiniBand Communication Semantics
CCGrid '11
43
Queue Pair Model • Each QP has two queues – Send Queue (SQ) – Receive Queue (RQ)
QP Send
CQ
Recv
WQEs
CQEs
– Work requests are queued to the QP (WQEs: “Wookies”)
InfiniBand Device
• QP to be linked to a Complete Queue
(CQ) – Gives notification of operation completion from QPs
– Completed WQEs are placed in the CQ with additional information (CQEs: “Cookies”) CCGrid '11
44
Memory Registration Before we do any communication: All memory used for communication must be registered
•
4
Kernel
2
Send virtual address and length
HCA/RNIC
Process cannot map memory that it does not own (security !)
3. HCA caches the virtual to physical mapping and issues a handle •
3
CCGrid '11
•
2. Kernel handles virtual->physical mapping and pins region into physical memory
Process
1
1. Registration Request
Includes an l_key and r_key
4. Handle is returned to application
45
Memory Protection For security, keys are required for all operations that touch buffers
• To send or receive data the l_key must be provided to the HCA
Process
• HCA verifies access to local memory l_key
Kernel
HCA/NIC
• For RDMA, initiator must have the r_key for the remote virtual address • Possibly exchanged with a send/recv
• r_key is not encrypted in IB r_key is needed for RDMA operations CCGrid '11
46
Communication in the Channel Semantics (Send/Receive Model) Memory Segment
Processor
Memory
Processor
Memory Memory Segment
Memory Segment
Memory Segment
Memory Segment
QP
CQ
Send
Recv
Processor is involved only to:
QP Send
Recv
CQ
1. Post receive WQE 2. Post send WQE 3. Pull out completed CQEs from the CQ
InfiniBand Device
Send WQE contains information about the send buffer (multiple non-contiguous segments) CCGrid '11
Hardware ACK
InfiniBand Device
Receive WQE contains information on the receive buffer (multiple non-contiguous segments); Incoming messages have to be matched to a receive WQE to know where to place 47
Communication in the Memory Semantics (RDMA Model)
Memory Segment
Processor
Memory
Processor
Memory Memory Segment
Memory Segment Memory Segment
QP Initiator processor is involved only to: Send Recv Recv 1. Post send WQE 2. Pull out completed CQE from the send CQ
QP
CQ
Send
CQ
No involvement from the target processor
InfiniBand Device
Hardware ACK
InfiniBand Device
Send WQE contains information about the send buffer (multiple segments) and the receive buffer (single segment) CCGrid '11
48
Communication in the Memory Semantics (Atomics) Processor
Memory
Processor
Memory
Source Memory Segment
Memory Segment
Destination Memory Segment
QP Initiator processor is involved only to: Send Recv Recv 1. Post send WQE 2. Pull out completed CQE from the send CQ
QP
CQ
Send
CQ
No involvement from the target processor
InfiniBand Device
Send WQE contains information about the send buffer (single 64-bit segment) and the receive buffer (single 64-bit segment) CCGrid '11
OP InfiniBand Device
IB supports compare-and-swap and fetch-and-add atomic operations
49
IB Overview • InfiniBand – Architecture and Basic Hardware Components – Communication Model and Semantics • Communication Model • Memory registration and protection • Channel and memory semantics
– Novel Features • Hardware Protocol Offload – Link, network and transport layer features
– Subnet Management and Services
CCGrid '11
50
Hardware Protocol Offload
Complete Hardware Implementations Exist
CCGrid '11
51
Link/Network Layer Capabilities • Buffering and Flow Control • Virtual Lanes, Service Levels and QoS • Switching and Multicast
CCGrid '11
52
Buffering and Flow Control • IB provides three-levels of communication throttling/control mechanisms – Link-level flow control (link layer feature) – Message-level flow control (transport layer feature): discussed later – Congestion control (part of the link layer features)
• IB provides an absolute credit-based flow-control – Receiver guarantees that enough space is allotted for N blocks of data – Occasional update of available credits by the receiver
• Has no relation to the number of messages, but only to the total amount of data being sent – One 1MB message is equivalent to 1024 1KB messages (except for rounding off at message boundaries) CCGrid '11
53
Virtual Lanes • Multiple virtual links within same physical link – Between 2 and 16
• Separate buffers and flow control – Avoids Head-of-Line Blocking
• VL15: reserved for management
• Each port supports one or more data VL CCGrid '11
54
Service Levels and QoS • Service Level (SL): – Packets may operate at one of 16 different SLs – Meaning not defined by IB
• SL to VL mapping: – SL determines which VL on the next link is to be used – Each port (switches, routers, end nodes) has a SL to VL mapping table configured by the subnet management
• Partitions: – Fabric administration (through Subnet Manager) may assign specific SLs to different partitions to isolate traffic flows
CCGrid '11
55
Traffic Segregation Benefits
IPC, Load Balancing, Web Caches, ASP
Servers
Servers
Servers
Virtual Lanes InfiniBand Network
InfiniBand Fabric Fabric
IP Network Routers, Switches VPN’s, DSLAMs
Storage Area Network RAID, NAS, Backup
(Courtesy: Mellanox Technologies)
CCGrid '11
• InfiniBand Virtual Lanes allow the multiplexing of multiple independent logical traffic flows on the same physical link • Providing the benefits of independent, separate networks while eliminating the cost and difficulties associated with maintaining two or more networks 56
Switching (Layer-2 Routing) and Multicast • Each port has one or more associated LIDs (Local Identifiers) – Switches look up which port to forward a packet to based on its destination LID (DLID) – This information is maintained at the switch
• For multicast packets, the switch needs to maintain multiple output ports to forward the packet to – Packet is replicated to each appropriate output port – Ensures at-most once delivery & loop-free forwarding
– There is an interface for a group management protocol • Create, join/leave, prune, delete group
CCGrid '11
57
Switch Complex • Basic unit of switching is a crossbar – Current InfiniBand products use either 24-port (DDR) or 36-port (QDR) crossbars
• Switches available in the market are typically collections of crossbars within a single cabinet • Do not confuse “non-blocking switches” with “crossbars” – Crossbars provide all-to-all connectivity to all connected nodes • For any random node pair selection, all communication is non-blocking
– Non-blocking switches provide a fat-tree of many crossbars • For any random node pair selection, there exists a switch configuration such that communication is non-blocking • If the communication pattern changes, the same switch configuration might no longer provide fully non-blocking communication CCGrid '11
58
IB Switching/Routing: An Example An Example IB Switch Block Diagram (Mellanox 144-Port)
Switching: IB supports Virtual Cut Through (VCT)
Spine Blocks
1 2 3 4
P2
P1
LID: 2 LID: 4
Leaf Blocks
DLID
Out-Port
Forwarding Table 2
1
4
4
• Someone has to setup the forwarding tables and give every port an LID – “Subnet Manager” does this work
• Different routing algorithms give different paths
CCGrid '11
Routing: Unspecified by IB SPEC Up*/Down*, Shift are popular routing engines supported by OFED
• Fat-Tree is a popular topology for IB Cluster – Different over-subscription ratio may be used
• Other topologies are also being used – 3D Torus (Sandia Red Sky) and SGI Altix (Hypercube) 59
More on Multipathing • Similar to basic switching, except… – … sender can utilize multiple LIDs associated to the same destination port • Packets sent to one DLID take a fixed path • Different packets can be sent using different DLIDs • Each DLID can have a different path (switch can be configured differently for each DLID)
• Can cause out-of-order arrival of packets – IB uses a simplistic approach: • If packets in one connection arrive out-of-order, they are dropped
– Easier to use different DLIDs for different connections • This is what most high-level libraries using IB do!
CCGrid '11
60
IB Multicast Example
CCGrid '11
61
Hardware Protocol Offload
Complete Hardware Implementations Exist
CCGrid '11
62
IB Transport Services Service Type
Connection Oriented
Acknowledged
Transport
Reliable Connection
Yes
Yes
IBA
Unreliable Connection
Yes
No
IBA
Reliable Datagram
No
Yes
IBA
Unreliable Datagram
No
No
IBA
RAW Datagram
No
No
Raw
• Each transport service can have zero or more QPs associated with it – E.g., you can have four QPs based on RC and one QP based on UD CCGrid '11
63
Trade-offs in Different Transport Types Attribute
Scalability (M processes, N nodes)
Reliable Connection M2N QPs per HCA
Reliable Datagram M QPs per HCA
eXtended Reliable Connection
Unreliable Connection
Unreliable Datagram
MN QPs per HCA
M2N QPs per HCA
M QPs per HCA
Reliability
Corrupt data detected Data delivered exactly once
Data Order Guarantees
One source to multiple destinations
Error Recovery
CCGrid '11
1 QP per HCA
Yes
Data Delivery Guarantee
Data Loss Detected
Raw Datagram
Per connection
Per connection
No guarantees Unordered, duplicate data detected
Yes Errors (retransmissions, alternate path, etc.) handled by transport layer. Client only involved in handling fatal errors (links broken, protection violation, etc.)
Packets with errors and sequence errors are reported to responder
No
No
No
No
None
None
64
Transport Layer Capabilities • Data Segmentation • Transaction Ordering • Message-level Flow Control • Static Rate Control and Auto-negotiation
CCGrid '11
65
Data Segmentation • IB transport layer provides a message-level communication granularity, not byte-level (unlike TCP)
• Application can hand over a large message – Network adapter segments it to MTU sized packets
– Single notification when the entire message is transmitted or received (not per packet)
• Reduced host overhead to send/receive messages – Depends on the number of messages, not the number of bytes
CCGrid '11
66
Transaction Ordering • IB follows a strong transaction ordering for RC • Sender network adapter transmits messages in the order in which WQEs were posted • Each QP utilizes a single LID – All WQEs posted on same QP take the same path
– All packets are received by the receiver in the same order – All receive WQEs are completed in the order in which they were posted
CCGrid '11
67
Message-level Flow-Control • Also called as End-to-end Flow-control – Does not depend on the number of network hops
• Separate from Link-level Flow-Control – Link-level flow-control only relies on the number of bytes being transmitted, not the number of messages – Message-level flow-control only relies on the number of messages transferred, not the number of bytes
• If 5 receive WQEs are posted, the sender can send 5 messages (can post 5 send WQEs) – If the sent messages are larger than what the receive buffers are posted, flow-control cannot handle it
CCGrid '11
68
Static Rate Control and Auto-Negotiation • IB allows link rates to be statically changed – On a 4X link, we can set data to be sent at 1X – For heterogeneous links, rate can be set to the lowest link rate – Useful for low-priority traffic
• Auto-negotiation also available – E.g., if you connect a 4X adapter to a 1X switch, data is automatically sent at 1X rate
• Only fixed settings available – Cannot set rate requirement to 3.16 Gbps, for example
CCGrid '11
69
IB Overview • InfiniBand – Architecture and Basic Hardware Components – Communication Model and Semantics • Communication Model • Memory registration and protection • Channel and memory semantics
– Novel Features • Hardware Protocol Offload – Link, network and transport layer features
– Subnet Management and Services
CCGrid '11
70
Concepts in IB Management • Agents – Processes or hardware units running on each adapter, switch, router (everything on the network) – Provide capability to query and set parameters
• Managers – Make high-level decisions and implement it on the network fabric using the agents
• Messaging schemes – Used for interactions between the manager and agents (or between agents)
• Messages
CCGrid '11
71
Subnet Manager
Inactive Link
Multicast Setup Switch Compute Node
Active Inactive Links Multicast Join
Multicast Setup
Multicast Join
Subnet Manager
CCGrid '11
72
IB, HSE and their Convergence • InfiniBand – Architecture and Basic Hardware Components – Communication Model and Semantics – Novel Features – Subnet Management and Services
• High-speed Ethernet Family – Internet Wide Area RDMA Protocol (iWARP)
– Alternate vendor-specific protocol stacks
• InfiniBand/Ethernet Convergence Technologies – Virtual Protocol Interconnect (VPI)
– (InfiniBand) RDMA over Ethernet (RoE) – (InfiniBand) RDMA over Converged (Enhanced) Ethernet (RoCE)
CCGrid '11
73
HSE Overview • High-speed Ethernet Family – Internet Wide-Area RDMA Protocol (iWARP) • Architecture and Components • Features – Out-of-order data placement – Dynamic and Fine-grained Data Rate control – Multipathing using VLANs • Existing Implementations of HSE/iWARP
– Alternate Vendor-specific Stacks • MX over Ethernet (for Myricom 10GE adapters) • Datagram Bypass Layer (for Myricom 10GE adapters)
• Solarflare OpenOnload (for Solarflare 10GE adapters)
CCGrid '11
74
IB and HSE RDMA Models: Commonalities and Differences IB
iWARP/HSE
Hardware Acceleration
Supported
Supported
RDMA
Supported
Supported
Atomic Operations
Supported
Not supported
Multicast
Supported
Supported
Congestion Control
Supported
Supported
Data Placement
Ordered
Out-of-order
Data Rate-control
Static and Coarse-grained
Dynamic and Fine-grained
QoS
Prioritization
Prioritization and Fixed Bandwidth QoS
Multipathing
Using DLIDs
Using VLANs
CCGrid '11
75
iWARP Architecture and Components iWARP Offload Engines
• RDMA Protocol (RDMAP) – Feature-rich interface
Application or Library
User
– Security Management RDMAP
• Remote Direct Data Placement (RDDP) RDDP
– Data Placement and Delivery MPA
SCTP TCP
Hardware
IP Device Driver Network Adapter (e.g., 10GigE)
– Multi Stream Semantics – Connection Management • Marker PDU Aligned (MPA) – Middle Box Fragmentation – Data Integrity (CRC)
(Courtesy iWARP Specification) CCGrid '11
76
HSE Overview • High-speed Ethernet Family – Internet Wide-Area RDMA Protocol (iWARP) • Architecture and Components • Features – Out-of-order data placement – Dynamic and Fine-grained Data Rate control • Existing Implementations of HSE/iWARP
– Alternate Vendor-specific Stacks • MX over Ethernet (for Myricom 10GE adapters) • Datagram Bypass Layer (for Myricom 10GE adapters) • Solarflare OpenOnload (for Solarflare 10GE adapters)
CCGrid '11
77
Decoupled Data Placement and Data Delivery • Place data as it arrives, whether in or out-of-order • If data is out-of-order, place it at the appropriate offset • Issues from the application’s perspective: – Second half of the message has been placed does not mean that the first half of the message has arrived as well – If one message has been placed, it does not mean that that the previous messages have been placed
• Issues from protocol stack’s perspective – The receiver network stack has to understand each frame of data • If the frame is unchanged during transmission, this is easy!
– The MPA protocol layer adds appropriate information at regular intervals to allow the receiver to identify fragmented frames CCGrid '11
78
Dynamic and Fine-grained Rate Control • Part of the Ethernet standard, not iWARP – Network vendors use a separate interface to support it
• Dynamic bandwidth allocation to flows based on interval between two packets in a flow – E.g., one stall for every packet sent on a 10 Gbps network refers to a bandwidth allocation of 5 Gbps – Complicated because of TCP windowing behavior
• Important for high-latency/high-bandwidth networks – Large windows exposed on the receiver side – Receiver overflow controlled through rate control
CCGrid '11
79
Prioritization and Fixed Bandwidth QoS • Can allow for simple prioritization: – E.g., connection 1 performs better than connection 2 – 8 classes provided (a connection can be in any class) • Similar to SLs in InfiniBand
– Two priority classes for high-priority traffic • E.g., management traffic or your favorite application
• Or can allow for specific bandwidth requests: – E.g., can request for 3.62 Gbps bandwidth – Packet pacing and stalls used to achieve this
• Query functionality to find out “remaining bandwidth”
CCGrid '11
80
HSE Overview • High-speed Ethernet Family – Internet Wide-Area RDMA Protocol (iWARP) • Architecture and Components • Features – Out-of-order data placement – Dynamic and Fine-grained Data Rate control • Existing Implementations of HSE/iWARP
– Alternate Vendor-specific Stacks • MX over Ethernet (for Myricom 10GE adapters) • Datagram Bypass Layer (for Myricom 10GE adapters) • Solarflare OpenOnload (for Solarflare 10GE adapters)
CCGrid '11
81
Software iWARP based Compatibility TOE Cluster
Regular Ethernet Cluster
Wide Area Network
iWARP Cluster
• Regular Ethernet adapters and TOEs are fully compatible • Compatibility with iWARP required • Software iWARP emulates the functionality of iWARP on the host – Fully compatible with hardware iWARP – Internally utilizes host TCP/IP stack
Distributed Cluster Environment
CCGrid '11
82
Different iWARP Implementations
OSU, OSC, IBM
Application
Application
OSU, ANL
Chelsio, NetEffect (Intel)
Application
Application
High Performance Sockets
User-level iWARP
Kernel-level iWARP
Sockets
Sockets TCP (Modified with MPA) TCP
Sockets
High Performance Sockets
Software iWARP
TCP
TCP
IP
IP
IP
IP Device Driver
Device Driver
Sockets
Device Driver
Device Driver
Offloaded TCP
Offloaded iWARP
Offloaded TCP
Network Adapter
Network Adapter
Regular Ethernet Adapters
CCGrid '11
Offloaded IP
Network Adapter
TCP Offload Engines
Offloaded IP
Network Adapter
iWARP compliant Adapters
83
HSE Overview • High-speed Ethernet Family – Internet Wide-Area RDMA Protocol (iWARP) • Architecture and Components • Features – Out-of-order data placement – Dynamic and Fine-grained Data Rate control – Multipathing using VLANs • Existing Implementations of HSE/iWARP
– Alternate Vendor-specific Stacks • MX over Ethernet (for Myricom 10GE adapters) • Datagram Bypass Layer (for Myricom 10GE adapters)
• Solarflare OpenOnload (for Solarflare 10GE adapters)
CCGrid '11
84
Myrinet Express (MX) • Proprietary communication layer developed by Myricom for their Myrinet adapters – Third generation communication layer (after FM and GM) – Supports Myrinet-2000 and the newer Myri-10G adapters
• Low-level “MPI-like” messaging layer – Almost one-to-one match with MPI semantics (including connectionless model, implicit memory registration and tag matching) – Later versions added some more advanced communication methods such as RDMA to support other programming models such as ARMCI (low-level runtime for the Global Arrays PGAS library)
• Open-MX – New open-source implementation of the MX interface for nonMyrinet adapters from INRIA, France CCGrid '11
85
Datagram Bypass Layer (DBL) • Another proprietary communication layer developed by Myricom – Compatible with regular UDP sockets (embraces and extends)
– Idea is to bypass the kernel stack and give UDP applications direct access to the network adapter • High performance and low-jitter
• Primary motivation: Financial market applications (e.g., stock market) – Applications prefer unreliable communication
– Timeliness is more important than reliability
• This stack is covered by NDA; more details can be requested from Myricom CCGrid '11
86
Solarflare Communications: OpenOnload Stack •
•
HPC Networking Stack provides many performance benefits, but has limitations for certain types of scenarios, especially where applications tend to fork(), exec() and need asynchronous advancement (per application)
Solarflare approach:
Typical Commodity Networking Stack
Network hardware provides user-safe interface to route packets directly to apps based on flow information in headers
Protocol processing can happen in both kernel and user space
Protocol state shared between app and kernel using shared memory
Typical HPC Networking Stack
Solarflare approach to networking stack
Courtesy Solarflare communications (www.openonload.org/openonload-google-talk.pdf) CCGrid '11
87
IB, HSE and their Convergence • InfiniBand – Architecture and Basic Hardware Components – Communication Model and Semantics – Novel Features – Subnet Management and Services
• High-speed Ethernet Family – Internet Wide Area RDMA Protocol (iWARP)
– Alternate vendor-specific protocol stacks
• InfiniBand/Ethernet Convergence Technologies – Virtual Protocol Interconnect (VPI)
– (InfiniBand) RDMA over Ethernet (RoE) – (InfiniBand) RDMA over Converged (Enhanced) Ethernet (RoCE)
CCGrid '11
88
Virtual Protocol Interconnect (VPI) • Single network firmware to support both IB and Ethernet
Applications
IB Verbs
Hardware
Sockets
IB Transport Layer
TCP
IB Network Layer
IP TCP/IP support
IB Link Layer
IB Port
CCGrid '11
Ethernet Link Layer
Ethernet Port
• Autosensing of layer-2 protocol – Can be configured to automatically work with either IB or Ethernet networks
• Multi-port adapters can use one port on IB and another on Ethernet • Multiple use modes: – Datacenters with IB inside the cluster and Ethernet outside – Clusters with IB network and Ethernet management 89
(InfiniBand) RDMA over Ethernet (IBoE or RoE) • Native convergence of IB network and transport layers with Ethernet link layer • IB packets encapsulated in Ethernet frames Application IB Verbs
IB Transport
• IB network layer already uses IPv6 frames • Pros: – Works natively in Ethernet environments (entire Ethernet management ecosystem is available) – Has all the benefits of IB verbs
IB Network Ethernet Hardware
• Cons: – Network bandwidth might be limited to Ethernet switches: 10GE switches available; 40GE yet to arrive; 32 Gbps IB available – Some IB native link-layer features are optional in (regular) Ethernet
• Approved by OFA board to be included into OFED CCGrid '11
90
(InfiniBand) RDMA over Converged (Enhanced) Ethernet (RoCE) • Very similar to IB over Ethernet – Often used interchangeably with IBoE Application IB Verbs
IB Transport IB Network CE
– Can be used to explicitly specify link layer is Converged (Enhanced) Ethernet (CE)
• Pros: – Works natively in Ethernet environments (entire Ethernet management ecosystem is available) – Has all the benefits of IB verbs – CE is very similar to the link layer of native IB, so there are no missing features
• Cons: Hardware
CCGrid '11
– Network bandwidth might be limited to Ethernet switches: 10GE switches available; 40GE yet to arrive; 32 Gbps IB available 91
IB and HSE: Feature Comparison IB
iWARP/HSE
RoE
RoCE
Hardware Acceleration
Yes
Yes
Yes
Yes
RDMA
Yes
Yes
Yes
Yes
Congestion Control
Yes
Optional
Optional
Yes
Multipathing
Yes
Yes
Yes
Yes
Atomic Operations
Yes
No
Yes
Yes
Multicast
Optional
No
Optional
Optional
Data Placement
Ordered
Out-of-order
Ordered
Ordered
Prioritization
Optional
Optional
Optional
Yes
Fixed BW QoS (ETS)
No
Optional
Optional
Yes
Ethernet Compatibility
No
Yes
Yes
Yes
TCP/IP Compatibility
Yes (using IPoIB)
Yes
Yes (using IPoIB)
Yes (using IPoIB)
CCGrid '11
92
Presentation Overview • Introduction • Why InfiniBand and High-speed Ethernet? • Overview of IB, HSE, their Convergence and Features • IB and HSE HW/SW Products and Installations • Sample Case Studies and Performance Numbers • Conclusions and Final Q&A
CCGrid '11
93
IB Hardware Products • Many IB vendors: Mellanox+Voltaire and Qlogic – Aligned with many server vendors: Intel, IBM, SUN, Dell – And many integrators: Appro, Advanced Clustering, Microway
• Broadly two kinds of adapters – Offloading (Mellanox) and Onloading (Qlogic)
• Adapters with different interfaces: – Dual port 4X with PCI-X (64 bit/133 MHz), PCIe x8, PCIe 2.0 and HT
• MemFree Adapter – No memory on HCA Uses System memory (through PCIe) – Good for LOM designs (Tyan S2935, Supermicro 6015T-INFB)
• Different speeds – SDR (8 Gbps), DDR (16 Gbps) and QDR (32 Gbps)
• Some 12X SDR adapters exist as well (24 Gbps each way) • New ConnectX-2 adapter from Mellanox supports offload for collectives (Barrier, Broadcast, etc.) CCGrid '11
94
Tyan Thunder S2935 Board
(Courtesy Tyan)
Similar boards from Supermicro with LOM features are also available
CCGrid '11
95
IB Hardware Products (contd.) •
Customized adapters to work with IB switches
– Cray XD1 (formerly by Octigabay), Cray CX1 •
Switches: – 4X SDR and DDR (8-288 ports); 12X SDR (small sizes)
– 3456-port “Magnum” switch from SUN used at TACC • 72-port “nano magnum”
– 36-port Mellanox InfiniScale IV QDR switch silicon in 2008 • Up to 648-port QDR switch by Mellanox and SUN • Some internal ports are 96 Gbps (12X QDR)
– IB switch silicon from Qlogic introduced at SC ’08 • Up to 846-port QDR switch by Qlogic
– New FDR (56 Gbps) switch silicon (Bridge-X) has been announced by Mellanox in May ‘11 •
Switch Routers with Gateways – IB-to-FC; IB-to-IP
CCGrid '11
96
10G, 40G and 100G Ethernet Products • 10GE adapters: Intel, Myricom, Mellanox (ConnectX) • 10GE/iWARP adapters: Chelsio, NetEffect (now owned by Intel) • 40GE adapters: Mellanox ConnectX2-EN 40G • 10GE switches – Fulcrum Microsystems • Low latency switch based on 24-port silicon
• FM4000 switch with IP routing, and TCP/UDP support – Fujitsu, Myricom (512 ports), Force10, Cisco, Arista (formerly Arastra)
• 40GE and 100GE switches – Nortel Networks • 10GE downlinks with 40GE and 100GE uplinks
– Broadcom has announced 40GE switch in early 2010 CCGrid '11
97
Products Providing IB and HSE Convergence • Mellanox ConnectX Adapter • Supports IB and HSE convergence • Ports can be configured to support IB or HSE
• Support for VPI and RoCE – 8 Gbps (SDR), 16Gbps (DDR) and 32Gbps (QDR) rates available for IB
– 10GE rate available for RoCE – 40GE rate for RoCE is expected to be available in near future
CCGrid '11
98
Software Convergence with OpenFabrics • Open source organization (formerly OpenIB) – www.openfabrics.org
• Incorporates both IB and iWARP in a unified manner – Support for Linux and Windows – Design of complete stack with `best of breed’ components • Gen1 • Gen2 (current focus)
• Users can download the entire stack and run – Latest release is OFED 1.5.3 – OFED 1.6 is underway
CCGrid '11
99
OpenFabrics Stack with Unified Verbs Interface
Verbs Interface (libibverbs) Mellanox (libmthca)
QLogic (libipathverbs)
IBM (libehca)
Chelsio (libcxgb3) User Level
Mellanox (ib_mthca)
QLogic (ib_ipath)
Chelsio (ib_cxgb3)
IBM (ib_ehca) Kernel Level
Mellanox Adapters
CCGrid '11
QLogic Adapters
IBM Adapters
Chelsio Adapters
100
OpenFabrics on Convergent IB/HSE • For IBoE and RoCE, the upper-level stacks remain completely unchanged
Verbs Interface (libibverbs)
• Within the hardware:
ConnectX (libmlx4)
User Level
ConnectX (ib_mlx4)
Kernel Level
ConnectX Adapters
– Transport and network layers remain completely unchanged – Both IB and Ethernet (or CEE) link layers are supported on the network adapter
• Note: The OpenFabrics stack is not valid for the Ethernet path in VPI HSE CCGrid '11
IB
– That still uses sockets and TCP/IP 101
OpenFabrics Software Stack Application Level
Diag Open Tools SM
IP Based App Access
Sockets Based Access
User Level MAD API
User APIs
Clustered DB Access
Access to File Systems
UDAPL
InfiniBand
OpenFabrics User Level
User Space
Verbs / API
iWARP R-NIC
SDP Lib
Kernel Space
Kernel bypass
IPoIB
Provider
SDP
SRP
iSER
RDS
NFS-RDMA RPC
Cluster File Sys
Connection Manager Abstraction (CMA) SA MAD Client InfiniBand
Hardware Specific Driver
SMA
Connection Manager
Connection Manager
OpenFabrics Kernel Level Verbs / API
iWARP R-NIC
Kernel bypass
Upper Layer Protocol
Mid-Layer
Block Storage Access
Various MPIs
Hardware Specific Driver
SA
Subnet Administrator
MAD
Management Datagram
SMA
Subnet Manager Agent
PMA
Performance Manager Agent
IPoIB
IP over InfiniBand
SDP
Sockets Direct Protocol
SRP
SCSI RDMA Protocol (Initiator)
iSER
iSCSI RDMA Protocol (Initiator)
RDS
Reliable Datagram Service
UDAPL
User Direct Access Programming Lib
HCA
Host Channel Adapter
R-NIC
RDMA NIC
Key
Common
InfiniBand
Hardware
CCGrid '11
InfiniBand HCA
iWARP R-NIC
iWARP
Apps & Access Methods for using OF Stack
102
InfiniBand in the Top500
Percentage share of InfiniBand is steadily increasing
CCGrid '11
103
InfiniBand in the Top500 (Nov. 2010) Performance
Number of Systems 0%
0%
1%
1%
0% 0% 0% 4%
0%
0% 0%
0%
1% 0%
6%
0% 9% 25%
45%
21%
43%
44%
Gigabit Ethernet
InfiniBand
Proprietary
Gigabit Ethernet
Infiniband
Proprietary
Myrinet
Quadrics
Mixed
Myrinet
Quadrics
Mixed
NUMAlink
SP Switch
Cray Interconnect
NUMAlink
SP Switch
Cray Interconnect
Fat Tree
Custom
Fat Tree
Custom
CCGrid '11
104
InfiniBand System Efficiency in the Top500 List Computer Cluster Efficiency Comparison 100 90 80
Efficiency (%)
70 60
50 40 30 20 10 0 0
50
100
150
200
250
300
350
400
450
500
Top 500 Systems IB-CPU
CCGrid '11
IB-GPU/Cell
GigE
10GigE
IBM-BlueGene
Cray
105
Large-scale InfiniBand Installations • 214 IB Clusters (42.8%) in the Nov ‘10 Top500 list (http://www.top500.org) • Installations in the Top 30 (13 systems): 120,640 cores (Nebulae) in China (3rd)
15,120 cores (Loewe) in Germany (22nd)
73,278 cores (Tsubame-2.0) at Japan (4th)
26,304 cores (Juropa) in Germany (23rd)
138,368 cores (Tera-100) at France (6th)
26,232 cores (Tachyonll) in South Korea (24th)
122,400 cores (RoadRunner) at LANL (7th)
23,040 cores (Jade) at GENCI (27th)
81,920 cores (Pleiades) at NASA Ames (11th)
33,120 cores (Mole-8.5) in China (28th)
42,440 cores (Red Sky) at Sandia (14th)
More are getting installed !
62,976 cores (Ranger) at TACC (15th) 35,360 cores (Lomonosov) in Russia (17th)
CCGrid '11
106
HSE Scientific Computing Installations • HSE compute systems with ranking in the Nov 2010 Top500 list – 8,856-core installation in Purdue with ConnectX-EN 10GigE (#126) – 7,944-core installation in Purdue with 10GigE Chelsio/iWARP (#147) – 6,828-core installation in Germany (#166) – 6,144-core installation in Germany (#214)
– 6,144-core installation in Germany (#215) – 7,040-core installation at the Amazon EC2 Cluster (#231) – 4,000-core installation at HHMI (#349)
• Other small clusters – 640-core installation in University of Heidelberg, Germany – 512-core installation at Sandia National Laboratory (SNL) with Chelsio/iWARP and Woven Systems switch – 256-core installation at Argonne National Lab with Myri-10G
• Integrated Systems – BG/P uses 10GE for I/O (ranks 9, 13, 16, and 34 in the Top 50) CCGrid '11
107
Other HSE Installations • HSE has most of its popularity in enterprise computing and other non-scientific markets including Wide-area networking • Example Enterprise Computing Domains – Enterprise Datacenters (HP, Intel) – Animation firms (e.g., Universal Studios (“The Hulk”), 20th Century Fox (“Avatar”), and many new movies using 10GE) – Amazon’s HPC cloud offering uses 10GE internally – Heavily used in financial markets (users are typically undisclosed)
• Many Network-attached Storage devices come integrated with 10GE network adapters • ESnet to install $62M 100GE infrastructure for US DOE CCGrid '11
108
Presentation Overview • Introduction • Why InfiniBand and High-speed Ethernet? • Overview of IB, HSE, their Convergence and Features • IB and HSE HW/SW Products and Installations • Sample Case Studies and Performance Numbers • Conclusions and Final Q&A
CCGrid '11
109
Modern Interconnects and Protocols Application
Application Interface
Sockets
Kernel Space
TCP/IP
Verbs
TCP/IP
Protocol Implementation
SDP
RDMA
Hardware Offload
RDMA
User space
Ethernet Driver
IPoIB
Network Adapter
1/10 GigE Adapter
InfiniBand Adapter
10 GigE Adapter
InfiniBand Adapter
InfiniBand Adapter
Network Switch
Ethernet Switch
InfiniBand Switch
10 GigE Switch
InfiniBand Switch
InfiniBand Switch
10 GigE-TOE
SDP
IB Verbs
1/10 GigE
IPoIB
110
Case Studies • Low-level Network Performance • Clusters with Message Passing Interface (MPI) • Datacenters with Sockets Direct Protocol (SDP) and TCP/IP (IPoIB) • InfiniBand in WAN and Grid-FTP • Cloud Computing: Hadoop and Memcached
CCGrid '11
111
Low-level Latency Measurements 30
Small Messages
10000
VPI-IB 25
9000
Native IB
8000 7000
RoCE Latency (us)
Latency (us)
VPI-Eth 20 15 10
Large Messages
6000 5000
4000 3000 2000
5
1000 0
0 Message Size (bytes)
Message Size (bytes)
ConnectX-DDR: 2.4 GHz Quad-core (Nehalem) Intel with IB and 10GE switches RoCE has a slight overhead compared to native IB because it operates at a slower clock rate (required to support only a 10Gbps link for Ethernet, as compared to a 32Gbps link for IB) CCGrid '11
112
Low-level Uni-directional Bandwidth Measurements Uni-directional Bandwidth 1600 1400 Bandwidth (MBps)
1200 1000 800 600
VPI-IB
400
Native IB
VPI-Eth 200
RoCE
0
Message Size (bytes)
ConnectX-DDR: 2.4 GHz Quad-core (Nehalem) Intel with IB and 10GE switches CCGrid '11
113
Case Studies • Low-level Network Performance • Clusters with Message Passing Interface (MPI) • Datacenters with Sockets Direct Protocol (SDP) and TCP/IP (IPoIB) • InfiniBand in WAN and Grid-FTP • Cloud Computing: Hadoop and Memcached
CCGrid '11
114
MVAPICH/MVAPICH2 Software • High Performance MPI Library for IB and HSE – MVAPICH (MPI-1) and MVAPICH2 (MPI-2.2) – Used by more than 1,550 organizations in 60 countries
– More than 66,000 downloads from OSU site directly – Empowering many TOP500 clusters • 11th ranked 81,920-core cluster (Pleiades) at NASA
• 15th ranked 62,976-core cluster (Ranger) at TACC
– Available with software stacks of many IB, HSE and server vendors including Open Fabrics Enterprise Distribution (OFED) and Linux
Distros (RedHat and SuSE) – http://mvapich.cse.ohio-state.edu
CCGrid '11
115
One-way Latency: MPI over IB 6
Small Message Latency
400 350
5
300
MVAPICH-Qlogic-DDR MVAPICH-Qlogic-QDR-PCIe2 MVAPICH-ConnectX-DDR
Latency (us)
4 Latency (us)
Large Message Latency
3 2.17 1.96 2 1.60 1.54 1
250
MVAPICH-ConnectX-QDR-PCIe2
200 150
100 50 0
0
Message Size (bytes)
Message Size (bytes)
All numbers taken on 2.4 GHz Quad-core (Nehalem) Intel with IB switch
CCGrid '11
116
Bandwidth: MPI over IB 3500
Unidirectional Bandwidth
7000
3023.7
MVAPICH-Qlogic-DDR
3000
MillionBytes/sec
2665.6 1901.1
2000 1500
1553.2 1000
MillionBytes/sec
6000
2500
5000
5835.7
MVAPICH-Qlogic-QDR-PCIe2 MVAPICH-ConnectX-DDR MVAPICH-ConnectX-QDR-PCIe2
3642.8
4000 3000
3244.1 2990.1
2000
500
1000
0
0
Message Size (bytes)
Bidirectional Bandwidth
Message Size (bytes)
All numbers taken on 2.4 GHz Quad-core (Nehalem) Intel with IB switch
CCGrid '11
117
One-way Latency: MPI over iWARP One-way Latency
90 Chelsio (TCP/IP)
80
Chelsio (iWARP)
70
Intel-NetEffect (TCP/IP)
Latency (us)
60
Intel-NetEffect (iWARP)
50 40 30
25.43
20
15.47
10
6.88
0
6.25
Message Size (bytes)
2.4 GHz Quad-core Intel (Clovertown) with 10GE (Fulcrum) Switch CCGrid '11
118
Bandwidth: MPI over iWARP Unidirectional Bandwidth
2500
1400
Chelsio (TCP/IP)
1245.0
2260.8
Chelsio (iWARP)
1200
2000
1169.7 1000 839.8 800
600 373.3
400
Intel-NetEffect (TCP/IP) Intel-NetEffect (iWARP)
MillionBytes/sec
MillionBytes/sec
Bidirectional Bandwidth
1500
855.3
1000
500
200 0
2029.5
647.1
0
Message Size (bytes)
Message Size (bytes)
2.33 GHz Quad-core Intel (Clovertown) with 10GE (Fulcrum) Switch CCGrid '11
119
Convergent Technologies: MPI Latency Small Messages
60
18000 50
16000 14000 37.65
VPI-IB VPI-Eth RoCE
10000
30
20 10
Native IB
12000
Latency (us)
Latency (us)
40
Large Messages
20000
8000 6000 4000
3.6 2.2 1.8
2000 0
0 Message Size (bytes)
Message Size (bytes)
ConnectX-DDR: 2.4 GHz Quad-core (Nehalem) Intel with IB and 10GE switches CCGrid '11
120
Convergent Technologies: MPI Uni- and Bi-directional Bandwidth 1600
Uni-directional Bandwidth
3500
1518.1
Bi-directional Bandwidth
1400
VPI-IB
1200
VPI-Eth
1518.1 1115.7
RoCE 1000 800 600 404.1
400
2825.6
2500
2114.9 2000 1500 1000
200
500
0
0 Message Size (bytes)
2880.4
3000
Bandwidth (MBps)
Bandwidth (MBps)
Native IB
1085.5
Message Size (bytes)
ConnectX-DDR: 2.4 GHz Quad-core (Nehalem) Intel with IB and 10GE switches CCGrid '11
121
Case Studies • Low-level Network Performance • Clusters with Message Passing Interface (MPI) • Datacenters with Sockets Direct Protocol (SDP) and TCP/IP (IPoIB) • InfiniBand in WAN and Grid-FTP • Cloud Computing: Hadoop and Memcached
CCGrid '11
122
IPoIB vs. SDP Architectural Models Traditional Model
Possible SDP Model
Sockets App
Sockets Application
Sockets API
Sockets API
SDP
User
OS Modules InfiniBand Hardware
User TCP/IP Sockets Provider
Kernel
TCP/IP Transport Driver
IPoIB Driver
InfiniBand CA
Kernel
TCP/IP Sockets Provider
TCP/IP Transport Driver
IPoIB Driver
Sockets Direct Protocol
Kernel Bypass RDMA Semantics
InfiniBand CA (Source: InfiniBand Trade Association)
CCGrid '11
123
SDP vs. IPoIB (IB QDR) Bidir Bandwidth (MBps)
Bandwidth (MBps)
2000 IPoIB-RC IPoIB-UD SDP
1500 1000 500 0 2
8
32
128 512
2K
8K
32K
4000 3500 3000 2500 2000 1500 1000 500 0 2
8
32
128 512
2K
8K
32K
30
Latency (us)
25
SDP enables high bandwidth (up to 15 Gbps), low latency (6.6 µs)
20 15 10 5 0 2
4
CCGrid '11
8
16 32 64 128 256 512 1K 2K 124
Case Studies • Low-level Network Performance • Clusters with Message Passing Interface (MPI) • Datacenters with Sockets Direct Protocol (SDP) and TCP/IP (IPoIB) • InfiniBand in WAN and Grid-FTP • Cloud Computing: Hadoop and Memcached
CCGrid '11
125
IB on the WAN • Option 1: Layer-1 Optical networks – IB standard specifies link, network and transport layers – Can use any layer-1 (though the standard says copper and optical)
• Option 2: Link Layer Conversion Techniques – InfiniBand-to-Ethernet conversion at the link layer: switches available from multiple companies (e.g., Obsidian) • Technically, it’s not conversion; it’s just tunneling (L2TP)
– InfiniBand’s network layer is IPv6 compliant
CCGrid '11
126
UltraScience Net: Experimental Research Network Testbed
Since 2005
This and the following IB WAN slides are courtesy Dr. Nagi Rao (ORNL) Features • End-to-end guaranteed bandwidth channels • Dynamic, in-advance, reservation and provisioning of fractional/full lambdas • Secure control-plane for signaling • Peering with ESnet, National Science Foundation CHEETAH, and other networks CCGrid '11
127
IB-WAN Connectivity with Obsidian Switches • Supports SONET OC-192 or 10GE LAN-PHY/WAN-PHY • Idea is to make remote storage “appear” local • IB-WAN switch does frame conversion – IB standard allows per-hop creditbased flow control – IB-WAN switch uses large internal buffers to allow enough credits to fill the wire
CCGrid '11
CPU
CPU Host bus
System Controller
System Memory
HCA
IB switch
Storage
IB WAN
Wide-area SONET/10GE
128
InfiniBand Over SONET: Obsidian Longbows RDMA throughput measurements over USN Quad-core Dual socket
ORNL longbow IB/S
Linux host
ORNL CDCI
IB 4x Linux host
longbow IB/S
OC192
700 miles
3300 miles
Chicago CDCI
Seattle CDCI
4300 miles Sunnyvale CDCI
IB 4x: 8Gbps (full speed) Host-to-host local switch:7.5Gbps
ORNL loop -0.2 mile: 7.48Gbps ORNL-Chicago loop – 1400 miles: 7.47Gbps
Hosts: Dual-socket Quad-core 2 GHz AMD Opteron, 4GB memory, 8-lane PCIExpress slot, Dual-port Voltaire 4x SDR HCA
ORNL- Chicago - Seattle loop – 6600 miles: 7.37Gbps ORNL – Chicago – Seattle - Sunnyvale loop – 8600 miles: 7.34Gbps
CCGrid '11
129
IB over 10GE LAN-PHY and WAN-PHY Quad-core Dual socket
ORNL longbow IB/S
Linux host
ORNL CDCI
IB 4x Linux host
longbow IB/S
OC192
700 miles
3300 miles
Chicago CDCI
Seattle CDCI
4300 miles Sunnyvale CDCI
Sunnyvale E300
Chicago E300
ORNL loop -0.2 mile: 7.5Gbps
OC 192 10GE WAN PHY
ORNL-Chicago loop – 1400 miles: 7.49Gbps ORNL- Chicago - Seattle loop – 6600 miles: 7.39Gbps ORNL – Chicago – Seattle - Sunnyvale loop – 8600 miles: 7.36Gbps
CCGrid '11
130
MPI over IB-WAN: Obsidian Routers Obsidian WAN Router
Obsidian WAN Router
(Variable Delay)
Cluster A
MPI Bidirectional Bandwidth
(Variable Delay) WAN Link
Cluster B Delay (us)
Distance (km)
10
2
100
20
1000
200
10000
2000
Impact of Encryption on Message Rate (Delay 0 ms)
Hardware encryption has no impact on performance for less communicating streams S. Narravula, H. Subramoni, P. Lai, R. Noronha and D. K. Panda, Performance of HPC Middleware over InfiniBand WAN, Int'l Conference on Parallel Processing (ICPP '08), September 2008. CCGrid '11
131
Communication Options in Grid GridFTP
High Performance Computing Applications
Globus XIO Framework
IB Verbs
IPoIB
RoCE
TCP/IP
Obsidian Routers
10 GigE Network
• • •
Multiple options exist to perform data transfer on Grid Globus-XIO framework currently does not support IB natively We create the Globus-XIO ADTS driver and add native IB support to GridFTP
CCGrid '11
132
Globus-XIO Framework with ADTS Driver Globus XIO Driver #1
Globus XIO Interface
User
Globus XIO Driver #n
Data Connection Management
Globus-XIO ADTS Driver
Persistent Session Management
Buffer & File Management
File System
Data Transport Interface Zero Copy Channel
Flow Control
Memory Registration
Modern WAN Interconnects CCGrid '11
InfiniBand / RoCE
10GigE/iWARP
Network 133
ADTS Driver
1200 1000 800 600 400 200 0
Bandwidth (MBps)
Bandwidth (MBps)
Performance of Memory Based Data Transfer
0 2 MB
• • • CCGrid '11
10 100 Network Delay (us) 8 MB
32 MB
1000 64 MB
UDT Driver
1200 1000 800 600 400 200 0 0 2 MB
10 100 Network Delay (us) 8 MB
32 MB
1000 64 MB
Performance numbers obtained while transferring 128 GB of aggregate data in chunks of 256 MB files ADTS based implementation is able to saturate the link bandwidth Best performance for ADTS obtained when performing data transfer with a network buffer of size 32 MB 134
Synchronous Mode
400
Bandwidth (MBps)
Bandwidth (MBps)
Performance of Disk Based Data Transfer 300 200 100 0
0
10 100 Network Delay (us)
ADTS-8MB ADTS-64MB
• • •
CCGrid '11
ADTS-16MB IPoIB-64MB
1000
ADTS-32MB
Asynchronous Mode
400 300 200 100
0 0
10 100 Network Delay (us)
ADTS-8MB ADTS-64MB
ADTS-16MB IPoIB-64MB
1000 ADTS-32MB
Performance numbers obtained while transferring 128 GB of aggregate data in chunks of 256 MB files Predictable as well as better performance when Disk-IO threads assist network thread (Asynchronous Mode) Best performance for ADTS obtained with a circular buffer with individual buffers of size 64 MB 135
Application Level Performance
Bandwidth (MBps)
300 250 200 150 100 50 0 CCSM
Ultra-Viz
Target Applications
ADTS
• Application performance for FTP get operation for disk based transfers • Community Climate System Model (CCSM) • Part of Earth System Grid Project • Transfers 160 TB of total data in chunks of 256 MB • Network latency - 30 ms • Ultra-Scale Visualization (Ultra-Viz) • Transfers files of size 2.6 GB • Network latency - 80 ms
IPoIB
The ADTS driver out performs the UDT driver using IPoIB by more than 100% H. Subramoni, P. Lai, R. Kettimuthu and D. K. Panda, High Performance Data Transfer in Grid Environment Using GridFTP over InfiniBand, Int'l Symposium on Cluster Computing and the Grid (CCGrid), May 2010. 136 CCGrid '11
136
Case Studies • Low-level Network Performance • Clusters with Message Passing Interface (MPI) • Datacenters with Sockets Direct Protocol (SDP) and TCP/IP (IPoIB) • InfiniBand in WAN and Grid-FTP • Cloud Computing: Hadoop and Memcached
CCGrid '11
137
A New Approach towards OFA in Cloud Current Cloud Software Design
Current Approach Towards OFA in Cloud
Application
Application
Our Approach Towards OFA in Cloud Application
Accelerated Sockets Verbs Interface
Sockets Verbs / Hardware Offload 1/10 GigE Network
10 GigE or InfiniBand
10 GigE or InfiniBand
• Sockets not designed for high-performance – Stream semantics often mismatch for upper layers (Memcached, Hadoop) – Zero-copy not available for non-blocking sockets (Memcached)
• Significant consolidation in cloud system software – Hadoop and Memcached are developer facing APIs, not sockets – Improving Hadoop and Memcached will benefit many applications immediately! CCGrid '11
138
Memcached Design Using Verbs Sockets Client
1 2 1
RDMA Client
2
Master Thread
Sockets Worker Thread
Sockets Worker Thread
Shared Data Memory
Slabs Items
Verbs Worker Thread
Verbs Worker Thread
…
• Server and client perform a negotiation protocol – Master thread assigns clients to appropriate worker thread • Once a client is assigned a verbs worker thread, it can communicate directly and is “bound” to that thread • All other Memcached data structures are shared among RDMA and Sockets worker threads CCGrid '11
139
Memcached Get Latency 250
350 300
200 Time (us)
Time (us)
250 150 100
SDP
200
IPoIB
150
OSU Design 100
1G
50
50
10G-TOE
0 0
1 1
2
4
8
16 32 64 128 256 512 1K 2K 4K Message Size
Intel Clovertown Cluster (IB: DDR)
2
4
8 16 32 64 128 256 512 1K 2K 4K Message Size
Intel Westmere Cluster (IB: QDR)
• Memcached Get latency – 4 bytes – DDR: 6 us; QDR: 5 us – 4K bytes -- DDR: 20 us; QDR:12 us
• Almost factor of four improvement over 10GE (TOE) for 4K • We are in the process of evaluating iWARP on 10GE CCGrid '11
140
Thousands of Transactions per second (TPS)
Memcached Get TPS 700
2000
600 500
SDP
1800
10G - TOE
1600 1400
OSU Design
1200
400
SDP IPoIB OSU Design
1000 300
800
200
600 400
100
200
0
0 8 Clients
16 Clients
Intel Clovertown Cluster (IB: DDR)
8 Clients
16 Clients
Intel Westmere Cluster (IB: QDR)
• Memcached Get transactions per second for 4 bytes – On IB DDR about 600K/s for 16 clients – On IB QDR 1.9M/s for 16 clients
• Almost factor of six improvement over 10GE (TOE) • We are in the process of evaluating iWARP on 10GE CCGrid '11
141
Hadoop: Java Communication Benchmark Bandwidth with C version
1400
1200
1000
1GE
1000
Bandwidth (MB/s)
Bandwidth (MB/s)
1200
Bandwidth Java Version
1400
800 600 400 200
IPoIB SDP
800
10GE-TOE 600
SDP-NIO
400 200
0
0
Message Size (Bytes)
1
4
16
64
256
1K
4K
16K 64K 256K 1M
4M
Message Size (Bytes)
• • • •
Sockets level ping-pong bandwidth test Java performance depends on usage of NIO (allocateDirect) C and Java versions of the benchmark have similar performance HDFS does not use direct allocated blocks or NIO on DataNode
S. Sur, H. Wang, J. Huang, X. Ouyang and D. K. Panda “Can High-Performance Interconnects Benefit Hadoop Distributed File System?”, MASVDC „10 in conjunction with MICRO 2010, Atlanta, GA. CCGrid '11
142
Hadoop: DFS IO Write Performance Average Write Throughput (MB/sec)
90
Four Data Nodes Using HDD and SSD
80 70
1GE with HDD
60
IGE with SSD 50
IPoIB with HDD
40
IPoIB with SSD
30
SDP with HDD SDP with SSD
20
10GE-TOE with HDD 10
10GE-TOE with SSD
0
1
2
3
4
5
6
7
8
9
10
File Size(GB)
• • • • •
DFS IO included in Hadoop, measures sequential access throughput We have two map tasks each writing to a file of increasing size (1-10GB) Significant improvement with IPoIB, SDP and 10GigE With SSD, performance improvement is almost seven or eight fold! SSD benefits not seen without using high-performance interconnect! • In-line with comment on Google keynote about I/O performance
CCGrid '11
143
Hadoop: RandomWriter Performance 700
Execution Time (sec)
600 1GE with HDD
500
IGE with SSD 400
IPoIB with HDD IPoIB with SSD
300
SDP with HDD
200
SDP with SSD 10GE-TOE with HDD
100
10GE-TOE with SSD
0 2
4 Number of data nodes
• Each map generates 1GB of random binary data and writes to HDFS • SSD improves execution time by 50% with 1GigE for two DataNodes • For four DataNodes, benefits are observed only with HPC interconnect
• IPoIB, SDP and 10GigE can improve performance by 59% on four DataNodes
CCGrid '11
144
Hadoop Sort Benchmark
Execution Time (sec)
2500
2000 1GE with HDD IGE with SSD
1500
IPoIB with HDD IPoIB with SSD
1000
SDP with HDD SDP with SSD
500
10GE-TOE with HDD 10GE-TOE with SSD
0 2
4 Number of data nodes
• • • •
Sort: baseline benchmark for Hadoop Sort phase: I/O bound; Reduce phase: communication bound SSD improves performance by 28% using 1GigE with two DataNodes Benefit of 50% on four DataNodes using SDP, IPoIB or 10GigE
CCGrid '11
145
Presentation Overview • Introduction • Why InfiniBand and High-speed Ethernet? • Overview of IB, HSE, their Convergence and Features • IB and HSE HW/SW Products and Installations • Sample Case Studies and Performance Numbers • Conclusions and Final Q&A
CCGrid '11
146
Concluding Remarks • Presented network architectures & trends for Clusters, Grid, Multi-tier Datacenters and Cloud Computing Systems
• Presented background and details of IB and HSE – Highlighted the main features of IB and HSE and their convergence – Gave an overview of IB and HSE hardware/software products
– Discussed sample performance numbers in designing various highend systems with IB and HSE
• IB and HSE are emerging as new architectures leading to a new generation of networked computing systems, opening many research issues needing novel solutions CCGrid '11
147
Funding Acknowledgments Funding Support by
Equipment Support by
CCGrid '11
148
Personnel Acknowledgments Current Students –
N. Dandapanthula (M.S.)
–
P. Balaji (Ph.D.)
–
R. Darbha (M.S.)
–
D. Buntinas (Ph.D.)
–
V. Dhanraj (M.S.)
–
S. Bhagvat (M.S.)
–
J. Huang (Ph.D.)
–
L. Chai (Ph.D.)
–
J. Jose (Ph.D.)
–
B. Chandrasekharan (M.S.)
–
K. Kandalla (Ph.D.)
–
T. Gangadharappa (M.S.)
–
M. Luo (Ph.D.)
–
K. Gopalakrishnan (M.S.)
–
V. Meshram (M.S.)
–
W. Huang (Ph.D.)
–
X. Ouyang (Ph.D.)
–
W. Jiang (M.S.)
–
S. Pai (M.S.)
–
S. Kini (M.S.)
–
S. Potluri (Ph.D.)
–
R. Rajachandrasekhar (Ph.D.)
–
M. Koop (Ph.D.)
–
A. Singh (Ph.D.)
–
R. Kumar (M.S.)
–
H. Subramoni (Ph.D.)
–
S. Krishnamoorthy (M.S.)
Current Research Scientist –
Past Students
S. Sur
CCGrid '11
Current Post-Docs
Past Post-Docs
–
P. Lai (Ph. D.)
–
J. Liu (Ph.D.)
–
A. Mamidala (Ph.D.)
–
G. Marsh (M.S.)
–
S. Naravula (Ph.D.)
–
R. Noronha (Ph.D.)
–
G. Santhanaraman (Ph.D.)
–
J. Sridhar (M.S.)
–
S. Sur (Ph.D.)
–
K. Vaidyanathan (Ph.D.)
–
A. Vishnu (Ph.D.)
–
J. Wu (Ph.D.)
–
W. Yu (Ph.D.)
Current Programmers
–
H. Wang
–
E. Mancini
–
M. Arnold
–
J. Vienne
–
S. Marcarelli
–
D. Bureddy
–
X. Besseron
–
H.-W. Jin
–
J. Perkins
149
Web Pointers
http://www.cse.ohio-state.edu/~panda http://www.cse.ohio-state.edu/~surs http://nowlab.cse.ohio-state.edu
MVAPICH Web Page http://mvapich.cse.ohio-state.edu
[email protected] [email protected]
CCGrid '11
150