RI2N/DRV: Multi-link Ethernet for High-Bandwidth ... - Semantic Scholar

RI2N/DRV: Multi-link Ethernet for High-Bandwidth and Fault-Tolerant Network on PC Clusters Shin’ichi Miura, Toshihiro Hanawa, Taiga Yonemoto, Taisuke Boku, and Mitsuhisa Sato University of Tsukuba

1

The 9th Workshop on Communication Architecture for Clusters (CAC2009)

25 May, 2009

University of Tsukuba

Background 

Performance of network is most important problem for PC clusters. 

Several SANs (System Area Network) are developed for PC clusters. 



Although cost/performance ratio of Ethernet is very good, two big problems exist.  



Network’s cost/performance ratio is not good.

Sustained performance of Ethernet is lower than SAN. Robustness of Ethernet is relatively low compared with SAN components.

Our solution … 

Utilizing multiple Ethernet links 

In normal state; 



In faulty state; 

2

Distributing packets on multiple links to get wider bandwidth Transmitting packets on only normal links to continue communication.


25 May, 2009


RI2N 



RI2N : Redundant Interconnection with Inexpensive Network RI2N/UDP (T. Okamoto et al, CAC2007) 

User level implantation on multithread library and UDP socket



System level implementation 



Retransmission and congestion control relies on upper layer protocol such as TCP/IP 

3

High system transparency

NIC

RI2N/DRV

NIC



Network Switch

node

node

Example of cluster connected with RI2N

Low Communication cost


25 May, 2009


Related Works 

Hardware implementation 

IEEE 802.3ad. (LACP)  

Special switch is required. Multiplexing of a switch is not considered. 



Solution for packet disordering problem 

Single point of failure

(balance-rr mode)  Packet disordering problem Low throughput Similar to RI2N

Remodeling TCP might influence the user application 



Linux Channel Bonding



TCP 

Software implementation 

4



Increased ACK packet from Receiver side 



Heavy CPU Load

Traffic congestion

SCTP 

Low program portability


25 May, 2009


Linux Channel Bonding 

Packet Transmitter 



LCB distribute packets to multiple physical network links in a roundrobin manner

Packet Receiver 

NIC transfer the packet directly to a upper layer protocol such as IP.  

5

Multiple packets are pushed all at once to save the amount of interrupt handling

Disordering packets 



TCP IP 0 1 2 3 4 5 6 7 LCB Interrupt coalescing or Real NIC NIC 0 2 4 6 1 3 5 7 NAPI Real

Interrupt coalescing NAPI on Linux device driver 



User Application

A lot of ACK is sent to Sender side

LCB does not care of receive operation. The 9th Workshop on Communication Architecture for Clusters (CAC2009)

Real NIC 0 2 4 6

Real NIC 1 3 5 7 LCB

0 2 4 6 1 3 5 7 IP TCP User Application Packet disordering 25 May, 2009


RI2N 

Packet Transmitter 



User Application

Round-robin as well as transmitter of LCB

TCP IP

Packet Receiver    

RI2N/DRV Interrupt coalescing or Real NIC NAPI Real NIC

NIC transfer the packet to RI2N Packets are saved to the buffer of RI2N Reordering received packets Transfer it to a upper layer protocol with correct order 

Real NIC 0 2 4 6

Does not generate a lot of ACK packets by correct order

Real NIC 1 3 5 7

0 1RI2N/DRV 2 3 4 5 6 7 0 1 2 3 4 5 6 7 IP



6

This Reordering Mechanism improve throughput

TCPPacket Buffering and Reordering User Application Correct Order


25 May, 2009


Implementation of RI2N 

RI2N makes pseudo network device to transmit and receive packets  



RI2N enhances the Ethernet packet to packet reordering 





NIC 1

NIC 2

Receives Ethernet packet instead of upper layer protocol such as IP Software Transfer Ethernet packet after packet interrupts reordering OS RX queue or NAPI

It cause large latency at short message

Ethernet Header RI2N Header

Ethernet Payload

Receive Operation requires 2 software interrupts to reordering packet for each packet 

7

Destination address Source address 0x0800 0x1044 Type 0x0800 Ethernet Payload RI2N Sequence Number

Insert “RI2N Header” to Ethernet packet between Ethernet Header and Ethernet Payload

Receive Operation 



Compatible with ordinary Ethernet device The user application can use TCP/IP and UDP/IP communication without special modification

Software Interrupts

Layer-2 Layer-3 Packet

Reordering IPv4


RI2N

25 May, 2009


Basic Performance Evaluation Benchmarks;





Network; 

NIC

Intel PRO/1000PT Dual port MTU=1500

NIC driver

e1000e 0.3.3.3-k

MPI

OpenMPI 1.2.4

Single Gigabit Ethernet Linux Channel bonding with 2 GbE

RI2N 

8

Linux Kernel 2.6.27.7

LCB 



OS

Single 



DDR2/800 4096MB

Node A

RI2N/DRV with 2 GbE


Switch

Switch

25 May, 2009

NIC



Memory

NIC



Amount of ACK Packets Latency Throughput on MPI Two-Way Communication

Intel Xeon 3110 3.0 GHz dual core

NIC



CPU

NIC



Node B


Amount of ACK Packets • By packet reordering in RI2N, unnecessary ACK packets are reduced

Number of ACK Packets Single

370

LCB

741

RI2N

376

LCB has twice number of ACK compared with RI2N is same as SINGLE Single

RI2NThe reduces causeACK of this packets gap isdue packet to packet disordering. disordering

9


25 May, 2009


Latency over Normal TCP/IP

Round-Trip Time [usec]

350

Lowest Latency

300 Latency of RI2N is larger than the other environment in a short message 250 200 single LCB RI2N

150 100 50 Receive operation of RI2N requires 2 0 software interrupts to reordering packet 0 3000 6000

9000

Message Size [Byte] 10


25 May, 2009


ThroughputMost ontypical MPIapplications (Ping-Pong) require 250

high communication bandwidth

Throughput [MB/sec]

200

150

RI2N Throughput is better than LCB

100 RI2N Throughput is lower than LCB

50

single LCB RI2N

0

Message Size [Byte] 11


25 May, 2009


Two-way Communication (TCP/IP) The performance of LCB and RI2N are better than single

Throughput Single

117 MB/sec

LCB

159 MB/sec

RI2N

214 MB/sec

Switch

Difference is 55 MB/sec

Switch

ACK packet disturbs the stream on the opposite way

0

12

1


Packet disordering frequently occurs

Transmits a large number of ACK packets

25 May, 2009


Bonnie++ Benchmarks on NFS 

Bonnie++ Benchmarks   

  

File Size: 2.0 GB NFS block sizes is 32 KB

Switch

Switch

0

1

NFS Server

NFS Client

Dual-link Gigabit Ethernet provides up to more than 200 Mbytes/sec of data transfer performance  

13

Sequential Write (put_block) Sequential Read (get_block) Sequential Rewrite (rewirte)

Prepared a tempfs memory disk on the NFS server Direct I/O performance takes about 1.4 Gbytes/sec The 9th Workshop on Communication Architecture for Clusters (CAC2009)

25 May, 2009


Bonnie++ Benchmarks on NFS (Results) Throughput [MB/sec]

CPU Usage [%]

The cause of the difference these of LCBgain and RI2N 250 100 performances LCB and RI2N achieved RI2N of can achieve the performance is the performance ofmore the two way communication twice throughput efficiently than LCB. 90

RI2N achieve highest 80 performance 70

200 150

60

RI2N provide lower CPU usage than LCB

50 100

High CPU utilization of RI2N

40 30

50

LCB is almost equal throughput to single

0

10 0

put_block get_block single 14

20

LCB

rewrite

put_block get_block

RI2N


single

LCB 25 May, 2009

rewrite

RI2N University of Tsukuba

NAS Parallel Benchmarks 

NAS Parallel Benchmarks 3.3   

CLASS: B Number of Process: 4 and 8 Benchmarks: 

5 Kernels 



EP, FT, IS, MG and CG

CPU

Intel Xeon 5110 1.6GHz dual core

Memory

DDR2/800 4092MB

OS

Linux 2.6.27.7

NIC

Intel PRO 1000PT Dual port MTU=1500

NIC driver

e1000e 0.3.3.3-k

MPI

OpenMPI 1.2.4

3 Application 

LU, BT and SP Switch

0

15

1


2

Switch

3

4

25 May, 2009

5

6

7


NAS Parallel Benchmarks (Results)

Relative Performance

The performance of RI2N is improved by high bandwidth 2.5 Decrease the message size Relative performance becomes worse 2 1.5 1 0.5

performance has not improved The performance of LCB worse than SINGLE environment though the bandwidth has increased

0 EP

FT

IS

single (4 node) single (8 node) 16

MG

CG

LCB (4 node) LCB (8 node)


LU

BT

SP

RI2N (4 node) RI2N (8 node) 25 May, 2009


Conclusion 

RI2N;  



A user can use this system transparently 



Provide High-bandwidth and Fault-tolerance with multiple Ethernet link Implemented as a pseudo network interface of Linux An existing system can use this system without the recompilation.

Prototype Implementation 

The performance is moderately good 



17

However, it is necessary to improve it.

RI2N can be applied not only for inter-node communication such as MPI but also for any UNIX network service that requires a high-speed, reliable network The 9th Workshop on Communication Architecture for Clusters (CAC2009)

25 May, 2009


Thank you for your kind attention

18


25 May, 2009


RI2N/DRV: Multi-link Ethernet for High-Bandwidth ... - Semantic Scholar

RI2N/DRV: Multi-link Ethernet for High-Bandwidth ... - Semantic Scholar

Suggest Documents

MultiLink

On Improving Ethernet LAN Security - Semantic Scholar

physical layer ethernet clock synchronization - Semantic Scholar

MULTILINK SUSPENSIONS - W124 Performance

Hierarchical Ethernet Transport Network ... - Semantic Scholar

Ethernet FSO Communications Link Performance ... - Semantic Scholar

Ethernet Topology Discovery for Networks with ... - Semantic Scholar

Analysis of 100Mb/s Ethernet for the Whitney ... - Semantic Scholar

Ethernet Switch/terminal Simulators for Novices to ... - Semantic Scholar

Ethernet Algorithm for Building Network Integration ... - Semantic Scholar

MultiLink - RERO DOC

YANG for 802.3 Ethernet - Ethernet Alliance [PDF]

Fast Ethernet and Gigabit Ethernet

Fast Failure Handling in Ethernet Networks - Semantic Scholar

Convergence of Ethernet PON and IEEE 802.16 ... - Semantic Scholar

Quality of Service Analysis of Ethernet Network ... - Semantic Scholar

Flow Monitoring Experiences at the Ethernet-Layer - Semantic Scholar

Blue Gene/L compute chip: Memory and Ethernet ... - Semantic Scholar

Evaluation of Tree-Based Routing Ethernet - Semantic Scholar

Analysis and Design of E1 over Ethernet Gateway - Semantic Scholar

Why Can Some Advanced Ethernet NICs Cause ... - Semantic Scholar

Leading Gigabit Ethernet to True Zero-Copy ... - Semantic Scholar

Real-time Ethernet Residual Bus Simulation: A ... - Semantic Scholar

A QoS Enabled Public Ethernet Access Network - Semantic Scholar