RI2N/DRV: Multi-link Ethernet for High-Bandwidth ... - Semantic Scholar

3 downloads 226 Views 412KB Size Report
May 25, 2009 - ... Network on PC Clusters ... Performance of network is most important problem for PC ..... network service that requires a high-speed, reliable.
RI2N/DRV: Multi-link Ethernet for High-Bandwidth and Fault-Tolerant Network on PC Clusters Shin’ichi Miura, Toshihiro Hanawa, Taiga Yonemoto, Taisuke Boku, and Mitsuhisa Sato University of Tsukuba

1

The 9th Workshop on Communication Architecture for Clusters (CAC2009)

25 May, 2009

University of Tsukuba

Background 

Performance of network is most important problem for PC clusters. 

Several SANs (System Area Network) are developed for PC clusters. 



Although cost/performance ratio of Ethernet is very good, two big problems exist.  



Network’s cost/performance ratio is not good.

Sustained performance of Ethernet is lower than SAN. Robustness of Ethernet is relatively low compared with SAN components.

Our solution … 

Utilizing multiple Ethernet links 

In normal state; 



In faulty state; 

2

Distributing packets on multiple links to get wider bandwidth Transmitting packets on only normal links to continue communication.

The 9th Workshop on Communication Architecture for Clusters (CAC2009)

25 May, 2009

University of Tsukuba

RI2N 



RI2N : Redundant Interconnection with Inexpensive Network RI2N/UDP (T. Okamoto et al, CAC2007) 

User level implantation on multithread library and UDP socket



System level implementation 



Retransmission and congestion control relies on upper layer protocol such as TCP/IP 

3

High system transparency

NIC

RI2N/DRV

NIC



Network Switch

node

node

Example of cluster connected with RI2N

Low Communication cost

The 9th Workshop on Communication Architecture for Clusters (CAC2009)

25 May, 2009

University of Tsukuba

Related Works 

Hardware implementation 

IEEE 802.3ad. (LACP)  

Special switch is required. Multiplexing of a switch is not considered. 



Solution for packet disordering problem 

Single point of failure

(balance-rr mode)  Packet disordering problem Low throughput Similar to RI2N

Remodeling TCP might influence the user application 



Linux Channel Bonding



TCP 

Software implementation 

4



Increased ACK packet from Receiver side 



Heavy CPU Load

Traffic congestion

SCTP 

Low program portability

The 9th Workshop on Communication Architecture for Clusters (CAC2009)

25 May, 2009

University of Tsukuba

Linux Channel Bonding 

Packet Transmitter 



LCB distribute packets to multiple physical network links in a roundrobin manner

Packet Receiver 

NIC transfer the packet directly to a upper layer protocol such as IP.  

5

Multiple packets are pushed all at once to save the amount of interrupt handling

Disordering packets 



TCP IP 0 1 2 3 4 5 6 7 LCB Interrupt coalescing or Real NIC NIC 0 2 4 6 1 3 5 7 NAPI Real

Interrupt coalescing NAPI on Linux device driver 



User Application

A lot of ACK is sent to Sender side

LCB does not care of receive operation. The 9th Workshop on Communication Architecture for Clusters (CAC2009)

Real NIC 0 2 4 6

Real NIC 1 3 5 7 LCB

0 2 4 6 1 3 5 7 IP TCP User Application Packet disordering 25 May, 2009

University of Tsukuba

RI2N 

Packet Transmitter 



User Application

Round-robin as well as transmitter of LCB

TCP IP

Packet Receiver    

RI2N/DRV Interrupt coalescing or Real NIC NAPI Real NIC

NIC transfer the packet to RI2N Packets are saved to the buffer of RI2N Reordering received packets Transfer it to a upper layer protocol with correct order 

Real NIC 0 2 4 6

Does not generate a lot of ACK packets by correct order

Real NIC 1 3 5 7

0 1RI2N/DRV 2 3 4 5 6 7 0 1 2 3 4 5 6 7 IP



6

This Reordering Mechanism improve throughput

TCPPacket Buffering and Reordering User Application Correct Order

The 9th Workshop on Communication Architecture for Clusters (CAC2009)

25 May, 2009

University of Tsukuba

Implementation of RI2N 

RI2N makes pseudo network device to transmit and receive packets  



RI2N enhances the Ethernet packet to packet reordering 





NIC 1

NIC 2

Receives Ethernet packet instead of upper layer protocol such as IP Software Transfer Ethernet packet after packet interrupts reordering OS RX queue or NAPI

It cause large latency at short message

Ethernet Header RI2N Header

Ethernet Payload

Receive Operation requires 2 software interrupts to reordering packet for each packet 

7

Destination address Source address 0x0800 0x1044 Type 0x0800 Ethernet Payload RI2N Sequence Number

Insert “RI2N Header” to Ethernet packet between Ethernet Header and Ethernet Payload

Receive Operation 



Compatible with ordinary Ethernet device The user application can use TCP/IP and UDP/IP communication without special modification

Software Interrupts

Layer-2 Layer-3 Packet

Reordering IPv4

The 9th Workshop on Communication Architecture for Clusters (CAC2009)

RI2N

25 May, 2009

University of Tsukuba

Basic Performance Evaluation Benchmarks;





Network; 

NIC

Intel PRO/1000PT Dual port MTU=1500

NIC driver

e1000e 0.3.3.3-k

MPI

OpenMPI 1.2.4

Single Gigabit Ethernet Linux Channel bonding with 2 GbE

RI2N 

8

Linux Kernel 2.6.27.7

LCB 



OS

Single 



DDR2/800 4096MB

Node A

RI2N/DRV with 2 GbE

The 9th Workshop on Communication Architecture for Clusters (CAC2009)

Switch

Switch

25 May, 2009

NIC



Memory

NIC



Amount of ACK Packets Latency Throughput on MPI Two-Way Communication

Intel Xeon 3110 3.0 GHz dual core

NIC



CPU

NIC



Node B

University of Tsukuba

Amount of ACK Packets • By packet reordering in RI2N, unnecessary ACK packets are reduced

Number of ACK Packets Single

370

LCB

741

RI2N

376

LCB has twice number of ACK compared with RI2N is same as SINGLE Single

RI2NThe reduces causeACK of this packets gap isdue packet to packet disordering. disordering

9

The 9th Workshop on Communication Architecture for Clusters (CAC2009)

25 May, 2009

University of Tsukuba

Latency over Normal TCP/IP

Round-Trip Time [usec]

350

Lowest Latency

300 Latency of RI2N is larger than the other environment in a short message 250 200 single LCB RI2N

150 100 50 Receive operation of RI2N requires 2 0 software interrupts to reordering packet 0 3000 6000

9000

Message Size [Byte] 10

The 9th Workshop on Communication Architecture for Clusters (CAC2009)

25 May, 2009

University of Tsukuba

ThroughputMost ontypical MPIapplications (Ping-Pong) require 250

high communication bandwidth

Throughput [MB/sec]

200

150

RI2N Throughput is better than LCB

100 RI2N Throughput is lower than LCB

50

single LCB RI2N

0

Message Size [Byte] 11

The 9th Workshop on Communication Architecture for Clusters (CAC2009)

25 May, 2009

University of Tsukuba

Two-way Communication (TCP/IP) The performance of LCB and RI2N are better than single

Throughput Single

117 MB/sec

LCB

159 MB/sec

RI2N

214 MB/sec

Switch

Difference is 55 MB/sec

Switch

ACK packet disturbs the stream on the opposite way

0

12

1

The 9th Workshop on Communication Architecture for Clusters (CAC2009)

Packet disordering frequently occurs

Transmits a large number of ACK packets

25 May, 2009

University of Tsukuba

Bonnie++ Benchmarks on NFS 

Bonnie++ Benchmarks   

  

File Size: 2.0 GB NFS block sizes is 32 KB

Switch

Switch

0

1

NFS Server

NFS Client

Dual-link Gigabit Ethernet provides up to more than 200 Mbytes/sec of data transfer performance  

13

Sequential Write (put_block) Sequential Read (get_block) Sequential Rewrite (rewirte)

Prepared a tempfs memory disk on the NFS server Direct I/O performance takes about 1.4 Gbytes/sec The 9th Workshop on Communication Architecture for Clusters (CAC2009)

25 May, 2009

University of Tsukuba

Bonnie++ Benchmarks on NFS (Results) Throughput [MB/sec]

CPU Usage [%]

The cause of the difference these of LCBgain and RI2N 250 100 performances LCB and RI2N achieved RI2N of can achieve the performance is the performance ofmore the two way communication twice throughput efficiently than LCB. 90

RI2N achieve highest 80 performance 70

200 150

60

RI2N provide lower CPU usage than LCB

50 100

High CPU utilization of RI2N

40 30

50

LCB is almost equal throughput to single

0

10 0

put_block get_block single 14

20

LCB

rewrite

put_block get_block

RI2N

The 9th Workshop on Communication Architecture for Clusters (CAC2009)

single

LCB 25 May, 2009

rewrite

RI2N University of Tsukuba

NAS Parallel Benchmarks 

NAS Parallel Benchmarks 3.3   

CLASS: B Number of Process: 4 and 8 Benchmarks: 

5 Kernels 



EP, FT, IS, MG and CG

CPU

Intel Xeon 5110 1.6GHz dual core

Memory

DDR2/800 4092MB

OS

Linux 2.6.27.7

NIC

Intel PRO 1000PT Dual port MTU=1500

NIC driver

e1000e 0.3.3.3-k

MPI

OpenMPI 1.2.4

3 Application 

LU, BT and SP Switch

0

15

1

The 9th Workshop on Communication Architecture for Clusters (CAC2009)

2

Switch

3

4

25 May, 2009

5

6

7

University of Tsukuba

NAS Parallel Benchmarks (Results)

Relative Performance

The performance of RI2N is improved by high bandwidth 2.5 Decrease the message size Relative performance becomes worse 2 1.5 1 0.5

performance has not improved The performance of LCB worse than SINGLE environment though the bandwidth has increased

0 EP

FT

IS

single (4 node) single (8 node) 16

MG

CG

LCB (4 node) LCB (8 node)

The 9th Workshop on Communication Architecture for Clusters (CAC2009)

LU

BT

SP

RI2N (4 node) RI2N (8 node) 25 May, 2009

University of Tsukuba

Conclusion 

RI2N;  



A user can use this system transparently 



Provide High-bandwidth and Fault-tolerance with multiple Ethernet link Implemented as a pseudo network interface of Linux An existing system can use this system without the recompilation.

Prototype Implementation 

The performance is moderately good 



17

However, it is necessary to improve it.

RI2N can be applied not only for inter-node communication such as MPI but also for any UNIX network service that requires a high-speed, reliable network The 9th Workshop on Communication Architecture for Clusters (CAC2009)

25 May, 2009

University of Tsukuba

Thank you for your kind attention

18

The 9th Workshop on Communication Architecture for Clusters (CAC2009)

25 May, 2009

University of Tsukuba

Suggest Documents