May 25, 2009 - ... Network on PC Clusters ... Performance of network is most important problem for PC ..... network service that requires a high-speed, reliable.
RI2N/DRV: Multi-link Ethernet for High-Bandwidth and Fault-Tolerant Network on PC Clusters Shin’ichi Miura, Toshihiro Hanawa, Taiga Yonemoto, Taisuke Boku, and Mitsuhisa Sato University of Tsukuba
1
The 9th Workshop on Communication Architecture for Clusters (CAC2009)
25 May, 2009
University of Tsukuba
Background
Performance of network is most important problem for PC clusters.
Several SANs (System Area Network) are developed for PC clusters.
Although cost/performance ratio of Ethernet is very good, two big problems exist.
Network’s cost/performance ratio is not good.
Sustained performance of Ethernet is lower than SAN. Robustness of Ethernet is relatively low compared with SAN components.
Our solution …
Utilizing multiple Ethernet links
In normal state;
In faulty state;
2
Distributing packets on multiple links to get wider bandwidth Transmitting packets on only normal links to continue communication.
The 9th Workshop on Communication Architecture for Clusters (CAC2009)
25 May, 2009
University of Tsukuba
RI2N
RI2N : Redundant Interconnection with Inexpensive Network RI2N/UDP (T. Okamoto et al, CAC2007)
User level implantation on multithread library and UDP socket
System level implementation
Retransmission and congestion control relies on upper layer protocol such as TCP/IP
3
High system transparency
NIC
RI2N/DRV
NIC
Network Switch
node
node
Example of cluster connected with RI2N
Low Communication cost
The 9th Workshop on Communication Architecture for Clusters (CAC2009)
25 May, 2009
University of Tsukuba
Related Works
Hardware implementation
IEEE 802.3ad. (LACP)
Special switch is required. Multiplexing of a switch is not considered.
Solution for packet disordering problem
Single point of failure
(balance-rr mode) Packet disordering problem Low throughput Similar to RI2N
Remodeling TCP might influence the user application
Linux Channel Bonding
TCP
Software implementation
4
Increased ACK packet from Receiver side
Heavy CPU Load
Traffic congestion
SCTP
Low program portability
The 9th Workshop on Communication Architecture for Clusters (CAC2009)
25 May, 2009
University of Tsukuba
Linux Channel Bonding
Packet Transmitter
LCB distribute packets to multiple physical network links in a roundrobin manner
Packet Receiver
NIC transfer the packet directly to a upper layer protocol such as IP.
5
Multiple packets are pushed all at once to save the amount of interrupt handling
Disordering packets
TCP IP 0 1 2 3 4 5 6 7 LCB Interrupt coalescing or Real NIC NIC 0 2 4 6 1 3 5 7 NAPI Real
Interrupt coalescing NAPI on Linux device driver
User Application
A lot of ACK is sent to Sender side
LCB does not care of receive operation. The 9th Workshop on Communication Architecture for Clusters (CAC2009)
Real NIC 0 2 4 6
Real NIC 1 3 5 7 LCB
0 2 4 6 1 3 5 7 IP TCP User Application Packet disordering 25 May, 2009
University of Tsukuba
RI2N
Packet Transmitter
User Application
Round-robin as well as transmitter of LCB
TCP IP
Packet Receiver
RI2N/DRV Interrupt coalescing or Real NIC NAPI Real NIC
NIC transfer the packet to RI2N Packets are saved to the buffer of RI2N Reordering received packets Transfer it to a upper layer protocol with correct order
Real NIC 0 2 4 6
Does not generate a lot of ACK packets by correct order
Real NIC 1 3 5 7
0 1RI2N/DRV 2 3 4 5 6 7 0 1 2 3 4 5 6 7 IP
6
This Reordering Mechanism improve throughput
TCPPacket Buffering and Reordering User Application Correct Order
The 9th Workshop on Communication Architecture for Clusters (CAC2009)
25 May, 2009
University of Tsukuba
Implementation of RI2N
RI2N makes pseudo network device to transmit and receive packets
RI2N enhances the Ethernet packet to packet reordering
NIC 1
NIC 2
Receives Ethernet packet instead of upper layer protocol such as IP Software Transfer Ethernet packet after packet interrupts reordering OS RX queue or NAPI
It cause large latency at short message
Ethernet Header RI2N Header
Ethernet Payload
Receive Operation requires 2 software interrupts to reordering packet for each packet
7
Destination address Source address 0x0800 0x1044 Type 0x0800 Ethernet Payload RI2N Sequence Number
Insert “RI2N Header” to Ethernet packet between Ethernet Header and Ethernet Payload
Receive Operation
Compatible with ordinary Ethernet device The user application can use TCP/IP and UDP/IP communication without special modification
Software Interrupts
Layer-2 Layer-3 Packet
Reordering IPv4
The 9th Workshop on Communication Architecture for Clusters (CAC2009)
RI2N
25 May, 2009
University of Tsukuba
Basic Performance Evaluation Benchmarks;
Network;
NIC
Intel PRO/1000PT Dual port MTU=1500
NIC driver
e1000e 0.3.3.3-k
MPI
OpenMPI 1.2.4
Single Gigabit Ethernet Linux Channel bonding with 2 GbE
RI2N
8
Linux Kernel 2.6.27.7
LCB
OS
Single
DDR2/800 4096MB
Node A
RI2N/DRV with 2 GbE
The 9th Workshop on Communication Architecture for Clusters (CAC2009)
Switch
Switch
25 May, 2009
NIC
Memory
NIC
Amount of ACK Packets Latency Throughput on MPI Two-Way Communication
Intel Xeon 3110 3.0 GHz dual core
NIC
CPU
NIC
Node B
University of Tsukuba
Amount of ACK Packets • By packet reordering in RI2N, unnecessary ACK packets are reduced
Number of ACK Packets Single
370
LCB
741
RI2N
376
LCB has twice number of ACK compared with RI2N is same as SINGLE Single
RI2NThe reduces causeACK of this packets gap isdue packet to packet disordering. disordering
9
The 9th Workshop on Communication Architecture for Clusters (CAC2009)
25 May, 2009
University of Tsukuba
Latency over Normal TCP/IP
Round-Trip Time [usec]
350
Lowest Latency
300 Latency of RI2N is larger than the other environment in a short message 250 200 single LCB RI2N
150 100 50 Receive operation of RI2N requires 2 0 software interrupts to reordering packet 0 3000 6000
9000
Message Size [Byte] 10
The 9th Workshop on Communication Architecture for Clusters (CAC2009)
25 May, 2009
University of Tsukuba
ThroughputMost ontypical MPIapplications (Ping-Pong) require 250
high communication bandwidth
Throughput [MB/sec]
200
150
RI2N Throughput is better than LCB
100 RI2N Throughput is lower than LCB
50
single LCB RI2N
0
Message Size [Byte] 11
The 9th Workshop on Communication Architecture for Clusters (CAC2009)
25 May, 2009
University of Tsukuba
Two-way Communication (TCP/IP) The performance of LCB and RI2N are better than single
Throughput Single
117 MB/sec
LCB
159 MB/sec
RI2N
214 MB/sec
Switch
Difference is 55 MB/sec
Switch
ACK packet disturbs the stream on the opposite way
0
12
1
The 9th Workshop on Communication Architecture for Clusters (CAC2009)
Packet disordering frequently occurs
Transmits a large number of ACK packets
25 May, 2009
University of Tsukuba
Bonnie++ Benchmarks on NFS
Bonnie++ Benchmarks
File Size: 2.0 GB NFS block sizes is 32 KB
Switch
Switch
0
1
NFS Server
NFS Client
Dual-link Gigabit Ethernet provides up to more than 200 Mbytes/sec of data transfer performance
13
Sequential Write (put_block) Sequential Read (get_block) Sequential Rewrite (rewirte)
Prepared a tempfs memory disk on the NFS server Direct I/O performance takes about 1.4 Gbytes/sec The 9th Workshop on Communication Architecture for Clusters (CAC2009)
25 May, 2009
University of Tsukuba
Bonnie++ Benchmarks on NFS (Results) Throughput [MB/sec]
CPU Usage [%]
The cause of the difference these of LCBgain and RI2N 250 100 performances LCB and RI2N achieved RI2N of can achieve the performance is the performance ofmore the two way communication twice throughput efficiently than LCB. 90
RI2N achieve highest 80 performance 70
200 150
60
RI2N provide lower CPU usage than LCB
50 100
High CPU utilization of RI2N
40 30
50
LCB is almost equal throughput to single
0
10 0
put_block get_block single 14
20
LCB
rewrite
put_block get_block
RI2N
The 9th Workshop on Communication Architecture for Clusters (CAC2009)
single
LCB 25 May, 2009
rewrite
RI2N University of Tsukuba
NAS Parallel Benchmarks
NAS Parallel Benchmarks 3.3
CLASS: B Number of Process: 4 and 8 Benchmarks:
5 Kernels
EP, FT, IS, MG and CG
CPU
Intel Xeon 5110 1.6GHz dual core
Memory
DDR2/800 4092MB
OS
Linux 2.6.27.7
NIC
Intel PRO 1000PT Dual port MTU=1500
NIC driver
e1000e 0.3.3.3-k
MPI
OpenMPI 1.2.4
3 Application
LU, BT and SP Switch
0
15
1
The 9th Workshop on Communication Architecture for Clusters (CAC2009)
2
Switch
3
4
25 May, 2009
5
6
7
University of Tsukuba
NAS Parallel Benchmarks (Results)
Relative Performance
The performance of RI2N is improved by high bandwidth 2.5 Decrease the message size Relative performance becomes worse 2 1.5 1 0.5
performance has not improved The performance of LCB worse than SINGLE environment though the bandwidth has increased
0 EP
FT
IS
single (4 node) single (8 node) 16
MG
CG
LCB (4 node) LCB (8 node)
The 9th Workshop on Communication Architecture for Clusters (CAC2009)
LU
BT
SP
RI2N (4 node) RI2N (8 node) 25 May, 2009
University of Tsukuba
Conclusion
RI2N;
A user can use this system transparently
Provide High-bandwidth and Fault-tolerance with multiple Ethernet link Implemented as a pseudo network interface of Linux An existing system can use this system without the recompilation.
Prototype Implementation
The performance is moderately good
17
However, it is necessary to improve it.
RI2N can be applied not only for inter-node communication such as MPI but also for any UNIX network service that requires a high-speed, reliable network The 9th Workshop on Communication Architecture for Clusters (CAC2009)
25 May, 2009
University of Tsukuba
Thank you for your kind attention
18
The 9th Workshop on Communication Architecture for Clusters (CAC2009)
25 May, 2009
University of Tsukuba