A Comparison of Network adapter based

0 downloads 0 Views 147KB Size Report
plementation of a message passing protocol that allows queuing of messages where ... cards running Solaris version 2.4 and Fore Thought version 3.0.2. ... Measurements for the ATM case was collected using Solaris TCP/IP over AAL5 and .... Round-trip timing shows that the TCP/IP protocol stack dramatically adds to the.
A Comparison of Network adapter based Technologies for Workstation Clustering Haakon Bryhni and Knut Omang

Department of Informatics, University of Oslo Box 1080, Blindern, N-0316 OSLO, Norway

fbryhni,knutog@i.uio.no

Abstract

We present a comparative performance evaluation of three interconnect technologies used for workstation clustering. We compare standard Ethernet, ATM (Asynchronous Transfer Mode) and SCI (Scalable Coherent Interface). The analysis is done using a set of throughput and latency microbenchmarks. When the standard operating system interface is used to pass messages over the interconnect, signicant time is spent in the operating system, giving unnecessary high latency for short messages. Using SCI, we show an alternate implementation of a message passing protocol that allows queuing of messages where some of the operating system overhead factors are reduced or removed, and observe a signicant improvement in latency. Our current SCI and ATM hardware compares roughly with respect to peak point-to-point throughput for large messages, but ATM outperforms this implementation of SCI for medium-sized messages when large TCP windows are used. However, SCI oers an order of magnitude lower latency than ATM for message passing and show more promising results with respect to cumulative throughput of the interconnect.

1 Introduction ATM interfaces are gaining popularity in backbone and high end workstation applications. The switched architecture and dedicated bandwidth are important features in scalable high speed networks, but the message passing nature of ATM give a latency penalty. Latency is an important parameter in clustering applications, and we present a detailed study of both throughput and latency using current ATM technology (155 Mbit/s). SCI (IEEE Standard no.1596 [7]) denes a standard for high speed interconnection intended to extend the system bus in multiprocessor architectures. The SCI interconnect can be used for workstation clustering and even local area network applications, as long as the distance between each host is short enough to adhere to the strict timing requirements of SCI. The current SCI hardware is characterized by very low latencies for short messages and a high bandwidth for large messages. Our microbenchmarks measure throughput with varying message sizes, low level latency for minimal messages and user level latency including queuing and dequeuing of requests. It is important to note that TCP over ATM or Ethernet oers a full transport protocol

service, while SCI for instance leaves buer level management to the user level application. The cluster testbed used in this work consists of 4 SparcStation 20 (SS20) with a 75MHz SuperSparc processor, 1MB second level cache and 64MB RAM. The cluster nodes are equipped with Dolphin Sbus-1/SCI [2] and Fore Systems ASX200 ATM [3] Sbus adapter cards running Solaris version 2.4 and Fore Thought version 3.0.2. We use the Gnu C compiler version 2.7.2, and all benchmarks are compiled with O3 optimization. The prototype is connected to the department's local area network on a 10 Mbit/s Ethernet segment. ATM link speed is 155 Mbit/s over Multimode ber. The SCI interconnect use a ring topology with point-to-point links with a link speed of 125 Mbytes/s (SCI protocol overhead not included). Measurements for the ATM case was collected using Solaris TCP/IP over AAL5 and ATM between two of the workstations directly connected by multimode ber. The current Sbus-1/SCI interface cards use a parallel copper interconnect, and provide a subset of the SCI protocol that oers two communication methods to the user level: 1) Support for message passing through the read and write system calls to oer aligned DMA transfers. This interface will in the rest of the paper be denoted the raw read/write interface. 2) Support for setup and use of user level shared memory segments.

2 Point-to-point throughput measurements Peak throughput (regardless of message size) is interesting for applications that intend to transfer large amounts of data, while throughput for dierent message sizes are interesting for applications that have a specic pattern of transfers (for instance to implement the TCP/IP protocol). We have implemented a microbenchmark that use read and write system calls to send dierent user data buers, and records the elapsed time executing the tests as well as CPU usage at the reader and writer. We ran this test between two SS20's using all three network technologies. Throughput measurements are shown in gure 1. All TCP/IP measurements were done with the TCP_NODELAY option of the TCP protocol, thus disabling Nagles algorithm and minimizing latency. All the tests were repeated a large number of times to minimize errors introduced by operating system intervenience. We measure Ethernet throughput to be about 1 MByte/s, that is very close the link capacity of 10 Mbit/s. For small messages below 512 bytes, Ethernet outperforms both ATM and SCI in terms of throghput! Using the default ATM conguration, we measured almost 6 MByte/s throughput. For user buer sizes of 1K, ATM performs better than SCI. However, if the TCP window size is increased above the default value, ATM outperforms SCI on a long range of message sizes, as discussed in the next section. In the SCI case we used both supported techniques, DMA and programmed I/O. SCI DMA can only be used for multiples of 64 byte and 64 byte aligned data, so all measurements are taken using such multiples. In the current (prototype version) of the SCI interface driver, for all transfers were data is unaligned or the size of each message is not a multiple of 64 byte, copying is done using the less ecient 1 byte SCI transactions. Since the cost of sending a single byte is equal to the cost of sending up to 8 bytes in one transaction, we have estimated the behaviour if the driver were using 8 byte transactions instead. A future optimized driver will of course use the most ecient transfer technique at all times.

18

16 Ethernet/TCP/IP ATM/TCP/IP (default) SCI (DMA transfers) SCI w/1 byte stores SCI w/8 byte stores(estimated)

16

ATM 32K tcp window ATM 16K tcp window ATM default

14

14

12

Throughput (MByte/s)

Throughput (MByte/s)

12

10

8

10

8

6

6 4

4

2

2

0

0 64

128

256

512

1K

2K

4K

8K

16K

32K

64K

User buffer size (bytes)

128K

4

8

12

16

20

24

28

32

36

40

44

48

52

56

60

64

User buffer size (Kbytes)

Figure 1: Average throughput for Ethernet, ATM and SCI between two SS20s (logaritmic x scale to the left, rightmost gure shows the eect of changing the TCP/IP windowsize for ATM) Experience with TCP/IP over ATM under SunOS 4.1.3 and reported work [4] suggested that the TCP window size is an important parameter with regards to performance. Running a set of experiments varying the TCP send buer (SO_SNDBUF) and receive buer (SO_RCVBUF) we found signicant performance improvements. The send and receive buer was set to the same size on each side of the connection. Medium TCP window sizes (16 and 32 Kbytes) gave a constant throughput gain (approx 130% for 16K and 170% for 32K), with a peak throughput for large user buer sizes of 9 MB/s and 11 MB/s correspondingly. The default ATM throughput where the TCP window size is not set, is plotted in the same graph for comparison.

3 Cumulative throughput To investigate the cumulative bandwidth of the interconnect we use a small microbenchmark that has two threads in each application, a writer thread and a reader thread that writes to one node and reads from another such that all the processors make up a ring together. The ATM interconnect has a theoretical bandwidth of 155 Mbits/s = 19.4Mbytes/s per link while the SCI interconnect has a theoretical bandwidth of 1 Gbits/s = 125Mbytes/s. We run this program on 2 and 4 SCI nodes and (since we presently only have 2 ATM cards) between two ATM nodes using a back-to-back connection. The results are presented in gure 2. For comparison, the point-to-point throughput measurements in section 2 is about 12Mbytes/s both for SCI and ATM. For SCI we get a peak cumulative throughput of 22 Mbytes/s for 2 nodes and more than 42 Mbytes/s for 4 nodes. For ATM the peak cumulative throughput for 2 nodes is around 14 Mbytes/s. Consequently, this benchmark exhibit almost linear throughput increase for SCI from 2 to 4 nodes, while only a little increase for TCP/IP over ATM. SCI performance seems unaected by the physical placements of the nodes in the ring. One of the possible reasons for the drop in the ATM case could be that the CPU is

SCI 2 nodes SCI 4 nodes ATM 2 nodes(default) ATM 2 nodes(32K tcp buffer)

Throughput (MByte/s)

40

30

20

10

0 64

128

256

512

1K

2K

4K

8K

16K

32K

64K

128K

User buffer size (bytes)

Figure 2: Average cumulative throughput for 2 and 4 nodes saturated. We have not been able to collect the CPU time spent in this program properly, since timing of threads based programs are undocumented in Solaris 2.4 but previous timings of both ATM/TCP/IP and SCI in the point-to-point case shows no big dierence in spent CPU time. We hope to be able to run this benchmark on a 4 node ATM cluster to see if it is the ATM interconnect or the node or its interface that is the real bottleneck in the ATM case. For the SCI case it is quite clear that the current bottleneck is the interfaces to the Sbus and not the ring itself.

4 Latency measurements We have measured pure average latency for minimal synchronous messages without the overhead on any protocol (ping-pong tests) as well as latency of a minimal message passing protocol for the dierent interconnects. The SCI interfaces has hardware support for shared memory, that is, special regions of local memory may be set up for direct hardware access from a remote computer through SCI. For the ATM and Ethernet cases, ordinary message passing with minimal (4 byte) messages over TCP/IP are used. The results are displayed in table 1. Ethernet/TCP ATM/TCP ATM/SSAM SCI 461.46 279.83 23 3.84

Table 1: Latency (in s) of one-way remote store Latency of one-way remote store measured using Ethernet and ATM are more than an order of magnitude larger compared to SCI, however recent work on Active Messages (SSAM, SparcStation Active Messages[8]) has demonstrated techniques for bypassing the traditional protocol stack. To complement the one-way remote store measurements, we have implemented a simple higher level protocol for communicating variable sized messages through a buer in shared

memory without any intervention with the operating system. Instead synchronization is done using a read and a write pointer and a ring buer structure. Consistency is ensured by requiring one buer structure for each unidirectional message channel, and make sure that the read head is only updated by the reader, while the write head is only updated by the writer. Since the remote load operation is the most time consuming, the communication buer is placed in memory local to the reader, utilizing locality since the memory system is I/O coherent. In addition, the write and read pointers are placed locally to where they are read. This microbenchmark measures the round-trip latency of 4 byte messages sent by 1) using the interrupt and operating system based raw read/write interface and 2) using our library of functions for message passing in user level shared memory. Ethernet ATM SCI TCP/IP TCP/IP read/write shared mem. 461.46 279.83 185.5 17.6

Table 2: Message passing (in s) using high level protocols compared to SCI The numbers indicate, for minimal messages, that the latency of the shared memory solution is an order of magnitude lower than in the raw read/write case, and certainly also for TCP over any other interconnect technology. The problem in a multitasking environment is to nd an ecient way of avoiding too much busy waiting and waste of CPU cycles as well as increasing interconnect trac. The numbers for shared memory are from an implementation in which all waiting code are active spins on a shared pointer written by the remote process. The read pointer is located in the local memory of the writer and the write pointer in the local memory of the reader. Reducing the CPU waste is more complicated. We have experimented with a passive delay after some initial accesses, but this method has so far only led to both sides waiting for the other.

5 Conclusion Traditional Ethernet is still competitive in terms of throughput and CPU utilization for transfer of medium sized packets, due to many years of tuning and optimization. The latency for round-trip measurements, however, is an order of magnitude higher than more dedicated clustering technologies like SCI. For small packets, where SCI benets from its hardware support for shared memory, SCI is faster even compared with novel techinques like Active Messages over ATM. The throughput of TCP/IP over ATM is closely related to parameters as TCP window size, and improve with increasing packet size. The throughput results using TCP/IP for single point-to-point transfers over standard Ethernet are still quite competitive with more recent interconnect technologies for message sizes below 512 bytes. For message sizes beyond 1 Kbytes, ATM and SCI totally outperforms Ethernet, but due to system overhead we never approach the limits of the physical links. For SCI, the maximum achieved throughput for point-to-point connections is 14 Mbytes/s, while in an ordinary input/output situation the maximum throughput for large messages drop to 12 Mbytes/s. The throughput of the SCI interconnect also increase with increasing message size. ATM peaks at 12 Mbytes/s already

for medium sized messages (above 8K). Furthermore, SCI exhibit almost linear increase in cummulative thoughput but this is not the case for ATM. Round-trip timing shows that the TCP/IP protocol stack dramatically adds to the latency of small message transfers. Since latency is a critical parameter for workstation clustering, more ecient protocols such as Active messages must be employed over ATM to minimize latency. The limitation in our system is primarily on the design of the DMA machine in the Sbus-1/SCI adapter, since both the system bus and the SCI interconnect can sustain far higher throughput. We measure SCI latency for individual load/store operations of 4-5 s, and 196 s for a standard interrupt based message passing protocol. We have presented an alternate implementation of a message passing protocol that achieve a latency of 17 s, doing message passing entirely in user space. The SCI experiments in this paper was collected using the rst version of Dolphin ICS's Sbus-1/SCI adapter. Currently, the next version of the interface is undergoing hardware alpha tests and will improve SCI performance both in terms of throughput and latency. As we have seen, high-performance interconnect technologies enable interconnection of workstations that compete with traditional SMP performance in terms of remote memory latency and memory bandwith, at least when price/performance is taken into consideration. The distributed approach to supercomputing leverage the rapid development of desktop systems, and enable more powerful computers using high-speed low-latency interconnects like SCI and ATM.

References [1] H. Bugge and P. O. Husøy. Dedicated Clustering: A Case Study. In Proceedings of Fourth International Workshop on SCI-based High-Performance Low-Cost Computing, Oct. 1995. [2] Dolphin Interconnect Solutions. SBus-to-SCI Apdapter User's Guide, v. 1.8, 1995. [3] FORE Systems. ForeRunner SBA-100/200 ATM SBus Adapter User's Manual, 1994. [4] M. K., K. E., and K. Ø. TCP/IP behavior in a high-speed local ATM network environment. In Proceedings of 19th IEEE conf. on local computer networks, Minneapolis, Oct. 1994. [5] E. Klovning and H. Bryhni. Design and performance evaluation of a multiprotocol SCI LAN. Technical Report TF R 45/94, Telenor Research, Kjeller, 1994. [6] K. Omang. Preliminary Performance results from SALMON, a Multiprocessing Environment based on Workstations Connected by SCI. Available at http://www.i.uio.no/ ~sci/papers.html. [7] IEEE Standard for Scalable Coherent Interface (SCI), Aug. 1993. [8] Thorsten von Eicken and Anindya Basu and Vineet Buch. Low-Latency Communication Over ATM Networks Using Active Messages. IEEE Micro, 15:4653, Feb. 1995.