Exploiting Fast Ethernet Performance in Multiplatform ...

1 downloads 0 Views 207KB Size Report
connection even with smaller message sizes. Finally .... model to describe delays supposes a linear connection between ..... Salzburg, Austria, 1999, pp. 522- ...
2004 ACM Symposium on Applied Computing

Exploiting Fast Ethernet Performance in Multiplatform Cluster Environment Sándor Juhász

Hassan Charaf

Budapest University of Technology and Economics Department of Automation and Applied Informatics 1111 Budapest, Goldmann György tér 3, Hungary +36-1-4631648

Budapest University of Technology and Economics Department of Automation and Applied Informatics 1111 Budapest, Goldmann György tér 3, Hungary +36-1-4633969

[email protected]

[email protected]

ABSTRACT As the communication subsystem largely determines the overall performance and the characteristics of cluster systems, it must face diverging demands such as bandwidth, latency, quality of service and cost. In this paper we investigate the performance and improvement possibilities of a portable TCP/IP based communication subsystem that aims to integrate heterogeneous nodes. The cluster is built up from standard PCs connected with low-cost network, where nodes may have different processor speed, memory size and may even run different operating systems. We present and compare application level end-to-end latencies measured under different conditions varying the number of simultaneous connections, processing threads and the types of operating systems. Our experiments show that message latencies are overwhelmingly dominated by software overheads, which can be hidden or eliminated by different methods, thus PC clusters can take good advantage of the bandwidth of a Fast Ethernet connection even with smaller message sizes. Finally, based on the results, we draw the attention to a domain of inaccuracy of the standard communication models in PC cluster environment, and we suggest a new formula to describe the latency of concurrent message channels over the same medium.

Categories and Subject Descriptors C.2.4 [Distributed and parallel systems]: Special software requirements, design issues, and real-time and safety-critical systems.

General Terms Measurement, Performance.

Keywords Communication performance, Cluster of workstations, Parallel TCP/IP channels, Performance modeling.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC’04, March 14-17, 2004, Nicosia, Cyprus Copyright 2004 ACM 1-58113-812-1/03/04...$5.00

1. INTRODUCTION As the processing speed of regular workstations has been gradually increasing, and interconnection networks of low cost and high bandwidth has been widely spread, clusters became serious competitors of traditional supercomputers. This trend is clearly visible at the list of the fastest supercomputers of the world [12], where most of the items are clusters of scalar or vector processor computers. Two major obstacles related to the communication subsystem are separating the clusters from the overall domination of high performance computing market [11]. Once, proprietary systems especially designed for a given hardware will always provide better performance than a generic interconnection, but of course at a higher cost. This prolongs the survival of massively parallel supercomputers in the domain of communication intensive tasks. The second difficulty is the semantic gap in programming paradigms hindering the direct porting of parallel algorithms between different parallel systems (clusters of workstations, SMPs, NUMA and massively parallel architectures). The relatively high latencies in clusters force the usage of message based programming and claim for higher computation granularity. As a result different algorithms are usable in shared memory and message passing based systems, thus algorithms must be tuned to every single environment type (e.g. in cluster of SMPs optimal algorithms must be aware and make use of the two different levels of separation [1]). As the communication plays a significant role in the overall performance, it must be characterized, measured and modeled. Section 2 introduces the main performance metrics, compares the most widespread interconnection possibilities (Fast Ethernet, Gigabit Ethernet, SCI, Myrinet, ATM), and presents the widely used linear communication model with its most important improvements, advantages and drawbacks. In Section 3 outlines the importance of application level end-to-end performance evaluation, then we introduce a platform impendent implementation of raw message passing and describe some methods of measurement and performance improvements, finally some performance curves will be presented using different number of connections and processing threads in Windows, Linux environments. Section 4 summarizes our work; we draw some conclusion showing an important domain of inaccuracy of the most widespread communication performance models, and in the end some improvement possibilities of the current models will be suggested.

1407

2. COMMUNICATION PERFORMANCE Fast Ethernet, Gigabit Ethernet, SCI, ATM, and Myrinet are the five most popular interconnection technologies used for building cluster systems. Fast Ethernet is a cheap LAN technology delivering a bandwidth of 100 Mbit/s, while maintaining the original Ethernet transmission protocol (CSMA/CD). TCP/IP is the most widely used protocol for Fast Ethernet, although other protocols, such as VIA (Virtual Interface Architecture) may also be used to enhance the performance further [8]. Table 1 summarizes the most important hardware properties of the different interconnects[3]. Table 1. Summary of interconnect specifications

network structure One way latency accessible bandwidth cable length per link

Fast Ethernet

Gigabit Ethernet

SCI

ATM

Myrinet

bus

bus

switched

switched

switched

20 µs

20 µs

5 µs

120 µs

5 µs

100 Mbps

1 Gbps

4 Gbps

155 Mbps

1.2 Gbps

200 m

200 m

10 m

100 m

10 m

The most important communication performance metrics are latency and bandwidth. Bandwidth characterizes the maximum number of items that can be transferred during a unit of time, and latency is the total time a message takes to travel form its source to its destination. For the hardware medium these two metrics are well defined and strongly related, but at the end-to-end application level they become completely different and independent of each other because of the software overhead involved. To ensure reliable data transfer, protocol stack implementations like TCP/IP usually require data to be copied several times among the layers and the communicating nodes exchange several protocol-related messages during the transmission. The number of protocol layers traversed, data copies, context switches, timers and the scheduling policy of the operating systems directly contribute to this software overhead [8]. In case of indirect connections the routing algorithm and number of message hops also have high performance impact. The latency function attributing time delays to different messages sizes proved to be the most useful to describe the communication performance. This function can be measured directly for the hardware medium, for different levels of the message passing subsystem, or at application-level between the communicating tasks. To avoid clock synchronization problems, the different values are usually determined with a ping-pong benchmark measuring the time a message travels forth and back through the underlying layers. The latency values given in Table 1 are the minimal amount of time needed for sending a zero-length message, while bandwidth values characterize the maximal transfer speed of larger data pieces. The most generally used model to describe delays supposes a linear connection between the transfer time and the message length [2][6][10]: 1 (1) tc ( n ) = t0 + ntd = t0 + n beff where tc is the communication time, n is the message length, t0 is the setup time, and td is the time necessary to transfer a unit of data, which is clearly the reciprocal value of the effective

bandwidth beff. The linear model is very popular, because it is simple, and very accurate in most cases. The length related coefficient includes the time of traveling through the network, and also that of message copies trough the buffers. The constant part deals with the basic network latency, with the setup time of the message passing system and with the operating system related overheads both of the sender and the receiver side [4]. If the message transfer touches intermediate nodes (hierarchical buses, mesh, hypercube etc. layouts, network switches) (1) can be rewritten to reflect the new situation [13]: 1 t c ( n, h ) = t0 + ht1 + nt d = t0 + ht1 + n (2) beff where h is the number of hops during the message transfer, and t1 is the time delay associated to one hop. Another source of inaccuracy is the shared use of the same communication medium (e.g. TCP/IP). The competition of different message transfers to for the same bandwidth is modeled by dividing the effective bandwidth between the competing transfers over the same medium: s t c ( n, s ) = t 0 + nst d = t0 + n (3) beff where s is the number of simultaneous transfers. Of course (2) and (3) can also be combined, although it is much more difficult to interpret the concept of competition. Previous works [5][9] showed that the overall performance of a parallel program depends much more on the software overhead (operating system scheduling, communication pattern of the application, inefficiencies in the implementation of the communication subsystem) than on any other physical characteristics of the hardware medium. Parallel software performance is particularly insensitive to the physical bandwidth, except for application where an important number of large messages are exchanged. That is why the linear communication model can have misleading results in clusters with commodity interconnects. As there is a multilevel protocol stack used in clusters, where each layer adds its envelope to the message packets, the effective bandwidth is also varying with the message size (e.g. Figure 1/d, ideal maximal TCP bandwidth). The effects of software overheads are more significant for small message sizes, and at the same time there is no real competition for bandwidth is at the hardware medium level, it is no use in dividing bandwidth for small messages. Although both problems are related to the domain of small messages it is important to note, that small messages are frequently used in cluster environments to distribute control and synchronization information. As it will be seen from the experimental results, the blurred concept of small messages can cover an important domain of message sizes depending on the amount of overheads.

3. PERFORMANCE MEASUREMENTS The latency values can be measured at different levels: the closer a layer is to the application level, the more it tells about the real performance of a parallel application. Developers of communication subsystems usually give the raw performance data of their subsystem, e.g. Chiola and Ciaccio in [5] report minimal latency and maximal bandwidth data summarized in Table 2. The measures were completed between two Pentium II 300 MHz PCs equipped with Fast Ethernet 3COM adapter running Linux operating system.

1408

Table 2. Performance data of communication subsystems

NIC access model min. one way latency accessible bandwidth

MPICH (an MPI impl.) from kernel using TCP/IP stack

U-Net user level using U-Net protocol

GAMMA active ports from kernel using GAMMA protocol

159 µs

30 µs

14 µs

96,8 Mbps

96,8 Mbps

97,6 Mbps

[5]. This practice makes the usefulness of the complexity introduced with the collective communication primitives arguable. Other messaging systems such as U-Net or Gamma prefer performance to portability. Using direct programming of the NIC (Network Interface Card) they obtain better results (Table 2), but to allow any portability the whole system must be rewritten for every single network card type and for each operating system.

The latency and bandwidth numbers, which are often used for “ranking” the messaging systems are hardly related to the real performance delivered by the messaging subsystem to the applications [5]. This difference is mainly to the scheduling policy of the underlying operating system, and partially to the tendency of pushing the overhead (initialization, message preparation and packing) to the application level. For realistic parallel performance modeling, the messages latencies must be measured at the application level, directly in the communicating tasks.

Our research group launched a cluster project named Pyramid [7] in 2001, aiming unite the computing resources of student laboratories during the idle periods. The test cluster is used to experiment with parallel algorithms and building performance models for execution time prediction. In the academic environment, where computer laboratories providing the potential “raw material” include heterogeneous computers and operating systems, portability is a primary issue. That is why standard TCP/IP communication provided by the host operating system was chosen as communication protocol, while the platform independency is achieved by using a free operating system abstraction layer wxWindows [14].

Another important property of message passing systems is the portability. It is extremely difficult to build system having high performance and being portable at the same time. MPI for example defines a standard interface for communication between parallel tasks, which ensures the portability of the sources, but unfortunately the performance is not ported with the applications. Many MPI implementations are known to have weak points in collective communication primitives, forcing the programmers to avoid certain routines and implementing them at application level

To allow communication and computation to overlap, message sending in Pyramid is fully asynchronous from the point of view of the tasks, as the send primitive only initiates the transfer then returns, and after that it is up to the node manager to take care of the message sending. When using TCP/IP for message passing, the most obvious solution is to open a connection the destination node as soon as a new sending request appears, transfer the message and close the connection to free the associated system

a)

3.1 Basic Performance

b) Windows Basic Transfer Time Windows Open Socket Transfer Time

[ms]

Linux Basic Transfer Time Linux Open Socket Transfer Time

Windows Basic Troughput Windows Open Socket Troughput

[Mbit/s]

Linux Basic Troughput Linux Open Socket Troughput

100

80

90

70

80

60

70

50

60

40

50

30

40 30

20

Message Size [kB]

c)

Windows Basic Transfer Time Windows Open Socket Transfer Time

[Mbit/s]

80

100

70

90

50 0

45 0

40 0

35 0

30 0

25 0

20 0

15 0

Windows Basic Troughput Windows Open Socket Troughput Ideal TCP Troughput

Linux Basic Troughput Linux Open Socket Troughput

80

60

70

50

60

40

50

30

40 30

20

10 00 00 0

10 00 00

Message Size [Byte]

10 00 0

10 00

1

10 00 00 0

10 00 00

Message Size [Byte]

10 00 0

10 00

10 0

0 10

10

0

10 0

20

10 1

10 0

Message Size [kB]

d) Linux Basic Transfer Time Linux Open Socket Transfer Time

10

[ms]

0

50 0

45 0

40 0

35 0

30 0

25 0

20 0

15 0

10 0

50

0

0

10

0

50

20

10

Figure 1. Comparing delay (a,c) and throughput (b,d) curves of rebuilt and continuous TCP connections under Windows and Linux using linear (a,b) and logarithmic (c,d) scale

1409

resources. In the tests we used a pair of Pentium IV 2.26 GHz PCs equipped with Intel 82801DB PRO/100 VE network adapters connected through a 3Com SuperStack 4226T switch; the results of the ping-pong benchmark are reported as “basic transfer times” in Figure 1. The measured message delays on both Windows XP and Red Hat Linux v2.4.20 operating systems are presented two times using linear and logarithmic scale. All the measures were done at the user task (application) level, and results are derived form the average time of 50 roundtrips (the messages traveled 100 times through the network). The bandwidth (throughput) values are directly calculated by dividing the message length by the transfer time.

(having as low as 190 µs delay on application level for messages sizes smaller than 64 bytes). The Linux performance approached much more the performance of the Windows implementation. The message time differences being the integer multiple of 5 ms give the feeling that the Linux scheduler is responsible for the higher latencies. As we completely build on the operating system services, better latencies could be achieved by using kernel tuning, which would be inappropriate here, as we are trying to achieve portable messaging performance without external intervention.

3.3 Parallel Connections Using multiple sockets and threads at the same time can reduce further the average scheduling and other software overheads. On each socket completely independent ping-pong benchmarks are run in parallel. Using more than one socket also allows taking benefit of the full duplex property of the communication channel. The throughput results of measures S1-S25 using various numbers of sockets and threads (between 1 and 16) are summarized in figure 2. When using multiple parallel sockets the throughput curves show no significant difference, although Linux curves are smoother, and less irregularities are produced by the scheduling system. It is also important to note, that the performance is practically independent of the number of processing threads in both cases, implying that a simple implementation using one thread is enough to provide full capacity. As far as the number of parallel connections is concerned, there is a significant performance gain in both operating systems for the small messages (

Suggest Documents