Porting MPICH ADI on GAMMA with Flow Control Giovanni Chiola and Giuseppe Ciaccio DISI, Universit`a di Genova via Dodecaneso 35, 16146 Genova, Italy fchiola,
[email protected] Abstract The Genoa Active Message MAchine (GAMMA) is an experimental prototype of a light-weight communication system based on the Active Ports paradigm and designed for efficient implementation over low-cost Fast Ethernet interconnects. The original prototype implementation started in 1996 and obtained best performance by removing traditional communication protocols and implementing zerocopy send and receive. The price to pay for good performance was a Datagram QoS, with error detection and notification but no error recovery. Technology evolved, however, and currently all low-cost off-the-shelf NICs are based on Descriptor-based DMA (DBDMA) transfers. The DBDMA mechanism does not allow true zero-copy on message receive, and this implies a slight decrease in throughput for short messages. However, DBDMA allows an effective implementation of a credit-based flow control, which introduces reliability with almost no additional impact on performance. The reliable version of GAMMA allows a very efficient porting of MPICH at the ADI level, with unprecedented throughput curve on Fast Ethernet.
1. Introduction Cluster computing currently represents the most costeffective way of entering the high-performance computing niche. A Beowulf-type [18] cluster based on Intel microprocessors and Fast Ethernet interconnection can currently deliver about 250-300 MFLOPS per node at a total cost of about US$ 1,000 per node on floating point matrix computations, and can scale up to few hundred nodes using off-theshelf Fast Ethernet switching technology. From the application programming point of view, such a low-cost processing power is easily exploited by means of appropriate, highlevel, standard Application Programming Interfaces (APIs) such as, e.g., the Message Passing Interface (MPI). Several open-source implementations of MPI are readily available, such as, e.g., MPICH [4], which run on Beowulf-type
clusters on top of standard TCP Unix sockets. The Linux Operating System (OS) is known to provide one of the most efficient implementations of the TCP/IP protocol stack. However, the Linux TCP sockets’ one-way latency heavily depends on the specific network driver being used, which in turn depends on what Network Interface Card (NIC) has been leveraged. Therefore, one may wonder why bothering about the adoption of a specialized messaging system for cluster computing, and not just adopting standard TCP/IP communication on good NICs, as done in most Beowulf-type clusters. We shall start by motivating the reasons for adopting a specialized low-level communication layer instead of an industry-standard one. Let us consider a very basic cluster configuration comprising two Pentium II 350 MHz PCs, each running Linux 2.0.36 and equipped with a DEC DE500-BA Fast Ethernet NIC. Suppose we connect the two PCs by a repeater hub (half-duplex mode). Then run a simple ping-pong test to evaluate one-way latency and asymptotic bandwidth at TCP socket level, after disabling the Nagle “piggybacking” algorithm. Under such experimental conditions you should measure a one-way latency of about 58 s and peak throughput of 11.3 MByte/s, on average. The half-power point, that is, the message size at which half of the peak throughput is delivered [16], is found at around 720 bytes. The maximum throughput achieved by single-packet (1460 bytes long) messages is about 7.5 MByte/s. The complete throughput profile is depicted in Figure 4, curve 4. Recalling that the maximum theoretical bandwidth of Fast Ethernet is 12.5 MByte/s and assuming a lower bound for hardware latency of 7 s, this means that with Linux sockets/TCP/IP on Fast Ethernet:
Latency is almost one order of magnitude worse than the hardware latency; Efficiency in the range of short (single packet) messages does not exceed 60%; Efficiency is good (91%) only with very long data streams.
Hence, while Linux TCP sockets appear to be very good for traditional networking, they do not appear to be well suited for cluster computing, especially if we assume the CPU clock cycle as a time measurement unit instead of the usual microsecond. The scenario becomes even less favourable when considering a typical implementation of MPI for commodity clusters like MPICH [12, 4], whose throughput profile is reported in Figure 4, curve 5. An additional performance degradation is clearly shown, due to stacking the complex multi-layer MPICH communication system over TCP sockets. The latency time at MPI level becomes as large as 131 s, while the communication throughput does not achieve 10 MByte/s, corresponding to a maximum efficiency of 83%. This inefficiency can only aggravate in case of faster hardware devices, such as, e.g., Gigabit Ethernet. The issue of efficiently supporting inter-process communication in a cluster of PCs connected by inexpensive 100baseT commodity technology has been addressed in several research projects [18, 21, 15, 19, 17, 11]. Good performance was usually obtained by removing layered communication protocols, often at the expense of the Quality of Service (QoS) offered to the API. Kernel level implementations of reliable protocols such as the ones described in [19] never attained the performance of crude interfaces offering Datagram QoS, such as the former prototype of the Genoa Active Message MAchine (GAMMA) [10, 6]. However, with the evolution of networking technology towards more “intelligent” though not yet programmable low-cost NICs, it has become feasible to design and implement efficient low-level communication protocols with higher QoS through the adoption of suitable, low-overhead flow control mechanisms. Indeed, the latest prototypes of GAMMA running on more modern 100base-T NICs could provide virtually reliable yet performing cluster-wide communication by enhancing the original communication layer with a simple credit-based flow control mechanism. GAMMA is a commodity cluster based on Personal Computers (PCs) running the Linux OS and 100 Mbit/s Fast Ethernet off-the-shelf technology. The Linux kernel has been enhanced with a communication layer implemented as a small set of additional light-weight system calls and a custom NIC driver. Most of the communication layer is thus embedded in the Linux kernel, the remaining part being placed in a user-level programming library. The adoption of an Active Message-like communication abstraction [20] called Active Ports [7] allowed a zero-copy optimistic protocol, with no need of either kernel-level or applicationlevel temporary storage for incoming as well as outgoing messages. Multi-user protected access to the communication abstraction is granted. The GAMMA device driver is capable of managing both GAMMA and IP communication in the same 100base-T network, without incurring
additional overhead for IP. The former prototype of GAMMA was developed for 3Com 3c595 and 3c905 NICs, yielding unbeated user-touser one-way communication performance (13 s latency and 12.2 MByte/s asymptotic bandwidth, corresponding to 98% efficiency) even on a 100base-T cluster of obsolete Pentium 133 PCs. However, the former prototype of GAMMA offered little more than Datagram QoS, by detecting communication errors (packet losses and corrupted packets) but then simply raising an error condition to the user process, leaving to the user (application as well as library writer) the task of programming suitable recovery policies, if needed. Indeed, the very low latency delivered by GAMMA could potentially allow the efficient implementation of a wide range of error recovery as well as explicit acknowledge protocols. On the other hand, if a good switch and good class 5 UTP wiring is adopted for the LAN, packet corruption with subsequent CRC receiver failure almost never occur. Assuming less than 10 11 bit error rate (which, according to our experiments, appears to be quite realistic if standard wiring is adopted), a error inducing frame loss is expected every 107 full-size frames (i.e., allowing for error-free data movements larger than 15GBytes). Hence, the only significant cause of frame loss in the original GAMMA prototype was receiver overrun in case of a “slow” receiver, due to the absence of flow control. The implementation of a low-overhead flow control mechanism for GAMMA appeared to be unfeasible with 3Com 3c595 and 3c905 NICs, due to the relatively small size of the NIC’s on-board receive queue (few KBytes) in which the incoming frames are temporarily stored while waiting for CPU service. Indeed, with such a limited buffering capacity of the NIC, the only way of preventing overrun from occurring is to run a stop-and-wait flow control protocol with explicit packet acknowledgements, with an expected and unacceptable 13 s additional latency per frame. Such a consideration justified the original choice in the GAMMA design to leave the reliability issue up to the user’s responsibility, so that a “performance-oriented user” might have the chance of balancing the transfer rate ratio among senders and receivers instead of using an inefficient flow control protocol. One practical problem that has prevented a distribution of the former GAMMA prototype in form of a Beta release was the discontinuing of commercial support and availability of the 3Com 3c595 and 3c905 NICs. Newer NICs distributed by 3Com interact with the host memory and CPU by a new data transfer mechanism called Descriptor-based DMA (DBDMA), found also in other products (such as the Digital 2114x chipsets, the Intel EtherExpress, and most Gigabit Ethernet NICs, just to cite a few). The distinctive feature of DBDMA is that the NIC’s queues are no longer on-board: rather than using few KBytes of expensive on-
board RAM, the NIC is able of autonomously operate on two potentially large regions of the cheap host memory, acting as the send and the receive queues, which are allocated by the OS kernel and made known to the NIC at the time of opening the device. The exploitation of the DBDMA mode required a substantial re-design of GAMMA. In particular, the “true zerocopy” message receive that was implemented in the original GAMMA prototype is almost impossible to implement with similar performance results in case of DBDMA NICs. We have thus developed a new version of GAMMA, currently available for DEC DE500 NICs, that implements a “one copy receive” protocol. We showed that, using current PC technology with 100 MHz memory bus clock and 350 MHz CPU clock rates, the impact of this additional copy on latency and throughput is relatively small [9]. The potential availability for DBDMA NICs of a much larger receive queue allocated in the host RAM allowed the implementation of a low-overhead flow control protocol based on credits [9]. The DBDMA version of GAMMA with flow control (that guarantees a very low probability of message loss in any load conditions) shows no significant performance penalty compared to the DBDMA version of GAMMA without flow control. This result allows the use of flow control for increased reliability and, in our opinion, compensates the (currently very modest, and asymptotically reducing as technology still evolves) loss in absolute performance with respect to the non-DBDMA original prototype. A reliable version of GAMMA with flow control that avoids receiver overrun enables the use of GAMMA as a robust, high performance messaging system over Fast Ethernet. In particular, the implementation of an MPI interface can take advantage of the extremely high efficiency of the GAMMA protocol to provide a high performance, standard Application Programming Interface (API) for commodity clusters based on Fast Ethernet interconnects. The main goal in implementing MPI atop GAMMA must be, of course, keeping the additional software overhead implied by this “stacking” operation to a minimum. Most existing efficient implementations of MPI for clusters run on expensive LAN interconnects like Myrinet (e.g., [13, 1]). Much fewer similar projects have been attempted for low-cost interconnects. A partial attempt, concerning only the optimization of collective routines in the case of Ethernet, is documented by [5]. To the best of our knowledge, the only attempt of providing an efficient implementation of MPI for Fast Ethernet, besides ours, is currently pursued by the M-VIA project [3]. However, no information is available yet on the M-VIA porting of MPI. In our attempt to provide a really fast implementation of MPI for Fast Ethernet, we started from the public-domain implementation MPICH [4], and gradually modified its intermediate layer called Abstract Device Interface (ADI) in
order to re-implement it on GAMMA. For the sake of efficiency we adopted the following optimization guidelines:
avoid stacking over intermediate layers, in order to save on header processing overhead and improve code locality; avoid temporary buffering of incoming messages whenever possible, by starting an ADI thread immediately upon message arrival; avoid dynamic allocation of memory for temporary storage; use a suitable programming style to take advantage of the Pentium CPU branch predictor unit and cache hierarchy. With MPICH, all the MPI calls are implemented in terms of ADI functions. Therefore, porting the ADI layer to GAMMA means running the whole MPICH atop GAMMA. We already have a complete and working prototype of MPI/GAMMA. We are still testing it with real-world MPI applications. However, the current prototype passes all the tests included in the MPICH distribution, and successfully runs the test of the BLACS (Basic Linear Algebra Communication Subroutines) library [2].
2. DBDMA versus CPU-driven DMA Figure 1 provides a comparison between throughput curves measured with slightly different system configurations for a Linux PC cluster based on 3Com 3c905 Fast Ethernet NICs. It is apparent that using the same hardware and OS but different device drivers leads to a substantial performance difference. The reason is very simple: The 3c905 NIC can be programmed and driven in two different ways. Older drivers used the CPU-driven DMA mode, according to which the host CPU itself starts a DMA operation of the NIC in order to move packets between the NIC’s on-board queues and the host RAM. After a DMA transfer has been started, the CPU has to wait for the current DMA transfer to end before starting a subsequent DMA operation. This leads to a “store-and-forward” behavior with a tight synchronization between the CPU and the NIC, that inevitably results in low throughput and high overhead as clearly shown in Figure 1, curve 3. However, the latest Linux drivers for the 3Com 3c905 NIC use the DBDMA mode. When driven in DBDMA mode, the NIC itself starts send and receive DMA transfers between host memory and the NIC by simply scanning two precomputed and static circular lists called rings, one for send and one for receive, both stored in host memory. Each entry of a ring is called a DMA descriptor. A DMA descriptor in the send ring contains a pointer (actually a
Ideal Fast Ethernet throughput (lat: 7 usec) TCP, 3c905, PII 300, recent driver (DBDMA) TCP, 3c905, PII 300, older driver (CPU-driven DMA)
1 2 3
1
12
Throughput (Mbyte/s)
10 8 6 2 4
3
2 0
the other hand, CPU-driven DMA transfers in the receiver side allow “true zero-copy” protocol implementations, reducing latency as compared to DBDMA. Indeed the demultiplexing of incoming messages to different destinations at different receiver processes requires CPU activity, and calling the CPU after the incoming messages have been buffered in memory forces a subsequent copy to deliver the messages in their correct final destination. In practice, even if one thinks that this might not be a good choice in terms of latency, the adoption of DBDMA NICs is forced by the non-availability of CPU-driven DMA NICs. In any case the porting of GAMMA to DBDMA NICs was mandatory, and the challenge was that of reducing the expected latency increase to a minimum.
3. The GAMMA flow control implementation 32
128
512 1500 6000 24K 48K 192K Message Size (byte)
Figure 1. Linux 2.0.29 TCP sockets: “pingpong” throughput, various NIC drivers (Nagle disabled) physical address) to the host memory region containing an outgoing packet. A DMA descriptor in the receive ring points to a memory region where an incoming packet can be stored. DMA descriptors are allocated by the OS, which is also responsible of forming the two rings out of them. The head of each ring is made known to the NIC by the OS at the time of opening the device. Both rings can be very large, the only limit being the total amount of host memory the OS devotes to packet storage. Since a DBDMA NIC is more autonomous during communications, a greater degree of parallelism is exploited in the communication path according to a producer/consumer behavior. For message transmission, while the NIC “consumes” DMA descriptors from the transmit ring operating the host-to-NIC data transfers specified in the descriptors, the CPU runs the TCP/IP protocol and “produces” the necessary DMA descriptors for subsequent data transfers, filling them with pointers to data buffers containing the packets to be transmitted. The reverse occurs for message reception. This leads to a much better throughput, as clearly shown in Figure 1, curve 2. Such an advantage certainly justifies the widespread adoption of DBDMA NICs and their substitution of CPU-driven DMA NICs in the market of commodity components. The theoretical question that remained open was whether light-weight cluster protocols such as GAMMA can benefit of DBDMA NICs or not. Thinking in terms of pure latency and bandwidth, the answer would obviously be “no.” On the one hand, the GAMMA prototype already provides very high bandwidth even with CPU-driven DMA transfers. On
In the redesign of the GAMMA prototype for DBDMA NICs we also attempted to include additional features that could solve part of the open problems inherent to the CPUdriven DMA version of the GAMMA driver. The main benefit of the DBDMA architecture that GAMMA exploits is the availability of substantially larger receive queues, although located in host memory and indirectedly accessed by scanning the receive ring. A good way of exploiting substantially larger receive queues is the introduction of a credit-based flow control protocol. It is well known that TCP uses a sliding-window flow control algorithm with “go-back-N” packet retransmission upon timeout. The flow control mechanism of TCP is an end-to-end protocol, that avoids packet overflow at the receiver side. However it cannot prevent overflow from occurring in a LAN switch in case of network congestion. Only a data-link level flow control mechanism, such as, e.g., the one defined in the 802.3x standard extension, can prevent packet loss within the network in case of congestion. A switch that does not offer 802.3x flow control can indeed discard frames in case of congestion. When this occurs, eventually the retransmission mechanism of TCP on the sender hosts is triggered and starts re-sending many more packets than needed, increasing network traffic and therefore making the LAN even more congested. Clearly, LANs require better flow control algorithms or at least more sophisticated retransmission policies, which should take into account the fact that the only source of packet loss in a modern LAN is overrun at a receiver NIC or at a switch, and that this often occurs according to well-known patterns (packet losses due to switch buffer overflow often happen in sequences called “clusters”, allowing a selective retransmission algorithm to perform far better than “go-back-N”). In the case of GAMMA we started designing an appropriate flow control protocol from scratch, taking into account the almost zero error rate exhibited by 100base-T
star topologies adopting standard class 5 UTP cabling. The idea of associating acknowledgement packets to each data packet was therefore discarded. A credit-based protocol was instead chosen for flow control purposes, similar to the one already adopted in FM2 [14]. Let us assume a 100-way GAMMA cluster, where each NIC may access a receive queue counting 1024 entries (1024 is a good choice, providing storage for 1024 full-size incoming Ethernet packets with an acceptable memory occupancy of 1.5 MByte in kernel space) After setting the size of the NIC receive ring to 1024 and initializing the ring itself, the GAMMA device driver reserves most of the available receive storage to the individual potential sender PCs in the cluster. With a 100-way cluster, each node may expect to receive messages from 99 different PCs. Reserving eight buffers to each individual sender, 792 out of the 1024 available buffers would be consumed, therefore a credit size of eight for each communication pair is feasible. The GAMMA driver running on each PC is then initialized with a credit of eight frames towards each possible receiver, so that up to eight consecutive frames could be safely sent towards a given arbitrary destination without requiring any acknowledgement. Once the credit towards a given PC is exhausted, the sender must refrain from sending further frames towards that PC until it receives an acknowledgement packet from it. This acknowledgement restores the whole initial credit, allowing the transmission to continue. In our example, additional 99 buffers in the receive queue should be reserved for incoming acknowledgements in order to guarantee absence of deadlock. Out of the 1024 buffers available in the receive queue, 133 would not be reserved for GAMMA communications, and could thus be used for other protocols (such as, e.g., TCP/IP) possibly sharing the same connection. Clearly, if L is the latency of the original messaging system without flow control and C is the credit size, the latency increase due to credit refill amounts to L=C (one credit refill is required each C frames), which vanishes when C gets large enough. As we will see later, the performance penalty with a typical credit size of 32 is negligible. The new GAMMA prototype based on DBDMA NIC operating mode has been developed at the Department of Computer Science of the University of Rome as part of two Master’s thesis. One version was developed for Intel EtherExpress cards and another for Digital 21140 chipset based NICs. A similar effort with 3Com 3c905B NICs is in progress. We ported the DEC 21140 GAMMA prototype to newer chipsets of the same family (21143) and added the flow control protocol outlined above. For message send, the mechanism implemented by the new GAMMA driver is similar to the one implemented by the older ones. An explicit gamma_send(out_port,
data_pointer,length) function is invoked by the sender process, that specifies a memory address and an integer length to describe the data to be transmitted through a specified GAMMA output port. The send routine traps into the GAMMA device driver, which is responsible for message fragmentation in packets of appropriate maximum size, and inserts the pointers to message fragments into the transmit ring. The NIC is explicitly warned by the CPU about each insertion in the transmit ring, and therefore starts consuming DMA descriptors as soon as the CPU has produced the first of them. In case of more than one packet to be transmitted, the NIC itself pipelines the physical transmissions to the scanining of the transmit ring. For message receive, however, the adoption of the DBDMA technology implied a different software architecture as compared to previous GAMMA prototypes. In the CPU-driven DMA version of GAMMA, incoming frames used to be kept in the on-board NIC’s receive queue until the CPU responded to the NIC’s interrupt request. Then the CPU used to look at the content of the frame header by accessing it in programmed I/O mode in order to distinguish GAMMA frames from TCP/IP frames and do the appropriate demultiplexing. Eventually, the CPU programmed the DMA Bus Mastering transfer from the NIC’s receive queue to the final destination in RAM for the received data. Upon completion of the DMA transfer, the CPU ran the corresponding Receiver Handler in case of GAMMA messages. In the DBDMA version, the receive queue is allocated in the host memory rather than on-board. When the NIC starts receiving a frame, it immediately fetches the DMA descriptor located at the head of the receive ring, then starts filling the host memory buffer pointed by the descriptor with the incoming packet. Upon complete packet receipt, the NIC notifies the presence of a new packet by raising a bit into an on-board status register and possibly interrupting the CPU. The former feature allows the CPU to explicitly poll the NIC in order to avoid the interrupt overhead. When the CPU reads the frame header and is ready to perform the demultiplexing, the message is already in RAM, unfortunately at the wrong memory address. A memory-to-memory copy is needed (and performed by the CPU) in order to deliver a GAMMA message in the user buffer specified by the receiver process. The two prototypes with and without flow control are now running on our cluster based on Linux Kernel 2.0.36. A porting under the newest kernel 2.2.6 is scheduled in the near future.
4. The GAMMA ADI Design In the ADI jargon, an arriving message which finds a pending receive to match at the receiver side is called an
receiver MPI_Recv(msgptr,...); GADI recv user memory
sender
posted unexpected
MPI_Send(dataptr,...); GADI receiver handler
GAMMA driver
user memory
DBDMA output ring
GAMMA driver interrupt
temporary buffer in user memory
DBDMA input ring kernel memory
DBDMA NIC DBDMA NIC
Hub/Switch Figure 2. GADI architecture sketch
expected message. In principle, expected messages do not need any temporary storage since they could be delivered directly to the user-level destination data structure provided by the pending MPI receive operation which matches it. Otherwise the message is called unexpected, and needs to be temporarily stored inside the ADI layer at the receiver side, waiting for a matching receive to be posted. This distinction appears more concrete when looking at the ADI architecture, where we can find two main distinct queues, namely the queue of “posted and not yet matched,” pending receives, and the queue of “unexpected” messages waiting for a matching receive to be executed. The original ADI inside MPICH uses three protocols for message delivery, namely the “short” and the “eager” protocols, which are essentially the same (the “short” protocol is an optimization of the “eager” one in the case of very short messages), and the “rendez-vous” protocol. By the short and eager protocols, a message is sent regardless of any knowledge about the state of the receiver. At the receiver side, best effort is done in order to store the arriving message, be it expected or not, including dynamic allocation of temporary storage. As a matter of fact, under some circumstances the receiver side may fail hosting an unexpected message due to insufficient memory, thus violating the MPI semantics. On the other hand, the rendez-vous protocol forces a synchronization between sender and receiver before transmitting data. This ensures the availability of sufficient receiver resources. Rendez-vous is achieved at the expenses of latency, which makes this protocol suitable for long messages only. Indeed, ADI defines a threshold on the message size to select the rendez-vous protocol instead of the eager/short ones. This way, buffering of long messages is avoided on the receiver side. Obviously, the rendez-vous protocol is also used with short messages whenever an MPI synchronous send routine is called. In order to make best use of the GAMMA programming interface while providing an MPI interface to the user, we have substantially changed the original ADI layer atop GAMMA. The obtained messaging system will be called GADI (GAMMA ADI) in the sequel. In GADI only two protocols are used for message delivery, namely, the eager and the rendez-vous protocols. The hardware/software architecture of the GADI implementation is depicted in Figure 2. The sender part of GADI is structured as one thread, invoked by the upper MPI send routines. Zero-copy send is implemented as in GAMMA, by including in the DBDMA transmit ring directly the address of the data payload. The receiver part of GADI is structured as two concurrent threads, one invoked by the upper MPI receive routines, and one invoked by the underlying GAMMA messaging system as a “receiver handler” upon message arrivals. The interaction between the GADI receiver thread and the GADI
handler thread is implemented by means of two shared queues, that are accessed in mutual exclusion. The very possibility of running a GADI thread as a GAMMA receiver handler for incoming messages makes it possible to inspect the queue of pending receives “on the fly” upon each message arrival. This allows GADI to deliver the payload directly to the final destination specified by a matching receive, if any, thus avoiding any useless temporary copy of incoming messages. Only one copy is performed on message receipt if the matching receive was already posted. Data are instead stored in temporary user-process buffers if no matching receive was posted before running the receiver handler. As a consequence, there are a number of cases with contiguous MPI data types, where point-to-point GADI communication is minimalcopy, namely:
an MPI “standard” send is invoked after the matching receive was invoked, which implies that the incoming message is expected; an MPI “synchronous” send is invoked, which implies that the message is certainly expected thanks to using the rendez-vous protocol; an MPI “ready” send is invoked, which implies that the the matching receive was already invoked and that the incoming message therefore is expected; the message is longer than a given threshold, currently set at 30000 bytes; this forces GADI to use the rendezvous protocol. In the remaining case, that is, whenever a unexpected message arrives that is shorter than 30000 bytes and that was sent in MPI “standard” mode, one temporary copy of the message is set up by the GADI receiver thread, using GADI temporary buffers. However, in order to minimize overhead, our design choice was to use fixed preallocated storage for unexpected messages. Each receiver provides N bytes of total preallocated room per sender at GADI level. In order to prevent this resource to run low in the case of too many unexpected message arrivals, a credit-based flow control has been implemented inside GADI. This consists of initially giving each sender a credit for transmitting N bytes to each destination when using the GADI eager protocol. GADI is then forced to switch to the rendez-vous protocol at the sender side when its credit towards the required destination is exhausted. As a result, the sender will synchronize with the destination before proceeding. The destination is then allowed to refill the sender’s credit during the synchronization phase. For non-contiguous MPI data types as well as with buffered send/receives, two additional temporary copies of messages are carried out by GADI. This is unavoidable consequence of the MPI semantics when using MPI “buffered”
5. Communication Performance Latency and throughput measurements were taken using a “ping-pong” experiment user-process to user-process level. Both the GAMMA programming interface and the MPICH point-to-point communications have been measured, the latter both in the original TCP version and in our prototype version based on GAMMA. Our experiments take into account all of the following activities, that in our opinion should never be neglected when measuring communication performance at user level:
the message copy from a user-level contiguous data structure to the communication layer at the sender side; the message copy from the communication layer to the user-level application-privided contiguous data structure devoted to store the message at the receiver side; the notification of message arrival to the receiver process.
5.1. Reliable GAMMA with flow control It is interesting to observe the diminishing impact of flow control on throughput and latency for increasing window size [9]. While a credit size of four frames still substantially affects the throughput curve and latency, a credit size of eight already has limited impact on both throughput and latency. With a credit size of a few tens, flow control hardly reduces performance in a noticeable way. Figure 3 reports throughput curves from different prototypes of GAMMA on different cluster configurations, together with the curve of theoretical 100base-T throughput based on an optimistic estimation of the hardware latency (7 s). The comparison between the former prototype of GAMMA for 3c905 NICs working in CPU-driven DMA mode on a cluster of outdated Pentium 133 MHz PCs (curve 2) and the current prototype of GAMMA for DEC DE500 NICs working in DBDMA mode on the same Pentium 133 MHz cluster without flow control (curve 4) shows a clear performance degradation with the latter. Part of this degradation, especially for the short-messages throughput, is due to the temporary copy of incoming packets in kernel space introduced by the DBDMA mode. However, we believe
Ideal Fast Ethernet: gamma, P 133, 3C905, no FC: gamma, PII 350, DE500, FC: gamma, P 133, DE500, no FC: 12
1
10
2 3
Throughput (Mbyte/s)
mode. For non-contiguous MPI data types, temporary copies might only be avoided if the underlying messaging system (GAMMA in this case) would provide optimized, minimal-copy support to transmit from/to non-contiguous memory regions (so called “gather/scatter” operations, not to be confused with the well-known homonymous collective routines).
lat. 7 usec lat. 13 usec lat. 14 usec lat. 17.6 usec
1 2 3 4
4
8 6 4 2 0
32
128
512 1500 6000 24K Message Size (byte)
192K
Figure 3. Throughput curves of the new GAMMA driver on DE500 NICs, with and without flow control (FC) that most of the latency degradation is due to the missing exploitation of the “early receive” NIC feature [10] in the current version of GAMMA. When upgrading to faster CPUs and current PC technology, that is, Pentium II 350 MHz PCs with 100 MHz memory bus, the former prototype of GAMMA obtains little benefits (curve 2 actually does not change if a Pentium II 350 MHz is used instead of Pentium 133 MHz), because the old CPU-driven DMA is an I/O-bound operation mode. On the other hand, the current DBDMA-based prototype of GAMMA shows much better performance, even including flow control with a credit size of 32 (curve 3): indeed, with its very low latency of 14 s and asymptotic bandwidth of 12.1 MByte/s (not reached in the figure because of a short horizontal scale), it becomes almost equivalent to the original prototype of GAMMA. Notice the reduced difference in the throughput curve between the old GAMMA, working in CPU-driven DMA mode with a zerocopy protocol and no flow control, and the new GAMMA, working in DBDMA mode with one single copy on receive and flow control, if current PC technology is used. The comparison clearly shows the diminishing overhead of the infamous memory-to-memory copy on receive, due to the improvement in the memory architecture of modern PCs. Finally, we would like to stress the absolute value of the performance results we obtained. GAMMA latency remains substantially lower than the latency reported for similar research projects, including U-Net [21] and M-VIA [3] on NICs using the same DEC chipset, even though our new GAMMA prototype now includes flow control (while
Ideal Fast Ethernet : gamma_send_fast_flowctl(): MPI/GAMMA: Linux 2.0.36 TCP sockets: MPICH/P4/TCP: 1
10
2
Throughput (Mbyte/s)
12
latency 7 usec latency 14 usec latency 17.7 usec latency 58 usec latency 131 usec
1 2 3 4 5
6. Conclusions
3 8 4 6 5 4 2 0
32
128
512 1500 6000 24K Message Size (byte)
path is minimal-copy all the time. Therefore the depicted throughput curve of MPI/GAMMA virtually corresponds to the optimal case. More extensive benchmarking based on the use of reallife MPI applications is currently under way.
192K
Figure 4. Ping-pong performance evaluation of MPI/GAMMA with Pentium II 350 MHz, a Fast Ethernet repeater hub, and Linux. Performance of Linux TCP, MPICH atop TCP, and GAMMA flow-controlled communication on the same platform are reported as a reference. neither U-Net nor M-VIA do). Bandwidth, as usual, is very close to the complete saturation of the channel.
5.2. MPICH on GAMMA ADI Our “ping-pong” tests of MPI/GAMMA on a pair of Pentium II 350 MHz connected by a Fast Ethernet hub show a one-way latency of 17.7 s, including about 1 s hardware latency from the repeater hub and cables. The asymptotic bandwidth is 12.1 MByte/s. The half-power point is reached at 256 bytes. The overall throughput profile for message size up to 192000 bytes is depicted in Figure 4, curve 3. When compared to the throughput curve of GAMMA flow-controlled communication, the throughput curve of MPI/GAMMA indeed accounts for a low-overhead implementation of MPI, certainly the most performing one currently available for Fast Ethernet clusters, thanks to the efficient architecture of GADI. The substantial improvement of MPI/GAMMA over MPICH running atop Linux TCP sockets on the same hardware platform is clearly apparent. Of course, the “pingpong” program forces an implicit synchronization between sender and receiver which allows minimal-copy communications most of the time during the test. On the other hand, for messages longer than 30000 bytes the communication
We have implemented a porting of GAMMA, an Active Port based messaging system for low cost PC clusters, to a state-of-the-art off-the-shelf NIC that adopts the DBDMA transfer mode from/to system memory. The porting required a redesign of the receive mechanism, which is now forced to introduce one copy from memory to memory on message receive, while the original GAMMA driver was able to deliver messages to user memory at arbitrary destinations in true zero copy mode. Performance results show that, also due to the current trend of improvement in CPU and memory architecture of PCs, such a change in the receive mechanism has a vanishing impact in latency as well as communication throughput. On the other hand, DBDMA-based NICs offer a substantially larger receive queue allocated in host memory rather than on-board. This allowed us to implement a credit-based flow control protocol, with appropriately large credits allocated to all potential communication pairs. Our performance measurements show that, using such protocol even with a reasonably small credit size, the impact of flow control on latency and throughput can be negligible. The practical interest of the latter result is enormous. With the adoption of flow control our GAMMA prototype becomes reliable up to hardware faults, avoiding all sorts of message losses due to receiver overrun under any load condition. The only remaining source of message loss in GAMMA with flow control is due to hardware errors and subsequent CRC failures. However, such errors almost never occurs if the LAN is properly cabled and is a switched one (assuming less than 10 11 bit error rate, an error inducing frame loss is expected every 107 frames), whereas packet corruptions due to collisions in shared LAN can always be recovered by the NIC driver upon detection of transmission failures due to excess collision. Using the reliable version of GAMMA with flow control we were able to implement an MPI interface (at the MPICH ADI level) on top of GAMMA, that offers unprecedented performance on cheapest Fast Ethernet technology. Thanks to the GAMMA Active Ports implementation, the GAMMA ADI implementation is two-threaded, allowing for on-thefly inspection of expected messages queue and minimal copy on receive. As a side effect of multi-threading implementation of the ADI level, our porting of MPICH should be thread-safe (this is an unverified claim, for the moment, as we could not test our porting on Linux SMP yet).
Porting the ADI layer to GAMMA greatly speeds up point-to-point MPI communications, but is not as much a satisfactory answer for collective calls. Speeding up MPI collective calls requires providing collective services at ADI level, possibly mapped onto efficient collective routines at lower level. Indeed GAMMA already provides efficient support to some collective patterns, by taking best advantage of the Ethernet hardware broadcast service [8]. Once the GAMMA library will be extended with a wider set of collective routines, and after modifying the ADI layer specification in order to deliver GAMMA collective routines to the upper MPI level, we may expect a performance gain for all the MPI collective routines far beyond the already impressive improvement exhibited by MPI/GAMMA with point-to-point communications.
[12]
[13]
[14]
[15]
References [1] MPI-BIP: An Implementation of MPI over Myrinet, http://lhpca. univ-lyon1.fr /mpibip.html. [2] BLACS homepage, http://www. netlib.org /blacs/, 1998. [3] M-VIA Home Page, http://www. nersc.gov /research /FTG /via/, 1998. [4] MPICH - A Portable MPI Implementation, http://www. mcs. anl. gov /mpi /mpich/, 1998. [5] J. Bruck, D. Dolev, C. Ho, M. Rosu, and R. Strong. Efficient Message Passing Interface (MPI) for Parallel Computing on Clusters of Workstations. Journal of Parallel and Distributed Computing, 40(1):19–34, Jan. 1997. [6] G. Chiola and G. Ciaccio. GAMMA home page, http://www. disi.unige.it /project /gamma/. [7] G. Chiola and G. Ciaccio. Active Ports: A Performanceoriented Operating System Support to Fast LAN Communications. In Proc. Euro-Par’98, number 1470 in Lecture Notes in Computer Science, pages 620–624, Southampton, UK, Sept. 1998. Springer. [8] G. Chiola and G. Ciaccio. Fast Barrier Synchronization on Shared Fast Ethernet. In Proc. of the 2nd International Workshop on Communication and Architectural Support for Network-Based Parallel Computing (CANPC’98), number 1362 in Lecture Notes in Computer Science, pages 132–143. Springer, Feb. 1998. [9] G. Chiola, G. Ciaccio, L. Mancini, and P. Rotondo. GAMMA on DEC 2114x with Efficient Flow Control. In Proc. 1999 International Conference on Parallel and Distributed Processing, Techniques and Applications (PDPTA’99), Las Vegas, Nevada, June 1999. [10] G. Ciaccio. Optimal Communication Performance on Fast Ethernet with GAMMA. In Proc. Workshop PC-NOW, IPPS/SPDP’98, number 1388 in Lecture Notes in Computer Science, pages 534–548, Orlando, Florida, Apr. 1998. Springer. [11] S. Donaldson, J. M. D. Hill, and D. B. Skillicorn. BSP Clusters: High Performance, Reliable and Very Low Cost.
[16]
[17]
[18]
[19]
[20]
[21]
Technical Report PRG-TR-5-98, Oxford University Computing Laboratory, Programming Research Group, Oxford, UK, 1998. W. Gropp and E. Lusk. User’s Guide for MPICH, a Portable Implementation of MPI. Technical Report MCS-TM-ANL96/6, Argonne National Lab., University of Chicago, 1996. M. Lauria and A. Chien. MPI-FM: High Performance MPI on Workstation Clusters. Journal of Parallel and Distributed Computing, 40(1):4–18, Jan. 1997. M. Lauria, S. Pakin, and A. Chien. Efficient Layering for High Speed Communication: Fast Messages 2.x. In Proc. of the Seventh IEEE Int’l Symp. on High Performance Distributed Computing (HPDC-7), Chigago, Illinois, July 1998. P. Marenzoni, G. Rimassa, M. Vignali, M. Bertozzi, G. Conte, and P. Rossi. An Operating System Support to Low-Overhead Communications in NOW Clusters. In Proc. of the 1st International Workshop on Communication and Architectural Support for Network-Based Parallel Computing (CANPC’97), number 1199 in Lecture Notes in Computer Science, pages 130–143. Springer, Feb. 1997. S. Pakin, M. Lauria, and A. Chien. High Performance Messaging on Workstations: Illinois Fast Messages (FM) for Myrinet. In Proc. Supercomputing ’95, San Diego, California, 1995. R. D. Russel and P. J. Hatcher. Efficient Kernel Support for Reliable Communication. In Proc. 1998 ACM Symp. on Applied Computing (SAC’98), Atlanta, Georgia, Feb. 1998. T. Sterling, D. Becker, D. Savarese, J. Dorband, U. Ranawake, and C. Packer. BEOWULF: A Parallel Workstation for Scientific Computation. In Proc. 24th Int. Conf. on Parallel Processing, Oconomowoc, Wisconsin, Aug. 1995. M. Verma and T. Chiueh. Pupa: A Low-Latency Communication System for Fast Ethernet. Technical report, Computer Science Department, State University of New York at Stony Brook, Apr. 1998. T. von Eicken, D. Culler, S. Goldstein, and K. Schauser. Active Messages: A Mechanism for Integrated Communication and Computation. In Proc. of the 19th Annual Int’l Symp. on Computer Architecture (ISCA’92), Gold Coast, Australia, May 1992. M. Welsh, A. Basu, and T. von Eicken. Low-latency Communication over Fast Ethernet. In Proc. Euro-Par’96, Lyon, France, Aug. 1996.