Modeling of a high speed network to maximize throughput performance: the experience of BIP over Myrinet Bernard Tourancheau Loic Prylli and Roland Westrelin LHPC & INRIA ReMaP LHPC & INRIA ReMaP LIGIM bat710, UCB-Lyon LIP, ENS-Lyon 69622 Villeurbanne - France 699364 Lyon - France
[email protected] [email protected] September 26, 1997 Abstract
High speed networks are now providing incredible performances. Software evolution is slow and the old protocol stacks are no longer adequate for these kind of communication speed. When throughput increases, the latency should decrease as much in order to keep the system balance. With the current network technology, the main bottleneck is most of the time the software that makes the interface between the hardware and the user. We designed and implemented new protocols of transmission targeted rst to parallel computing that squeeze the most out of the high speed Myrinet network, giving all the speed to the applications. A precise modeling of the high speed hardware was necessary to analytically determine the optimal pipelines parameters. This modeling and design is presented here as well as experimental results that lead to achieve real Gigabit/s throughput and less than 5s latency with half of the best performance (n1=2 ) for 4KBytes messages on a cluster of PC workstations with this network hardware. Moreover, our networking results compare favorably with the expensive parallel computers or ATM LANs.
1 Introduction Multimedia application as well as parallel computing and databases are asking for low latency high bandwidth networks. This kind of performance implies a new design of the protocols in order to avoid software latency and memory copies. Indeed, the recent relative evolution of computer subsystems has created new problems: ve years ago, with parallel computing over 10Mbits/s Ethernet or even on a 100Mbits/s FDDI ring, it was easy to saturate the network. The memory bandwidth or the IO bandwidth of typical workstations were an order of magnitude faster than the physical network, so the interface between the user and the hardware was not too much a problem. Nowadays relatively inexpensive network technology such as Myrinet is over 1Gbits/s for LAN, and although workstations have also increased in performance, the gap between network throughput and other inner throughput (memory and IO busses) has been considerably reduced. So it is time to use carefully host resources. Our experiences with ATM LAN networks have shown two problems, rst even when the wire is able to provide 155Mbits/s, a poor design of the ATM board drivers can prevent the use of more than half the hardware bandwidth. Second the latency on typical workstations is counted by hundreds of microsecond [Pry96, PT97b] which is unbearable in such a context (a 500 s latency is equivalent to the transfer of 10KBytes of data at the speed of 155Mbits/s). This work was supported by EUREKA contract EUROTOPS, LHPC (Matra MSI, CNRS, ENS-Lyon, INRIA, Région Rhône-Alpes), INRIA Rhône-Alpes project REMAP, CNRS PICS program, CEE KIT contract, INRIA-NSF grant
1
Our software research work was driven by the idea that we wanted to exploit to its full strength the potential of the network in the applications. In the real world, what counts is not what the hardware can theoretically support (ATM 155Mbits/s, Myrinet 1.28Gbits/s) but what performance is available at the user/developer level (and which marketing will not advertise). Our research shows that the power of high speed networks can be exploited by carefully shortening the path of data from application to application, optimizing the pipeline eects and avoiding all possible overhead. This was necessary for both latency and bandwidth improvement.
2 Our PCs - Myrinet LAN platform The Myricom LAN target was chosen for its performance over the Gbits/s (OC-24 = 1.28 Gbits/s actually), its aordability (the interface board are around 1K USD and the 8 8 switch is around 1.5K USD) and its software openness (all the software and specications are freely available for customers). We are currently running LINUX 2.1 on PentiumPro200 and Pentium133 PCs with Myrinet PCI boards. The network boards are plugged into the PCs and connected together with a Myricom SAN and LAN 8 by 8 ports switch (see Figure 1). Processeur
Mémoire Principale
Cache
Controleur du BUS PCI Carte réseau Myrinet
LANai
SRAM
Commutateur Myrinet
Figure 1: Architecture of the Myrinet network on PCs [Naq97] gives the results of our experiments on this platform with dierent softwares, namely the Fast Messages, Active Messages, Sockets on IP, PVM on IP, MPICH on IP, MyriApi that were obtained during the beginning of 97. The latency measurements shows the inuence of the processor performance when the protocol is not ran in rmware on the network board but on the PCs. The throughput curves show the limitation of all these interfaces compared to the possibility of the hardware.
2
3 Overview of BIP Our rst objective was to implement BIP (Basic Interface for Parallelism), an interface for network communication targeted towards message-passing parallel computing. The idea in BIP was to provide protocols with low level functionalities that could be interfaced with specially design critical parallel applications. Then adding another protocol layer, BIP was easily interfaced with classical protocol stacks like IP or APIs like the well established MPI[SOHL+ 95] and PVM[GBD+ 94] (see Figure 2). Our main goal was to build it with a library interface accessible from applications that will implement a high speed protocol with the less possible accesses to the system kernel (for other works in this area, see Section8). The BIP interface provides several functions to get parameters or set-up constants and the send and receive blocking or non-blocking communication primitives. At the implementation level, the send and receive provide a loose rendez-vous semantics, which means, for large message, the send is guaranteed to complete only if a corresponding receive has been posted and the loose comes from the fact that for small messages, the intermediate buering and queuing in the protocol may cause the send to complete even before the receive is posted. Note that in fact this semantics respect the semantics of the standard send and receive from MPI but there is no ow control to avoid the overowing of the queues. BIP messages can be routed through multiple Myrinet switches, which provides, from the OSI point of view, services that are part of the level 3 functionality (in fact Myrinet design implies that routing is done at the physical layer), we provide for each message the path that it should follow through the switches. This patch is currently xed at initialization for each pair of nodes. BIP messages can be tagged for identication. The send other arguments are the data and the logical number of the destination, the receive does not specify a particular source but can check on a tag, its other arguments are a buer where to receive a message and the maximal length it can receive in this buer. It is up to the application or upper protocol layers to provide more functionality if needed. The rational here is that BIP is intended as a intermediate layer for other higher level protocols as soon as a complex functionality is needed. But the services provided are strongly oriented to the basic parallel application demand. The send and the receive are also available in a non-blocking semantics, where one can either test or wait for the completion of the asynchronous send and receive calls. With the actual version, at one time, a process may only have one receive per tag and one send posted, and not more. Send and receive operations are completely independent so you can intermix them in any manner. The non-blocking primitives allow overlapping of computation and communication when appropriate but there must be no more than one send and one receive operation per tag pending. Here is a summary of the characteristics of BIP messages, a more complete presentation is done in the BIP manual [Pry97]. The BIP scope of an application consists of n processes numbered logically from 0 to n ? 1 at start time. BIP relies on send and receive primitives that have a loose rendez-vous semantic. All BIP communication are as reliable as the hardware, there is no re-emission mechanism but error are detected and provided to the application or upper protocol layers for abortion or treatment. BIP ensures in-order delivery. For simplicity, the transmitted data must be contiguous and aligned on 4 bytes word boundaries. BIP does not put any limit on the size of messages transmitted.
4 Design choices for a zero main-memory copy The software design was in particular guided by the high speed network platform we use: Myricom[Myr] Myrinet [Myr95]. However, the general ideas are applicable to any network hardware architecture that provide the same functionalities: processor and memory on the network board to implement our BIP rmware, DMAs to do the pipeline transfers. 3
Application
PVM TCP/UDP MPI IP IP-BIP
MPI-BIP
BIP
Figure 2: Description of the protocol stack with our application performance point of view: the application can access directly the BIP level (basic message passing interface with semantic constraints) or use other functionality levels. Notice that this can be changed depending on the implementation that are done, for instance one can imagine to port the BLACS, or TCP, or PVM directly on top of BIP in order to keep very low latency. The important point here has already been pointed out by others [PKC97, BBVvE95], for performance reason, we cannot aord using the OS and a heavy protocol stack as an intermediary to access the network hardware. BIP message-passing library directly manages the hardware and the BIP rmware to implement the message-passing API. Five years ago, on the parallel machines, or on cluster of workstations with 10Mbits/s Ethernet, the memory bus was one or several order of magnitudes faster than the network. There was not too much concern on the way to put data from the memory to the network. Copies of communication buers in a memory space suited for the communication protocol, packet disassembly and reassembly, all this kind of operations did not have an important impact on the nal performances. The current situation is quite dierent, the network bandwidth is often comparable with the memory bandwidth : for instance with our conguration the memory bus throughput is 180MBytes/s when reading memory, a memory copy is about 80Mbytes/s, and the network bandwidth is 160Mbytes/s. So the alternative between doing a memory copy versus putting directly user data on the network can impact performance as much as the old-fashioned store and forward strategy slow down routing compared to circuitswitched or worm-hole strategies. In our case, each memory copy is exactly like having a store-and-forward step on the network. BIP is designed for zero memory copy transfers and the data transfer is fully pipeline in order to increase the throughput. We will describe this design in the following and more details about over parts of the implementation can be found in [PT98, PT97a].
5 Modeling of the hardware platform In order to optimized the pipeline exchanges, a complexity study was conducted to nd out the optimal block transfer analytical expression. From these results, the loss from the optimal was computed for each 4
message length interval (with power a two bounds for eciency reasons). An adaptable strategy for the message splitting was then implemented in the rmware using a tabulation of the results. This study can be conducted for any network of the same kind, with an adjustment of the parameters.
5.1 Pipeline study
The sending of messages along the data path is done in four steps in the BIP software : From main memory to the network interface memory on the sending machine From the network interface memory to the wire on the sending machine From the wire to the network interface memory on the receiving machine From the network interface memory to the main memory on the receiving machine In order to decrease the initialization time and create a pipeline eect, the big messages are split into packets of equal sizes to pipeline the 4 previously described steps. The host processors are only involved at the transfer initialization to give to the board the location where to take and respectively store the target messages. After that all the transmission is managed by our protocol implementation on the boards. Exchanges between the main memory and the network board are done using the network board DMA on the PCI bus. The maximum throughput of this exchange is evaluated with a Myricom test program and is around 128 MBytes/s on our platform. Exchanges on the Myrinet wires are clocked at 160 MBytes/s but the real speed between Myrinet boards is limited by the DMA of the board memory to/from the wire that appends at 132 MBytes/s. A message transfer thus follows a 3 phases path with 4 DMAs as described on Figure 3. Note that a message contiguous in user memory will not be contiguous in physical memory. Moreover, the alignment has no reason to be the same on the sending and receiving side, that means the splitting into packets at both sides is not the same, a packet at the receiving side may be composed of two fractions of consecutive packets. Our algorithm arranges so that the ow of bytes on the wire is continuous and alignment has a negligible impact on performance as soon as the number of packet is big enough. This is generally the case because the packet size will depend on the whole message length.
Mémoire Bus PCI centrale
(phase 1)
carte
carte
d’interface
d’interface
Machine émettrice
Bus PCI
Mémoire
(phase 3)
centrale
Machine réceptrice Réseau (phase 2)
Figure 3: Data path of a message transfer.
5.2 Data transfer model
Communication modeling is done with the classical (for parallel computing) ane model for each phase of the transfer : T (L) = + L
for a message of size L, with representing the startup time, i.e. all the initialization times (software and hardware) and 1 the throughput of the channel. 5
In the following, the pairs 2 , 2 and 3 , 3 are respectively the parameters of the DMA exchanges on the PCI bus and on the Myrinet network. 1 gives another startup time, necessary for the modeling and that will be described in the following. Using the peak performance values notice that: (1)
2 > 3
Then in the next sections, we distinguished two cases for the data transfer modeling T (L) of a message of size L.
5.2.1 Messages that are smaller than a packet
The message is sent in one chunk and the transfer time, knowing (1) is T (L) = 1 + 2 + 2L + 3 + 3 L + 2 + 2 L T (L) = 1 + 2 2 + 3 + (22 + 3 ) L
(2)
5.2.2 Messages bigger than the packet size
The 3 transfer phases are independent and thus a pipeline can append that will allow 3 dierent packets on the data path at the same time. Let P be the packet size, there are PL ? 1 packets of size P and one of size L0 = L ? ( PL ? 1) P . The pipelined transfer is illustrated on Figure 4. Phase 1
Paquet 1
1 2 2 P
Phase 2
Phase 3
3 3 P 2 2 P
Paquet 2 Paquet 3
Temps
Figure 4: Transfer of a message of size L = 3 P The startup 1 represents é initializations that append while the rst packet is going through the data path. They are all gathered at time 0 in order to simplify the representation. With our software, there is no overlap within a packet transfer, i.e. the phase 2 of packet n cannot begin before phase 1 of packet n and phase 3 of packet n ? 1 are nished. Thus the following conditions must be veried: 1. 2.
2 + 2 P > 3 P 2 + 2 P
+ + P > P + + P Condition 1: is veried from equation (1). Condition 2: implies that 3
3
3
2
2
3 > 0, which is always true. Our modeling is then coherent with the behavior of our software and the hardware parameters. We can then set the following equation for the transfer time of a message of size L: If L is a multiple of P :
T (L; P ) = 1 +
L P
( 2 + 3 + 2 P ) + 3 P + 2 + 2 P 6
T (L; P ) = L (
2 + 3 P
+ ) + + + ( + ) P
(3) For any L, the packet of size L0 can be smaller, we consider our software split the equal size ones rst and thus that the L0 size packet is the last one, as described in Figure 5. 2
1
2
3
2
111 000 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 P L
Figure 5: Splitting of a message into packets In the transfer of the last two packets, described in Figure 6, the timings can be slightly dierent. Phase 1
Avant-dernier paquet
Phase 2
2 2 P
Phase 3
3 3 P 2 2 P
Dernier paquet Figure 6: The last two packets transfers for a message of size L = P n + L0 Let,
T = max( P; + L0) 3
2
2
then the transfer time can be expressed with the following equation, taking into account that the last packet is smaller : T (L; P ) = 1 + (
L P
? 1) ( 2 + 3 + 2P ) + max(3 P + 2 + 2 P; T + 3 + 3 L0 ) + 2 + 2 L0 (4)
For sake of clarity, we will continue our complexity study in the case where L is a multiple of P . This will anyway provide a very tight upper bound for the general case. The experimental measurements for the startup times and throughput gave us good precision on the following terms and then we express them in the equation 3:
Equation 3 becomes:
= 2 + 3 = 2 0 = 1 + 2 0 = 3 + 2
T (L; P ) = L (
P
+ ) + 0 + 0P
(5) We use this equation to deduce analytically various limits on the data transfer performances in the following. 7
5.2.3 Optimal packet size computation
We derive equation 5 regarding P , to obtain the optimal packet size as a function of L: Popt =
r
L 0
(6)
Notice that for this equation to be true, L > Popt , must be veried, i.e. L > . 0
5.2.4 Computation of the maximum throughput
For a given message size L, the maximum throughput of the platform can be deduced from (6): BWmax = T (L;LP BWmax =
2
q 1 0
L
opt) (7)
+ + L
0
5.2.5 Computation of minimum message size
In order to reach a given throughput S , the minimum message size necessary can be computed from (7): L T (L; P )
S
q 1 0p 0 + 0 ? 0 ( ? S ) A L@ 1
with