Implementing a Low Cost, Low Latency Parallel ... - Semantic Scholar

16 downloads 34989 Views 199KB Size Report
Nov 7, 1996 - that nowadays workstations connected to a LAN are actually used as if the ..... We call such an identi cation number a parallel PID. Processes ..... Technical report, Ohio Supercomputer Center, Columbus, Ohio, 1994.
Implementing a Low Cost, Low Latency Parallel Platform G. Chiola, G. Ciaccio

DISI, Universita di Genova, 16146 Genoa, Italy fchiola,[email protected]

Abstract The cost of high-performance parallel platforms prevents parallel processing techniques from spreading in present applications. Networks of Workstations (NOW) exploiting o -the-shelf communication hardware, high-end PCs and standard communication software provide much cheaper but poorly performing parallel platforms. In our NOW prototype called GAMMA (Genoa Active Message MAchine) every node is a PC running a Linux operating system kernel enhanced with ecient communication mechanisms based on the Active Message paradigm. Active Messages supply virtualization of the network interface close enough to the raw hardware to guarantee good performance. The preliminary performance measures obtained by GAMMA show how competitive such a cheap NOW is.

1 Introduction Historically Local Area Network (LAN) device drivers in the Operating System (OS) kernel of a workstation have never been optimized like other devices whose performance is critical for user applications (such as disk drivers, memory management units, co-processors, etc.). At the beginning LANs were used for sporadic interprocessor communications that had little impact on application performance such as email, ftp, telnet, etc. Gradually the importance of LANs in computer system architecture increased with the introduction of several \network services" such as the Network File System (NFS) in such a way that nowadays workstations connected to a LAN are actually used as if the \network" itself were a real distributed system. During this gradual transition we observed a very curious phenomenon of \sedimentation" of several layers of protocols and software that one after the other became so called \de-facto standards." The main motivation for this phenomenon was the real need for standardization of components that allows Preprint submitted to Elsevier Science

7 November 1996

inter-operability of devices in a network of heterogeneous systems. It was this need for inter-operability of non homogeneous components that made the use of several layers of very inecient protocols mandatory and that prevented the optimization of the OS kernel support for the use of LANs in high-performance inter-processor communication. On the other hand, in the last few years \massively parallel platforms" were introduced by several companies. The current trend is to use standard commercial processors as computation nodes. These parallel platforms provide very high performance inter-processor communication facilities at a substantially higher cost than a traditional Network Of Workstations (NOW) using the same processing nodes. In these machines the high performance of the communication is obtained by a very careful implementation of the OS kernel support for inter-processor communication, as well as by using high performance hardware devices. A-priori it appears very dicult to establish whether the performance di erence between a NOW equipped with only commodity hardware and software and, let's say, a Connection Machine CM-5 with processors connected by a sophisticated Fat Tree network [7], as experienced by an application requesting message exchange in an MPI [11] environment is caused more by the di erence in communication hardware technology or by the level of software optimization. What we claim (and we think the results presented in this paper are very strong evidence supporting our claim) is that the di erence is mainly due to ineciency in LAN software. We believe that with proper OS kernel organization even a Fast Ethernet based NOW could deliver acceptable performance for a signi cant class of parallel processing applications. Nowadays individual workstations in a NOW can be considered very cheap and can deliver very good performance, comparable to the performance that only a very few years ago was achieved only by super-computers. Standard LAN technology is also very cheap. The idea of assembling a high-performance hardware platform for parallel processing by using a standard NOW con guration is thus becoming extremely appealing. The only obstacle to this approach is the poor performance of LAN message exchanges as currently implemented in NOW environments. Indeed portability and hardware-independence issues are extremely crucial even in parallel processing applications. Therefore we are not proposing the elimination of all de-facto standards that allow portability and code re-use. We are however proposing to limit the use of a de-facto standard (such as MPI) only at the user application level. All lower system programming levels should be \standard free" in our opinion and implemented from scratch. The use of communication hardware should be as direct and simple as possible in order to bring the raw performance of hardware devices as close to the application 2

level as possible. We cannot claim that this approach is original. Indeed the same idea led a research group from the University of California at Berkeley to propose so called Active Messages [12] as a very basic form of inter-process communication in a parallel programming environment. A commercial implementation of an Active Message Layer can be found for instance in Thinking Machines' CMAML library for the CM-5 platform [1]. The use of Active Messages as an ecient software organization to support high performance communications is now consolidating and several research projects are currently being carried out on this topic [7,8]. Our original contribution to the Active Message approach is the use of this very same concept in a standard, extremely cheap NOW environment. Our hardware platform currently is a network of Intel Pentium 133Mhz PCs connected by a 100 Mb/s Fast Ethernet repeater hub. Our software platform is the version 2.0.0 of the Linux operating system, a public domain Unix clone, enhanced with our own implementation of Active Messages. Our result consists in showing that this extremely cheap, standard platform can beat very expensive parallel platforms such as the CM-5, the SP-2, the T3D without shared memory, etc., in terms of message latency (while of course it cannot compete in terms of communication bandwidth due to the obvious di erence in communication hardware complexity and cost). Yet, several parallel processing applications exist that would bene t by the availability of a truly cheap platform o ering moderate bandwidth, low latency communications. The rest of the paper is organized as follows. We rst present our low cost platform and then brie y explain how our Active Message communication mechanisms are supported by an extension of the Linux kernel. Then we present a set of preliminary performance measures that show the cost-e ectiveness of our platform. We conclude by outlining the future directions of our research.

2 The GAMMA project The Genoa Active Message MAchine (GAMMA) project started in December 1995 when the Department of Informatics and Computer Science (DISI) of the University of Genoa decided to renew its didactical laboratories buying some 40 personal computers based on the Intel Pentium processor to be used as low cost Unix workstations in a LAN environment. The obvious idea was to try to use the same hardware as a kind of parallel platform in order to obtain a \parallel processing lab" for free. At the time we knew about two other projects aimed at producing a parallel 3

platform out of very similar hardware of homogeneous processors running the Linux OS: the Beowulf project developed at NASA [9], and the PARMA2 project [10] carried out by collegues of the University of Parma with whom we exchanged ideas, experiences and preliminary results. The di erence between GAMMA and the two above mentioned projects is that we were willing to extend the functionality of the OS kernel with new and performance-oriented support to inter-process communication, while the other projects were attempting to focus on limited changes of the device drivers and/or the communication protocols. A more radical approach has been followed by Cornell University's U-Net project that has recently also produced a PC/Fast Ethernet based prototype [13], though failing to achieve the same results that we did in terms of message latency. 2.1 The hardware and software architecture

The hardware platform for our NOW experiment is composed of a set of 12 workstations connected by means of two independent LANs: one 10 Mb/s Ethernet running standard communication protocols and supporting usual network services in the Unix environment and one isolated 100 Mb/s Fast Ethernet devoted to the implementation of fast inter-processor communications. All 12 processing nodes have exactly the same hardware con guration: { Intel Pentium 133 MHz CPU { PCI mother board with 256 KB of 15 ns pipelined burst secondary cache and a PCI Intel Triton chipset { 32 MB of 60 ns RAM { 3COM 3C595-TX Fast Etherlink 10/100BASE-T PCI network adapter. The Fast Ethernet LAN consists of a 3COM LinkBuilder FMS 100 repeater hub with 12 RJ-45 ports, to which each Fast Etherlink adapter is connected by a UTP cable. At the time the project started, in Italy the cost of each node could be roughly estimated at around K$ 3, including a high quality 17 monitor, 1 GB EIDE disk and accelerated graphics. The cost has already dropped below K$ 2 per node in one year since then. Today a network of 200MHz P6 processors can be bought for the same price. 00

Each workstation runs a modi ed version of the Linux OS kernel, version 2.0.0 Each process running on a node has independent access to the local le system as well as to the shared le system located on le servers (that are accessible by means of the separate 10 Mb/s Ethernet LAN). An inter-process communication is accomplished by the sender process invoking an Active Message \send" system call. The goal of this system call is to copy a portion of memory content from the user memory space of the sender 4

process on one node to the user memory space of the receiver on another node. On the sender side this is accomplished by the CPU writing the memory content directly into the network adapter using no intermediate bu ering into kernel space. The overhead is very limited thanks to the low abstraction level of our communication protocol. Writing into the network adapter results in sending one or more Ethernet frames across the LAN, which are received by the network adapter on the receiver side. An interrupt handler on the receiving node is then launched which makes the CPU copy the content of the adapter's input queue to the memory space of the receiver process. Upon completion of the memory-to-memory data transfer, a user de ned receiver handler is called on the receiver node. In general, the role of a receiver handler is to integrate the incoming messages into the data structures of the receiver process and to prepare a bu er in user memory space for the next message to be received. Receiver handlers may be used for instance to implement communication protocols if needed at the application level. It is worth pointing out that in the Active Message approach each process belonging to a running parallel application has at least two distinct threads, namely: { the main process thread which performs the computations and sends messages; this thread is subject to the usual schedule policy of the OS kernel; { one or more receiver threads, each one corresponding to a receiver handler and devoted to integrating incoming messages into the main thread and preparing the next message reception; receiver threads are executed right away upon message reception under the control of the device driver's interrupt handler, without involving the OS kernel scheduler. The receiver threads may cooperate with the main process thread by sharing all the global data structures of the program. This multi-threading mechanism avoids the descheduling of the receiver process that is inherent to blocking message reception such as is usually implemented in Unix \sockets" thus substantially reducing latency. Active message exchanges can be completed without involving the OS kernel scheduler, hence reducing context switching overhead as well. 2.2 GAMMA and MPI

MPI [11] is a well-known library for communication based on the message passing paradigm. Various implementations of MPI have been developed [5,2] for many parallel computers (IBM SP-1 and SP2, Cray T3D, Meiko CS-2, NCUBE-2, to mention only a few), as well as for a vast number of conventional UNIX workstations. In the latter case a socket-based implementation of MPI has been developed which o ers an emulation of a parallel platform that can 5

be promptly installed on various hardware con gurations, ranging from the single UNIX workstation to the network of heterogeneous workstations. One of our long-term goals is to implement the MPI library on GAMMA. The main motivation for doing so is to provide users with a parallel programming interface whose abstraction level is high enough to allow easy parallel programming. However ecient exploitation of hardware capabilities is also of our concern as in all e ective approaches to parallel processing. From the viewpoint of the MPI porting ease, any operating system presenting a socket-like interface to communication is a good candidate for supporting the socket-based implementation of MPI with limited porting e ort. On the other hand, this very consideration leads to a \sedimentation" of software layers with a corresponding lack of ecient exploitation of hardware capabilities, especially in terms of raw communication performance. On the basis of this consideration we preferred to develop an ecient low-level communication layer rst. We intend to implement the most critical functions of MPI from scratch on top of this low-level communication layer. Indeed this will require a greater MPI porting e ort but this way we may expect the distance from raw hardware performances to parallel application performances to be signi cantly reduced. Therefore user applications will be o ered a standard either much faster (as compared to present implementation on NOWs) or much cheaper (as compared to present \real parallel platforms") MPI environment. Following this approach we believe that a NOW architecture like ours may well compete in cost/performance terms with many other parallel machines. 2.3 The GAMMA system calls

Capitalizing on previous experience with the use of the Active Message layer CMAML provided by the CM-5 platform [4] we then wondered about the feasibility of direct implementation of a similar set of Active Message primitives in the Linux kernel. Indeed we de ned a set of communication primitives as system calls to be added to the standard ones provided by Linux. We then started to implement a subset of these system calls and a custom device driver in order to obtain a rough prototype to start measuring the actual performance bene ts of our approach. The way these low-level primitives together with their data structures are embedded into the original Linux kernel is conservative, i.e., neither were any Linux kernel function/data structures modi ed nor were any scheduling/memory management policies changed. Moreover, most of the whole new code does not rely on Linux data structures. This guarantees compatibility with respect to future versions of the Linux kernel even in case of drastic 6

changes of its own data structures. In order to discuss the data structures and the primitives of our system, it is worth brie y pointing out the computational model we had in mind during the design phase, as well as the way a parallel application is launched on GAMMA. In what follows, the words \workstation" and \PC" are used as synonyms. A physical GAMMA is the set of M workstations connected to the fast LAN. Each workstation is addressed by the Ethernet address of the corresponding fast network card. A virtual GAMMA is a set of N computation nodes, each one corresponding to a distinct workstation in the physical GAMMA. The nodes are numbered from zero on and this numbering provides the necessary addressing of nodes at the application level in the most straighforward way. At the kernel level each of these numbers is mapped onto the Ethernet address of the corresponding PC. A parallel program may be thought of as a nite collection of N processes running in parallel, each on a distinct node of a virtual machine. GAMMA supports parallel multitasking, i.e., more than one virtual GAMMA may be spawned at the same time on the same physical GAMMA. Each one runs its own parallel program on behalf of potentially di erent users. As a consequence, any workstation may be shared by more than one parallel program at a given time. Each virtual GAMMA and the corresponding running parallel program is identi ed by an identi cation number which is unique in the platform. We call such an identi cation number a parallel PID. Processes belonging to the same running parallel application are characterized by the parallel PID of the application itself. The parallel PID is used to distinguish among processes belonging to di erent parallel applications at the level of each individual workstation. A maximum of 256 di erent parallel PIDs may be activated on a physical GAMMA. Each process has 255 communication ports numbered from one on; port number zero is reserved for kernel-level fast communications like those occurring to create a virtual GAMMA and launching a parallel program. Each port is bidirectional, which means that data can be both sent and received through it. Each port can be mapped to another port of another process belonging either to the same parallel program or to a di erent one, thus forming an outgoing communication channel. However there is no explicit connection protocol and correctness and determinism of the communication topology induced by the port-to-port mapping is up to the programmer. A parallel program, namely the set of N processes sharing a common virtual GAMMA and parallel PID is launched in the following way: { The user working on a given PC simply writes the name of the executable le, launching it as a sequential executable. 7

{ At a given point of execution, the gamma_init(N) system call for starting the parallel session is encountered and invoked. This system call, whose interface will be described later on, rst generates a virtual GAMMA by assigning a unique parallel PID to it and labelling it with the name of the invoking program. Then it registers the invoking process on it, chooses N-1 PCs in addition to the local one and sends them a kernel-level communication in order for them to join the new virtual GAMMA. Finally it launches N-1 copies of the program itself on the chosen PCs via the standard Remote Shell service as if they were independent, sequential executables as well. In order for this to occur, not only must the user have remote shell rights on the other PCs, but even the executable le must be accessible by each PC by means of the same path-name, possibly through the standard NFS service. The local PC becomes node number 0 (master) within the new virtual GAMMA. { The other N-1 running copies of the program launched by the rst one eventually encounter and invoke the gamma_init() system call as well. However in this case a virtual GAMMA labelled with the name of the invoking program already exists and a parallel PID is already assigned to the process so that no further virtual GAMMA is created. Instead, the calling process registers itself on the existing virtual GAMMA. When all N-1 \slave nodes" have registered, the parallel computation may start. In order not to have to modify any pre-existing kernel data structures (e.g., process descriptors), parallel PIDs are kept distinct from the Linux PIDs. This requires a \table of parallel processes" at kernel level on each workstation, which each process belonging to a parallel program is registered on, allowing the conversion of parallel PIDs to standard PIDs locally on each workstation. Each process in a parallel program has an associated table of ports. For each port of the process there is an entry in the table of ports storing the pointer to the receiver handler to be run upon receiving a message through that port. Moreover, if the port is mapped to another one, the entry stores the node number, the parallel PID and the port number characterizing the receiver end. This way the port number fully de nes the destination of messages sent through that port as well as the actions performed by the process to handle incoming messages from that port. When a transmission is to take place, data are copied from user space to the network adapter directly rather than bu ering them into kernel space. This improves performance and poses no protection problems, as writing to the adapter still requires supervisor privilege and as such it can only be accomplished through a system call. In case of programmed I/O this poses no memory access problems either since the transmitting process is running when the send function copies data from user space to the network adapter. We exploited the same kind of optimization on the receiver end. Each time a process expects to receive messages from one of its own ports, it must provide 8

in advance for a pointer to a user-space bu er where the incoming message is to be stored. This pointer is kept in the port table of the process, at the entry corresponding to the receiving port, together with the pointer to the receiver handler. When a message comes to the network adapter, the process which is the destination of that message might not be running. In order to correctly deliver data while avoiding bu ering into kernel space it is thus necessary to lock user memory pages in RAM to avoid swapping of the receiver bu er. Moreover on message reception the device driver must restore the receiver process' segment descriptors and page table pointer in the CPU registers in order to copy data in the user-space bu er and execute the receive handler. So to speak, a kind of \temporary context switching" to the receiving process is accomplished in order to deliver the message without storing it in kernel space. To accomplish this rather tricky operation, the kernel must keep a copy of the segment descriptors and of the page table pointer for each process involved in a running parallel program. Each message is sent on the Fast Ethernet LAN as a sequence of one or more Ethernet frames. Each frame is labelled by a sequence progressive number, which allows the receiver to detect possible frame losses. Each frame has a 20 byte header: { The rst 12 bytes are the Ethernet addresses of the receiving adapter and the sending adapter respectively. { The next 2 bytes distinguish between GAMMA frames and frames from standard communication protocols. This allows one to use a single Fast Ethernet for both Active Message and standard communications simultaneously. { The subsequent 2 bytes designate the receiving port by parallel PID and port number. { The last 4 bytes encode the frame sequence number and the length in bytes of the signi cant portion of the frame itself. The extremely simple communication protocol of GAMMA provides neither message acknowledgement nor error recovery. We claim that the high cost of acknowledgements in terms of loss of bandwidth and additional latency time is not worthwhile in a communication system with an extremely low probability of frame corruption and frame loss. In our experiments we were able to exchange several hundreds of Mbytes from node to node without incurring in any corruption or loss of frames. In the (very sporadic) cases of error detection, the custom device driver of our Active Message layer activates a user de ned error handler (in the same way as for a receiver handler) which is responsible for possibly implementing user-level error recovery or aborting the application. At the moment, the following system calls have been implemented: {

int gamma_init(int N);

9

This function spawns a virtual GAMMA and initiates the parallel computation as explained before. int gamma_setport(

port, partner_node, partner_par_pid, partner_port,

{

(*handler_in)(), (*handler_err)(), *buffer_in);

{ {

{

{

This function activates one of the 255 ports of the calling process, which must have previously registered itself in the \table of parallel processes" by calling gamma_init(). It speci es which port to activate, which process on which node the partner is, which partner's port to connect to, the two user de ned handlers for message reception and error recovery and a pointer to a user-space bu er for storing incoming messages. int gamma_atomic(void (*funct)(void));

This function allows the execution of its parameter function funct() to complete atomically (i.e., without any interruption, neither due to other processes nor due to the process' receiver handlers). int gamma_send(port, void *data, len);

This function sends a message through the fast LAN to the node, process and port unmistakably designated by the port number speci ed as a parameter. The message is speci ed by providing both a pointer to the user space bu er where the message itself lays and its length in bytes. int gamma_sync(void);

This function implements a barrier synchronization among all processes sharing the same par_pid in the virtual GAMMA: a process may resume execution without error code after calling gamma_sync() only when all other processes running with the same par_pid have also called gamma_sync().

int gamma_exit(void);

This function unregisters the calling process from the \table of parallel processes." If the calling process was registered on node 0 of a virtual GAMMA, then the virtual GAMMA itself is destroyed.

In addition to these system calls, other utilities have been de ned in order to facilitate the creation of handlers and atomic functions (see [3] for more details).

3 Preliminary Results Although GAMMA is still incomplete, we were able to perform some experiments and obtain some measurements regarding the communication performances delivered at the application level. We carried out our measurements on a small GAMMA con gured with only two Pentium 133 workstations, con10

nected to the LAN hub by 100 meter long cables. Time measurements were accomplished by means of the TSC register of the Pentium CPU which is incremented by one every clock tick. By reading the TSC content before and after the execution of a piece of code, we were able to measure the execution time of that code with the accuracy of 10 ns. We de ned message delay as the time interval between the instant the sender process invokes the send system call and the instant the interrupt handler on the receiver node terminates the execution of the user de ned receiver handler which is invoked upon completion of the copy of data in the user-space receive bu er. Such message delay depends on the message size. Let D(s) indicate the delay for a message of size s bytes. We de ne message latency as the delay time for zero length messages, D(0). We de ne the communication throughput T (s) as the transfer rate perceived by the application when sending a message of size s: T(s) = s=D(s): We de ne communication bandwidth B (s) so that the following equation holds: D(s) = D(0) + s=B (s) In case of small variations of B (s) as a function of s this allows us for the derivation of a linear approximation, by substituting a costant value B . The delay measure was performed by means of the \ping-pong" application. A virtual GAMMA with two nodes is created, then node 0 acts as \ping" while node 1 behaves as \pong" (actually they are launched just as the same executable, then they di erentiate from each other by calling the gamma_my_node() utility function which returns the node number of the local workstation in the virtual GAMMA). Both processes de ne a receive ag which is set by their receiver handlers and tested by the main process thread. \Ping" rst sends a message to \pong", then it busy-waits until getting the message back. \Pong" rst busy-waits on receiving a message, then sends it back to \ping". The ping-pong is repeated for 1000 times in a loop, collecting the message roundtrip times. The communication delay is then computed as half the average round-trip time measured by the \ping" process. As memory accesses are involved in the experiment, the exploitation of cache may a ect performance. Therefore data were collected in two cases: hot cache, when the send and receive bu ers are pre-charged in cache before invoking the communication system calls; cold cache, when the send and receive bu ers are discharged from cache before invoking the communication system calls. 11

time in s. min avg. max variance send sys. call (hot cache) 4.5 4.7 13.8 0.13 send sys. call (cold cache) 8.6 9.9 10.7 0.06 total delay (hot cache) 18.1 18.8 26.8 0.29 total delay (cold cache) 24.3 25.5 27.0 0.45

Table 1 Communication delay with 0 byte messages.

Due to the use of the IEEE 802.3 standard on the physical communication channel, messages are split into frames, each one of total size ranging from 60 to 1536 bytes. As previously mentioned, each GAMMA frame is characterized by a 20 byte header. In case of user messages ranging from 0 to 40 bytes less than 60 bytes are written to and read from the network adapters. The network adapter on the sender side lls the remaining gap with garbage bytes in order to conform to the IEEE 802.3 standard and a minimal length frame of 60 bytes is sent on the channel just the same. In case of user messages ranging from 40 to 1516 bytes the number of bytes written to and read from the network adapter is the same as the length of the Ethernet frame, which in this case carries the whole message. When the message size exceeds 1516 bytes, more than one frame is used. In this case all frames but possibly the last one are full sized, and are transmitted as a \rapid burst" without ow control. A rst experiment was aimed at getting a measurement of the message latency D(0) by exchanging empty messages. The only e ect of such empty messages have is activating the receiver handler associated with the receiving port. In this case only the 20 byte header is written into/read from the communication adapter while a regular 60 byte frame travels along the Fast Ethernet (the sender adapter completes the frame with 40 garbage bytes). The results obtained are outlined in Table 1. As a consequence of our implementation of message splitting in Ethernet frames we would expect the linear approximation for D(s) obtained by a xed value for the bandwidth to provide very good results at least in the range from 40 to 1516 bytes. In fact we measured throughput and bandwidth for various message sizes and obtained the results listed in Table 2 and plotted in Figure 1. In the case of hot cache we can observe that B (s) takes almost constant value of about 5.64 MB/s in the range from 256 to 1516 bytes. The small increase in bandwidth for messages ranging from 40 to 256 byte is due to one of the optimizations that we introduced in the GAMMA driver in order to reduce latency. The receiver adapter is programmed in such a way that the interrupt request is raised upon receipt of the rst few bytes of an incoming frame (a threshold was chosen in order to minimize latency) rather than waiting for the complete frame to be stored in the receive queue. This optimization was 12

Message size hot cache cold cache s T(s) B(s) T(s) B(s) (bytes) (MB/s) (MB/s) (MB/s) (MB/s) 40 1.53 5.48 1.20 5.33 64 2.11 5.57 1.63 4.85 128 3.07 5.59 2.62 5.59 256 3.99 5.64 3.55 5.57 512 4.67 5.64 4.37 5.61 1K 5.11 5.64 4.89 5.58 1516 5.28 5.65 5.01 5.48 2K 6.08 6.44 5.80 6.26 4K 7.78 8.06 7.65 8.04 8K 8.78 8.96 8.67 8.91 16 K 9.45 9.55 9.44 9.59 32 K 9.80 9.86 9.79 9.87 64 K 9.86 9.89 9.82 9.86

Table 2 Throughput and bandwidth with messages ranging from 40 bytes to 64 Kbytes.

introduced with the intent of compensating the latency time of the CPU in answering an interrupt request in case of short frames. Indeed, before the introduction of such optimization D(0) was 22.8 s instead of 18.8 and the linear approximation did hold on the complete interval from 40 to 1516 bytes (up to less than 1% error), but with a lower constant value of 3.89 MB/s. In the case of cold cache the slightly irregular behaviour of the bandwidth as a function of message size up to 128 bytes is due to the relevant impact of cache misses and to the xed size of the cache lines (32 bytes). Notice that the ping-pong application implicitly controls the frame ow onto the physical channel, in the case of messages of size 1516 bytes or less. Ping cannot send another message until pong has sent the previous one back, thus behaving like a stop-and-wait communication protocol. This is not the case with longer messages, when a message gives rise to a sequence of frames sent at full speed. Our adapters implement bu ering of incoming and outgoing frames in internal FIFO queues. Each adapter contains 128 KBytes of internal memory used to implement two FIFO queues. The send queue takes one fourth of the internal memory while the receive queue takes three fourths of the internal 13

(Mbyte/sec)

10

8

Bandwidth (hot cache)

6

Bandwidth (cold cache) 4

Throughput (hot cache)

Throughput (cold cache)

2

6

40 2

7

2

8

2

9

2

10

2

11

2

12

2

13

2

14

2

15

2

16

2

message size (bytes)

Fig. 1. Throughput and bandwidth with messages ranging from 64 bytes to 64 Kbytes.

memory. Transmission of a frame is triggered by writing the frame into the send queue. Subsequent frames may be queued-up before completion of the previosly demanded frame transmissions are completed until the entire 32 KB send queue is lled. At that point the send system call waits until enough space is freed in the adapter send queue in order to contain the next frame. This way the speed of the sender is automatically limited by the speed of the physical link if this is slower. Thanks to this organization of the adapters, one way communication of long messages may be implemented without explicit

ow control in a pipelined way, where the two CPUs (the sender and the receiver) work simultaneously with the Ethernet to perform data copy from/to user memory space. This pipeline e ect improves the exploitation of the raw communication bandwidth. We may observe the decrease of the cache e ect as the message becomes larger, due to the limited size of the cache itself as well as to the saturation of the Fast Ethernet bandwidth. During our experiments no frame was lost (the error handler of the ping-pong application was never invoked), even with very long messages (2 MB long, much larger than the size of the adapters' FIFO queues) sent at full speed without

ow control. Notice that having one single sender at a time which handles the 14

Platform

Latency Bandwidth (s) (MByte/s) Linux TCP/IP, Pentium 133 MHz 151.0 5.3 PARMA2, Pentium 100 MHz (no ow cntrl.) 73.5 6.6 U-Net, Pentium 133 MHz 30.0 12.1 GAMMA, Pentium 133 MHz (cold cache) 25.5 9.9 GAMMA, Pentium 133 MHz (hot cache) 18.8 9.9 CM-5 CMMD 93.7 8.3 T3D PVM 136.7 25.1 T3D PVMFAST 30.0 25.1 SP2 PVMe 209.0 16.5 SP2 MPL 44.8 34.9

Table 3 Message latency and communication bandwidth: comparison of platforms.

whole network bandwidth sending long bursts of frames at full speed represents the worst case in terms of possibility of frame over ow on the receiver side. In a more realistic situation in which the bandwidth of the fast LAN is partitioned among various senders, the speed of each sender is reduced accordingly. This consideration suggests that no ow control protocol might actually be needed in order to implement reliable Active Message communication in our NOW. The same idea of avoiding ow control was also implemented in the PARMA2 prototype, in the form of the PRP protocol replacing standard TCP/IP. In the PARMA2 case however, the receiver is not fast enough in consuming incoming messages and providing fresh memory bu ers for storing them, so that sending messages larger than 300KBytes currently results in frame loss due to over ow. Using results provided by our collegues working on the PARMA2 Project [10, Table 2.1] we can compare our latency and bandwidth estimates to similar measurements carried out on other platforms. Table 3 reports some of these results in which the GAMMA and the U-Net [13] results are integrated. As expected PARMA2, U-Net and GAMMA provide a dramatic improvement with respect to TCP/IP socket communication on the same hardware con guration ( rst row in the table). Like PARMA2 and U-Net, GAMMA cannot compete with real parallel platforms in terms of total bandwidth. Notice also that the 6.6 MByte/s measured on PARMA2 are somehow unrealistic for very long messages since a form of ow control should be introduced on that plat15

form to avoid packet loss. However, in terms of pure message latency, GAMMA provides an excellent improvement compared to PARMA2 (whose communication protocol already provides a remarkable improvement with respect to TCP/IP) as well as U-Net. Indeed, GAMMA appears to be quite competitive even compared to much more expensive parallel platforms, provided that we do not saturate the communication channel with long messages. Of course comparing GAMMA performance measured through its very low level communication interface to the performance of other platforms which o er MPI at user level is not completely fair, since this does not account for the software overhead of the higher level communication interface. However we do not expect the overhead of the MPI implementation over GAMMA to increase message latency in a substantial way. On the other hand we feel that the performances relative to the other platforms are all a ected by some form of caching e ect, due to repetition of the same ping-pong experiment for a number of times, hence a fair comparison should take only GAMMA hot cache performances into account.

4 Conclusions and Future Works Our preliminary experimental results show an improvement of more than four times in message delay and latency as compared to the PARMA2 prototype which in turn had already shown a substantial improvement over the standard implementation of socket based inter-process communication in a LAN environment. An improvement of almost two times for latency has also been obtained with respect to the \raw, user level protocol" approach implemented in U-Net on similar hardware. It is worth noting that GAMMA is capable of delivering lower latency times than U-Net while providing user applications with a more abstract and exible virtualization of the network interface. However it should be noted that the Fast Ethernet prototype of U-Net exploits the raw network bandwidth even better than GAMMA, probably due to the DMA features of its network interface. The network interface supported by our current prototype of GAMMA provides for some DMA features, also called \bus mastering". We carried out some preliminary experiments exploiting bus mastering for message transmission. As expected, this way a larger amount of the raw communication bandwidth can be delivered in case of long messages. However the delay time for short messages increases too, due to the overhead of setting up the DMA transfer. A trade-o between programmed I/O and bus mastering may be found by retaining programmed I/O only for short messages and switching to DMA for 16

longer ones. Work on this aspect is in progress. Our results may suce to make the use of inexpensive NOW platforms feasible and convenient with respect to much more expensive \commercial parallel platforms" for a large class of applications and with a fair number of processors. Applications that are well adapted to the GAMMA platform are the ones involving extensive use of small messages exchanged by asynchronous processes running on di erent processors. No other message passing machine is currently able to deliver lower message latency than GAMMA even if we compare GAMMA with very expensive parallel platforms. A good example of application for GAMMA is distributed discrete event simulation, in which low message latency is very crucial to obtaining good speed-up [4]. Of course, our current NOW prototype lacks one of the main requirements that a massively parallel architecture must consider, namely scalability with respect to the number of processing nodes. If we increase the number of processing nodes in GAMMA sooner or later we will saturate the LAN bandwidth. No doubt that a Fast Ethernet bus is the worst choice from this point of view. A bus LAN technology can be used only if we assume that we will never try to scale-up the number of processing nodes. However, Fast Ethernet switches are starting to become a cheap alternative to other switched LAN technology allowing for some bandwidth scalability (currently up to 8-12 nodes) at a reasonably low price. Moreover, the use of a switch allows each channel to become full-duplex thus completely avoiding collisions. In this sense, with a moderate increase in hardware cost, our platform could easily bene t from switched LAN technology to support applications requiring higher global bandwidth. Up-to-date information on the GAMMA project is maintained at the URL http://www.disi.unige.it/project/gamma/.

17

References [1] Connection Machine CM-5 Technical Summary. Technical report, Thinking Machines Corporation, Cambridge, Massachusetts, 1992. [2] G. Burns, R. Daoud, and J. Vaigl. LAM: An Open Cluster Environment for MPI. Technical report, Ohio Supercomputer Center, Columbus, Ohio, 1994. [3] G. Chiola and G. Ciaccio. GAMMA: Architecture, Programming Interface and Preliminary Benchmarking. Technical Report DISI-TR-96-??, DISI, Universita di Genova, November 1996. [4] G. Chiola and A. Ferscha. Performance Comparable Design of Ecient Synchronization Protocols for Distributed Simulation. In Proc. MASCOTS'95, Durham, North Carolina, January 1995. IEEE-CS Press. [5] W. Gropp and E. Lusk. User's Guide for mpich, a Portable Implementation of MPI. Technical Report MCS-TM-ANL-96/6, Argonne National Lab., University of Chicago, 1996. [6] R.W. Hockney. The Communication Challenge for MPP: Intel Paragon and Meiko CS-2. Parallel Computing, 20(3):389{398, March 1994. [7] L.T. Liu and D.E. Culler. Measurement of Active Message Performance on the CM-5. Technical Report CSD-94-807, Computer Science Dept., University of California at Berkeley, May 1994. [8] S. Pakin, M. Lauria, and A. Chien. High Performance Messaging on Workstations: Illinois Fast Messages (FM) for Myrinet Computation. In Proc. Supercomputing '95, San Diego, California, 1995. ACM Press. [9] T. Sterling, D.J. Becker, D. Savarese, J.E. Dorband, U.A. Ranawake, and C.V. Packer. BEOWULF: A Parallel Workstation for Scienti c Computation. In Proc. 24th Int. Conf. on Parallel Processing, Oconomowoc, Wisconsin, August 1995. [10] The Computer Engineering Group. PARMA2 Project: Parma PARallel MAchine. Technical report, Dip. Ingegneria dell'Informazione, University of Parma, October 1995. [11] The Message Passing Interface Forum. MPI: A Message Passing Interface Standard. Technical report, University of Tennessee, Knoxville, Tennessee, 1995. [12] T. von Eicken, D.E. Culler, S.C. Goldstein, and K.E. Schauser. Active Messages: A Mechanism for Integrated Communication and Computation. In Proc. 19th Int. Symp. on Computer Architecture, Gold Coast, Australia, May 1992. ACM Press. [13] M. Welsh, A. Basu, and T. von Eicken. Low-latency Communication over Fast Ethernet. In Proc. Euro-Par'96, Lyon, France, August 1996.

18

Suggest Documents