High Performance Communication using a Gigabit ... - Semantic Scholar

TR-98-003

High Performance Communication using a Gigabit Ethernet

Shinji Sumimoto, Hiroshi Tezuka, Atsushi Hori, Hiroshi Harada, Toshiyuki Takahashi, and Yutaka Ishikawa

Received Sep 28, 1998

Tsukuba Research Center, Real World Computing Partnership Tsukuba Mitsui Building, 16th oor, 1-6-1 Takezono Tsukuba-shi, Ibaraki 305, Japan

A high performance communication facility, called the GigaE PM, has been designed and implemented for parallel applications on clusters of computers using a Gigabit Ethernet. The GigaE PM provides not only a reliable high bandwidth and low latency communication function, but also supports existing network protocols such as TCP/IP. In the design of the GigaE PM, it is assumed that a Gigabit Ethernet card has a dedicated processor and its program can be modi ed. A reliable communication mechanism for a parallel application is implemented on the rmware while existing network protocols are handled by an operating system kernel. A prototype system has been implemented using an Essential Communications Gigabit Ethernet card. The performance results show that a 47.7 microseconds round trip time for a four bytes user message, and 58.3 MBytes/sec bandwidth for a 1,468 bytes message have been achieved on Intel Pentium 150MHz PCs. Abstract

Table 1: Performance of TCP/IP on Gigabit Ethernet NIC Packet Enginge[4] Essential Essential Essential

Machine DEC Alpha, 533 MHz Intel Pentium II, 333 MHz Intel Pentium II, 333 MHz Intel Pentium, 150 MHz

OS Windows NT Windows NT Linux Redhat 5.0 Linux Redhat 5.0

bandwidth 29.6 MBytes/sec 27.5 MBytes/sec 26.3 MBytes/sec 12.0 MBytes/sec

1 Introduction A cluster of computers is widely accepted as an inexpensive means to implement a parallel computing system. There are two approaches to building such a system. One is to develop a high performance communication facility using gigabit class networks such as AM[5], FM[6], and PM[1, 2] communication facilities. This system is comparable in performance and less expensive when compared with existing parallel computers[3]. Since such a system is mainly designed to replace a parallel computing system or MPP, computers are installed on a system rack or in a computer room, instead of connecting physically distributed machines in a LAN environment. The other approach is to use commodity software and networks such as MPI, TCP/IP, and 100 Baset-T or ATM-LAN. However, since the current commodity LAN's speed is around 100 Mbps, the system is far slower than a parallel system. On the other hand, the system can be easily built up using the existing distributed environment without excluding distributed applications. Recently, the Gigabit Ethernet has become widely available and is the commodity of the next generation LAN. Using the Gigabit Ethernet, a high performance cluster system can be con gured in a LAN environment. However, the existing network protocols on the Gigabit Ethernet are not capable of supporting high performance communication used in a parallel application. As shown in Table 1, the TCP/IP achieves only less than 30 MBytes/sec although the physical layer's bandwidth is 125 MBytes/sec. The GigaE PM communication facility has been designed to support high performance communication for a parallel application, as well as to support existing network protocols. Therefore, a cluster system can be constructed on a distributed environment where parallel applications coexist with distributed applications. We assume that a Gigabit Ethernet card has a dedicated processor and its rmware can be modi ed. The GigaE PM network protocol is implemented on the network interface card (or NIC for short) so that the data exchange between the host and the NIC is minimized to reduce the overhead. The GigaE PM supports reliable low latency and high bandwidth communication, i.e., a 47.7 microseconds round trip time for a 4 byte user message and a 58.3 MBytes/sec bandwidth for a 1,468 bytes message. In this paper, a network protocol implementation overhead is described in section 2. This section concludes that the network protocol should be implemented on the NIC to realize a high performance communication facility. However, a network protocol such as TCP/IP cannot be implemented on the NIC due to its restricted memory resources. The GigaE PM design requirements are presented in section 3. The design and implementation of the GigaE PM are described in section 4. In section 5, the basic performance, i.e., latency and bandwidth, is measured and compared with the TCP/IP. Related works on high performance communication facilities are described in section 6. Finally, section 7 presents our conclusions.

1

Table 2: Data transfer cost between the host and NIC memories and system call Transfer Method N words (4bytes/word) transfer cost (microseconds) Host to Host by host processor 0.104 x N Host to NIC by host processor 0.25 x N NIC to Host by host processor 0.45 x N Host to NIC by NIC DMA 2.21 + 0.04 x N NIC to Host by NIC DMA 1.99 + 0.04 x N Interrupt/System Call Cost interrupt 5.6 ioctl 1.9

2 Network Protocols and NIC A modern network interface card (or NIC for short) has an on-board processor and memory so that the data-link layer or Layer 2 can be handled by the NIC. A message is passed to the host memory via an I/O bus, e.g., PCI. The host processor handles the upper layers, such as TCP/IP. Such protocol handling layers across the NIC and the host incur some costs. These costs are analyzed using an Essential Communication's PCI Gigabit Ethernet NIC plugged into a Pentium 150 MHz PC. Table 2 shows the basic costs of i) data transfer between host and NIC memories, ii) interrupt, and iii) system call. 2.1

Cost Estimation

Figure 1 and the following description show the data and control ow model for handling TCP/IP in a typical implementation. This model does not re ect the actual implementation, but rather focuses on the information movement between the host and the NIC. It is assumed that the message sender and receiver buers, and their descriptors are located in the host memory. The cost is calculated based on Table 2.

Transfer between the user and kernel spaces. The user data is transferred to the kernel space by invoking a system call. It has a system call and the memory copy overheads as shown in c1) and d1) of Figure 1. This total cost is 1:9 + 0:104 3 N words microseconds.

An IP packet transfer from the host to the NIC. TCP and IP headers are assembled with the data, and then the IP packet is transferred to the NIC. Two pointers, pointing to the header and data areas, are stored in a message sender descriptor. In d2-1) of the gure, it is assumed that the header creation cost is the same as the header copy cost. The host tells the NIC that the new descriptor is available. This cost is shown in c2). Then the NIC issues a DMA so that the packet header and data are transferred to the NIC, whose cost is shown in d2-2). The total cost becomes 5:95 + 0:04 3 N microseconds.

An IP packet transfer from the NIC to the host. When an IP packet arrives at the NIC, the packet is transfered to the host. Then the Layer 2 handler in the NIC informs the host of a packet arrival by issuing an interrupt signal. Those two costs shown in d3) and c3) are 10:48 + 0:04 3 N microseconds. 2

Host Machine User Memory Area

c1) System Call 1.9 usec User Space

d1) Memory Copy 0.104 * N usec

Kernel Space

d1) Memory Copy 0.104 * N usec Kernel Memory Area

TCP protocol handler d2−1) A header creation 0.6 usec IP protocol handler d3) Transfer from NIC to Host descriptor: 2.21 + 0.08 usec IP packet: 1.99 + 0.6 + 0.04*N usec

NIC c3) Interrupt 5.6 usec

d2−2) Transfer from Host to NIC descriptor: 2.21 + 0.08 usec IP packet: 2.21 + 0.6 + 0.04*N usec

c2) Flag Set 0.25 usec

Layer 2 handler N is the number of words (four bytes/word)

Figure 1: Data and Control Flow Model in TCP/IP Handling 2.2

Discussions

According to the above cost estimation, a one word message transfer from the sender to the receiver requires 20.518 microseconds overhead for communication between the host and the NIC. For the total cost of message passing from the sender to the receiver, the TCP/IP protocol handling overhead in the host processor and the Layer 2 protocol handling in the NIC must be added. To implement the TCP/IP protocol, control packets, such as ACK, are passed between the sender's and receiver's protocol handler where the communication overhead between the host and the NIC is 16.43 microseconds according to our cost estimation. To make a high performance communication facility, the overhead between the host and the NIC shown above must be reduced as much as possible. One approach has the TCP/IP protocol handling performed in the NIC. In this TCP/IP implementation, both the sender and receiver need some message buer area to implement a sliding window protocol in which the sender may send messages asynchronously. The number of messages asynchronously sent is speci ed as the window size. The sender and receiver need to have the message buer area to keep the N messages. A parallel application, especially a data parallel application, is realized by processes, each of which runs on a computer and communicates with other processes. If such an application uses TCP/IP, each process must establish TCP/IP connections with all the other processes. For example, if a process has 128 connections and window size is 8 and MTU is 1.5KB, 1.5 Mbytes memory is required. Since most NICs have only 1 MBytes or less, a TCP/IP protocol handler cannot be implemented in an NIC. Therefore, we have designed a high performance communication facility called the GigaE PM in which a transport protocol like TCP/IP can be implemented in an NIC.

3

Nde#0

Nde#1

Nde#N

Process for A1

Process for A2

Process for A1

Send to Node#1

Send to Node#N

Receive

PM Interface

Process for A2

Process for A1

Process for A2 Receive

PM Interface

PM Interface ch#3

ch#2

ch#1

ch#0

ch#3

ch#2

ch#1

ch#0

ch#3

ch#2

ch#1

ch#0

Figure 2: Virtual Network

3 Design Requirements The GigaE PM design requirements can be summarized as follows: 1. A simple network protocol for a parallel application. According to the previous section's discussion, a network protocol must not require large message buers in order to communicate with many processes. It must be simple in the sense that it can be implemented on an NIC which has restricted hardware resources. 2. Reliability and FIFO-ness are guaranteed. Since the Gigabit Ethernet does not guarantee message arrival, the network protocol must support reliable communication. 3. Low overhead information exchange between the NIC and the host. As previously stated, the location of the information shared by the host and the NIC dominates the information exchange overhead. 4. The GigE PM protocol coexists with other protocols such as TCP/IP. Since the communication facility is used in the LAN's environment where other protocols such as TCP/IP are also used, the facility supports both the dedicated protocol for high performance computation and traditional network protocols.

4 Design and Implementation 4.1

Virtual Network

The GigaE PM communication facility is designed to adapt to a parallel application, especially a data parallel application. It provides a channel to support a virtual network which has been introduced in the PM[1]. Each process of a parallel application exclusively uses the same channel number which represents a virtual network. The number of channels depends on the NIC's hardware resource. In the current implementation, four channels are supported.

4

Figure 2 shows an example of the virtual network usage. It is assumed that parallel applications A1 and A2 run on Node#1 through Node#N. Processes for A1 and A2 run on each node. Processes for A1 use the channel 1 while processes for A2 use the channel 2 in this gure. The API of the GigaE PM is based on PM[1]. An example of message sending is described as follows: a process i) obtains a message buer area by issuing the PM getsendBuf function whose arguments are a channel number and a buer size, ii) constructs a message using the buer, and iii) issues the PM sendmsg function whose arguments include the buer address obtained by the PM getsendBuf. An example of message receiving is described as follows: a process i) obtains a pointer to a message buer by issuing the PM receive function, ii) processes the message, and ii) releases the message buer by issuing the PM putreceiveBuf function. To realize that parallel processes more than the number of channels may run, we have developed the SCore global operating system that realizes multiplex channels of the PM[1]. It should be noted that the GigaE PM is mainly used to implement upper communication libraries such as MPI. 4.2

Reliable Communication

To guarantee message delivery and FIFO-ness, The protocol GO back N with the STOP and GO ow control have been adopted. 4.2.1

GO back N Protocol

In the GO back N protocol, the sender may send the ith to i + N th data messages to the receiver without waiting for an acknowledgment of message receptions from the receiver. That is, the i 0 1th ACK message from the receiver has been received in the sender. The receiver sends the ith ACK message back when the ith data message is received by the receiver. When the sender receives the ith ACK message, the sender may send the i + 1th to i + 1 + N th data messages to the receiver. If the sender does not receive the ith ACK message after a certain time, the sender sends the ith and the following data messages again. This happens where the ith data or ACK message is lost. The sender needs to have a buer area to keep the ith to i + N th messages, but the receiver does not need to have a buer area to keep the N data messages. When the receiver receives a data message whose sequence number is larger than i, the received data is discarded, and then a LOOSE message is sent back to the sender. This diers from the TCP/IP sliding window protocol. When the receiver's buer area becomes full, a STOP message is sent to stop the sender. The receiver will send a GO message when the receiver again has enough buer area. This protocol is well known as the STOP and GO ow control. The detailed network protocol will be described in section 4.4. 4.3

Information Exchange between the Host and the NIC

As described in section 2, information exchange between the host and the NIC is crucial in the design of a high bandwidth and low latency communication facility. The information is exchanged through a message descriptor. This entry has an message address, a message size, and a sender or receiver identi er depending on the type of message. First of all, the location of descriptors and their access methods are designed as follows: 1. Descriptors Location The message descriptors can be located in either host or the NIC. Since the descriptors are 5

accessed by both the host and NIC, they should be located in the place where the total access time cost is minimized. In our implementation using the Essential Communication's PCI Gigabit Ethernet, the descriptors are located in the NIC according to Table 2. 2. Descriptors Access If the user process writes a send message descriptor and informs the NIC that a new descriptor is ready, no kernel calls are required in sending a message. To guarantee a safety communication facility, i) send message descriptors are isolated from other user processes so that the user process only accesses the proper area, and ii) the NIC must verify the send message descriptor because the user process might have written a bad message address. In the Essential Communication's PCI Gigabit Ethernet NIC, the host can access whole memory in the NIC by writing a control register of the NIC. If all send message descriptors are located in the NIC, accessing a send message descriptor requires writing the control register. If the user has a write-access right to the register, the user may write other registers. Thus, to guarantee a safety communication facility, the current GigaE PM provides a kernel function that accesses a message descriptor in the NIC. The host can access restricted memory area without writing a control register of the Essential Communication's NIC. To utilize this functionality, the receive message descriptors for the users is provided in this restricted memory area so that the user process may read the message descriptor without a kernel call. In other words, the NIC has two kids of receive message descriptors: the descriptor for the user processes and the internal descriptor. 3. Trigger The following methods to inform the NIC/host by the host/NIC must be considered: Two methods can be used: i)the host writes a ag in the memory of the NIC, or ii) the NIC polls a ag area in the host. According to Table 2, the host processor writes a ag in the memory of the NIC. From the NIC to the Host Here, there are three methods: i) the NIC writes a ag in the host's memory, ii) the host polls a ag area in the NIC, and iii) the NIC issues an interrupt signal to the host. According to Table 2, the host polls a ag area in the NIC. From the Host to the NIC

Let us describe information exchange between the host and the NIC in GigaE PM using Figure 3: Sender: 1)

The PM senmsg function, realized by the GigaE PM library, invokes the GigaE PM driver in sc1) so that the driver writes a descriptor on the NIC as shown in sc2).

2)

The NIC will transfer the message on the user space to a NIC memory area as shown in sd1) of FIgure 3.

Receiver: 1)

The PM receive function, realized by the GigaE PM library, polls the arrival ag in the NIC without any kernel trapping as shown in rc1) of Figure 3. If the arrival ag has been set, the receive message descriptor is read so that the message buer address, message size, and the sender identi er are obtained. 6

User Memory Area

GigaE PM Library

GigaE PM Library rc1) Poll Arrival rc2) Issue Releasing Flag Message Buffer User Space

sc1) Issue Send Request

User Space Kernel Space

Kernel Space sd1) Transfer from Host to NIC

GigaE PM Driver

Send Message Descriptors

User Memory Area

sc2) Write and Register Descriptor

GigaE PM Driver Receive Message Descriptors for User

NIC Memory Area

Receive Message Descriptors

GigaE PM Firmware

Sending a Message

rc3) Update Receive Descriptor

rd1) Transfer from NIC to Host

NIC Memory Area

GigaE PM Firmware

Receiving a Message

Figure 3: Descriptors, host, and NIC in GigaE PM 2)

When the NIC receives a new incoming message, the message is transferred to the user message buer whose area has been registered by the initialization routine of the GigaE PM library. The NIC updates the arrival ag.

3)

The PM putreceiveBuf function, realized by the GigaE PM library, invokes the GigaE PM driver in rc2) to inform the NIC that the message buer has been released.

4.4

GigaE PM Protocol

In this subsection, the GigaE PM network protocol is described in detailed. There is a sending and receiving buers for each channel. All out-going messages are stored in the sending buer and all in-coming messages are stored in the receiving buer. Unlike the TCP/IP, buers are not allocated with respect to peer to peer communication, but are allocated with respect to a channel. A message is represented by M sg(SenderI D; ReceiverI D; DataM essageSequenceN umber). Let N represent the number of messages that the sender may send asynchronously without waiting for an ACK. Timeout is referred to as T . Let SBuf (r; i) represent the sender message buer. Let ST ime(r; i) represent the time when the ith message is sent to receiver r. M sgI dS ent(r ) keeps the largest sequence number of a message which has been sent to the receiver r. M sgI dRecv(s) keeps the largest sequence number of a message sent by the sender s and received by the receiver. At the initialization time, both M sgI dSent(s) and M sgI dRecv(s) are unde ned. The GigaE PM protocol on sender and receiver nodes is described as follows:

On Sender Node: S1

Sender s may send receiver r messages 8M sg (s; r; j ) where M sgI dSent(r) For each message M sg(s; r; i), the following procedure is performed:

< j

1. The M sg(s; r; i) is sent to the receiver. 2. The sender message buer SBuf (r; i) of M sg (s; r; i) is created and kept. 3. The current time is kept in S T ime(r; i). 4. M sgI dSent(r) M sgI dS ent(r) + 1. 7

< i

+

N

.

S2 -

When the dierence between the current time and S1.

If a LOOSE message LOOSE (r; k) where

i

k < i

(

) is larger than T , perform

ST ime s; r; i

+

N

is received,

1. release the sender message buer, 8SBuf (r; i) where j < k. 2. i k. 3. Perform S1. -

release the sender message buer, 8

If a STOP message ST OP (r; k) where

i

1. 2. i k. 3. Stop sending.

-

If a GO message GO(r; k) where

-

If an ACK message AC K (r; k) where

i

k < i i

k < i

+ N is received, (

SBuf r; j

) where j < k.

+ N is received, perform S1.

k < i

+ N is received,

1. i k. 2. Perform S1. On Receiver Node: -

When receiver r receives message M sg (s; r; i),

-

In the case of M sgI dRecv(s) + 1 = i and the receiver buer on the host is not full, { An ACK message AC K (r; i) is sent back to the sender s. { The message is transferred to the host. { M sgI dRecv (s) M sgI dRecv (s) + 1. In the case of M sgI dRecv(s) + 1 = i and the receiver buer on the host is full, { A STOP message ST OP (r; i) is sent back to the sender s. { All following received messages are discarded. In the case of M sgI dRecv(s) + 1 < i which means that a message has been lost, a LOOSE message LOOSE (r; M sgI dRecv(s)) is sent back to the sender s.

When the receiver buer on the host again has room and a STOP message has been sent, A GO message GO(r; M sgI dRecv (s) + 1) is sent back to sender s. The receiver r can again receive messages.

4.5

GigaE PM and other protocols

The GigaE PM network protocol is handled in the NIC while the other network protocols such as IP are handled in the host. When a network packet, whose type is not the GigaE PM, arrives, the packet is transferred to the host and a handler in the host is triggered.

5 Evaluation The basic performance, latency and bandwidth, is measured in this section. Table 3 shows our evaluation environment. The evaluation contains two cases: i) two machines are connected with the Extreme's switch and ii) two machines are directly connected without the switch. 8

Table 3: Machine Environment Hardware Pentium 150MHz, 430FX chipset, 64MB fast page memory NIC Essential's PCI Gigabit Ethernet NIC model EC-440-SF (33 MHz clock, DMA for PCI and MAC SEEQ 8100 MAC) Switch Extreme's Summit 2 Host OS Redhat 5.0 Linux (2.0.32 kernel)

5.1

Bandwidth

The bandwidth is shown in Figure 4. The gure contains performance of the GigaE PM, TCP/IP on the GigaE PM, and TCP/IP provided by the Essential. The GigaE PM has achieved a 58.3 Mbytes/sec bandwidth in case of a 1,468 bytes message. In contrast to the GigaE PM, the TCP/IP on both the GigaE PM and the Essential only achieve about 11 Mbytes/sec. The bandwidth of TCP/IP on the GigaE PM has achieved comparable performance of the TCP/IP using the Essential's rmware. Application Level Bandwidth 80 GigaE PM with Switch GigaE PM without Switch TCP/IP on GigaE PM TCP/IP on ESS Firmware

70

Throughput (MB/s)

60

50

40

30

20

10

200

400

600 800 1000 Payload Length (Bytes)

1200

1400

1600

Figure 4: Bandwidth 5.2

Latency

Figure 5 shows the round trip time cost. The GigaE PM has achieved a 47.7 microseconds round trip latency in a four bytes user message. When a Summit 2 switch is inserted between the two hosts, the latency increases to 57.7 microseconds. This means that a message transfer latency 9

on the switch is 5 microseconds. The round trip latency of TCP/IP on the Essential's rmware is 428.0 microseconds while the latency on the GigaE PM is 291.8 microseconds. The GigaE PM realizes better performance with coexisting the high performance network protocol. Application Level Round Trip Latency 900 GigaE PM with Switch GigaE PM without Switch TCP/IP on GigaE PM TCP/IP on ESS Firmware

800

Round Trip Time (usec)

700

600

500

400

300

200

100

0 0

200

400

600 800 1000 Payload Length (Bytes)

1200

1400

1600

Figure 5: Round Trip Time

6 Related Works Many high performance communication facilities have been developed. AM[5], FM[6], and PM[1] are based on Myrinet[7]. The Myrinet, gigabit class network, supports reliable message transfer at the hardware level. Thus, a facility does not need to worry about a message being lost. To decrease the kernel trapping overhead, the user-level communication is realized by PM and U-Net[8]. The PM achieves a 15 microseconds round trip time in a Myricom Myrinet network. Thus, the kernel trapping overhead is crucial. However, in our experience, the kernel trapping overhead is 3.8 microseconds of 47.7 microseconds round trip time, or only 7.9 % overhead of round trip time. It is not considered crucial. VIA[9], the Virtual Interface Architecture, is being widely implemented in gigabit class networks on the Microsoft's Windows operating system. VIA is designed so that other communication facilities such as sockets and MPI are implemented on top of VIA. VIA supports connection oriented communication. The reliable communication support is an option according to the VIA speci cation Version 1.0. As described in section 2, if a reliable connection oriented communication facility is realized using an unreliable network such as Gigabit Ethernet, larger communication buers are required. Such a facility cannot be implemented in a NIC. Thus, we think that it is dicult to support high performance communication for parallel application, especially data parallel application. 10

Since the GigaE PM protocol is simple and does not require large message buers, unlike other connection oriented communication facilities, the handler is implemented in a NIC and provides reliable communication on top of the Gigabit Ethernet.

7 Conclusions In this paper, it has been pointed out that information exchange between the NIC and the host is crucial in the design of high performance communication for parallel applications on clusters of computers using Gigabit Ethernet. Therefore, a high performance communication facility called the GigaE PM has been designed. A prototype system has been implemented using an Essential Communication's Gigabit Ethernet card. The performance results show that a 47.7 microseconds round trip time for a four bytes user message and a 58.3 MBytes/sec bandwidth for a 1,468 bytes message are achieved. The GigaE PM also supports TCP/IP protocol without decreasing the performance of the original Gigabit Ethernet card. This paper contributes to the design of a high performance communication facility using the Gigabit Ethernet. The GigaE PM is not dedicated for the Essential Communication's NIC, rather it is a general facility which can be implemented on other modern NICs. Because a modern NIC has an on-board processor which the GigaE PM implementation assumes. The GigaE PM facility provides not only a reliable high bandwidth and low latency communication function but also supports existing network protocols such as TCP/IP. Using the facility, a high performance cluster system is constructed on a distributed environment where parallel applications coexist with distributed applications. Future works will include the implementation of a MPI library on top of the GigaE PM and the investigation of GigaE PM scalability using parallel applications.

References [1] Hiroshi Tezuka, Atsushi Hori, Yutaka Ishikawa, and Mitsuhisa Sato. PM: An Operating System Coordinated High Performance Communication Library. In Peter Sloot Bob Hertzberger, editor, High-Performance Computing and Networking, Vol. 1225 of Lecture Notes in Computer Science, pp. 708{717. Springer-Verlag, April 1997. [2] Hiroshi Tezuka, Francis O'Carroll, Atsushi Hori, and Yutaka Ishikawa. Pin-down Cache: A Virtual Memory Management Technique for Zero-copy Communication. In IPPS/SPDP'98, pp. 308{314. IEEE, April 1998. [3] http://www.rwcp.or.jp/lab/pdslab/benchmarks/. [4] http://www.packetengines.com/products/ performance/gnicntperf.htm. [5] http://now.cs.berkeley.edu/AM/lam release.html. [6] Scott Pakin, Mario Lauria and Andrew Chein. \High Performance Messaging on Workstations: Illinois Fast Messages (FM) for Myrinet". In Proceedings of Supercomputing '95, San Diego, California, 1995. [7] N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. L. Seitz, J. N. Seizovic and WenKing Su. \Myrinet { A Gigabit-per-Second Local-Area Network". IEEE MICRO, Vol. 15, No. 1, pp. 29{36, February 1995. 11

[8] Anindya Basu, Vineet Buch, Werner Vogels, and Thorsten von Eicken. U-Net: A UserLevel Network Interface for Parallel and Distributed Computing. In Proceedings of the Third International Symposium on High Performance Computer Architecture (HPCA), February 1997. [9] http://www.viarch.org/.

12