PM: An Operating System Coordinated High Performance ... - CiteSeerX

PM: An Operating System Coordinated High Performance Communication Library

Hiroshi Tezuka, Atsushi Hori, Yutaka Ishikawa, and Mitsuhisa Sato Real World Computing Partnership 16F Mitsui bldg., 1-6-1 Takezono, Tsukuba Ibaraki, 305, JAPAN ftezuka,hori,ishikawa,[email protected] TEL: +81-298-53-1693 FAX: +81-298-53-1652 Abstract. We have developed a new communication library, called PM, for the Myrinet gigabit LAN card, that has a dedicated processor and on-board memory to handle a communication protocol. In order to obtain high performance communication and support a multi-user environment, we have co-desgined PM, an operating system realized by a daemon process, and the run-time routine for a programming language. Several unique features, e.g., network context switching and modi ed ACK/NACK ow control algorithm have been developed for PM. The PM library has been implemented on two types of clusters: Sun Sparc station model 20/71 workstations and Intel Pentium based PCs. PM for the Suns has a speed of 20 seconds round trip for a user-level 8 bytes message and 38.6 Mbytes/second bandwidth for an 8 Kbytes message. The result of a NAS parallel benchmark shows that a Sparc 20 workstation cluster achieves almost the same performance as Cray T3D.

1

Introduction

It has become possible to cluster high-performance workstations and PCs to create cost-eective parallel machines using high-speed interconnect technologies such as ATM, Fibre Channel and Myrinet[1]. To achieve a low latency and high bandwidth communication, user memory mapped network drivers are widely employed, such as Active Messages (AM)[2], Fast Messages (FM)[3], and UNet[4]. In these drivers, the user process directly accesses the network hardware to eliminate kernel traps and data copies. However, the communication facility can only be used by a single process at a time because the process exclusively uses the network hardware resource. We have co-designed a run-time system for a programming language called MPC++[5, 6], an operating system called SCore, and a communication library called PM to support multi-user parallel processing environments. PM provides a reliable FIFO order message passing facility which the MPC++ run-time routine assumes. SCore is implemented as a daemon process called SCore-D on top of a Unix operating system and manages the distribution of user processes to several workstations. We call these simultaneous user processes a parallel process. SCoreD, cooperating with the MPC++ run-time system, employs gang scheduling to

change the context of the parallel process, including its network state. This allows multiple users to access the workstation cluster in a time and space sharing (TSSS) fashion[7, 8]. This technique overcomes the drawback of a parallel software environment using a memory mapped network driver, such as in AM and FM. To accomplish gang scheduling, the SCore-D requires PM to support multiple communication paths for the users and the SCore-D daemon process. In this paper, we propose several techniques to implement PM for supporting high performance communication in a multi-user environment using the Myrinet card, which has an on-board processor and memory. We will introduce a communication channel, that represents the network status and includes a message buer on the on-board memory, and its context switching capability to support multiple communication paths. A new ow control algorithm called Modi ed ACK/NACK, that assumes hardware-level reliable FIFO order message delivery unless the communication buer in the on-board memory over ows, will be proposed. The algorithm is scalable and preserves the message order even if a communication buer over ows. The Immediate Sending technique has been developed to increase the throughput and lower the latency of large messages.

2 2.1

Design and Implementation Gigabit LAN Card

Myrinet is a gigabit LAN commercially produced by Myricom Inc. The Myrinet LAN interface card has a 32-bit microprocessor, an SRAM, a network interface, and a SBus or PCI DMA controller. The on-board processor executes programs stored in the SRAM to control the network interface and the DMA controller. Thus, we can implement new protocols on Myrinet by changing the program in the card. Because all messages transmitted to and received from the link must be stored in the SRAM, a separate data transfer between the SRAM and main memory is needed. This on-board SRAM is located in the SBus or PCI address space. By mapping this SRAM address area into the user address space, users can directly access data in the SRAM. This results in low latency message transfer using the polling technique. Myrinet is distinguished from other LAN hardwares such as Ethernet and ATM by 1) its guarantee of message delivery using hardware ow control and 2) the preservation of message order transferred via the same path. 2.2

Requirements

Requirements of the run-time system for the programming language MPC++ and the operating system SCore-D are described in this subsection. Existing communication libraries support not only simple message delivery mechanisms but also high-level communication mechanisms to program parallel applications. For example, MPI supports several features including the tagged

message delivery mechanism. AM supports the handler call mechanism with a restriction, that is, the execution of a handler program may not be blocked. Though supporting high-level communication mechanisms in a communication library with some execution penalty, it is hard to write a parallel program without programming language support. Using the C++ template feature, however, a type-safe and language-integrated communication library can be realized. The MPC++ version 2 template library[6], called the Multiple Threads Template Library (MTTL), supports local/remote thread invocation, synchronization structures, and global pointers. The global pointer allows access to a remote object and its distribution to multiple processors. Such a facility is useful to implement data parallel applications in which each processor's local memory is often distributed to other processors. The MPC++ programming model is based on multiple threads on a distributed memory parallel machine in which thread execution continues unless it waits for synchronization or exits. In other words, an interrupt-driven communication mechanism is not required in the MPC++ run-time system which deals with communication and user-level threads. The MPC++ run-time system does not assume any high-level communication mechanism, but assumes a reliable FIFO order message delivery mechanism. Since the MPC++ programming model is based on a distributed memory parallel machine, low latency and high throughput communication is required to communicate with remote threads. The operating system on our workstation cluster has been designed as a daemon process called SCore-D which communicates with other SCore-D daemon processes via Myrinet to implement the gang scheduling. That is, the user and the daemon processes access the Myrinet hardware resources simultaneously. To allow multiple processes (SCore-D and user processes) to use the communication facility simultaneously, such a connection will be realized by a channel. A message sent from a channel of the sender node is transferred to the same channel on the receiver node. Each channel will be occupied by a process. A channel must have its own message buer so that congestion in one channel does not aect other channels. Because it is not possible to create as many channels as needed with the limited resources of a LAN card's on-board memory, a channel must be used by multiple processes in a time-sharing fashion. To realize this, a mechanism of channel context switching is required. A channel context consists of all memory associated with the channel, including the message buer and messages oating on the network. 2.3

Decreasing latency

Protocol implementations which use system calls and interrupts and which are used in ordinary LANs such as Ethernet cannot satisfy the low latency requirement. Although AM on the sun workstation with the ATM (called SSAM)[9] use a special trap instruction to solve this problem, modi cations to the operating system kernel are required.

We adopted a polling method to decrease the communication latency. In PM, the LAN card on-board processor transfers messages and the receiver thread of the user process polls the arriving messages. Polling immediately noti es the receiver thread about message arrivals and makes it possible to communicate with low latency. If the host program does not receive incoming messages, the message buer in the on-board memory may over ow. We designed and implemented a new ow control based on the ACK/NACK protocol as described in subsection 2.5. 2.4

Increasing throughput

Because the Myrinet network interface can only access the data in the on-board SRAM, transferring data between the host memory of each node involves four steps: 1) transferring the data by DMA from main memory to the on-board SRAM, 2) sending the data from the on-board SRAM to the network, 3) receiving the data from the network to the on-board SRAM, and 4) transferring the data from the on-board SRAM to main memory. The sequential execution of DMA and message transmission cannot utilize the bandwidth of the DMA controller and the network interface. Although double buering allows the DMA transfer of the next message to be executed while the current message is being sent in order to increase the throughput, it cannot lower the latency of each message transfer. We developed a new technique called Immediate Sending that starts sending data from the SRAM to the network immediately after the DMA transfer from main memory to the SRAM begins, as in pipeline processing. Immediate Sending can both increase the throughput and lower the latency. On a receiver node, it is not possible to receive a message and issue a DMA transfer simultaneously, because a CRC error is detected after the whole message is received. PM uses the double buering technique on the receiver node to increase throughput. 2.5

Flow control

The ow control method for a workstation cluster should be scalable because it must work eciently even with a large number of nodes. A window-based ow control algorithm is not scalable, because the receiver node must manage receive buers dedicated to each sender node, and a larger number of nodes makes the eective buer size smaller. Although the ordinary \ACK/NACK and Retransmit" method or \Return to Sender [3]" which sends back messages which cannot be received do not have such a dividing buer problem, they also do not preserve the message order. A scalable ow control algorithm, which preserves the order of messages, has been designed. This algorithm is called Modi ed ACK/NACK in the sense that it uses the ACK/NACK and Re-transmit method, and the sender/receiver state is introduced to preserve message order. In this algorithm, message order is preserved because succeeding messages are not received until a message which

could not be initially received is received normally. This algorithm requires the sender message buer to store messages. A message buer is freed only if the sender receives the ACK message for that message. This buer does not occupy a large space. The Myrinet hardware, which guarantees the delivery of messages, allows us to use this simple ow control algorithm. The Modi ed ACK/NACK algorithm is described in detail as follows: Msg (s; r; n) represents the nth message which is sent from the sender node s to the receiver node r . Ack(s; r; n) and Nack (s; r; n) represent the positive and negative acknowledgments for Msg (s; r; n), respectively. Bufs (r; n) represents the sender buer which keeps Msg (s; r; n) in the sender node s. Indxs (r) keeps the number of messages sent to r. Msg (s; r; Indxs (r)) is the latest message sent to r. Sender nodes have two state: Normal and Not-to-Send while receiver nodes have two state:Normal and Not-to-Receive. Sender node

:

: s transmits the message Msg (s; r; i) stored in Bufs (r; i) to r. Bufs (r; i) is not freed after the transmission. { When s receives Ack (s; r; i), it frees Bufs (r; j ) where j i. { When s receives Nack (s; r; i), it frees buers Bufs (r; j ) where j < i and re-transmits messages Msg (s; r; l) where i l Indxs (r ). Then s enters the Not-to-Send state that inhibits transmission of messages except for those being re-transmitted. Not-to-Send state : { When s receives Ack (s; r; k ), it frees Bufs (r; k ) and enters the normal state to resume transmission. { When s receives Nack (s; r; k ), it re-transmits messages Msg (s; r; l) where k l Indxs (r ) again. Receiver node : Normal state : { When r receives Msg (s; r; i), it returns Ack (s; r; i) to the sender s. { If r receives adjacent messages Msg (s; r; i) Msg (s; r; j ), it only returns Ack (s; r; j ) for the last message. { If r cannot receive Msg (s; r; i) because of a receive buer shortage, it discards the message and returns Nack (s; r; i) to s. Let W antr (s) be the value i. It enters the Not-to-Receive state to reject all Msg (s; r; j ) where j 6= W antr (s). Not-to-Receive state : { r does not receive Msg (s; r; j ) where j 6= W antr (s) and does not reply to the sender. { If r receives the re-transmitted message Msg (s; r; W antr (s)), it returns Ack (s; r; W antr (s)) to s and enters the Normal state to resume receiving messages again. Normal state {

There are two advantages of the Modi ed ACK/NACK ow control. One is that the sender node knows whether or not a message has arrived at the receiver

node. We use this feature to implement channel context switching as described in the next section. Another advantage is that all messages are always sent to the destination without being blocked whether or not the receiver message buer over ows. If a user process sends a huge amount of messages over a channel and blocks the network, another channel never has a chance to send a message. The SCore-D operating system using a channel cannot proceed in such a case. The Modi ed ACK/NACK ow control precludes such a case from occurring. The drawback of this ow control algorithm is in the possibility to increasing the network load when re-transmission occurs, because the sender continues to transmit messages until a NACK arrives at the sender node. 2.6

Context Switch

The context of the PM channel consists of data structures and buers in the host main memory and on-board SRAM corresponding to the channel. To switch the channel context, these memory areas are saved to the host memory and the next channel context is restored. The context switching of channels must not cause duplicate messages, missing messages or the mixing of messages between contexts. To guarantee this, the context must be switched only when no out-going messages or in-coming acknowledges are present. Using the proposed ow control, receiving all ACK/NACK's for messages already sent con rms that no messages remain on the network.

3.1

Evaluation Basic Performance

1250 Immediate Sending Round Trip Time (µsec)

3

1000 Sequential 750

500

250

0 0

2048

4096

6144

Message Size (Bytes) Fig. 1.

Round trip time

8192

We have measured the basic performance of PM using two Sun Sparc station model 20/71 connected with the Myrinet switch and the Myrinet LANai 4.0 hardware. Figure 1 shows the round trip time of PM. In this evaluation, we varied the message size from 8 to 8192 bytes, and measured 1) With Immediate Sending, and 2) Without Immediate Sending. The PM round trip time for an 8 byte message is about 20 microseconds. Figure 1 demonstrates that the Immediate Sending technique decreases the latency for large messages.

Bandwidth (MB/s)

40

30

20 Imm. Sending + Double Buf. Immediate Sending

10

Sequential 0 0

2048

4096

6144

8192

Message Size (Bytes) Fig. 2.

Throughput

Figure 2 shows the throughput of PM. In this evaluation, we varied the message size from 8 to 8192 bytes, and measured 1) With Immediate Sending and receive double buering, 2) With Immediate Sending only, and 3) Without Immediate Sending and receive double buering. The gure demonstrates that the Immediate Sending technique increases the throughput of large messages, and receive double buering increases the throughput for all message sizes. The maximum throughput of PM is about 36.8 Mbytes per second with 8192 byte messages. Table 1 shows the time to save and restore a channel context. We measured the four cases: 1) no message in the buer, 2) send buer is full: 511 messages (12 Kbytes) are in the send buer, 3) receive buer is full: 2730 messages (32 Kbytes) are in the receive buer, and 4) both send and receive buers are full. It takes longer to save and restore the channel context when messages are in the buers. This is because PM has a relatively large message buer, and the overhead to access the on-board SRAM via the SBus is large. Although the amount of data to be transferred when saving and restoring the context is the same, saving the context takes longer. This is because reading data from the SBus space takes longer than writing data to it.

Table 1.

Context switch time (miliseconds)

Condition Save Restore No message 0.13 0.11 Send buer full 1.88 1.40 3.39 1.95 Receive buer full Send and receive buer full 5.15 3.22

A circle and a square represent a node and a switch, respectively. Fig. 3.

3.2

Network topology

Benchmark Program

We have evaluated the CG (Conjugate Gradient) method benchmark of NAS Parallel Benchmark using our workstation cluster whose network topolgy is shown in Figure 3. Figure ?? shows the benchmark result of our parallel software environment with Cray T3D. The result of Cray T3D was reported in the NAS Parallel Benchmark Results WEB. In this benchmark, our parallel software environment achieves the same performance of the Cray T3D.

4

Related Works

There are several dierent implementations of the message layer on Myrinet, such as Myrinet API[10], FM[3], and AM[2]. None of these have a context switching mechanism like PM, so it is dicult to use for any number of processes like our multi-user environment. Myrinet API supports multiple channels and performs scatter receiving and gather sending of messages and dynamic routing information generation. Myrinet API is optimized for a large message throughput, and its maximum bandwidth is about 26 Mbytes per second for 8192 bytes messages. However the minimum round trip time of Myrinet API is large: about 100 microseconds. Hence, Myrinet API has no ow control mechanisms, and message delivery is made reliable by an upper layer library called Mt.

60 50 Cray T3D Time (Sec)

40

MPC++/Workstation Cluster

30 20 10 0 0

8

16

24

32

# of PUs

We cannot compare PM with both FM and AM in terms of a latency and bandwidth because their results we obtained are based on the Myrinet LANai 2.3 hardware, which is a predecessor Myrinet. FM uses the Return To Sender method in version 1.0 or a window-based technique for ow control in another version. The former does not guarantee the order of messages, and the latter has a problem with scalability as long as we understand.

5

Conclusion

We have designed a new communication library called PM on the Myrinet network to cooperate with the SCore-D operating system daemon and the MPC++ programming language run-time system. PM supports FIFO order reliable message delivery with low latency and high bandwidth for a multi-user processing environment. To accomplish this, we have proposed a communication channel, its context switching mechanism, and the Modi ed ACK/NACK ow control algorithm. To increase the throughput of large messages, the Immediate Sending technique has been proposed. Basic performance results showed that the PM on the sun workstation cluster achieves a 20 seconds round trip for a user-level 8 bytes message and 38.6 MBytes/second bandwidth for an 8 Kbytes message using the Myrinet LANai 4.0 hardware. The result of a NAS parallel benchmark, CG, showed that our parallel software environment achieves almost the same performance of Cray T3D. We have demonstrated that a parallel system based on a commodity of an operating system, workstations, PCs, and a gigabit LAN card can achieve almost the same performance as a massively parallel machine. Though the techniques presented in this paper were only implemented on the Myrinet network, they

are applicable to other LAN cards having an on-board processor and memory to handle communication. If a LAN card's on-board memory can be accessed by a host machine, the channel context switching mechanism is also applicable. The Modi ed ACK/NACK ow control algorithm requires a hardware that guarantees 1) message delivery unless the communication buer over ows and 2) FIFO message delivery. If the LAN card has DMA facilities between the host and the on-board memory and between on-board memory and the network, the Immediate Sending technique is applicable. We have developed an Intel Pentium based PC cluster consisting of 32 processors. The PM on this cluster achieves a 14.4 seconds round trip for a user-level 8 bytes message and 117.6 MBytes/second bandwidth for an 8 Kbytes message.

References 1. N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. L. Seitz, J. N. Seizovic and Wen-King Su. \Myrinet { A Gigabit-per-Second Local-Area Network". IEEE MICRO, Vol. 15, No. 1, pp. 29{36, February 1995. 2. http://now.cs.berkeley.edu/AM/lam release.html. 3. Scott Pakin, Mario Lauria and Andrew Chein. \High Performance Messaging on Workstations: Illinois Fast Messages (FM) for Myrinet". In Proceedings of Supercomputing '95, San Diego, California, 1995. 4. Thorston von Eicken, Anindya Basu, and Werner Vogels. U-Net: A User Level Network Interface for Parallel and Distributed Computing. In Fifteenth ACM Sumposium on Operating Systems Principles, pp. 40{53, 1995. 5. Y. Ishikawa, A. Hori, H. Tezuka, M. Matsuda, H. Konaka, M. Maeda, T. Tomokiyo, and J. Nolte. MPC++. In Gregory V. Wilson and Paul Lu, editors, Parallel Programming Using C++, pp. 429{464. MIT Press, 1996. 6. Y. Ishikawa. Multi Thread Template Library { MPC++ Version 2.0 Level 0 Document {. Technical Report TR{96012, RWC, September 1996. Visit http://www.rwcp.or.jp/lab/mpslab/mpc++/mpc++.html. 7. A. Hori, T. Yokota, Y. Ishikawa, S. Sakai, H. Konaka, M. Maeda, T. Tomokiyo, J. Nolte, H. Matsuoka, K. Okamoto, and H. Hirono. Time Space Sharing Scheduling and Architectural Support. In D. G. Feitelson and L. Rudolph, editors, Job Scheduling Strategies for Parallel Processing, Vol. 949 of Lecture Notes in Computer Science. Springer-Verlag, April 1995. 8. A. Hori, H. Tezuka, Y. Ishikawa, N. Soda, H. Konaka, and M. Maeda. Implementation of Gang-Scheduling on Workstation Cluster. In D. G. Feitelson and L. Rudolph, editors, IPPS'96 Workshop on Job Scheduling Strategies for Parallel Processing, Vol. 1162 of Lecture Notes in Computer Science, pp. 76{83. SpringerVerlag, April 1996. 9. T. von Eicken, V. Avula, A. Basu and V. Buch. \Low-Latency Communication over ATM Networks using Active Messages". In Proceedings of Hot Interconnects II, 1994 Palo Alto, August 1994. 10. http://www.myri.com. This article was processed using the LaTEX macro package with LLNCS style

PM: An Operating System Coordinated High Performance ... - CiteSeerX

PM: An Operating System Coordinated High Performance ... - CiteSeerX

Suggest Documents

High-Performance Operating System Primitives for ... - CiteSeerX

A High Performance Kernel-Less Operating System ... - CiteSeerX

A High Performance Kernel-Less Operating System ... - CiteSeerX

Reflections on an Operating System Design - CiteSeerX

Service Continuations: An Operating System Mechanism ... - CiteSeerX

Service Continuations: An Operating System Mechanism ... - CiteSeerX

High Performance Data Analysis via Coordinated ...

Booting an Operating System

On a NIC's Operating System, Schedulers and High-Performance ...

Four Crucial Skills of a High Performance Cultural Operating System

PM : A High-Performance Communication Library for Multi ... - CiteSeerX

System-Level Virtualization for High Performance ... - CiteSeerX

High Performance Computing in Power System ... - CiteSeerX

Relationship between High Performance Work System ... - CiteSeerX

An Evolutionary Fuzzy System for Coordinated and Traffic ... - CiteSeerX

OaSis: An Application Specific Operating System for an ... - CiteSeerX

Measuring the Real-Time Operating System Performance - CiteSeerX

Optimizing Center Performance through Coordinated Data ... - CiteSeerX

Operating system verificationâAn overview

An Operating System Machine

NOX: Towards an Operating System for Networks - High ...

NOX: Towards an Operating System for Networks - High ...

Operating system verificationâAn overview

Optimizing Center Performance through Coordinated Data ... - CiteSeerX

PM: An Operating System Coordinated High Performance ... - CiteSeerX