Performance Evaluation of Some MPI Implementations on Workstation ...

3 downloads 12296 Views 162KB Size Report
sists of 6 DEC Alpha workstations interconnected via ... call does not guarantee the reuse of the resources. R. S. R. S. R. S. R. S ..... puter Center, May 1994. 4.
Performance Evaluation of Some MPI Implementations on Workstation Clusters  Natawut Nupairoj

and Lionel M. Ni

Department of Computer Science Michigan State University East Lansing, MI 48824-1027 fnupairoj, [email protected] Abstract

Message Passing Interface (MPI) is an attempt to standardize the communication library for distributed-memory computing systems. Since the release of the recent MPI speci cation, several MPI implementations have been made publicly available. Different implementations employ di erent approaches, and thus, the performance of each implementation may vary. Since the performance of communication is extremely crucial to message-passing based applications, selecting an appropriate MPI implementation becomes critical. Our study is intended to provide a guideline on how to perform such a task on workstation clusters which are known to be an economical and e ective platform in high performance computing. We investigate several MPI aspects including its functionalities and performance. Our results also point out the strength and weakness of each implementation on our experimental system.

1 Introduction

Message Passing Interface (MPI) has become an emerging standard for implementing message-based parallel programs in distributed-memory computing environments. One major goal of the MPI is to provide a widely portable and ease-of-use programming library without sacri cing the performance [1]. To establish an ecient standard for many platforms, the MPI provides several mechanisms to perform pointto-point and collective communications. The performance of these mechanisms may be varied depending on the software implementation and the underlying hardware. Several other features, such as derived datatype, persistent communication, and group con This work was supported in part by NSF grants CDA9121641 and MIP-9204066, and DOE grant DE-FG0293ER25167.

cept, are also introduced to improve the ease-of-use. However, if the MPI is not carefully implemented, the communication cost can be so expensive that programs may not gain any bene t from parallel processing. Since the release of the recent MPI speci cation [1], several MPI implementations have been made public available. While these respected groups are still improving the quality of their MPI implementations, these implementations did show there early and initial e orts in promoting the MPI. The purpose of this paper is to establish a set of MPI communication benchmark programs to evaluate the communication performance of their implementations. In this study, four public-domain MPI implementations, shown in Table 1, are considered. Our testing environment consists of 6 DEC Alpha workstations interconnected via both Ethernet and a DEC GIGAswitch. The DEC GIGAswitch can provide up to 100 Mbps per channel. We have developed a set of benchmarks to evaluate the performance of both point-to-point and collective communication services. These benchmark programs include: 1. Ping: to measure the peak performance of the point-to-point communication over a communication channel; 2. PingPong: to evaluate the end-to-end communication latency which includes the e ect of the communication protocol; and 3. Collective: to evaluate the performance of some collective communication, including broadcast, and barrier synchronization. Our evaluation focuses on not only the raw performance, but also other performance issues, such as the e ect of user bu er allocation, the utilization of the underlying hardware support, the performance of

MPI CHIMP LAM MPICH Unify

Table 1. Summary of testing MPI implementations. Organization Version ftp site U. of Edinburgh 2.1 ftp.epcc.ed.ac.uk Ohio State 5.2 tbag.osc.edu Argonne-Mississippi State July 13, 1994 info.mcs.anl.gov Mississippi State 0.9.1 ftp.erc.msstate.edu

each sending mode as well as persistent communication, and the e ect of derived datatype including contiguous-space, non-contiguous space datatypes, and pack-unpack operations. We also indicate how to interpret these results and show the relative performance of the initial implementation of those four MPI libraries. The rest of this paper is organized as follows. In Section 2, we explains the fundamental concepts and some terminologies of the MPI. The model and performance metrics used in our study are presented in Section 3. Section 4 contains the detail of our benchmark set, including how to interpret the results. Experimental results and analysis are discussed in Section 5. Due to space limitation, only partial results are presented. Interested readers may refer to [6] for additional performance results. Related works are given in Section 6. We conclude our paper in Section 7.

2 Message Passing Interface

In message-passing paradigm, a program consists of a set of processes or tasks. Each process performs computation independently and communicates with other processes via communication channels. For workstation clusters, processes can be executed on distinct workstations connecting to a high-speed switch which provides the communication channels among processes. At some point, those processes may need to do some synchronizations or exchange some data by passing messages through the communication channel. These communication can sequentialize the program and hence prevent the program from achieving an ideal speedup. Thus, the implementation of MPI is very crucial to the performance of applications. To investigate this problem, we have to understand the semantics of some MPI functions.

Point-to-Point Communication

MPI provides two mechanisms of data transmission:

blocking and nonblocking. In blocking call, it guaran-

tees the resources such as bu er can be safely reused. It simpli es the need of hand-shaking and resource polling. The nonblocking call allows the overlapping of communication and computation. With suitable hardware, data copying can be concurrent with com-

reference [2] [3] [4] [5]

putation [1]. Nevertheless, the return of nonblocking call does not guarantee the reuse of the resources. 1

Standard

S

R

Buffer

S

R

1

2

1

Synchronous

2

S

R

3

1

Ready

S

R 2

Figure 1. The model of the communication modes Both blocking and nonblocking calls can use one of the following communication modes: standard, bu er, synchronous, and ready. These communication modes can be best described in Figure 1. In standard mode, the MPI chooses the communication mode to send a message. Any combination of three other modes or communication modes not de ned in the MPI speci cation, can be used. As mentioned in [7], one possible criteria which can be used in selecting the communication mode is the size of sending messages. The MPI also allows a user to customize the communication mode by providing three other modes. The bu er mode always bu er an outgoing message when no matching receive is posted. The sender can resume computing very fast since the send is considered complete when the message is put in the bu er. Messages are usually bundled together and sent in group which improves overall communication throughput. However, memory is required as a bu er in this mode. Furthermore, an extra copying is needed to move message to the bu er before sending. It will be even worse when sending a large message. In synchronous mode, the MPI ensures that the receiver has reached a certain point in its execution. A message does not have to be bu ered at the sender in this mode. However, an acknowledgment from the receiver might be nec-

essary. Finally, a ready-mode send will be completed when the matching receive is already posted. This mode is available since using this mode in some communication systems is more ecient. As noted in the MPI proposal [1], this sending mode can be safely replaced with the standard mode. Our experiments are mostly based on the blocking standard and bu er sending modes. For measuring purpose, these two modes can provide us useful information regarding to the performance of the MPI on our testing platform. Other platforms may have to use di erent other modes.

Collective Communication

Parallel programs often require collective communications, which involve a group of processes. MPI has many operations that support these functionalities. The collective communications, such as barrier synchronization and multicast, can be implemented with point-to-point communications. However, this straight-forward implementation is inecient and does not scale well when the number of processes in the group is large. New high-speed switches usually provide some basic supports for those operations. An ecient implementation of MPI must utilize these supports to reduce the overhead.

Persistent Communication Requests

In some applications, the communications are between the same sources-destination pairs. Since some transmission arguments, for example header, can be the same, the communication can be optimized by allowing application to reuse the old arguments without re-initialization. This functionality is known as persistent communication request. An application rst initializes the communication by requesting the persistent communication. When the application wants to send, it can invoke an MPI function call to send the request. Since some initialization has been done during the rst request, the setup overhead can be reduced.

Non-Contiguous Datatype

Some applications, such as matrix calculation, send sub-blocks of a matrix which are non-contiguous. In tradition, the sender has to gather non-contiguous data to a contiguous bu er (pack) and the receiver has to distribute data from the contiguous bu er to non-contiguous bu er (unpack). Some communication systems have special hardware to support the transmission of non-contiguous data. To utilize these hardware, MPI allows users to send non-contiguous datatypes which are vector and indexed. Vector datatype consists of multiple data blocks which are evenly spaced. All blocks are the same size and can be derived from multiple copies of the

old datatype. Indexed datatype is similar to vector datatype, except the size of each block and the space between blocks can be any size.

3 Model and Metrics

To make a fair comparison among di erent MPI implementations, our benchmarks will evaluate the performance of the communication channel at the application level. At this level, we can measure both software and network overhead. Theoretically, the communication throughput available to a process is limited by the peak throughput of the channel. However, the sustained throughput is much lower due to software overhead and network congestion.

3.1 Measurement Model Workstations P0

P1

Pn

Communication Channel High-speed Switch

Figure 2. The measurement model Figure 2 demonstrates the conceptual model of our experiments. Each benchmark may consist of two or more processes. We execute each process on one workstation in order to eliminate the problem of having two processes competing for the communication channel. Since our experiments are done on the Unix environment which are a multi-tasking system, our processes still have to share CPU with other processes, including daemons. By running the benchmarks long enough, the measured performance will be quite accurate and can be used in the comparison. If we inspect the message transmission between two processes carefully, it can be decomposed into three steps: 1. Send: the sender has to spend time to do some packet processing, such as message copying, packetizing, and checksum computing. We call this latency, \sending latency" (tsend ). 2. Network: after being put onto the communication channel, the message has to spend some delay in the network before it reaches the destination. We call this latency, \network latency" (tnet).

3. Receive: once the message arrives, the receiver picks up the message from the communication channel and perform some packet processing, such as message reassembling and checksum computing. We call this latency, \receiving latency" (trecv ).

3.2 Performance Metrics

Comparing two communication systems requires measuring several metrics. In our study, we compare the implementation of di erent communication libraries. Thus, only two metrics are sucient for the evaluation.

Communication latency (t)

We de ne the communication latency (t) to be the time that a process has to spend when it sends or receives (or both) a message. The communication latency is proportional to the message size which is given by t = ts + n  tt + bn=pc  tp (1) where ts is the start-up latency which is xed for each message, n indicates the size of the message, tt is the transmission latency (usually much less than ts ), and tp is the packetization latency. The start-up latency also includes the xed cost of system call and initialization overhead. Communication latency is highly dependent on the characteristics of the program. Suppose that a process sends a message to the other process. If it has to wait for a reply from that process (PingPong), the communication latency will include both the software overhead from MPI and the network overhead. This type of latency is called end-to-end delay. However, if the process does not have to wait and the sending is non-blocking or in the bu ered mode (Ping), the process will su er only the software overhead.

Channel throughput ()

The channel throughput () or bandwidth is the rate at which the network can deliver data (usually in Mbits per second). It is widely used among the vendors because of its simplicity. We use this metric when we compare the performance of di erent message sizes. The throughput can be directly computed from the communication latency by =

n t  106

(2)

If we substitute t with Equation (1), the throughput becomes 10?6  = ts (3) +t n

t

Thus the peak throughput will be limited to

10?6 tt

when the message size is in nite. We further de ne the sustained throughput as the maximum throughput that can be achieved. By injecting messages to the communication channel as fast as possible, such as repeatedly sending messages in the bu ered mode, we can compute the sustained throughput from Equation (2).

3.3 Communication Parameters

Some communication parameters may have dramatic impact on the communication performance. The communication performance can be greatly improved when appropriate values are used for the parameters. In our benchmarks, we study two major parameters: message size and the bu er size.

Message size (n)

As show in Equation (1), the message size has a major e ect on the communication latency. Sending a small-sized message is not ecient since the communication latency is dominated by the start-up latency (ts )[8]. Increasing the message size improve the performance, because the e ect of the startup latency is reduced when the transmission time (n  tt ) increases. When the message size is big enough, the communication latency will be close to the transmission time.

Bu er size (B)

In bu er mode, a sending call can return as soon as a sending message is placed in the bu er. This can e ectively decrease the sending latency. Larger bu er implies more messages can be placed in the bu er. Hence, more messages can be injected without getting blocked. To bu er messages, an application must dedicate a portion of memory to become a bu er. Using a large bu er consumes a lot of memory. The performance gained by using a bu er may not overcome the degradation due to the low in memory resource. Thus, choosing a right bu er size is very crucial in the bu er mode.

4 Benchmarking

In this section, we describe of our benchmark programs as well as how to measure the metrics and also explain how to interpret the results from our benchmarks.

4.1 Ping

The purpose of P ing benchmark is to measure the effective bandwidth of MPI by sending messages from one processor to another. The sender keeps sending data unless it is blocked, and the receiver keeps consuming data. The communication model is shown in Figure 3. The algorithm for P ing is straightforward. The sender sending out a message will be one iteration of

1

P1

P2

Figure 3. Ping Communication Model the P ing communication. We measure the elapsed time of k iterations, and compute the average delay accordingly. Throughput can be obtained using Equation (3).

Sender:

measure start-time; For i := 1 to k Do send(message); EndDo measure stop-time; compute elapsed time and throughput.

Receiver:

measure start-time; For i := 1 to k Do recv(message); EndDo measure stop-time;

The elapsed time measured from P ing benchmark (with the bu er mode) can be described in Figure 4. The arrows represent the call from the processes (down arrows) and the return from MPI (up arrows). The time ows from left to right. On the sender, after invoking the send function, the sending message is copied to the bu er. If there is bu er available, the call is returned when the message is successfully copied to the bu er. Hence we can use this benchmark to estimate the sending latency (tsend), given that the bu er is big enough. Sender MPI

t send

Figure 4. Ping timing diagram. If we focus on the elapsed time measured on the sender end, we use Equation (3) to explain the relationship between the latency and the size of messages. When the message size is small, the throughput is relatively low due to the startup latency. As the message size gets larger, tnet dominates. Throughput calculated from Equation (3) will be stable when the size

reaches a certain point. It represents the sustained network bandwidth. When the message size becomes too large, the throughput may decrease because of the overhead of the congestion control of the network protocol. Bu er size is another parameter that can be varied in the bu er mode. When the bu er size is small, the bu er can be full after sending some messages, especially when the receiver cannot consume the messages fast enough. Hence the sender is blocked and has to wait until there is space available in the bu er. Larger bu er size can reduce the chance of sender getting blocked. At some point, increasing bu er size is no longer improving the performance. Even worse, large bu er consumes lots of memory and can cause more overhead due to memory page-faults.

4.2 PingPong

The PingPong benchmark is aimed to measure the end-to-end delay by sending a message back and forth between two processes. Unlike Ping, each process takes turn to become a sender and there is only one sender at a time. 1

P1

P2 2

Figure 5. PingPong Communication Model Figure 5 shows the communication model of PingPong. Initially, P1 sends a message to P2 and waits for a returned message from P2. When P2 receives a message, it replies back to P1. This is one iteration of PingPong communication in blocking standard mode. We measure the elapsed time of the k-iteration of PingPong communication and then nd the average of the delay. The algorithm of the PingPong benchmark is given below:

P1:

measure start-time FOR i := 1 to k DO send(message); recv(message); ENDDO measure stop-time compute elapsed time and throughput

P2: FOR i := 1 to k DO recv(message); send(message); ENDDO

Since P1 has to wait for a message from P2 before it can begin the next iteration, the measured time represents the actual time needed to deliver a message. In other words, PingPong benchmark takes an account for the impact of the network latency. Figure 6 contains the timing diagram of the PingPong benchmark. P1

P2

t

send

t

net

t

P1

P2

recv

Figure 6. PingPong Timing Diagram The performance of the PingPong benchmark regarding to the message size should be similar to the Ping benchmark. Using a small message size yields low throughput due to the startup latency. As the message size is increased, the transmission latency starts to dominate and hence improves the throughput. However, the peak throughput, that can be achieved by PingPong, must be less than the peak throughput achieved by Ping because the measured latency includes the network latency. Moreover, the impact of the bu er size is limited since only one message is transmitted at a time.

4.3 Collective

In our study, we focus on two collective communication functions: broadcast and barrier synchronization, because most parallel programs use these two functions.

Bcast:

measure start-time; For i := 1 to k Do bcast(message); EndDo measure stop-time; compute elapsed time.

Barrier:

measure start-time; For i := 1 to k Do barrier(message); EndDo measure stop-time; compute elapsed time.

The benchmarks for these functions are quite simple. To measure broadcast (blocking mode), we repeatedly broadcast messages to other processes in the group. The elapsed time before and after broadcasting can be used to estimate the latency of broadcast.

We can vary the size of messages and the number of processes involved. Measuring barrier synchronization is similar to broadcast. The latency can inform us the scalability of broadcast and barrier synchronization. One simple implementation of broadcast is to separately send messages to individual process. Thus, the latency of broadcast should linearly increase when the number of processes involved increases.

5 Experiments

5.1 Testing Environment

We performed our test on a cluster of six DEC Alpha 3000/400 with 32 MBytes of main memory and 424 MBytes of local disk. Each workstation is running OSF/1 version 2.0 and connected to the GIGAswitch via FDDI. The testing MPI implementations was summarized in Table 1.

5.2 Experimental Results

In this section, we present the results from our experiments. Each data-point in our results is the average of 10 experiments (1000 iterations each). Due to the memory allocation problem of MPICH and Chimp on our testing platform, we can only perform experiments which involve message size less than 4 Kbytes for MPICH and less than 1 Kbytes for Chimp. Latency (secs) Chimp 3 6000 2 LAM + 2 2 MPICH 4500 Unify  + + 3000 + + + +

3 1500  3  3  3 2  3 3 222 2 2 0 256 512 768 1024 1280 1536 Message Size (bytes)

Figure 7. Sending latency (short messages). Figure 7 contains the results from Ping benchmark with the standard sending mode. Obviously, MPICH has very small sending latency, comparing with others. However, this is not the case when the message size is large. When the message size is above 1 Kbytes, the sending latency of MPICH is rapidly increased. As MPICH uses two di erent sending algorithms for sending short and long messages, when the message is short, MPICH sends it directly from the user program's bu er without performing any packetization. However, when the message size is longer than 1 Kbytes, MPICH starts packetizing the message which greatly increases the sending latency. Other im-

plementations show steady performance as the message size is increased. This indicates that the communication latency is dominated by the software startup overhead. Bandwidth (Mbps)    20  16      12 + 8 + + + +  + + + + 4 + LAM + Unify  + 0K 2K 4K 6K 8K 10K Message Size (bytes) Figure 8. Sustained bandwidth (long messages). We perform our Ping benchmark with long messages on LAM and Unify in order to study the sustained bandwidth of these two implementations. We use Equation 2 to compute the sustained bandwidth which is presented in Figure 8. Unify has higher sustained bandwidth than LAM since the organization of Unify is quite simple, hence, it incurs less software overhead than LAM. Although our GIGAswitch and FDDI connection can deliver peak performance at 100 Mbps and more than 80 Mbps when using TCP/IP directly, high software overhead limits the sustained performance of both LAM and Unify to be under 20 Mbps. We also notice that packetization occurs every 4 Kbytes. Latency (secs) + + 8000 standard 3 sync + + 33 6000 3 3 4000 + 2000 + + + + + 333 3 3 3 0 512 1024 1536 2048 Message Size (bytes) Figure 9. Standard vs. Synchronous (MPICH). The impact of sending modes of MPICH is presented in Figure 9. As mentioned in Section 2, one possible implementation for synchronous send is a request-and-acknowledge scheme. This implies extra overhead when sending a message. Thus, the communication latency of synchronous mode is usually longer than the standard sending mode.

Latency (secs) 8000

ping pingpong 6000 + + + + + + + 4000 +++ 333 3 3 3 3 3 3 2000 0K

1K 2K 3K Message Size (bytes)

3 + + + 3 3 4K

Figure 10. Ping vs. PingPong (LAM) To take an account for the latency from the underlying network, we test our PingPong benchmark which is shown in Figure 10. As expected, the endto-end delay (measured from PingPong) is higher than sending latency (measured from Ping). The end-toend delay increases as the message size increases. This is because long messages require longer transmission time than short messages. Latency (secs) LAM (1 byte) 3 12000 LAM (1 Kbytes) + 10000 MPICH (1 byte) 2 8000 MPICH (1 Kbytes)  + 3    + 6000 3 + 3 4000 + 3 2000 2 2 2 2 3 4 5 2 Number of workstations Figure 11. Broadcast latency of two message sizes (LAM and MPICH). All four MPI implementations use software-based approaches to provide the collective communication. The performance of collective communication is dependent on the performance of point-to-point communication. In Figure 11, experiments of broadcasting two message sizes are conducted on LAM and MPICH. When the message size is small (1 byte), MPICH, which has lower software startup latency for short messages, performs very well. When broadcasting longer messages (1 Kbytes), LAM performs better since it has lower point-to-point software overhead. However, the cost of broadcasting on LAM is high when many workstations are involved, while MPICH shows very good scalability. Figure 12 shows the performance of barrier synchronization. Since barrier synchronization can be implemented with point-to-point communication of

20000 16000 12000 8000 4000

Latency (secs) Chimp 3 + LAM + + MPICH 2 Unify  + 3 3 + 3 2 2 2 23 2 3 4 5 Number of workstations

Figure 12. Barrier synchronization performance short messages, having low startup software latency allows very ecient barrier implementation. As expected, MPICH requires the least time to complete the barrier synchronizations.

6 Related Works

The issues of small-packet communication on highspeed network had been discussed by Thekkath and Levy [9]. They present the design and evaluation of new low-latency RPC system which can be applied to the new-technology networks. Douglas et al. investigated the performance of several important programming environments for workstation clusters [10]. They use some benchmarks, including pingpong and molecular dynamics program, in order to evaluate ve di erent programming systems. The ecient implementation of MPI on IBM-SP1 and how to measure its performance is presented in Franke et al [7].

7 Conclusion

In this paper, we discuss the evaluation of some MPI implementations which are currently publicly available on workstation clusters. Our results indicate that the software overhead is very high and has to be greatly reduced in order to fully exploit the bandwidth of the high-speed switch. Among four MPI implementations, we choose LAM as the best implementation currently available for our environment. LAM implementation is very stable and very easy to install. We did not experience any major bug from LAM. But LAM incurs relatively high software startup overhead for short messages. Unify shows very strong performance. However, more development is needed in order to fully implement the MPI library. We do not choose MPICH and Chimp because of some memory management problems. Because of time limitation, we could not conduct an extensive set of experiments on di erent distribution of non-contiguous datatypes. But our initial results based on simple vector datatype show that the

cost of sending non-contiguous datatype is not much higher than sending contiguous datatype of the same size. Further investigation on the impact of the noncontiguous datatype is needed. We are also investigating the performance of other collective communication services.

Acknowledgments

The authors wish to thank Chi-Ming Chiang and Sherry Q. He for many useful discussions and suggestions.

References

1. M. P. I. Forum, MPI: A Message-Passing Interface Standard, Mar. 1994. 2. R. Alasdair, A. Bruce, J. G. Mills, and A. G. Smith, CHIMP Version 2.0 User Guide. University of Edinburgh, Mar. 1994. 3. G. Burns, R. Daoud, and J. Vaigl, LAM: An Open Cluster Environment for MPI. Ohio Supercomputer Center, May 1994. 4. B. Gropp, R. Lusk, T. Skjellum, and N. Doss, Portable MPI Model Implementation. Argonne National Laboratory, July 1994. 5. F.-C. Cheng, P. Vaughan, D. Reese, and A. Skjellum, The Unify System. Mississipi State University, July 1994. 6. N. Nupairoj and L. Ni, \Performance evaluation of some MPI implementations," Tech. Rep. MSUCPS-ACS-94, Department of Computer Science, Michigan State University, Sept. 1994. 7. H. Franke, P. Hochschild, P. Pattnaik, and M. Snir, \MPI-F: An ecient implementation of MPI on IBM-SP1," in Proceedings of the 1994 International Conference on Parallel Processing, (St. Charles, IL), pp. 197{201, Aug. 1994. 8. L. M. Ni and P. K. McKinley, \A survey of wormhole routing techniques in direct networks," IEEE Computer, vol. 26, pp. 62 { 76, Feb. 1993. 9. C. Thekkath and H. Levy, \Limits to lowlatency communication on high-speed networks," ACM Transactions on Computer Systems, vol. 11, pp. 179{203, May 1993. 10. C. Douglas, T. Mattson, and M. Schultz, \Parallel programming systems for workstation clusters," Tech. Rep. YALEU/DCS/TR-975, Department of Computer Science, Yale University, Aug. 1993.

Suggest Documents