Optimizing MPI One Sided Communication on Multi-core ... - CiteSeerX

19 downloads 83 Views 165KB Size Report
riod in which a window can be accessed) is bounded by calls to MPI Win lock and. MPI Win .... The other is Lonestar at Texas Advance Computing Center.
Optimizing MPI One Sided Communication on Multi-core InfiniBand Clusters using Shared Memory Backed Windows Sreeram Potluri, Hao Wang, Vijay Dhanraj, Sayantan Sur, and Dhabaleswar K. Panda Department of Computer Science and Engineering, The Ohio State University {potluri,wangh,dhanraj,surs,panda}@cse.ohio-state.edu

Abstract. The Message Passing Interface (MPI) has been very popular for programming parallel scientific applications. As the multi-core architectures have become prevalent, a major question that has emerged about the use of MPI within a compute node and its impact on communication costs. The one-sided communication interface in MPI provides a mechanism to reduce communication costs by removing matching requirements of the send/receive model. The MPI standard provides the flexibility to allocate memory windows backed by shared memory. However, state-of-the-art open-source MPI libraries do not leverage this optimization opportunity for commodity clusters. In this paper, we present a design and implementation of intra-node MPI one-sided interface using shared memory backed windows on multi-core clusters. We use MVAPICH2 MPI library for design, implementation and evaluation. Micro-benchmark evaluation shows that the new design can bring up to 85% improvement in Put, Get and Accumulate latencies, with passive synchronization mode. The bandwidth performance of Put and Get improves by 64% and 42%, respectively. Splash LU benchmark shows an improvement of up to 55% with the new design on 32 core Magny-cours node. It shows similar improvement on a 12 core Westmere node. The mean BFS time in Graph500 reduces by 39% and 77% on Magny-cours and Westmere nodes, respectively. Keywords: MPI, shared memory, one-sided

1

Introduction

Message Passing has been the most popular model for developing parallel scientific applications. However, with the advent of multi-core processors, researchers have questioned the use of message passing for intra-node communication, since the senderreceiver interaction and tag matching pose a significant overhead. MPI One Sided Communication provides a better alternative by avoiding these overheads. In this model, the origin process can independently initiate and complete transfers from remote memory without requiring any involvement from the process owning it. Traditionally, many MPI libraries have implemented intra-node one sided communication calls over the two-sided model. This prevents them from achieving the true potential of one-sided communication within a node.

2

1.1

S. Potluri, H. Wang, V. Dhanraj, S. Sur, D. K. Panda

Motivation

The one sided communication model provides the flexibility to create windows in shared memory, thus providing true one sided intra-node transfers with minimal overheads. MPI implementations on shared memory machines and on platforms with the SHM protocol [4, 13] have taken advantage of this to achieve good performance. However, their usage is limited to the particular platforms or a single node. MVAPICH2 supports the use of kernel modules to implement one sided communication [6]. The usability of kernel-assisted methods are in general limited due to the requirement for kernel compatibility, system administrator support in installation, and overheads for small messages. This has motivated the work in this paper to design shared memory backed windows in MVAPICH2 for use on multi-core InfiniBand clusters. Figure 1 depicts the design choice presented in this paper. 1.2

Contributions

In this paper, we make the following key contributions: 1. We design and implement shared memory backed windows in MVAPICH2. 2. We discuss the interactions of this design with the existing designs for inter-node communication on InfiniBand clusters. 3. Using micro-benchmarks and application-level benchmarks, we show the improved performance with our new design. Micro-benchmark evaluation shows that the design using shared memory backed windows can achieve up to 85% improvment in latency for Put, Get and Accumulate, in passive mode. Bandwidth for Put and Get improves upto 64% and 42% respectively. Splash LU benchmark, and Graph 500 benchmark can get up to 55%, and 39% improvement respectively, on a 32 core AMD Magny-Cours node. They show 55% and 77% improvement respectively, on a 12 cores Intel Westmere node. MPI Communication Two Sided Message Passing Shared Memory

Kernel Assisted

One Sided Communication Over Two Sided Message Passing

True One Sided Communication

Kernel Assisted

Shared Memory Backed Windows

Fig. 1. MPI Intra-Node Communication Design (design choice in this paper is colored)

2

Background and Related Work

In this section, we provide an overview of the one-sided interface in MPI-2 and the extensions in MPI-3 Draft [2] that are relevant to our work. We also discuss the existing implementations for one-sided communication within a node.

3

2.1

MPI One-sided Communication Model

MPI one-sided interface enables direct remote memory access through the concept of a “window”. Window is a region in memory that each process exposes to others through a collective operation: MPI Win create. After that, each process can directly read or update data in the windows at all other processes in the communicator. Semantically, this differs from two-sided communication in that, the origin specifies all the parameters required for a communication (including remote address and datatype). In MPI-2, the user has to allocate the buffer and pass it onto the MPI Win create function. The standard suggests the use of MPI Alloc mem to allocate window memory as this allows the MPI libraries to allocate memory in way to achieve high performance. The MPI-3 draft introduces a new function: MPI Win allocate which lets the library to both allocate a buffer and create the window in one call. MPI-2 provides three communication operations all of which are non-blocking: MPI Put (write), MPI Get (read) and MPI Accumulate (update). Access to windows and start/completion of communication operations is managed through synchronization calls. Synchronization modes provided by MPI-2 RMA can be classified into passive (no explicit participation from the target) and active (involves both origin and target). In the passive mode, an epoch (period in which a window can be accessed) is bounded by calls to MPI Win lock and MPI Win unlock. As the name suggests, this does not require any participation from the remote process. MPI-2 one-sided interface provides two modes of active synchronization: a) a collective synchronization on the communicator: MPI Win fence; and b) a group based synchronization: MPI Win Post-MPI Win Wait and MPI Win startMPI Win complete. In the later mode, the origin process uses MPI Win start and MPI Win complete to specify access epoch for a group of target processes, and a target process calls MPI Win post and MPI Win wait to specify exposure epoch for a group of origin processes to access its window. The communication operations issued by an origin can execute only after the target has called post, and the target can complete an epoch only when all the origins in the post group have called complete on the window. Normally multiple RMA operations are issued in an epoch to amortize the synchronization overhead. The Draft MPI-3 standard adds several new calls for communication and synchronization. However, we focus our discussions in this paper to the functions discussed above. 2.2

Existing Implementation for Intra-node MPI One-sided Communication

One-sided communication within a node can be designed in several ways. Most libraries implement them over Send/Receive calls using shared memory buffers. The origin process copies the data in and the destination process copies it out. It cannot provide asynchronous progress that is desired in one-sided communication. Also, for large messages, the copies into and out of shared memory buffers become a huge overhead. Some kernel assisted techniques like LIMIC2 [5] and KNEM [3] have been proposed to avoid these extra copies. The source and destination buffers are mapped onto the kernel’s address space and the data is copied directly between the two buffers, through a kernel module. However, this still does not solve the problem of asynchronous progress. More recent work [6] has proposed designs for direct implementation of one-sided communication using kernel assistance. This is supported in MVAPICH2 from version 1.6.

4

S. Potluri, H. Wang, V. Dhanraj, S. Sur, D. K. Panda

The window buffers are mapped into the kernel address space which enables zerocopy transfers. The usability of kernel-assisted methods are in general limited due to the requirement for kernel compatibility, system administrator support in installation, and overheads for small messages. One-sided communication has been efficiently implemented over shared memory systems [13, 4] where memory is globally accessible. However these designs are platform dependent or are restricted to a single node. In this paper we propose the use of shared memory backed windows for intra-node one-sided communication on modern multi-core clusters.

3

Design and Implementation

In this section, we describe in detail, the design of the One Sided Communication and Synchronization semantics using shared memory backed windows. We also discuss the interaction of the proposed intra-node design with existing implementations of internode communication. 3.1

Window Creation

In this section, we present the design of shared memory backed windows in the context of two window creation mechanisms: MPI Win create (MPI-2 and draft MPI-3) and MPI Win allocate (draft MPI-3). Using MPI Alloc mem: In our design, memory regions created using MPI Alloc mem are allocated as shared memory files. They are mapped and maintained as a list sorted by their starting addresses. When the application calls MPI Win create with a buffer, we parse the shared memory file list to check if it was allocated in shared memory or not. If window memory at all the processes on a given node was not allocated in shared memory, the processes fall-back to the default implementation that uses Send/Receive calls. Otherwise, each process checks if it has already mapped the corresponding shared memory file. This check is required as two or more windows can be created from memory in the same file/buffer. If not, the process maps the shared memory files onto its virtual address space and generates the base addresses. Using MPI Win allocate: Window allocation in shared memory becomes easier with the new semantics. Each buffer will have only one window associated with it. The steps are window creation are similar to the earlier case except that the information about the shared memory files can associated with and stored within the window structure rather than maintaining an external data structure. 3.2

Communication

Implementation of communication operations over shared memory backed windows requires two simple steps. The process initiating the call calculates the target address using the base address of the window (mapped from the shared memory file) and the displacement specified in the operation. It can then directly copy or compute on the data at the target address.

5

3.3

Synchronization

Lock-Unlock: MPI Win lock and MPI Win unlock can be implemented over Send/Receive using lock request and lock grant messages. In the case when there is just one communication operation between the lock and unlock calls, the sequence of lock-opunlock operations can be accomplished through just one message. This implementation applies for both inter-node and intra-node scenarios, in the current implementation of MVAPICH2. In the new design, each intra-node lock is implemented as two shared memory counters allocated during window creation. One counter has the information about the kind (exclusive/shared) of lock any process holds on the window. The second counter contains the number of processes holding the lock when the type is shared. When a process is handling a request it received from an off-node process, it checks the shared memory lock counters for any existing conflicting locks. While granting a lock, it sets its shared memory lock counters appropriately so that no intra-node peer can acquire a conflicting lock. Locks can be implemented using InfiniBand RDMA atomics as described in [9, 7]. However, InfiniBand does not provide atomicity between CPU initiated lock operations and network-initiated atomic operations. Therefore, we implement locks between intranode processes using loop-back network atomic operations as suggested in [9]. Post-Wait/Start-Complete: In a Send/Receive based implementation of Post-Wait/StartComplete, an MPI Win Post call at the target converts to a message being sent to each process in the specified group. On the other end, an origin process calling MPI Win start will wait for post messages to arrive from all the processes in the specified group, before issuing communication calls. A similar interaction happens in the opposite direction when MPI Win complete and MPI Win wait calls are called. When communication operations are deferred until the second set of synchronization calls, complete messages can be piggybacked onto the last data message being sent from a source to a target. We implement Post-Wait/Start-complete synchronization within a node using shared memory counters. Each process has a Post and a Complete flag per window for every other process on the same node. Processes can directly set these counters to signal Post and Complete operations instead of sending messages. The processes calling Start and Complete poll on these counters. A similar intra-node design works in the conjunction with RDMA-based implementation where the signaling happens through RDMA writes to pre-allocated counters. Fence: MPI Win Fence is a collective operation. In one call, it marks the end of the previous epoch, ensuring the completion of all the operations issued before it and also marks the beginning of the next epoch. In Send/Receive based implementations, each process gets the count of operations that were issued in the epoch with its window as the target. MPI Reduce Scatter is used for this exchange. Then each process waits for the expected number of messages to arrive from every other process. An MPI Barrier is called to ensure completion at all the processes before starting the next epoch. In our design, intra-node communication operations are blocking and can be considered

6

S. Potluri, H. Wang, V. Dhanraj, S. Sur, D. K. Panda

complete when they return. So no additional handling is required during the Fence call. The existing Barrier ensures the required synchronization. Several design choices exist for implementing Fence using RDMA operations. One can use counters like in the case of Post-Wait/Start-Complete, maintaining Post and Complete counters for all the processes in the communicator. More optimized schemes can be designed using RDMA Write with Immediate Data as described in [11]. In either case, the intra-node synchronization can still be handled by calling a Barrier on the shared memory(intra-node) communicator.

4

Experimental Results

In this section we describe our experiments and give an in-depth analysis of the results. We have carried out all micro-benchmark experiments on the Intel Westmere platform. Each node has 8 cores (4 per socket) running at 2.67GHz with 12GB of DDR3 RAM. The operating system is RHEL Server 5.4. We use MVAPICH2 1.6 version and LIMIC2 0.5.4 kernel module. We have evaluated the designs with application benchmarks on two platforms. One is Trestles at San Diego Supercomputing Center. Each node contains four sockets, each with a 8-core AMD Magny-Cours processor, and 64 GB memory per node. The other is Lonestar at Texas Advance Computing Center. Each node has two sockets, each with a 6-core Intel Westmere processor, and 24GB memory per node. Both use CentOS 5.5.

9

MV2 MV2-LIMIC MV2-SWIN

16

MV2 MV2-LIMIC MV2-SWIN

8 7

12

6 5 4 3

0 16

64 256 1K Message size

(a) MPI Put

4K

16K

8 6

2

1 4

10

4

2

1

MV2 MV2-LIMIC MV2-SWIN

14

Time (us)

10 9 8 7 6 5 4 3 2 1 0

Time (us)

Time (us)

4.1 Micro-Benchmark Evaluation In order to avoid effects of compiler and system optimizations on the memory to memory copies, we have modified OSU Micro Benchmark (OMB) [8] to use a different buffer in each transfer of a given message size and we modify the contents of the buffers after each message size. All point-to-point experiments were run across cores on different sockets with no shared cache. “MV2” refers to the version that uses Send/Recv based implementation of One Sided Communication. “MV2-LIMIC” refers to the designs using LIMIC2 discussed in [6]. “MV2-SWIN” represents the designs presented in this paper, using shared memory backed windows.

0 1

4

16

64 256 1K Message size

4K

(b) MPI Get

16K

1

4

16

64 256 1K Message size

4K

16K

(c) MPI Accumulate

Fig. 2. Latency Performance

Latency: Figure 2 shows latency performance for MPI Put, MPI Get and MPI Accumulate. These benchmarks use Post-Wait/Start-Complete mode of synchronization. MVAPICH2

7

enables LIMIC2 in one sided communication for message sizes greater than 4Kbytes. For smaller messages, the overhead from LIMIC2 is greater than the benefits it gives. The slight improvement in “MV2-LIMIC” over “MV2” for small messages is from a shared-memory based synchronization. Compared with “MV2-LIMIC”, “MV2-SWIN” does not have the overhead of the kernel module and can be used for all message sizes. It performs consistently better than “MV2-LIMIC” for small messages. For larger messages, the overhead from the kernel module becomes negligible, so “MV2-SWIN” and “MV2-LIMIC” converge as the message size increases. At 16 KB, “MV2-SWIN” performs 16% better than “MV2-LIMIC” and 59% better than “MV2”. MPI Accumulate does not use the kernel assisted design. Hence we see similar performance for “MV2” and ”MV2-LIMIC”. We see 53% improvement in Accumulate latency with “MV2SWIN” for 8Kbyte messages. Bandwidth: The bandwidth performance is shown in Figure 3. “MV2-SWIN” clearly out-performs the existing techniques and achieves close to system peak performance. We see up to 64% improvement in bandwidth for Put and 42% for Get compared with “MV2-LIMIC”. “MV2-LIMIC” and “MV2” designs queue operations until the synchronization phase. These queuing overheads are removed in “MV2-SWIN” where operations are executed as soon as they are issued.

7000

MV2 MV2-LIMIC MV2-SWIN

4000 Bandwidth (MB/s)

Bandwidth (MB/s)

4500

MV2 MV2-LIMIC MV2-SWIN

6000 5000 4000 3000 2000 1000

3500 3000 2500 2000 1500 1000 500

0

0 1

16

256 4K Message size

64K

1M

1

(a) MPI Put

16

256 4K Message size

64K

1M

(b) MPI Get

Fig. 3. Bandwidth Performance

MV2 MV2-LIMIC MV2-SWIN

8

Time (us)

Time (us)

10

6 4 2 0 1

4

16

64 256 1K Message size

4K

16K

(a) Off-cache MPI Put Latency

5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0

120

MV2 MV2-LIMIC MV2-SWIN

1

4

16

Lock-Get-Unlock Time (us)

12

64 256 1K Message size

4K

16K

MV2 MV2-LIMIC MV2-SWIN

100 80 60 40 20 0 0

10 20 30 40 50 60 70 80 Computation on Target (us)

100

(b) Passive MPI Put Latency (c) MPI Get Latency w/ Busy Target

Fig. 4. Performance with Off-cache data, Passive synchronization and Busy target

8

S. Potluri, H. Wang, V. Dhanraj, S. Sur, D. K. Panda

Latency Performance with Off-Cache Data: In Figure 4(a), we have modified Put latency benchmark to ensure the data is off-cache, by accessing data from a different page in each iteration and by flushing the cache after each message size. We believe this experiment will show the performance behavior of the designs in real-world applications. For 4Kbyte messages “MV2-SWIN” shows 43% improvement compared to “MV2-LIMIC”. Latency with Passive Synchronization: Figure 4(b) shows Put latency with passive mode of synchronization. One Sided Communication in this mode has not been optimized with LIMIC2 kernel module. Hence we observe similar performance for “MV2” and “MV2-LIMIC” up to 16K. We see more than 85% improvement using “MV2SWIN”. Passive Latency with Busy Target: This experiment shows the asynchronous progress that can be achieved using the shared memory backed windows. We measure the latency of Lock-Put-Unlock when the remote process is busy in computation for an increasing amount of time. The results are shown in Figure 4(c). For “MV2-SWIN”, where the completion of passive operations does not require any remote process intervention, we see that the time remains constant. Multi-pair Bandwidth and Message Rate: We use all 8 cores on the Westmere node with 4 pairs of MPI processes. Figure 5(a) shows the multi-pair bandwidth performance using MPI Put. “MV2-SWIN” gives 69% better bandwidth than “MV2-LIMIC” for 4Kbyte messages. For messages beyond 4Kbytes, the two designs perform similarly. Figure 5(b) shows the multi-pair message rate performance of the different designs. “MV2-SWIN” achieves nearly 4.5 times improvement in message rate compared to other designs for 4byte messages. 9000 Bandwidth (MB/s)

7000

Message Rate (Messages/s)

MV2 MV2-LIMIC MV2-SWIN

8000 6000 5000 4000 3000 2000 1000 0

3e+07

MV2 MV2-LIMIC MV2-SWIN

2.5e+07 2e+07 1.5e+07 1e+07 5e+06 0

1

16

256 4K Message size

64K

1M

(a) MPI Put Bandwidth

1

16

256 4K Message size

64K

1M

(b) MPI Put Message Rate

Fig. 5. Multi-Pair Put Bandwidth and Message Rate

4.2

Application Benchmark Evaluation

In this section we evaluate our new design with application benchmarks. The application benchmarks use communication paths (passive synchronization and accumulate operations) that were not optimized using LIMIC2. So we present results comparing

9

“MV2-SWIN” with the default design “MV2”. All the experiments were run on a single node i.e. with 32 cores on Trestles and 12 cores on Lonestar. Splash LU: We use Splash LU benchmark [12] modified to use MPI-2 communication calls. The design of this benchmark is outlined in [10] and uses the passive synchronization and MPI Get calls. We run the test on a matrix of size 16Kx16K matrix of doubles and a block size of 128x128. From results shown in 6(a), the new design “MV2-SWIN” shows a 55% improvement compared to “MV2” on Trestles. It shows a similar improvement on Lonestar. Graph500: Now, we compare the performance of the Breadth First Search(BFS) kernel from the Graph500 [1] benchmark suite using “MV2” and “MV2-SWIN”. This kernel uses Fence synchronization and Accumulate operations. Graphs of size 2ˆ18 and 2ˆ16 nodes were used on Trestles and Lonestar respectively. Results in 6(b) show that “MV2-SWIN” achieves 39% lower mean BFS Time compared to “MV2”, on Trestles. It achieves a 77% improvement on Lonestar. MV2  

MV2  

MV2-­‐SWIN  

160  

Mean  BFS  Time  (sec)  

140   Time(sec)  

120   100   80   60   40   20   0   Trestles   Lonestar   Architecture  

(a) Splash-LU Latency

2   1.8   1.6   1.4   1.2   1   0.8   0.6   0.4   0.2   0  

MV2-­‐SWIN  

Trestles  {2^18,16}  

Lonestar  {2^16,16}  

Architecture  {Scale,  Edge  Factor}  

(b) Graph500 BFS Time

Fig. 6. Splash LU and Graph500

5

Conclusion

Two-sided message passing leads to significant overhead for intra-node communication due to requirement for sender-receiver interactions and tag matching. The one-sided communication interface in MPI provides a better alternative. In this paper, we design one-sided communication using shared memory backed windows for use on multi-core InfiniBand clusters. Experimental evaluation using micro-benchmarks and application benchmarks has shown that our design performs significantly better than the existing kernel-assisted and send/receive-based options. We have implemented our design in MVAPICH2 and will make it publicly available through a release in the near future.

10

S. Potluri, H. Wang, V. Dhanraj, S. Sur, D. K. Panda

Acknowledgments. This research is supported in part by U.S. Department of Energy grants #DE-FC02-06ER25749 and #DE-PFC02-06ER25755; National Science Foundation grants #CCF-0833169, #CCF-0916302, #OCI-0926691 and #CCF-0937842; grants from Intel, Mellanox, Cisco, QLogic, and Sun Microsystems; Equipment donations from Intel, Mellanox, AMD, Appro, Chelsio, Dell, Microway, QLogic, and Sun Microsystems.

References 1. Graph500. http://www.graph500.org/ 2. MPI-3 RMA. https://svn.mpi-forum.org/trac/mpi-forum-web/rawattachment/wiki/mpi3-rma-proposal1/one-side-2.pdf 3. Barrett, B., Shipman, G.M., Lumsdaine, A.: Analysis of Implementation Options for MPI-2 One-Sided. In: Proceedings of PVM/MPI. pp. 242–250 (2007) 4. Booth, S., Mourao, E.: Single sided MPI implementations for SUN MPI. In: Proceedings of the ACM/IEEE conference on Supercomputing. pp. 2–2 (2000) 5. Jin, H.W., Sur, S., Chai, L., Panda, D.K.: Lightweight kernel-level primitives for highperformance MPI intra-node communication over multi-core systems. In: Proceedings of IEEE International Conference on Cluster Computing. pp. 446–451 (2007) 6. Lai, P., Sur, S., Panda, D.K.: Designing Truly One-Sided MPI-2 RMA Intra-node Communication on Multi-core Systems. In: Proceedings of International Supercomputing Conference (ISC). vol. 25, pp. 3–14 (2010) 7. Narravula, S., Mamidala, A., Vishnu, A., Vaidyanathan, K., Jin, H.W., Panda, D.K.: High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations. In: Proceedings of IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid). pp. 583–590 (2007) 8. OSU Microbenchmarks: http://mvapich.cse.ohio-state.edu/benchmarks/ 9. Santhanaraman, G., Balaji, P., Gopalakrishnan, K., Thakur, R., Gropp, W., Panda, D.K.: Natively Supporting True One-Sided Communication in MPI on Multi-core Systems with InfiniBand. In: Proceedings of the 9th IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid). pp. 380–387 (2009) 10. Santhanaraman, G., Narravula, S., Panda, D.K.: Designing Passive Synchronization for MPI2 One-Sided Communication to Maximize Overlap. In: Proceedings of International Parallel and Distributed Processing Symposium (IPDPS). pp. 1–11 (2008) 11. Santhanaraman, G., Gangadharappa, T., Narravula, S., Mamidala, A., Panda, D.K.: Design Alternatives for Implementing Fence Synchronization in MPI-2 One-sided Communication on InfiniBand Clusters. In: Proceedings of IEEE Cluster. pp. 1–9 (2009) 12. Singh, J.P., Weber, W., Gupta, A.: Splash: Stanford parallel applications for shared-memory. Tech. rep., Stanford, CA, USA (1991) 13. Thakur, R., Gropp, W., Toonen, B.: Optimizing the Synchronization Operations in MPI OneSided Communication. In: International Journal of High Performance Computing Applications (IJHPCA). pp. 119–128 (2005)

Suggest Documents