Design of High Performance MVAPICH2: MPI2 over ... - CiteSeerX

Design of High Performance MVAPICH2: MPI2 over InfiniBand ∗ W. Huang†

G. Santhanaraman†

†

H. -W. Jin‡

Computer Science and Engineering The Ohio State University Columbus, OH 43210 {huanwei,santhana,gaoq,panda}@cse.ohio-state.edu Abstract MPICH2 provides a layered architecture for implementing MPI-2. In this paper, we provide a new design for implementing MPI-2 over InfiniBand by extending the MPICH2 ADI3 layer. Our new design aims to achieve high performance by providing a multi-communication method framework that can utilize appropriate communication channels/devices to attain optimal performance without compromising on scalability and portability. We also present the performance comparison of the new design with our previous design based on the MPICH2 RDMA channel. We show significant performance improvements in micro-benchmarks and NAS Parallel Benchmarks.

1 Introduction Message passing interface (MPI) has become the de facto standard for writing parallel applications. The need for high performance and scalable MPI implementations is growing with the increasing proliferation of clusters in the High Performance Computing area and the increasing size of the next generation supercomputers. Over the past decade, there is a strong trend towards RDMA capable networks due to their improved performance and scalability. InfiniBand [2], with its RDMA capability and advanced hardware features, has emerged as a strong player in the area of cluster computing. MPI-2 standard [5] has been proposed to extend the functionality of the earlier proposed MPI-1 standard [7]. MPICH2 is a popular MPI-2 implementation from Argonne National Laboratory which aims to combine performance with portability across different interconnects. As part of our previous work, we have implemented MPI2 over InfiniBand based on the MPICH2 RDMA channel [3]. Implementing at a lower layer benefits from the ease of porting but lacks the flexibility of incorporating novel solutions, which has been studied in [8]. In this paper we describe a new extended ADI3 level design for MPI-2 over InfiniBand. We provide a framework that attempts to address the issues of performance ∗ This research is supported in part by Department of Energy’s Grant #DE-FC02-01ER25506; National Science Foundation grants #CNS0403342 and #CCR-0509452; grants from Intel, Mellanox, Sun, Cisco, and Linux Networx; and equipment donations from Apple, AMD, IBM, Intel, Microway, Pathscale, Silverstorm and Sun.

Q. Gao† ‡

D. K. Panda†

Computer Engineering Department Konkuk University Seoul 143-701, Korea [email protected]

and scalability mainly in the context of InfiniBand network, yet portable to other programming APIs provided by RDMA capable interconnects. We propose a layered design including an ADI3 extended layer, a multi communication method (MCM) layer which chooses the appropriate communication channels/devices that can deliver the optimal performance and improve scalability, and a device layer which encapsulates all device specific information to keep the portability of the original MPICH2 design. The rest of the paper is organized as follows. In Section 2 we describe the layered design of MPICH2 as background information. We propose our new design in Section 3 and present the performance evaluation in Section 4. Finally, we conclude the paper in Section 5.

2 Layered Design of MPICH2 MPICH2 from Argonne National Laboratory is a popular implementation of MPI-2. It aims to combine performance with portability across different interconnects. It follows a layered design which offers three implementation alternatives over InfiniBand as illustrated in Figure 1. MPI 2

1: ADI3 level Design 2: CH3 Level Design 3: RDMA Channel Level Design

ADI3

CH3

1 TCP Socket Channel

SHMEM Channel

RDMA Channel

2 3

SHMEM

Sys V Shared Memory

InfiniBand

Figure 1. Layered Design of MPICH2 The ADI3 layer is a full featured, abstract device interface used in MPICH2. It is responsible for both the point to point and one sided communications. ADI3 completes its tasks through function interfaces provided by the CH3 layer to increase portability. The CH3 layer provides a communication channel that consists of a dozen interfaces providing basic functionalities for eager, rendezvous communication protocols, as well as auxiliary functions such as process startup, finalize, etc. As shown in the figure, MPICH2 provides several channel implementations for sockets, shared memory, etc.

To further increase the portability over RDMA capable networks, MPICH2 also has proposed CH3 implementation based on the RDMA channel interfaces. All communication operations that MPICH2 supports are mapped to just five functions at the RDMA channel, thus the RDMA channel provides the least porting overhead. However, our earlier studies on the design choice of MVAPICH2 (MPI2 over InfiniBand) based on the RDMA channel have revealed that implementation at this layer is limited from many performance optimizations [8]. The reason is because of the inability to access certain data structures to get necessary information for such optimizations.

3 Design and Implementation In this section we present a new design of MVAPICH2 by extending the MPICH2 ADI3 layer. The main objectives we are trying to achieve through this new design are: Performance: Our new design should be able to take advantage of the available communication methods on a cluster to achieve the best performance, such as shared memory for intra-node communication. Portability: It is desirable to have a layered design which has a concise device abstraction for each of these communication methods, while allowing the lower layer of this design to have enough information to perform most of the hardware-specific optimizations, such as zero-copy datatype communication [6]. MPI−2

ADI3−Ex Query

Point to Point

Header Caching

One Sided

Collectives

Multi−Communication Method Query

Eager Protocol

Rendezvous Protocol

Direct One Sided Protocol

Progress Engine

point or one sided operation from the user applications. Our extensions to the ADI3 layer are mainly along the following aspects: Query: For RDMA enabled interconnects, two protocols are often used for message transfer, the eager protocol which uses pre-registered buffers for small message communication and the zero-copy rendezvous protocol for large messages [3]. Current MPICH2 design supports these two protocols but decides the protocol only at the ADI3 layer. In our design, the ADI3-Ex layer queries the preferred protocol from the MCM layer, which dynamically chooses from the available communication devices, thus has better knowledge to make this decision. Point to Point operations: Point to point communication can take advantage of header caching. Header caching is a technique that caches the content of internal MPI headers at the receiver side. If the next message to the receiver contains the same header fields, the sender does not need to send these fields again, which reduces the message size. Header caching can only be implemented at this layer since only the ADI3-Ex layer is able to understand the content of the message. Collectives: Currently, the implementation of collective operations in our design are inherited from MPICH2, which is based on point to point communication. Thus it can take advantage of performance benefits from the new design. In future, the framework we propose can also be extended to incorporate optimized algorithms for collectives which directly utilize the hardware capabilities of InfiniBand, such as hardware multicast [4], etc. One Sided Operations: We also incorporate optimized one sided operations in the design. The detailed design and implementation is described in [1].

The Device Layer Query Primitive

Copy & Send Registration Primitive Primitive

Inter−node Devices Gen2 VAPI UDAPL InfiniBand

Zero−Copy Message Primitive Detection Intra−node Comm− unication Device

Other Interconnects

Figure 2. Overall new design of MVAPICH2 Our overall design is illustrated in Figure 2, which follows the basic idea of the layered structure of MPICH2. We start from an extension of the ADI3 layer (ADI3Ex). Below that we have our own design of the multicommunication method (MCM) layer and the device layer. The device layer provides abstractions of the communication methods on the cluster, which can be either an intra-node communication device or an inter-node communication device. And the MCM layer aims to exploit the performance benefits provided by these abstract devices. The following subsections briefly describe the designs for all these three layers. A more detailed technical report is available at [1].

3.1 ADI3-Ex Layer

The ADI3-Ex layer, which extends from the ADI3 layer in MPICH2, is the highest level in our design. Inherited from MPICH2, the main responsibilities of the ADI3Ex layer include selecting appropriate internal communication protocols (eager or rendezvous) for each point to

3.2 Multi-communication Method Layer The multi-communication method (MCM) layer implements the communication protocols (eager or rendezvous) selected by the ADI3-Ex layer using the communication primitives supported by the device layer. Figure 3 illustrates the interfaces and the basic functional modules of this layer. The MCM Interfaces Query

Eager Protocol

Rendezvous Direct One Sided Protocol Protocol

Progress

The MCM Layer Components Device Selection

Progress Engine Eager Progress

Rendezvous Progress

Query Copy & Send Registration Primitive Primitive Primitive

One Sided Scheduling

Ordering

Zero−Copy Message Detection Primitive

The Device Layer Interfaces

Figure 3. Design details of the MCM layer The Device Selection Component understands the performance features of each communication device provided at the device layer and chooses the most suitable one to complete the communication at runtime. Right now our implementation supports one inter-node and one

intra-node communication device simultaneously, but in future we plan to extend this framework to support multiple communication devices. Once the ADI3-Ex layer decides the communication protocol, the communication progress is taken care of by the progress engine at the MCM layer. There are several important components within the progress engine. The eager and rendezvous progress components implement the eager and rendezvous communication protocols. Typically for eager protocol, the messages are sent through the copy-and-send primitives provided by the device layer. For rendezvous protocol, the user buffers involved in the communication are first registered by the registration primitives provided by the device layer and then the data is sent through the zero-copy primitives. In our design the progress engines understands the capabilities of the device layer, and can internally handle the limitations. This process is transparent to the ADI3-Ex layer. For example, the ADI3-Ex layer may send an eager message too large for the device to send out at one time, in that case the message needs to be broken down to smaller sizes and sent out in multiple packets. This is especially important for scalability, where the lower level device can reduce the size of pre-registered buffer for smaller memory consumption. As illustrated in Section 4, hiding such details from the ADI3-Ex layer, which results in a “packetized send”, can greatly benefits the performance when pre-registered buffers are small. Also the device may fail to register the user buffer for zero-copy rendezvous protocols, then large messages will also need to be broken down into pieces to be sent through copy-and-send primitives. This allows our MPI design to work seamlessly with network programming interfaces that does not support RDMA features, in which case the device layer just need to return failure on every registration attempt. The ordering component helps to keep the correct ordering of the messages sent from different devices at the receiver side. The incoming message from a specific device is detected through the message detection interface provided by the device layer. The One Sided Scheduling component implements the re-ordering and aggregation for one sided operations, which significantly improves the performance [9].

3.3 Device Layer The lowest device layer implements the basic network transfer primitives. The paradigm of this layer allows us to hide the difference between various communication schemes provided by the system while exposing the maximum number of features (such as zero-copy communication) to the MCM layer. The interfaces provided by the device layer include copy-and-send primitives, zero-copy primitives, and registration primitives which prepares for zero-copy communication. The communication primitives at this layer are directly implemented on top of the programming libraries provided by communication systems Since there can be multiple devices existing at the same time, each device also implements the query primitive to

provide the device selection component of the MCM layer with the basic performance features as well as specialized hardware features it supports, such as using scatter/gather for non-contiguous data communication. This enables the MCM layer to select at runtime the most suitable device for a particular message. The message detection primitive is queried by the progress engine at the MCM layer to detect the next incoming packet.

4 Performance Evaluation In this section we evaluate our new design with microbenchmarks and NAS Parallel Benchmarks. We compare the new design with MVAPICH2-0.6.5, which is our old design based on the RDMA channel over InfiniBand. Our experimental testbed is a cluster of eight nodes. Each node is equipped with dual Intel Xeon 3.0GHz CPU, 2 Giga-bytes memory, and a Mellanox MT23108 PCI-X InfiniBand HCA.

4.1 Inter-Node and Intra-Node Communication We first look at the improvement on the basic point to point communication performance. We evaluate the point to point latency and throughput between two processes for the both inter-node and intra-node communication. Figures 4(a) and 4(b) show the latency and throughput. For inter-node communication, we observe that the 1 byte message latency is reduced from 5.8µ s to 4.9µ s, a 16% of improvement for the new design. This can be attributed to the header caching scheme we have implemented in the ADI3-Ex layer. Since our new design also overcomes the shortcoming of only one outstanding communication request [8], the message throughput increases greatly, especially for medium size messages, where we observe up to 29% improvement on throughput. In our new design, the MCM layer automatically chooses shared memory as the intra-node communication device, which reduces the latency from 5.7µ s to 1.1µ s for small messages. And it also increases the throughput drastically especially for medium size messages. For example, for 16KB message, the throughput is improved by 150%, from 398 MB/s to 1019 MB/s.

4.2 Non-Contiguous Datatype Communication Performance can largely benefit from the hardwarespecific optimizations at the device level. Here we implement the SGRS scheme [6] at the device level as an example to see how it can enhance the performance for non-contiguous datatype communication. Earlier RDMA channel design does not have enough information (the layout of the whole message) to do such optimization. The benchmarks transfer increasing number of columns in a two dimensional M ∗ 8192 integer matrix between 2 processes on 2 cluster nodes. These columns can be represented as vector datatype, which are widely used in scientific applications. Figures 4(c) and 4(d) show the comparison of latency and throughput between our new and old design. The old design uses the pack and unpack approach involving extra copies while the new design uses the SGRS scheme. As we increase the number of columns

10

5

0

new design mvapich2-0.6.5 new design mvapich2-0.6.5

1000 800 600 400

4

16 64 256 Message Size (Bytes)

1k

1500 1000 500

200

4k

0 1

(a) MPI latency

800 700 600 500 400 300 200 100 0

4

80

Latency (us)

50 40 30

10

256

512

1k

2k

4k

0

512

1k

2k

4k

new design (small buffer) new design (normal) mvapich2-0.6.5 (small buffer) mvapich2-0.6.5 (normal)

800 600 400 200 0

1k

Number of Columns

(d) Vector communication throughput

256

Number of Columns

1000

20

128

128

(c) Vector communication latency

new design (small buffer) new design (normal) mvapich2-0.6.5 (small buffer) mvapich2-0.6.5 (normal)

70 60

64

64

16 64 256 1k 4k 16k 64K256k 1M Message Size (Bytes)

(b) MPI bandwidth

new design (SGRS) mvapich2-0.6.5

new design (SGRS) mvapich2-0.6.5

2000

0 1

Throughput (MillionBytes/s)

2500

(inter-node) (inter-node) (intra-node) (intra-node)


Latency (us)

1200

(inter-node) (inter-node) (intra-node) (intra-node)

Latency (us)

new design mvapich2-0.6.5 new design mvapich2-0.6.5

15


20

2k

4k 8k 16k Message Size (Bytes)

32k

(e) Packetization effect (latency)

1

4

16 64 256 1k 4k 16k 64K256k 1M Message Size (Bytes)

(f) Packetization effect (throughput)

Figure 4. Micro-benchmark evaluations from 64 to 4096, we clearly observe that the zero-copy SGRS scheme outperforms the copy based scheme.

5 Conclusions

In this paper we have presented a new MVAPICH2 design which extends the MPICH2’s ADI3 layer. We illustrate the overall design and explain the advantages of our new design. Within the design we propose, most of the hardware-related performance optimizations can be smoothly incorporated at the device layer, which brings high performance and scalability. At the same time, we keep the portability as originally delivered by MPICH2’s RDMA channel interface.

References Figure 5. NAS execution time comparison

4.3 Effects of Packetization For scalability reasons, the device layer will reduce the size of pre-registered communication buffer with the increase of MPI job size. For the old design (MVAPICH20.6.5), messages bigger than the communication buffer are forced to be sent through rendezvous protocol, which hurts the performance due to the additional handshake phase. In Figures 4(e) and 4(f) we clearly see that for the new design where the packetization scheme is incorporated, the performance drop is minimal when reducing the pre-registered buffer size to only 2KB. It is a big improvement as compared to the old design.

4.4 NAS Parallel Benchmarks We evaluate the application level performance with NAS Parallel Benchmarks. Figure 5 shows the results. These experiments are conducted on 8 dual processors (16 processes). Due to the overall effect of supporting intra-node communication device and the optimized progress engine, the new design performs considerably better. Especially for CG benchmark, we get 24% improvement with respect to the execution time. For MG and IS benchmarks, we get 5% and 14% improvement, respectively.

[1] W. Huang, G. Santhanaraman, Hyun-Wook Jin, Q. Gao, and D. K. Panda. Design and Implementation of High Performance MVAPICH2 (MPI2 over InfiniBand). TechReport OSU-CISRC-12/05-TR76. [2] InfiniBand Architecture Specification, Release 1.2. [3] J. Liu, W. Jiang, P. Wyckoff, D. K. Panda, D. Ashton, D. Buntinas, W. Gropp, and B. Toonen. Design and Implementation of MPICH2 over InfiniBand with RDMA Support. In IPDPS 2004. [4] J. Liu, A. Mamidala, and D. K. Panda. Fast and scalable mpi-level broadcast using infiniband’s hardware multicast support. In Proceedings of IPDPS’04, 2004. [5] MPICH2. http://www-unix.mcs.anl.gov/mpi/mpich2/. [6] G. Santhanaraman, J. Wu, W. Huang, and D. K. Panda. Designing Zero-copy MPI Derived Datatype Communication over InfiniBand: Alternative Approaches and Performance Evaluation. In Special Issue of IJHPCA, 2005. [7] M. Snir, S. Otto, S. Huss-Lederman, D. Walker, and J. Dongarra. MPI–The Complete Reference. Volume 1 - The MPI-1 Core, 2nd edition. The MIT Press, 1998. [8] W. Huang, G. Santhanaraman, H.-W. Jin, and D. K. Panda. Design Alternatives and Performance Trade-offs for Implementing MPI-2 over InfiniBand. In EuroPVM/MPI 05. [9] W. Huang, G. Santhanaraman, H.-W. Jin, and D. K. Panda. Scheduling of MPI-2 One Sided Operations over InfiniBand. In CAC 05 (in conjunction with IPDPS 05).

Design of High Performance MVAPICH2: MPI2 over ... - CiteSeerX

Design of High Performance MVAPICH2: MPI2 over ... - CiteSeerX

Suggest Documents

High Performance Distributed Computing over ATM ... - CiteSeerX

The design and performance of high-performance ... - CiteSeerX

High-Performance Distributed Objects over System Area ... - CiteSeerX

SEMPLAR: High-Performance Remote Parallel I/O over SRB - CiteSeerX

Empirical study design in the area of High-Performance ... - CiteSeerX

Design and identification of high performance steel alloys ... - CiteSeerX

High-Performance High-Strength Concrete: Design Recommendations

High Performance Regenerative Receiver Design

High Performance Polymers - CiteSeerX

High-Performance Telepointers - CiteSeerX

High Performance Computing Over Parallel Mobile Systems

Exploration and Design of High Performance

Design of High performance MIPS-32 Pipeline

Design Methodology of Configurable High Performance ... - Liberouter

Design of High-Performance Analog Circuits Using

Design of Multiplierless and High-Performance

Design and Verification of a High Performance

Design of High performance MIPS-32 Pipeline

Design of a Low Power, High Performance

Design of high-performance parallelized gene ...

Design, Manufacturing, and Characterization of High-Performance ...

Design of high performance multimedia control ...

Conceptual design of high performance Unmanned ...

Design Methodology of Configurable High Performance ... - Liberouter