Efficient MPI for Virtual Interface (VI) Architecture

3 downloads 0 Views 173KB Size Report
second SAN interconnects have made clus- ters of workstations a powerful environment for parallel processing. There has been a significant interest in MPI for ...
Ecient MPI for Virtual Interface (VI) Architecture Rossen Dimitrov and Anthony Skjellum Mississippi State University and MPI Software Technology, Inc. Starkville, MS, U.S.A.

Abstract Ecient Message Passing Interface im-

plementations for emerging cluster interconnects are an important requirement for useful parallel processing on cost-e ective clusters of NT workstations. This paper reports on a new implementation of MPI for VI Architecture networks. Support for high bandwidth, low latency, and low overhead are considered, as is the match of the MPI speci cation to the VI Architecture features. The ability to o er thread safety, computation and communication overlapping, optimized protocols, and full MPI progress are highlighted.

Keywords: MPI/Pro, multi-device MPI, clusters, VI Architecture

1 Introduction In recent years, MPI has become a de facto standard for developing portable parallel applications [1, 2]. At present, there are a number of MPI implementations that target various parallel platforms and operating systems. One of the most extensively used implementations is MPICH from Mississippi State University and Argonne National Laboratory [3]. The rapid increase of microprocessor performance and the development of gigabit-persecond SAN interconnects have made clusters of workstations a powerful environment for parallel processing. There has been a signi cant interest in MPI for enabling highperformance computing in clusters. A detailed review of MPI implementations for clusters of workstations is presented in [4]. 

Corresponding author.

2 VI Architecture MPI/Pro has been designed and implemented speci cally to target VI Architecture networks. The VI Architecture is a new industry-driven standard that speci es the interface between a high-performance SAN and a computer system [5]. A major objective of the VI Architecture is to reduce the message-passing overhead and the number of intermediate data copies induced by traditional general-purpose protocol stacks, such as TCP/IP. This is achieved by collapsing excessive software layers, eliminating host operating system from the critical data path, and providing a hardware thread of control implemented by the VI network interface controllers (NIC). The VI Architecture allows user processes to interact directly with the NIC without interference of the host operating system. This avoids unnecessary context switches and signi cantly reduces the software overhead. Also, data transfers can continue even when processes participating in the communication are not scheduled for execution. Transferring data in and out of user memory when processes are swapped out is achieved through the NIC DMA engines and the VI memory registration mechanism. Memory registration translates virtual to physical addresses and pins user bu ers in physical memory. Another advanced feature of the VI Architecture is the remote direct memory access (RDMA). RDMA increases the e ective throughput between network nodes while minimizing the CPU time spent for communication

tasks. RDMA allows a user process to deliver or obtain data over a VI network transparently to the remote system. More detailed discussion on the VI Architecture and speci cally on the VI registration mechanism and RDMA can be found in [4]. The VI Architecture is an important step toward creating a uniform view of high-speed networks. Thus, underlying details of datalink and physical network layers remain hidden for the upper communication software layers. This signi cantly increases portability of message-passing systems while preserving maximum performance.

3 MPI/Pro Features MPI/Pro has a number of features that make it a valuable tool for ecient programming of clusters of workstations. MPI/Pro provides: multi-device architecture that allows MPI applications to exploit eciently SMP parallelism; multithreaded design that achieves independent message progress; user level thread safety so user applications can e ectively exploit multi-grained parallelism through full edged MPI processes or operating system speci c threads; asynchronous method of synchronization and noti cation; optimized persistent mode of MPI point-to-point operations; optimized protocols for short and long messages over both VI and SMP devices; multiple queues for receive requests - this reduces the time for processing receive requests and increases the degree of concurrency, and optimized derived data types - this allows for exploiting the high abstraction power of derived data types without loss of performance.

4 MPI/Pro Design Objectives MPI/Pro has been designed to take full advantage of the VI Architecture high-performance features and some well understood (but rarely used in other MPI implementations) programming techniques, such as multithreading and asynchronous noti cation. Although the rst

public version of the VI Architecture speci cation had just been released when the design of MPI/Pro was initiated, we believed that this new standard would attract the attention of network vendors and software developers from di erent areas. Currently, there are already several commercially available VI Architecture interconnects. Among these are GigaNet, ServerNet and FC-VI [4]. Nowadays, workstations and servers with multiple processors are commonly used in practice. A computer platform with two or four processors is often considered cost e ective because of the relatively low additive cost of additional processors in respect to the increased computational power. We wished to provide ecient mechanisms for optimal resource utilization, so we considered an MPI design with multi-device architecture incorporating a VI network device and an optimized SMP device. MPI/Pro ensures the most ecient interprocess communication within a single multiprocessor machine. Multithreading was another major design objective. An MPI implementation using multiple threads can achieve independent message progress, asynchronous noti cation and create the potential for higher degree of computation and communication overlapping. A number of operating systems for workstations, such as Solaris, Linux, and Windows NT o er ecient preemptive thread packages. Extending the idea of multithreading, we considered user-level thread safety as another important design objective. Thread safety provides users with the opportunity to exploit local parallelism, ecient intra-SMP processing, and techniques for optimal CPU utilization. Most of the existing MPI implementations for workstations, including MPICH, rely on polling for noti cation and synchronization. Although polling can achieve the lowest message-passing latency, it wastes CPU cycles for spinning on noti cation ags and leaves little room for exploiting programming techniques for computation and communication overlapping and ecient resource utilization [6]. A user thread that polls for synchro-

nization does not yield the CPU until it is descheduled by the operating system, thus reducing the CPU time for useful computation. Furthermore, latency is inferior to processor overhead as a realistic measure of performance for applications that can overlap computation and communication. Optimizing persistent MPI operations was another important design consideration. The VI Architecture requires registration of all memory segments before they can participate in data transfers. Memory registration is a high-overhead operation and for achieving better performance the VI Architecture speci cation recommends that the number of registration operations be minimized. Careful implementation of persistent operations can reduce the registration overhead by reusing the same registered segments multiple times. Additionally, a large group of applications can bene t from the optimized persistent operations directly because these operations eciently re ect temporal locality present in data-parallel and other regular parallel algorithms. Optimized user-de ned (derived) data types was another important design goal for MPI/Pro. MPI derived data types have had broad acceptance as a powerful data marshalling technique. Through derived data types, MPI programmers can easily describe complex data structures and achieve ecient data distributions among processors. However, in most existing MPI systems, derived data types are implemented ineciently and cause a signi cant loss of performance. However, this performance loss can be reduced to a minimum or ultimately eliminated, with the right design and implementation. Given the requirements, an ab initio implementation of MPI/Pro was chosen. Later in the development phase, this decision allowed us to incorporate a number of new architectural solutions for scalability and performance and accomplish desirable features such as thread safety and asynchronous noti cation of completion through new ecient programming techniques.

5 MPI/Pro Architecture MPI/Pro incorporates several new architectural solutions that minimize messageprocessing overhead, reduce the CPU time for communications, and improve scalability. Additionally, these solutions facilitate a high degree of overlapping of communications and computations and ecient multi-device mode of operation. Below, some of the most important architectural solutions are identi ed.

5.1 Progress threads MPI/Pro uses a progress thread in each of its VI and SMP communication devices for implementing an independent, non-polling message progress. In most existing MPI implementations, progress of non-blocking or long messages is made only when user processes continuously call the MPI library. As opposed to this scenario, MPI/Pro makes progress of all messages independently of the sequence of user calls. Ultimately, after a non-blocking send request is posted MPI/Pro completes the transfer associated with this request even if the user never makes a subsequent call to MPI. Using a library thread per communication device also facilitates an asynchronous model for noti cation of communication events. The asynchronous model eliminates the need for polling and ensures minimal CPU involvement in data transfers. As a consequence, a user thread that participates in data exchange may execute useful computations while the progress thread is blocked and waits for an incoming message, thus a high degree of communication and computation overlapping can be achieved. The use of progress threads also helps MPI/Pro to comply with the strict interpretation of the MPI progress rule [1]. An important consequence of the threaded architecture of MPI/Pro is the ecient processing in multi-device mode. As opposed to conventional MPI implementations, in MPI/Pro messages arriving on di erent devices are processed independently from each other; hence, the performance of faster devices

is not limited by the presence of slower devices. Finally, the multi-threaded architecture was naturally used as a basis for achieving a userlevel thread safety. By providing thread safety, MPI/Pro meets the requirements of a variety of MPI users for exploiting local parallelism through threads.

5.2 Multiple queues Multiple queues for receive requests can be used to enhance performance. For example, MPICH implements a single pair of receive queues for posted and unexpected receive requests. The basic algorithm for matching arriving messages and receive requests is as follows. All messages that arrive before a matching request has been posted are queued on the unexpected receive queue as unexpected requests. Similarly, all receive requests for messages from all ranks are posted to the posted receive queue if they are not matched with requests from the unexpected queue. MPI/Pro employs a new approach for matching incoming messages and receive requests. A pair of posted and unexpected queues is associated with each process, regardless of the device that services the communication to and from the process. In e ect, MPI/Pro distributes the sinlge pair of queues into multiple pairs, thus reducing the search times for request matching and eliminating the point of synchronization caused by the single pair solution. Further detail is provided in [4]. Through the distributed receive queues mechanism, MPI/Pro achieves faster demultiplexing of the incoming messages and a at the same time a higher degree of internal concurrency, which is especially important for optimizing performance in multi-device mode.

5.3 Optimized Protocols MPI/Pro has been optimized to use the most ecient mechanisms for message transfers re ecting various overhead tradeo s, message sizes and speci c device characteristics. For both the VI and SMP devices, MPI/Pro has

been designed to achieve the low latency for short messages and maximum throughput for long messages. Depending on the particular communication device, the optimal protocol for short messages may involve an additional intermediate copy. The decision is based on the relative cost of the overhead associated with setting up a zero-copy transfer to the overhead of an intermediate memory copy. Here, we review the short and long protocols of the VI device of MPI/Pro. As discussed earlier, VI RDMA transfers achieve the highest bandwidth. However, setting up a zero-copy transfer with RDMA an initialization phase for acquiring the address and memory handle of the remote process bu er. This phase involves two more messages in addition to the actual RDMA data transfer. Up to a certain message size, these extra messages introduce a delay that is higher than the overhead associated with one memory copy. Consequently, for achieving optimal performance in terms of latency, MPI/Pros short protocol uses one extra copy at the receive side. As opposed to this scenario, MPI/Pro uses a rendezvous protocol for long messages. This protocol rst exchanges control information and then performs actual RDMA data transfers. RDMA operations reduce signi cantly the CPU time spent on communication activities. The CPU utilization of an MPI process transferring long-messages over the VI device is in the range of only 3-4%. So, MPI/Pro o ers to MPI applications the opportunity to really hide communications overhead and overlap it with useful computations. This, clearly, is one of the most important opportunities that MPI/Pro provides to user applications for improving their overall performance.

5.4 Optimized Derived Data Types As emphasized earlier, MPI/Pro aimed to optimize derived data types in order to allow MPI programmers to take advantage of the high abstraction power of this MPI mechanism without sacri cing performance. This a rather challenging task inasmuch as there have been var-

6 Performance

6.1 Experimental Environment

The computer systems used in the experiments are shown in Table 1. All systems were running

Windows NT 4.0 and were equipped with 33 MHz 32-bit PCI buses. The cluster interconnect was GigaNet cLAN with one eight-port GNX5000 switch. Test applications were compiled with Microsoft Visual C++ 5.0 and Digital Visual Fortran 5.0. The SMP graphs in the gures represent a mode of operation with simultaneous use of VI and SMP devices. Table 1: Experimental con gurations CPU Type @ MHz RAM [MB] Con g1 1 x Pentium II@400 128 Con g2 2 x Xeon@400 512 Con g3 2 x Xeon@450 512

6.2 Point-to-Point Performance The graphs in Figure 1 and Figure 2 represent one-way latency and bandwidth measurements from point-to-point experiments with MPI/Pro using both the VI and SMP devices. The message sizes are user-level sizes and do not re ect the constant-size 32-byte long MPI/Pro header. The one-way latency is computed as the ratio T/(N+1), where T is the time for transmitting N consecutive messages in one direction and one message in the opposite direction. The bandwidth numbers are based on the roundtrip time. More performance data on various point-to-point tests and hardware con gurations can be found in [8]. 70 Config1-VI Config2-VI Config1-SMP Config2-SMP

65 60 Latency [microsec]

ious projects that have attempted to improve the performance of MPI derived data types [7]. The MPI standard describes a data type as a sequence of pairs (type, displacement). This sequence is called a type map. The rst element of a pair in the type map corresponds to an already de ned data type. The second element represents the displacement of type in respect to the beginning of the derived data type. A large number of MPI implementations have chosen to use the internal representation of the derived data types as speci ed in the MPI standard. However, this representation leads to a signi cant overhead during the MPI calls that use derived data types. The suggested solution leads to a recursive algorithm for determining the memory locations of data in user bu ers. In addition to the extra memory copy necessary for constructing a contiguous message suitable for transmitting over a network, the overhead associated with the recursive algorithm signi cantly reduces the effective throughput of messages that use derived data types. MPI/Pro employs a di erent approach for data types internal representation. Instead of using a type map, MPI/Pro constructs a table of pairs (displacement, length) for each type. Each pair represents a contiguous segment of memory. In e ect, using this approach, MPI/Pro \ attens" derived data types at the time they are construction. Subsequently, when users make an MPI call with a derived data type, MPI/Pro simply \walks" through the user bu er and immediately accesses the appropriate memory locations. There is almost no overhead associated with the actual processing. The only remaining factor that reduces the e ective throughput is the memory copy necessary for constructing the contiguous block subject to transmission over the network.

55 50 45 40 35 30 25 20 15 4

16

64 256 Message size [bytes]

1024

Figure 1: One-way latency

4096

6.4 Derived Data Types

200 Config1-VI Config2-VI Config1-SMP Config2-SMP

180 Bandwidth [MB/sec]

160 140 120 100 80 60 40 20 0 1

4

16 64 Message size [Kbytes]

256

1024

Figure 2: Bandwidth

6.3 NAS Parallel Benchmarks Here, we present experimental data from one of the eight NAS Parallel Benchmarks [9]. The benchmark we chose is LU (Figure 3). The NAS experiments were carried out on Con g3 as speci ed in Table 1. The cluster consisted of 8 dual-CPU nodes, so the maximum number of processors we could use was 8 for the VI device and 16 for the VI and SMP devices. The results from the other NAS benchmarks with di erent problem sizes are summarized in [8]. The parallel performance evidently increases nearly linearly with the increase of the number of processors, which is a clear demonstration of the communications performance and scalability of the parallel system built with GigaNet's VI Architecture interconnect and enabled by MPI/Pro.

Here, we present an experiment that shows the performance of MPI/Pro derived data types subsystem. For this purpose, we constructed a test application that uses a square matrix of one million complex oat elements. The test transfers the upper right triangular matrix between two processes. We evaluate three di erent cases. In the rst case we constructed a derived data type that described that triangular matrix subject to exchange and used this type in the MPI calls. In the second case, we used the data type derived in case one to pack the data into a contiguous bu er that was used in the transfer. In the third case, we did not use the derived data type and packed the data \manually" into a contiguous bu er. The e ective bandwidth of the three cases is presented in Table 2. The di erence in performance between Case1 and Case3 is about 5% and between Case2 and Case3 is less than 1%. Table 2: Derived data types performance Case1 Case2 Case3 32.41 MB/s 34.33 MB/s 34.45 MB/s

7 Future Work Improving and re ning the functionality, architecture, and performance characteristics of MPI/Pro are ongoing e orts. Immediate plans for MPI/Pro updates are as follows:

900

Performance [Mflops]



Config3-VI Config3-SMP

800 700 600 500 400



300 200 100



0 0

2

4

6 8 10 12 Number of processors

14

Figure 3: LU performance, Class W

16



Provide better specialization and optimization of communication protocols in order to re ect the speci c di erences of the various VI Architecture implementations, Develop optimal mechanisms to re ect the varying overheads of memory registration and memory copying, Optimize collective operations by providing separate communication channels for di erent communicators, and Use the optimal synchronization methods on di erent platforms.

MPI Software Technology, Inc is also working on some longer-term goals. Some of the features that will be included in the next versions of MPI/Pro are as follows:     

Flexible multi-device architecture, Cross-platform interoperability, Object-oriented design, C++ and FORTRAN 90 bindings, and Support the MPI-2 one-sided communication operations.

8 Summary and Conclusions In this paper, we presented MPI/Pro, an ef cient multi-device MPI implementation that targets VI Architecture networks and SMP platforms. MPI/Pro has a number of architectural solutions for delivering maximum performance to user processes, achieving optimal resource utilization, and facilitating high degree of computation and communication overlapping. MPI/Pro complies with the MPI Progress Rule and provides user-level thread safety, which allows user applications to exploit local parallelism through threads and bene t from using SMP platforms. MPI/Pro uses ef cient data transfer protocols, optimizes the demultiplexing of incoming messages and emphasizes the persistent mode of MPI point-topoint operations for increased message-passing performance. There are already multiple installations of MPI/Pro for VI Architecture networks. The largest MPI/Pro installation is at Sandia National Laboratory where a Windows NT cluster of 72 dual-CPU nodes interconnected with ServerNet I VI network was built. MPI/Pro has demonstrated scalability and robustness in this production environment. It is our conclusion, based on this e ort thus far, that new implementations of MPI, such as MPI/Pro, are needed in order to exploit emerging architectural features, and performanceoriented networks, typi ed here by the VI architecture.

References [1] Message Passing Interface Forum. MPI: A Message-Passing Interface Standard, 1994. http://www.mpi-forum.org/docs. [2] Message Passing Interface Forum. MPI2: Extensions to the Message Passing Interface, 1997. http://www.mpiforum.org/docs/mpi-20.ps. [3] W. Gropp et al. A High-Performance Portable Implementation of the Message Passing Interface. Journal of Parallel Computing, 22:789{828, 1996. [4] R. Dimitrov and A. Skjellum. An Ecient MPI Implementation for Virtual Interface (VI) Architecture-Enabled Cluster Computing. In Proceedings of the Third MPI Developer's Conference, March 1999. [5] Intel, Compaq, and Microsoft. VI Architecture Speci cation Version 1.0, 1997. http://www.viarch.org. [6] L.S. Hebert et al. MPI for NT: Two Generations of Implementations and Experience with the Message Passing Interface for Clusters and SMP Environments. In Proceedings of the International Conference on Parallel and Distributed Processing, Techniques and Applications, July 1998. [7] W. Gropp, E. Lusk, and D. Swider. Improving the Performance of MPI Derived Datatypes. In Proceedings of the Third MPI Developer's Conference, March 1999. [8] Inc MPI Software Technology. MPI/Pro Performance, 1999. http://www.mpisofttech.com/performance/mpi-via.html. [9] D. Bailey et al. The NAS Parallel Benchmarks. International Journal of Supercomputer Applications, 5:63{73, 1991.

Suggest Documents