Minimizing Synchronization Overhead in the Implementation of MPI ...

Recommend Documents

Minimizing the overhead of dynamic scheduling ...

For example, RMS (Rate Monotonic Scheduling) is a static scheduling strategy which assigns priorities to processes based on their occurrence frequencies [3,6].

MINIMIZING OVERHEAD IN PARALLEL ... - Semantic Scholar

Contract. No. NAS1-19480. Februao'. 1997. Institute for Computer. Applications ... of parallel programming. .... While this places the burden on the programmer,.

Analysis of the Component Architecture Overhead in Open MPI 1

as language interoperability and to support rapid application development, with ... high performance component architecture for scientific applications. Bernholdt.

Minimizing the Reconfiguration Overhead in Content-Based Publish ...

ABSTRACT. The publish-subscribe model provides strong decoupling ... ios, this approach generates more overhead than necessary. Our goal in this paper is ...

Minimizing Communication Overhead Using Pipelining for Multi ...

C.Calvin. LMC-IMAG. INPG. Av. F elix Viallet. 38012 GRENOBLE Cedex [email protected] ... This work was supported by MRE grant No. 974, the ...

Optimizing the Synchronization Operations in MPI One-Sided ...

sided operations after the first call to MPI Win fence returns. ..... Don Frederick for giving us access to the IBM SP at the San Diego Supercomputer Center.

Optimizing the Synchronization Operations in MPI One-Sided

lock-unlock. Our performance results demonstrate that, for short messages, MPICH2 performs six times faster than LAM for fence synchronization and 50% ...

Low-Overhead Barrier Synchronization for ... - Semantic Scholar

by traditional single-processor designs. Multi-core technology has been succesfully adopted for the past seven years, and is currently in the many-core era, ...

DART-MPI: An MPI-based Implementation of a PGAS Runtime

Jul 7, 2015 - space, and zone-based memory management. Besides PGAS ... basic concept of windows to specify a local memory region accessible to ...

Characterizing and Minimizing Synchronization and ...

mounting calibration and node synchronization on a case study application â knee ... experimental setup for data collection and analysis. Sections 3 through 5 .... such as could be used to communicate with recent Android-based smartphones ...

mpi+openmp implementation of memory-saving

Second University of Naples, Italy ... cussed in this paper, along with its MPI+OpenMP imple- ... at each time step, in terms of the contribution yielded by the.

Memory Conscious Implementation of MPI ... - Computer Science

code (for example, see the results with IBM's MPI in ... communication is the movement of data between source ..... forum.org/docs/docs.html, June 1995.

design, implementation, and performance of mpi

implemented efficiently for different operating systems and networking hardware. In particular, Portals provides the necessary building blocks for higher-level ...

Low Cost Implementation for Synchronization in ...

Costas. Loop. Some implementation of the synchronization for distributed multi antenna is discussed in ... in GNU-Radio is written in Python (i.e initialization and.

Performance Analysis of a Hybrid MPI/CUDA Implementation of the ...

{sjp, sdh, saj}@dcs.warwick.ac.uk. G.R. Mudalige ... lenge of physical size, the power consumption that would be required for .... tion 2 discusses previous work related to this study; Section .... mented to allow the selection of floating-point type

MPI Advisor: a Minimal Overhead Tool for MPI ... - ACM Digital Library

and tool, named MPI Advisor, that requires just a single ex- ecution of ... For MPI- related performance, MPI Advisor automates the first three ... and management.

Design Issues for E cient Implementation of MPI in Java

design and performance of a pure Java implementation of MPI are ... which compares Java byte array communication to C byte array ... Figure 1: Native vs. Java ...

E cient Implementation of Reduce-Scatter in MPI 1

algorithms for the reduce-scatter that have been proven to be highly e cient in theory under ..... integer array specifying the number of elements ... MPI REDUCE SCATTER rst does an element-wise reduction on vector of count .... where l is assumed to

The Design and Implementation of Zero Copy MPI Using ... - CiteSeerX

Tsukuba Research Center. Real World ... High performance network hardware, such as Myrinet .... written per call is limited by a maximum trans- fer unit).

An Implementation of the MPI-2 Extended Collective Operations

There have been many research projects on MPI collective operations. Most of ... 1 by the processes ranked 0 and 2 in the root group), call no message passing ...

An MPI Implementation of the MCHF Atomic Structure ... - CiteSeerX

email: [email protected] and [email protected]. November , 1996. Prepared for the U.S. Department of Energy under grant number DE-FG02- ...

Feasibility Study of MPI Implementation on the Heterogeneous Multi ...

Sri Sathya Sai University. Prasanthi Nilayam, India [email protected] ... Ashok Srinivasan. Florida State University. Tallahassee, Florida, USA.

FPGA Implementation of a Frame Synchronization ... - Radioengineering

325. FPGA Implementation of a Frame Synchronization. Algorithm for Powerline Communications. Stavros TSAKIRIS, Anastasios SALIS, Nikolaos UZUNOGLU.

Electronic circuit implementation of chaos synchronization

Jul 20, 2012 - In this paper, an electronic circuit implementation of a robustly chaotic two-dimensional map is .... of 5KÎ© two 10KÎ© resistors were connected in parallel. For ... these, was also computed and displayed on the oscilloscope as.

Minimizing Synchronization Overhead in the Implementation of MPI ...

Download PDF

7 downloads 133 Views 166KB Size Report

Comment

to provide the convenience of directly accessing remote memory and the ... networks that support only two-sided communication, such as TCP, it is harder.

Minimizing Synchronization Overhead in the Implementation of MPI One-Sided Communication Rajeev Thakur, William Gropp, and Brian Toonen Mathematics and Computer Science Division Argonne National Laboratory Argonne, IL 60439, USA Abstract. The one-sided communication operations in MPI are intended to provide the convenience of directly accessing remote memory and the potential for higher performance than regular point-to-point communication. Our performance measurements with three MPI implementations (IBM MPI, Sun MPI, and LAM) indicate, however, that one-sided communication can perform much worse than point-to-point communication if the associated synchronization calls are not implemented efficiently. In this paper, we describe our efforts to minimize the overhead of synchronization in our implementation of one-sided communication in MPICH-2. We describe our optimizations for all three synchronization mechanisms defined in MPI: fence, post-start-complete-wait, and lock-unlock. Our performance results demonstrate that, for short messages, MPICH-2 performs six times faster than LAM for fence synchronization and 50% faster for post-start-complete-wait synchronization, and it performs more than twice as fast as Sun MPI for all three synchronization methods.

1

Introduction

MPI defines one-sided communication operations that allow users to directly access the memory of a remote process [9]. One-sided communication both is convenient to use and has the potential to deliver higher performance than regular point-to-point (two-sided) communication, particularly on networks that support one-sided communication natively, such as InfiniBand and Myrinet. On networks that support only two-sided communication, such as TCP, it is harder for one-sided communication to do better than point-to-point communication. Nonetheless, a good implementation should strive to deliver performance as close as possible to that of point-to-point communication. One-sided communication in MPI requires the use of one of three synchronization mechanisms: fence, post-start-complete-wait, or lock-unlock. The synchronization mechanism defines the time at which the user can initiate one-sided communication and the time when the operations are guaranteed to be completed. The true cost of one-sided communication, therefore, must include the time taken for synchronization. An unoptimized implementation of the synchronization functions may perform more communication and synchronization than necessary (such as a barrier), which can adversely affect performance, particularly for short and medium-sized messages.

We measured the performance of three MPI implementations, IBM MPI, Sun MPI, and LAM [8], for a test program that performs nearest-neighbor ghostarea exchange, a communication pattern common in many scientific applications such as PDE simulations. We wrote four versions of this program: using pointto-point communication (isend/irecv) and using one-sided communication with fence, post-start-complete-wait, and lock-unlock synchronization. We measured the time taken for a single communication step (each process exchanges data with its four neighbors) by doing the step a number of times and calculating the average. Figure 1 shows a snippet of the fence version of the program, and Figure 2 shows the performance results. for (i=0; i