Optimizing the Synchronization Operations in MPI One-Sided ...

Recommend Documents

Optimizing the Synchronization Operations in MPI One-Sided

lock-unlock. Our performance results demonstrate that, for short messages, MPICH2 performs six times faster than LAM for fence synchronization and 50% ...

Optimizing Noncontiguous Accesses in MPI-IO

a single I/O function call, unlike in Unix I/O. In this paper, we explain how critical .... a well-known MPI-IO benchmark developed at NASA Ames Research Center. .... DEC, and IBM; and networks of workstations (Sun, SGI, HP, IBM, DEC, Linux,.

Minimizing Synchronization Overhead in the Implementation of MPI ...

to provide the convenience of directly accessing remote memory and the ... networks that support only two-sided communication, such as TCP, it is harder.

Optimizing Synchronization, Flow, and Robustness in Weighted ...

Department of Physics and Social and Cognitive Networks Academic Research ... The ultimate challenge in network optimization (of synchronization and flow).

optimizing synchronization in multiprocessor dsp ... - Semantic Scholar

access shared memory for the purpose of synchronization in embedded, ..... [3] The number of tokens in any cycle of the IPC graph is always conserved over all ...... fb. G s. 4n ff. 2n fb. +. (. ) G. G e. G p e. ( )â . G e( ) src e( ) snk p( ). Dela

Finding synchronization-free slices of operations in

Abstract: This paper presents a new approach for extracting synchronization- free parallelism ... resolved in order to utilize the entire power of this approach. ...... parallel machine SGI Altix 3700, 128 x Itanium2 1.5GHz (NUMA), 256 GB. RAM ... qu

OneSided Fixed Paper - CiteSeerX

He showed that, if the gauge limits are properly chosen, grouped data are an excellent alternative to exact measurement since the small loss in statistical ...

OPTIMIZING MARITIME CONTAINER TERMINAL OPERATIONS

Figure 19 Sample of EdiRite presentation of a BAPLIE file .......................................... ...................... 51. Figure 20 Output of the current way of working heuristic .

Sparse Collective Operations for MPI - CiteSeerX

Torsten Hoefler. Open Systems ...... Publishing Co., Inc., 1995. [7] F. Gygi ... [16] T. Hoefler, F. Lorenzen, and A. Lumsdaine, âSparse Non-Blocking. Collectives in ...

Sparse Collective Operations for MPI - CiteSeerX

Email: [email protected]. Jesper Larsson ..... typically have different layout in memory and thus different MPI datatypes. ..... Newsletter, vol. 73, pp. 145â173 ...

Power Redistribution for Optimizing Performance in MPI Clusters - arXiv

Oct 24, 2014 - Email: [email protected]. AbstractâPower .... program running using equal-share, ILP based, and heuristic ..... A broadcast operation is ...

Optimizing Timetable Synchronization for Rail Mass Transit

In most urban public transit rail systems, passengers may need to make several ... our algorithm for the Mass Transit Railway (MTR) system in Hong Kong, which ...

Efficient MPI Collective Operations for Clusters in Long ... - GitHub Pages

where a message is split and distributed to the nodes, and then gathering the split .... The counter copy function can overflow the bottle- neck link when all the ...

Collective Operations in an Application-level Fault Tolerant MPI ... - ISS

are now significantly greater than the mean-time-between- failures (MTBF) of the ... This has too much overhead to be practical on high- performance .... to save late messages after taking its local checkpoint; once logging is complete, the .... a pr

Topology-Aware Strategy for MPI-IO Operations in Clusters

Oct 16, 2018 - assignment algorithm to minimize communication between computing ..... complexity. (ii) Jonker and Volgenant algorithm [23]: A shortest aug-.

An Implementation of the MPI-2 Extended Collective Operations

There have been many research projects on MPI collective operations. Most of ... 1 by the processes ranked 0 and 2 in the root group), call no message passing ...

Eliminating Synchronization-Related Atomic Operations ... - CiteSeerX

Oct 26, 2006 - Categories and Subject Descriptors D.3.4 [Programming Lan- guages]: ... Keywords Java, synchronization, monitor, lock, atomic, optimization ...

Eliminating Synchronization-Related Atomic Operations ... - CiteSeerX

Oct 26, 2006 - The technique is applicable to any virtual machine-based programming ... the lightweight locking technique in the Java HotSpot VM and its.

Optimizing Hedge Fund Business Operations - FRA, LLC

Jan 16, 2013 - business operations. Effective tax plan. Enterprise risk management ..... regulatory reporting and risk m

Optimizing Hedge Fund Business Operations - FRA, LLC

Jan 10, 2012 - complex operational challenges, then the Hedge Fund Business Operations. Association's (HFBOA) ... How go

Modeling and Optimizing Military Air Operations

military operation that involves two opposing forces engaged in a battle. ... a SEAD mission, in which the Blue force and the Red force engage in multiple ...... 205â222, 2000. [7] F. A. Lanchester, Aircraft in warfare: The dawn of the fourth arm.

Optimizing Hedge Fund Business Operations - FRA, LLC

Jan 10, 2012 - The Conference Organizer ... hand-picked by the brightest minds in the business. ... Call 800-280-8440 or

Optimizing Aircraft and Operations for Minimum Noise

aircraft design and mission for minimum cost under ... ley's Aircraft Noise Prediction Program (ANOPP) .... bulence interference is neglected, leaving jet mixing.

Optimizing Aircraft and Operations for Minimum Noise

By optimizing the aircraft design and mission for minimum cost under ... operating cost and certification noise; this will allow a range of ..... App. Angle (deg.).

Optimizing the Synchronization Operations in MPI One-Sided ...

Download PDF

13 downloads 12941 Views 67KB Size Report

Comment

sided operations after the first call to MPI Win fence returns. ..... Don Frederick for giving us access to the IBM SP at the San Diego Supercomputer Center.

Optimizing the Synchronization Operations in MPI One-Sided Communication∗ Rajeev Thakur William Gropp Brian Toonen Mathematics and Computer Science Division Argonne National Laboratory 9700 S. Cass Avenue Argonne, IL 60439, USA {thakur, gropp, toonen}@mcs.anl.gov

Abstract One-sided communication in MPI requires the use of one of three different synchronization mechanisms, which indicate when the one-sided operation can be started and when the operation is completed. Efficient implementation of the synchronization mechanisms is critical to achieving good performance with one-sided communication. Our performance measurements, however, indicate that in many MPI implementations, the synchronization functions add significant overhead, resulting in one-sided communication performing much worse than point-to-point communication for short- and medium-sized messages. In this paper, we describe our efforts to minimize the overhead of synchronization in our implementation of one-sided communication in MPICH2. We describe our optimizations for all three synchronization mechanisms defined in MPI: fence, post-start-complete-wait, and lock-unlock. Our performance results demonstrate that, for short messages, MPICH2 performs six times faster than LAM for fence synchronization and 50% faster for post-startcomplete-wait synchronization, and it performs more than twice as fast as Sun MPI for all three synchronization methods.

∗

This work was supported by the Mathematical, Information, and Computational Sciences Division subprogram

of the Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy, under Contract W-31-109-ENG-38.

1 Introduction MPI defines one-sided communication operations that allow users to directly read from or write to the memory of a remote process [12]. One-sided communication both is convenient to use and has the potential to deliver higher performance than regular point-to-point (two-sided) communication, particularly on networks that support one-sided communication natively, such as InfiniBand and Myrinet. One-sided communication in MPI requires the use of one of three synchronization mechanisms: fence, post-start-complete-wait, or lock-unlock. The synchronization mechanism defines the time at which the user can initiate one-sided communication and the time when the operations are guaranteed to be completed. The true cost of one-sided communication, therefore, must include the time taken for synchronization. An unoptimized implementation of the synchronization functions may perform more communication and synchronization than necessary (such as a barrier), which can adversely affect performance, particularly for short and medium-sized messages. To determine how the different synchronization mechanisms perform in existing MPI implementations, we used a test program that performs nearest-neighbor ghost-area exchange, a communication pattern common in many scientific applications such as PDE simulations. We wrote four versions of this program—using point-to-point communication (isend/irecv) and using onesided communication with fence, post-start-complete-wait, and lock-unlock synchronization— and ran them on an IBM SP with IBM MPI, on a Sun SMP with Sun MPI, and on a fast-ethernetconnected Linux cluster with LAM [11]. We measured the time taken for a single communication step (each process exchanges data with its four neighbors) by doing the step a number of times and calculating the average. Figure 1 shows a template of the fence version of the test; the other versions are similar. for (i=0; i