MPI SH: a Distributed Shared Memory Implementation of MPI Davide Guerri
Paolo Mori,
Laura Ricci
Synapsis s.r.l. Piazza Dante 19, 57121 LIVORNO
Dipartimento di Informatica Universita` di Pisa Corso Italia 40, 56125 - PISA
Dipartimento di Informatica Universita` di Pisa Corso Italia 40, 56125 - PISA
[email protected]
[email protected]
[email protected] ABSTRACT
Most current M P I implementations are developed on top of message passing libraries. This paper considers a rather different approach where M P I primitives are supported through a distributed shared memory. To evaluate this approach, we describe M P I SH, an implementation of M P I on top of DV SA, a package to emulate a shared memory on top of a distributed memory architecture. DV SA structures the shared memory as a set of variable size areas and defines a set of operations each involving a whole area. The various kind of data required to implement M P I, i.e. to handle a communicator, a point to point or a collective communication, can be mapped onto these areas so that a large degree of concurrency, and hence a good performance, can be achieved. Performance figures of M P I SH are presented. Accordingly to these figures, the proposed approach may achieve better performances than more traditional implementations for collective communication.
General Terms Distributed shared memory, Collective communications
1.
INTRODUCTION
DV SA, Distributed Shared Areas [7], is a distributed shared memory abstract machine developed in the P QE2000 project [11]. Since the goal of the project is a general purpose parallel architecture, DV SA has not been defined to enable an application to directly share some informations, rather it is a common layer in the definition of higher levels virtual machines each oriented toward a given class of application. This implies that the DV SA user is the designer of a programming environment to be supported by our architecture [2]. To evaluate the feasibility and the effectiveness of adopting DV SA to support both message passing and shared memory environment, we have developed M P I SH, an implementation of M P I on DV SA. The implementation exploits the DV SA constructs only. As an example, M P I non blocking
communications are defined through DV SA non blocking primitives even if a thread mechanism supported by the architecture might be more efficient. This choice favors the portability of the M P I support across the distinct physical architectures to be supported by the P QE2000 project. Several other implementation of M P I [3, 4, 5, 9, 10] have been developed on top of general purpose or special purpose message-passing libraries. Currently, DV SA has been implemented on a Meiko CS2 and on a cluster of workstations running Linux. The performance figures of our implementation shows that M P I collective communications can benefit of an implementation on top of a shared memory abstract machine. From another point of view, these figures confirm that one of the major problem of current implementation of M P I, on top of general purpose or special purpose message passing libraries, is a strategy to efficiently support both M P I point to point and collective communications. [8] for instance, shows that the efficiency of current implementation of M P I collective communication on CS2 is not satisfactory. It is worth stressing that, while data parallel algorithms with static data allocation can be easily implemented through point to point communications only, most complex algorithms require some form of data re-mapping that heavily exploits collective communications. Hence, the efficiency of these constructs cannot be neglected in several applications [1]. Our results suggest the adoption of a hybrid approach where M P I point to point communications could be directly implemented on top of the communication library of the considered architecture, while M P I collective communications could be implemented on top of a distributed memory system. The additional overhead due to this layer, that could be implemented on the same communication library of M P I, may be recovered if each M P I primitive is implemented through a few invocations by properly exploiting operation on a distributed memory. Section 2 introduces the main characteristics of DV SA. The overall structure of MPI implementation is presented in Section 3. Section 4 shows the strategy to support MPI communicators. The implementations of point to point and of collective communications are shown, respectively, in Section 5 and in Section 6. Section 7 shows some experimental results and draws the conclusions.
2.
DVSA
The DV SA library supports the definition of a shared memory space on a distributed memory system. The space is structured as a set of areas. The size of each area can be freely chosen within an architecture dependent range when the area is declared. Hence, any application can freely choose the size of the areas its processes are going to share. Any DVSA operation, i.e. reads, updates, synchronizations, involves a whole area, i. e. all the values in the area are either returned to the invoker or updated through the values supplied by the invoker. The DV SA operations can be partitioned into four classes:
• Notification • Access • Synchronization • Utility
Operations in the Notification class allow a process to declare all and only the areas it is going to share and to notify the termination of its operations on these areas. The Synchronization class set includes operations to acquire exclusive access, i. e. a lock, to an area. Distinct operations are defined to acquire access to read or to update an area. The Access Class includes operations to read and/or update an area. Operations in this class may be synchronous or asynchronous and the access to the area may or not be synchronized. A synchronous operation starts a read/write operation and waits till the operation is completed. An asynchronous operation immediately terminates and returns an handle, that is an identifier, that may be used to test for its completion. For instance, a process can start an asynchronous read operation identified by an handle H, continue the execution of its program and then wait for the completion of the operation identified by H only when the read result is required to continue the execution. Each access to the shared memory may be synchronized or not synchronized: a synchronized access operation preserves the consistency of the shared information by properly synchronizing the processes accessing the area. Instead, a not synchronized access operation, does not guarantee the serialization of the accesses to the area. In this case, the consistency of the information in the area is preserved either because of the semantics of the computation or enclosing these operations through the use of proper synchronization operations. Utility operations return general information about the shared memory, such as the list of all the areas shared by a process. The whole set of operations is defined in [7].
3.
OVERALL STRUCTURE OF THE IMPLEMENTATION
An important assumption that has driven our implementation is that an effective implementation may be designed by properly minimizing the contention on an area due to synchronized operations issued by distinct processes as well as by properly allocating the areas into the local memories of
the processing nodes. Hence, the first step in the implementation of M P I SH has defined the overall structure of the areas used to implement message exchange and process synchronization. These areas records both the data exchanged among processes and the state of the processes involved in ongoing communication. According to our initial assumption on the implementation, the data structures required to implement MPI communications are mapped onto the areas so that: • synchronized operations on an area are used only if this is the only way to preserve the M P I semantics. As an example, this implies that two data structures that are accessed through, respectively, synchronized and not synchronized access operations should be mapped onto distinct areas; • the access to each area A should be restricted to the processes exploiting the data stored in A, i.e. the address of A should be known only by them. This implies that an area including the definition of a M P I communicator C should be shared among the processes belonging to C only. In the same way, an area storing a point to point message should be accessed only by the communication partners. These principles can be satisfied by a dynamic assignment of the areas to the processes. A static approach is not always feasible, because the M P I semantics does not support the definition of a static analysis that detects both the communications involving a process P and the communicators it belongs to. On the other hand, DV SA, for efficiency reasons, does not support a dynamic management of the shared areas and each process has to define the areas it is going to share at the beginning of the execution of its program. For this reason, the dynamic management of the areas is explicitly implemented by M P I SH. Before starting the execution of the processes, M P I SH defines a pool of areas shared by all the processes. The size of this pool depends upon the number of processes of the application and the maximum number of communicators to be supported. The M P I SH primitives dynamically fetch areas from this pool and assign them to the requesting processes. In this way, an area A is allocated when a process starts a point to point communication and the address of A is notified to the partner of the communication when it executes the corresponding primitive. The addresses of the areas shared by a process P are dynamically stored in a table local to P . M P I SH guarantees that each process access only the areas whose addresses are stored in its local tables. The areas used by M P I SH can be structured onto a hierarchy, where each level of the hierarchy is characterized by both the number of processes sharing an area A in that level and the interval of time within which A is allocated. • Level1 These areas are always shared between all the processes of the application. They include global information to create new communicators, such as a global
counter to assign unique contexts to them. They also store a set of pointers to a pool of free areas to be allocated for communicators descriptors. • Level2 These areas are allocated as soon as a new communicator is created and are shared by all the processes of the communicator. This set includes areas recording the information on the communicator or to implement collective communications occurring within a communicator. • Level3 These areas are allocated when the first communication between two processes occurs, within a communicator C, and record the state of all pending communications between the two processes. Each area is shared by the partners of the communication as long as C is active. • Level4 This level includes areas dynamically allocated to support the transmission of point to point messages. Each area is allocated by the process initiating the communication and is shared between the communication partners. An area is freed as soon as the transmission of the message is completed. By structuring the areas into a hierarchy of levels, we can achieve better memory utilization, since the size of an area can be chosen accordingly to the level it belongs to. Furthermore, the number of processes sharing an area decreases as the level of the area increases and better allocation strategies can be adopted for areas of the higher levels. As an example, as detailed in the following, each level 4 area is always allocated in the local memory of one partner of the communication.
4.
IMPLEMENTATION OF MPI COMMUNICATORS
The execution environment of M P I SH is set up by the function M P I Init SH. This function allocates a pool of free areas, initially shared by all the processes and exploited to support the creation of a communicator. An efficient allocation of the free areas is a fundamental issue in the definition of M P I SH and the allocations of areas for distinct communicators should be executed concurrently. For this reason M P I Init SH partitions the pool among the processes of the application, to avoid the bottleneck arising when a single pool is adopted. As a matter of fact, a single pool serializes, at some level, the accesses to the pool. Each process stores in a local table the addresses of free areas it has been assigned. Furthermore, the areas are partitioned according to both the communicator they are associated with and the semantics of the data they record. Each process taking part in the creation of a communicator assigns some of its areas to the new communicator. M P I Init SH initializes the areas of level0 and creates the M P I COM M W ORD communicator. Further communicators can be explicitly created through M P I Comm Create SH. According to the M P I semantics, the creation of a new communicator C involves all the
DVSA Free Areas Pool
Mother Page MPI_COMM_WORD
Free Areas
Communicator Descriptor
Collective Areas
Process Areas
Point to Point Areas
Figure 1: M P I COM M W ORLD Environment
processes of an existing communicator and it is implemented through two successive phases. In the first phase, each process informs its partners of its participation in the creation of C. To this purpose, it decrements a counter allocated in the descriptor of C allocated by the first process invoking M P I Comm Create SH. This process initializes the descriptor by storing a new context, the number of processes of the communicator and their identifiers. The context is assigned by accessing a global context counter allocated in a Level0 area. The communicator descriptor also stores pointers to a set of free areas to support the communications within the communicator. As an example, Fig 1 shows the areas after the execution of M P I Init SH that initializes both the descriptor of the M P I COM M W ORLD communicator and the execution environment. A communicator descriptor includes both a pointer to a pool of free areas, Level3 and Level4 areas, which will be dynamically allocated during the communicator lifetime and a set of Level2 areas, one for each process in the communicator, shared among all the processes of the communicator during its lifetime. Free areas are used to support point to point communications; other areas are partitioned according to their use. Collective areas are exploited to support collective communications, Point to Point areas to implement point to point synchronizations and Process areas to store the address of dynamically allocated channel areas. These areas will be described in more detail in the following together with the implementation of different kinds of communications. Each process invoking M P I Comm Create SH allocates a subset of the areas of the communicator by selecting their address from its local tables. The second phase synchronizes all the processes involved in the creation of the communicator. This synchronization is
necessary to meet the M P I semantics that requires that all the processes of a communicator complete the creation before any process uses the communicator. This synchronization can be simply implemented by inserting an explicit barrier after M P I Comm Create SH. However, it is possible to loose the synchronization so that processes are delayed only when they execute the first operation that refers to the communicator. In this way, each process can continue the execution of its program while its partner are still involved in the creation of the communicator. This solution can be simply implemented by checking, in the communicator descriptor, the number of processes which have notified that they are taking part in the creation. This guarantees that any communication starts only after all the areas of the new communicator have been allocated. At the end of the second phase, each process can copy the addresses of the areas shared within the communicator into a set of local tables.
5.
POINT TO POINT COMMUNICATIONS
This section describes the implementation of a subset of point to point communications: [6] describes the whole set of primitives including the non deterministic ones. Point to point communications are implemented through Channel, Buffer and Point to Point areas. Channel areas store information about pending communications between two partners and Buffer areas store the corresponding messages. Point to Point areas implement process synchronization. Each channel area is always allocated by the process starting the communication that selects a free area from the communicator pool and stores its address in a Process area. In the current implementation, the Process area is always associated with the receiver. The partner process can select the proper Channel area by accessing this Process area. To avoid repeated accesses to the communicator descriptor, both partners of a communication cache the address of the channel areas into their local tables. Each channel area stores a set of queues, each corresponding to a communication exploiting a different tag. Each element of the queue records the kind of primitive which has initiated the communication, a pointer to the corresponding Buffer area and the message size. Each process executing a point to point communication checks if the communication is currently pending by inspecting the proper queue. To synchronize the partners in the case of a synchronous send, a set of semaphores are allocated in Point to Point areas. The semaphores are allocated in this area to simplify the implementation of non deterministic communication. These communications, and in particular the one that allows a process to receive a message without specifying the tag of the sender have forced us to synchronize the accesses to these queues, because the ability of not defining, in some communications, the tag or the sender process enables the receiver process to choose the message to be received from a set of queue that cannot be known in advance. Hence, no optimization of the accesses to the queues can be supported. This is the most critical feature of M P I as far as concerns the overall performance of M P I SH. The partition of the areas into several levels supports the implementation of several optimization strategies related to the allocation and to the accesses to the areas. As far as
COMMUNICATOR .......... P1
1
......
Pi
........
n
1 n .....
Pn
1 SYNC_IN
........
AREA_IN
........
SYNC_OUT
........
AREA_OUT
........
.....
n
1 n SYNC_OUT ..... AREA_OUT SYNC_IN
AREA_IN
Figure 2: Collective areas concerns the allocation, M P I SH allocates the pool of areas managed by the process P in the local memory of the node executing P . This guarantees, for instance, that the channel and the buffer areas are always allocated either in the memory of the sender or in that of the receiver. M P I SH optimizes the accesses to shared areas as well by defining a proper caching strategy. Each access to a shared area requires, in general, an access to a remote memory. To reduce the cost of this access in the implementation of point to point communications, M P I SH caches several data into the local memories. This strategy is adopted, for instance, for Channel areas. When a process P access a channel to receive a message, it copies into its local memory any information regarding any pending communication. We have chosen the receiver process, because the number of pending sends is generally larger than that of the receives. When a process receives a message, at first it checks the pending communications in the cache and only if the cache does not include any matching pending communication, it accesses the possibly remote channel area. When a process accesses a channel area, it copies into the area the updated information from the cache. The performance figures in section 7 show that this optimization may largely improve the overall efficiency.
6. COLLECTIVE COMMUNICATIONS Collective communications exploit Collective areas in the communicator descriptor for both message transmission and process synchronization. The definition of a distinct Collective area for each communicator guarantees that collective communications involving different communicators can be concurrently executed, without side effects. According to the M P I semantics, collective communications are blocking, but not synchronous. This implies that a process may start a collective communication before the previous one has been completed by all the other processes. For this reason, M P I SH further partitions the set of Collective areas so that a different set of areas is associated with each process of the communicator. In this way, collective communications with distinct root processes can be simultaneously active within the same communicator because
2000 1800
MPI_SH MPI
2000
Execution Time (microsec)
Execution Time (microsec)
2500
1500
1000
500
1600
MPI_SH MPI
1400 1200 1000 800 600 400 200
0 4
16
64
256
1024
4096
16384
Message Size (bytes)
Execution Time (microsec)
1200 1000 MPI_SH MPI
600 400 200
4
16
64
256
1024
16
64
256
1024
4096
16384
Message Size (bytes)
Figure 3: Execution times of broadcast
800
4
4096
16384
Message Size (bytes)
Figure 4: Execution times of scatter
the communications exploit different areas. In this way no side effect within a communicator can arise. For the moment being, let us assume that the data exchanged in a collective communication always fits in one area. M P I SH pairs two data areas, DAT A IN and DAT A OU T , with each process P . The former is exploited when P is the root receiver of a communication, the latter when P is the root sender. Furthermore, each data area is paired with a corresponding area, respectively a SY N C IN AREA and SY N C OU T AREA, to synchronize the processes involved in the collective communication Each synchronization area includes nprocs binary semaphores, one for each process of the communicator. This semaphores enables a receiver process to check if the data it is waiting for is present in the corresponding data area, and a sender process to check if the data area is free or if it includes a previously transmitted data. The structure of the Collective areas paired with a communicator is shown in Fig. 2 Let us now consider the implementation of a collective communication, such as a broadcast. In this case, the root process P checks if its DAT A OU T area is free by inspecting the corresponding SY N C OU T area. If all the processes involved in a previous operation with the same sender root have fetched the message from the DAT A OU T area, then all the semaphores are set to 0. In this case, P writes the message in DAT A OU T and it set all the semaphores to 1, otherwise it waits till all the semaphores are set to 0.
Figure 5: Execution times of alltoall The i-th process involved in the communication checks the presence of the message by inspecting the i-th semaphore of SY N C OU T . When the value of this semaphore is 1, the data can be read and the value of the semaphore is reset. It is worth noticing that no synchronization or synchronized access to the areas is required to implement MPI collective communications. This is feasible because of the M P I semantics and of the partitioning of the areas into synchronization and data areas. It is worth noticing that some DV SA operations not described here support concurrent access to the same area from different processes by merging the concurrent operations. As an example, concurrent access to check and modify the semaphores of a SY N C IN or SY N C OU T area are possible. The assumption that a message always fits in an area is too restrictive, because the size of the messages is not known when Collective areas are allocated. To handle any message, the support defines k DAT A OU T and k DAT A IN areas for each process, each one associated with a corresponding synchronization area. In this way, a sender can partition the message in n packets, and store each one in a distinct area. It is worth noticing that this strategy further increases concurrency, because it loosen the synchronization between the readers and the writer of the same area. The sender can write the i-th packet of a message, while the previous packets are read by the other processes. This is possible also because a distinct synchronization page is paired with each of the k areas of a process
7. EXPERIMENTAL RESULTS AND CONCLUSIONS This section describes some performance figures of the implementation of M P I SH on a Meiko CS-2 and compares them against those of M P I implementation developed on top of a native communication library. Fig. 3, 4 and 5 show the execution time of broadcast, scatter and all to all primitives in the case of 8 processes. Fig. 6 show the execution time of a barrier primitive varying the number of processes. The performances of collective communications largely benefits from an implementation on top of distributed virtual shared memory. The curve in Fig. 7 shows the effectiveness of the caching strategy in M P I SH. In the considered program, at first, two processes sends to each other k messages, the pending messages labeling the x coordinate in the
800 MPI_SH MPI
500
700
Execution Time (microsec)
Execution Time (microsec)
600
400 300 200 100
400 MPI_SH using cache (msg. 4 bytes) MPI_SH without cache (msg. 4 bytes) MPI_SH using cache (msg. 1024 bytes MPI_SH without cache (msg. 1024 bytes)
300 200
0 2
3
4
5
6
7
8
Number of Process
2
3
4
Figure 8: Comparison between MPI and MPI-SH on point to point communcation
MPI_SH 4 bytes MPI 4 bytes MPI_SH 1024 bytes MPI 1024 bytes
200
1
Number of pending messages
Figure 6: Execution times of barrier
Execution Time (microsec)
500
100
0
PVM/MPI 2000: Lecture Notes in Computer Science, 1908:80–87, Sept. 2000.
150
[2] F. Baiardi, D. Guerri, P. Mori, L. Moroni, and L. Ricci. Two layers distributed shared memory. Proceedings of HPCN 2001: Lecture Notes in Computer Science, 2110:302–311, June 2001.
100
50
0 1
2
3
4
Number of pending messages
Figure 7: Comparison between MPI and MPI-SH on cache optimization efficiency figure. Then, they execute a barrier synchronization and receive the messages. Fig. 8 compares the performances of M P I SH against those of the MPI implementation on a message passing library. This curve shows that the performance of M P I SH point to point communications is worse than the one that can be achieved by a message passing library. We believe that a large amount of the overhead of M P I SH in the implementation of point to point communications is due to the synchronizations introduced to support M P I non deterministic constructs as recalled in Section 5. Our experiments shows that a highly efficient M P I implementation can be developed through a hybrid approach that exploits both native message passing library for point to point communication, and a distributed shared memory for collective communications. Furthermore, key factors for the efficiency of such an implementation are proper caching strategies and the minimization of contention on shared information.
8.
600
ADDITIONAL AUTHORS
[3] R. Calkin, R. Hempel, and P. Wypior. Portable Programming with the PARMACS Message-Passing Lybrary. In Parallel Computing, volume 20, pages 615–632. North-Holland, 1994. [4] K. Cameron, L. Clarke, and A. Smith. CRI/EPCC MPI for CRAY T3D. Edinburgh Parallel Computing Centre, The University of Edinburgh, 1995. [5] J. Dongorra and T. Dunigan. Message Passing Performance on Variouse Computers. Technical Report UT/CS-95-229, Computer Science Dept., University of Tennessee, Knoxville, 1995. [6] F.Baiardi and D.Guerri. Implementazione di MPI mediante la memoria virtuale condivisa. Technical report, University of Pisa, 1998. [7] F.Baiardi, D.Guerri, P.Mori, and L.Ricci. Evaluation of a virtual shared memory machine by the compilation of data parallel loops. In 8-th Euromicro Workshop on Parallel and Distributed processing, 2000. [8] G. Folino, G. Spezzano, and D. Talia. Performance evaluation and modelling of MPI communications on the Meiko CS-2. Proceedings of HPCN Europe ’98, LNCS 1401:932–936, 1998. [9] W. Groop and E. Lusk. The MPI communication library: Its design and a portable imlementation. MCS, Argonne National Lab TR, 1996.
Additional authors: Fabrizio Baiardi and Leonardo Vaglini, Dipartimento di Informatica, Universit` a di Pisa, Corso Italia 40, 56125 PISA. email: {baiardi,vaglini}@di.unipi.it
[10] M. Snir, S. W. Otto, S. Huss-Lederman, D. W. Walker, and J. Dongarra. MPI: The Complete Reference. MIT Press, Cambridge, Massachussetts, London, England, 1996.
9.
[11] M. Vanneschi. PQE2000: HPC Tools for Industrial Application. In IEEE Concurrency, pages 68–73. October-December 1998.
REFERENCES
[1] F. Baiardi, S. Chiti, P. Mori, and L. Ricci. Adaptive multigrid methods in MPI. Proceedings of Euro