Highly Ecient Implementation of Point-to-point Communication Using Remote Memory Operations MPI
Osamu Tatebe
Yuetsu Kodama
Satoshi Sekiguchi
Electrotechnical Laboratory
Electrotechnical Laboratory
Electrotechnical Laboratory
1-1-4 Umezono, Tsukuba
1-1-4 Umezono, Tsukuba
1-1-4 Umezono, Tsukuba
Ibaraki 305-8568 JAPAN
Ibaraki 305-8568 JAPAN
Ibaraki 305-8568 JAPAN
[email protected]
[email protected]
[email protected]
Yoshinori Yamaguchi Electrotechnical Laboratory 1-1-4 Umezono, Tsukuba Ibaraki 305-8568 JAPAN
[email protected]
Since point-to-point communication is a basic and useful operation, its ecient implementation is a key for achieving high performance. In current MPI implementations, a sender always inquires of a receiver at send-time whether the corresponding receive has been issued. This design is simple and straightforward but may cause to reduce performance because the inquiry at send-time increases the communication latency and moreover it interrupts the receive process. This paper proposes a design of one-sided implementation of point-to-point communication using remote memory operations in the case that nonblocking receive is issued in advance. At send-time, the sender sends messages by remote memory write without inquiring of the receiver. To implement this design, it is necessary that the receiver sends a receive request to the sender and the corresponding receive is determined by the sender not by the receiver. Unfortunately, the sender-side matching cannot be done when the receive speci es a wildcard for the rank of the source. The receiver-side matching is still necessary. This paper also proposes a consistent message matching by both sender and receiver. Using this consistent matching, pointto-point communication can be implemented in all communication pattern, while the sender can send messages by remote memory operations in the case that nonblocking receives are issued in advance. Our proposal for implementation of point-to-point communication can be implemented by massively parallel processors (MPPs) and symmetric multiprocessors (SMPs) that support remote memory operations by hardware. We implement part of the MPI standard on the EM-X parallel computer[5] using special communication supports, such as remote memory write, remote thread invocation and I-structures[1]. We evaluate the MPI-EMX and also comment on a MPI program having room to be eciently executed by MPI runtime library. In Section 2, MPI point-to-point communication is brie y explained. A MPI parallel programming is also described for overlapping communication and computation using nonblocking communications. Section 3 shows a basic idea for one-sided implementation of point-to-point communication using remote memory operations. Section 4 discusses consistent message matching by both sender and receiver. Im-
Abstract
point-to-point communication is a basic operation, however it requires runtime-matching of send and receive that causes to reduce performance. This paper proposes a new approach to send messages by remote memory write without inquiring of the receiver under a communication pattern such that nonblocking receive is issued in advance. Basically, this approach makes it possible to gain low latency and high bandwidth as the hardware speci cation. MPI-EMX, our implementation of the MPI on the EM-X multiprocessor, achieves a zero-byte latency of 13.4 sec. and a maximum bandwidth of 31.4 MB/s, which can compete with commercial MPPs. This approach to reduce communication latency is widely applicable to other systems and is quite a promising technique for achieving low latency and high bandwidth. MPI
1
Introduction
The MPI standard[11] is designed in order to achieve high performance as well as portability, in particular, it is designed to encourage overlap of communication and computation to hide communication overhead. Point-to-point communication is not only a basic operation for message passing library, but also quite a useful operation for parallel programming. Point-to-point communication can be considered as a generalization of a FIFO memory. This means that send and receive are considered as store and load respectively by regarding rank of source or destination, message tag and communicator as a generalized address space, and messages as data. FIFO means the order of communications is preserved. Point-to-point communication can be also considered as a mechanism for local synchronization. This means that a receive buer is not overwritten before issuing the receive, and a receive buer has a valid data after the completion of the receive.
To appear in the 12th ACM International Conference on Supercomputing, July 1998, Melbourne, Australia. 1
plementation of MPI-EMX is described in Section 5. In Section6, MPI-EMX is evaluated by basic benchmarks and a simple kernel. 2
that operations are done by one side without interruption of the target processor. Since load and store operations in shared memory or distributed shared memory also have this aspect, these operations are considered as remote memory operations. Software implementation of remote memory operations is, of course, possible on distributed memory machines using such as Berkeley Active Messages[12], however this software implementation does not bring good performance since the target processor should execute a message handler by interrupting target processes or by polling incoming messages periodically.
MPI point-to-point communication
In MPI point-to-point communication, both sender and receiver specify a message buer, number of entries, a datatype, a rank of source or destination, a message tag and a communicator. Send and receive are matched by the rank of source and destination, the message tag and the communicator. MPI nonblocking communications are declared as follows in C.
3.2
int MPI_Isend(buf, count, datatype, dest, tag, comm, req) void *buf; int count, dest, tag; MPI_Datatype datatype; MPI_Comm comm; MPI_Request *req;
Corresponding send and receive cannot be always determined by MPI program statically. When each MPI process is single-threaded and when any receive does not specify wildcard for source rank and when no multiple completion operation is issued and so on, the static determination may be possible. Even if there is a nondeterminisity in MPI program, it is possible to determine statically corresponding send and receive in an arbitrary order, however deadlock or starvation of messages may occur without checking deadlock-free and starvation-free. MPI-2[7] introduces dynamic process creation and communication between two sets of MPI processes, which makes the static analysis much harder. Ecient mechanism for dynamic and runtime message matching is important. When a receive speci es a wildcard for rank of source and/or message tag, corresponding send is expected to be determined in the arrival order to the receiver. The matching of corresponding send and receive is, therefore, normally done by the receiver. In this implementation of receiver-side matching, however, inquiring of the receiver is always necessary at send-time, which may cause low communication performance since this inquiry aects the receive processes. Our approach, depicted by Figure 1, is
int MPI_Irecv(buf, count, datatype, source, tag, comm, req) void *buf; int count, source, tag; MPI_Datatype datatype; MPI_Comm comm; MPI_Request *req;
The rank of source and tag in MPI_Irecv() can be speci ed by a wildcard value, MPI_ANY_SOURCE and MPI_ANY_TAG. Nonblocking communications terminate locally, and return a request object for notifying the completion of the communication such that the message buer can be modi ed or accessed validly. Using nonblocking communications, overlapping communication with computation is semantically done by the following code.
1. MPI_Irecv() sends a receive request that inserts addresses of a receive buer and a request object, message tag and communicator to an associative memory of the sender.
// Post send and receive operations MPI_Irecv(recv_buf, n, datatype, source, tag, comm, &recv_req); computation 1 MPI_Isend(send_buf, n, datatype, dest, tag, comm, &send_req); computation 2 // Overlapped region // Wait for the completion of send MPI_Wait(&send_req, &send_status); computation 3 // Overlapped region, too // Wait for the completion of receive MPI_Wait(&recv_req, &recv_status);
2. MPI_Isend() or MPI_Send() checks the associative memory, and writes remotely to the receive buer and the request object. This implementation makes it possible to send by remote memory write without inquiring of the receiver at send-time. The key of the associative memory is destination, message tag and communicator. Since there might be several receive requests with the same key and since MPI guarantees the order of messages among two processes, the associative memory is necessary to be implemented by FIFO queue in each entry. Unfortunately, we cannot always use this message matching by the sender because MPI_Irecv() with the wildcard source cannot send the receive request1 . In this case, receiverside matching is indispensable. For one-sided implementation of point-to-point communication, message matching should be done by the sender, so it is necessary to match corresponding send and receive by both sender and receiver. The next section considers the
There is room to overlap communication with computation between nonblocking communications and the completion operations. In the above code, computations 2 and 3 may overlap communication. Computation 1 is necessary not for overlapping but for one-sided implementation of point-topoint communication using remote memory operations that is a proposal in this paper. 3 3.1
One-sided implementation
One-sided send for point-to-point communication Remote memory operations
Recent MPPs support remote memory operations by a special hardware, such as PUT/GET on Fujitsu AP1000+ and remote DMA on Hitachi SR-2201. An important aspect is
1 Even if the receive request is broadcasted, there is another problem to match only one pair of send and receive, which requires another collective communication.
2
sender
MPI_Isend()
receiver MPI_Irecv()
sender
receiver MPI_Irecv()
send a receive request
MPI_Isend()
matching
send a send request
matching hit
remote memory read
remote memory write
Figure 1: One-sided send using message matching by the sender
Figure 2: Sync protocol and get protocol in the case that nonblocking receive has been issued sender receiver MPI_Isend() send a send request
message matching by both sender and receiver with consistency. 4
matching miss
MPI_Irecv() remote memory read
Consistent message matching by both sender and re-
matching hit
ceiver
Ideally, the decision which side matches the message depends on whether the corresponding receive does not specify a wildcard for rank of the source process or whether the receive is not issued very late. This decision policy is, however, not realistic since it is not known until matching time.
Figure 3: Sync protocol in the case that nonblocking receive has not been issued 4.2
4.1
Receiver-side message matching
Consistent message matching by both sender and receiver
There are several protocols to send messages based on the message matching by receiver. If no temporary (system) buer is used, sender sends a send request , which includes addresses of send buer and request object, message tag and communicator, to the receiver to inquire whether the corresponding receive has been issued. If it has been issued, the receiver reads from the send buer remotely (Figure 2). If not, the receiver inserts the send request in the associative memory. When the corresponding receive is issued, the receiver checks the associative memory, and reads remotely from the send buer (Figure 3). This method we call the sync protocol is similar to the protocol method in [10]. Notice that this protocol oers only synchronous-mode send though there is always no extra copy. In the sync protocol, MPI_Send() is blocked too long until the corresponding receive is issued. To avoid this blocking, an intermediate buer is used. This buer is desirable to be allocated in the receiver side because of reducing latency at receive-time and coping with receive with wildcard. Since the sender does not know the address of the intermediate buer in the receiver, it needs inquiring of the receiver. There are two protocols whether matching is done or not at inquiry. In a protocol without matching, we call the buered protocol, the sender requests the allocation of the intermediate buer for the receiver, writes messages remotely, and sends the send request to the receiver (Figure 4). The equivalent operation is done using remote memory read by the receiver after allocating the intermediate receive buer. In the other protocol, the sender sends the send request to inquire whether the corresponding receive has already issued. If the corresponding receive is found, the message is directly read from the send buer to the receive buer without any extra copy (Figure 2). If not, the receiver allocates an intermediate buer, and reads from the send buer remotely, or the sender writes remotely (Figure 5). This protocol is called the get protocol in MPICH.
Our sender-side matching design is intended to eciently execute the code that MPI_Irecv() is issued in advance. To decide by which side message matching is done, therefore, we basically give up sender-side matching if the receive request has not arrived before issuing send. Basic strategy which side matches messages is 1. The sender checks the corresponding receive request rst. If found, send messages by remote memory write. If not, send the send request. 2. The receiver checks the corresponding send request rst. If found, copy to the receive buer (remotely). If not and if the receive does not specify wildcard for the rank of the source, send the receive request. In this design, there is a problem about consistency. If send and receive have issued at the almost same time, the sender sends the send request because the corresponding receive request has not arrived and the receiver also sends the request because the corresponding send request has not arrived. A basic solution is to ignore either request. In the buered protocol, it is proper to ignore the receive request since messages are copied to the receiver by the send request (Figure 6). In the sync protocol and the get protocol, ignore the send request (Figure 7). To implement this solution, the sender enqueues to a queue, called a send queue2 , that shows the send request has been sent and the receiver also enqueues to a receive queue. When the request that should be ignored has arrived, dequeue from the queue. It is to be noted that dequeuing is necessary even if there is no request to be ignored because a subsequent request that should not be ignored may be ignored. On the multithreading environment, note that the section to check the request and delete or insert to the associative memory is a critical section. 2 The send queue may be also implemented by associative memory or hash table.
3
sender MPI_Isend()
receiver
request for recv. buffer
allocate a recv. buffer
remote memory write & send a send request
MPI_Irecv()
sender receiver MPI_Isend() request for recv. MPI_Irecv() buffer
ignored remote memory write & send a send request
matching copy from the recv. buffer
Figure 4: Buered protocol
Figure 6: In the case of buered protocol sender
sender receiver MPI_Isend() send a send request
remote memory read
matching hit copy from recv. buffer
MPI_Isend()
matching miss allocate a recv. buffer
matching
receiver
send request receive request
MPI_Irecv() ignored
remote memory write
MPI_Irecv() matching copy from the recv. buffer
Figure 7: In the case of sync and get protocols
Figure 5: Get protocol in the case that nonblocking receive has not been issued 4.3
anism and the operand segment, the EMC-Y supports multithreaded execution by hardware. Local synchronization mechanisms, I-structures[1] and Q-structures[2], are implemented by the packet matching mechanism. The I-structure is a data structure introduced to functional languages. Each element of the I-structure can be assigned only once, and read operation is blocked until the corresponding write is issued. The EM-X implements the I-structure using packet matching for two inputs. When the matching is done, the blocked read operation is invoked. Note that there is no interruption and no polling for invoking blocked read operation. The Q-structure is an extension of the I-structure, which allows multiple read and multiple write operations using FIFO queue, which is implemented by software on the EMX. Remote memory operations can be implemented by invoking SCB, while the EM-X supports two special hardwired packets, SYSWR and SYSRD. When SYSWR packet arrives at packet input buer, a data is stored by input buer unit not by execution unit. Using SYSWR and SYSRD packets, the EM-X supports remote memory operations.
Order of communications
MPI guarantees the order of communications among two processes. When the associative memory is implemented by FIFO, our design for point-to-point communication also guarantees the FIFOness if the network ensures FIFOness except when the receiver speci es wildcard for sender. Because a receive with wildcard cannot send the receive request, it might happen that the following receive that does not specify wildcard matches the send that the previous receive with wildcard should match if the following receive sends the receive request. Therefore, the receive request after a receive with wildcard should not be issued until the receive with wildcard is matched.
5
Implementation of MPI-EMX
MPI-EMX is our MPI implementation on the EM-X multiprocessor[5], which supports both sender-side and receiverside matchings. 5.1
EM-X multiprocessor 5.2
The EM-X is a distributed memory machine developed at Electrotechnical Laboratory (ETL), whose processing element EMC-Y[4] supports packet sending and packet matching mechanisms based on data ow execution model. The EMC-Y also supports a serial and atomic execution in a block (or thread) we call a strongly connected block (SCB) based on von Neumann execution model. The SCB is invoked by one or two packets and executed by execution unit. When a packet arrives, it is stored in a input buer. If the packet is intended to invoke an SCB with two inputs and the other packet has not arrived, the packet is removed from the input buer and stored to a memory until the other packet arrives, which is called packet matching . Packets in the input buer are ready to execute an SCB. One or more SCBs can have an operand segment (or function frame) that is used for arguments of a function call, local variables and so on. Using SCB invocation mech-
Implementation of MPI-EMX
A key feature of proposed one-sided implementation of pointto-point communication is how to implement sending a send request and a receive request. Both the send request and the receive request need to insert to an associative FIFO memory in some other processor. Besides, it is necessary to check on the send queue or the receive queue before insertion whether the corresponding request has been sent. This operation is implemented by remote thread invocation. Since several threads may be executed simultaneously, a mutual exclusive lock for the associative memory is necessary. This operation about associative memory and queue is a main factor of communication latency. We implement this operation as an atomic operation using SCB in EMC-Y assembly without explicit locking. Remote memory operations are used not only for communicating data but also for writing a source rank, a message 4
tag, and message size to the receive request object. To wait the completion of send or receive, local synchronization mechanism, I-structure, is used instead of polling or interruption. Thanks to the I-structure, it is possible to wait almost ideally for the completion without any load of the processor. There is a simple problem such that the I-structure does not support test operation whether the write is issued. That is because the I-structure does not keep a state since it is introduced to functional languages. To use the I-structure for completion of send or receive, however, the test operation is necessary for such as MPI_Test(). We extend asynchronous read operation to the I-structure and implement on the EMX. Since the I-structure on the EM-X is implemented by a mechanism of packet matching and a data tag, this asynchronous read operation can be implemented by checking the tag of the I-structure. Q-structures are considered to be used for matching messages, however there are two diculties. One is a problem for the address space that needs three words (12 bytes, or 96 bits); source or destination, message tag and communicator. The other is quite a serious one such that the receive may specify the wildcard. The remote thread creation on the EM-X ordinarily consists of a request of an operand segment, remote memory write of arguments and invocation of the thread by a packet. A special packet handler can be invoked by one packet, while it does not use the operand segment and the number of arguments is restricted to two. We implement a thread deleting an entry from the send queue or the receive queue using the special packet handler since it has few arguments. Because a request object has a xed length, allocating and freeing the request object are highly optimized in assembly using free list management. MPI_Wait() is also implemented in assembly. 6 6.1
MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Barrier(MPI_COMM_WORLD); if (rank == 1) { time = MPI_Wtime(); for (i = 0; i < NUM_ITER; ++i) { MPI_Irecv(rbuf, 1, MPI_INT, 2, 100, MPI_COMM_WORLD, &req); MPI_Wait(&req, &st); MPI_Isend(sbuf, 1, MPI_INT, 2, 100, MPI_COMM_WORLD, &req); MPI_Wait(&req, &st); /* or MPI_Request_free(&req); */ } time = MPI_Wtime() - time; } else if (rank == 2) { time = MPI_Wtime(); for (i = 0; i < NUM_ITER; ++i) { MPI_Isend(sbuf, 1, MPI_INT, 1, 100, MPI_COMM_WORLD, &req); MPI_Wait(&req, &st); /* or MPI_Request_free(&req); */ MPI_Irecv(rbuf, 1, MPI_INT, 1, 100, MPI_COMM_WORLD, &req); MPI_Wait(&req, &st); } time = MPI_Wtime() - time; }
Figure 8: Normal code for pingpong Table 1: One-way time of pingpong on the EM-X (sec.) both sides only receive side buf. get sync buf. get sync normal 36.1 25.4 27.9 42.8 21.4 22.1 pre-request 15.4 42.9 20.6 21.4
Performance evaluation and programming examples Pingpong
An ordinary pingpong program written in MPI is shown by Figure 8, however this program is not written with the intention of issuing MPI_Irecv() in advance. Figure 8 can be rewritten as all MPI_Irecv() is issued in advance3 (Figure 9). In this program, since all receive requests arrive before issuing send, every send is executed by remote memory write without inquiring of the receiver. Table 1 shows one-way time of pingpong in each protocol by Figure 8, denoted by 'normal', and Figure 9, 'prerequest', on the EM-X. The pre-request program and message matching by both sides achieve a minimal 4-byte latency of 15.4 sec., which is the same in each protocol because message matching by sender is always successful. Programs such that MPI library can execute eciently and MPI library such that the program can run eciently are quite important. In the normal program, the buered protocol by both sides is faster than the buered protocol by only receive side. That is because all sender-side matching is successful in the process that initially receive a message. The other two protocols by both sides are slower than that by only
receive side. That is because all matchings by sender are failed in vain. The latency without communication, which includes message matching and writing to the request object, is 13.4 sec. in the case of both-side matching. This latency accounts for 87% of 4-byte latency. That is because CPU cycle of the EM-X is now only 16 MHz. One-way latency of pingpong would by expected to be 3 - 4 sec. if the CPU cycle would be ten times faster. 6.2
Throughput
Throughput of the MPI-EMX, which is measured by pingpong program such that all nonblocking receives are issued in advance, is shown by Figure 10. The top line denoted by 'remote write' shows the performance of remote memory write of a contiguous block that is used when sender-side matching succeeds. This remote write routine is unrolled by 16, there is a jump around 64 bytes. Because the EM-X has quite a low latency of communication and the receiver is dedicated to receive without any other computation, the dierence of throughput between both sides and only receiver side is slightly. Since the buered protocol has to allocate an intermediate re-
3 This program is valid in the sense of benchmarking, however nonblocking receive with the same receive buer should not be generally issued at the same time because receiving messages becomes a race condition
5
35
MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == 1) { for (i = 0; i < NUM_ITER; ++i) { MPI_Irecv(rbuf, 1, MPI_INT, 2, 100, MPI_COMM_WORLD, &req[i]); } MPI_Barrier(MPI_COMM_WORLD);
Throughput [MB/sec.]
30
time = MPI_Wtime(); for (i = 0; i < NUM_ITER; ++i) { MPI_Isend(sbuf, 1, MPI_INT, 2, 100, MPI_COMM_WORLD, &r); MPI_Wait(&r, &s); /* or MPI_Request_free(&r); */ MPI_Wait(&req[i], &st[i]); } time = MPI_Wtime() - time;
20 15 10
remote write both sides get sync buffered
5 0 10
100
1000 Data size [Byte]
10000
100000
Figure 10: Throughput on the EM-X
} else if (rank == 2) { for (i = 0; i < NUM_ITER; ++i) { MPI_Irecv(rbuf, 1, MPI_INT, 1, 100, MPI_COMM_WORLD, &req[i]); } MPI_Barrier(MPI_COMM_WORLD);
not execute a thread invoked by the sender for a minute. However, yield() is a pure overhead for computation. Using the pre-request and message matching by both side, there is no need to execute thread remotely at send-time, and there is also no need to execute yield() explicitly.
time = MPI_Wtime(); for (i = 0; i < NUM_ITER; ++i) { MPI_Wait(&req[i], &st[i]); MPI_Isend(sbuf, 1, MPI_INT, 1, 100, MPI_COMM_WORLD, &r); MPI_Wait(&r, &s); /* or MPI_Request_free(&r); */ } time = MPI_Wtime() - time;
7
Related Works
MPICH[3] is the most famous implementation of the MPI standard. MPICH has three protocols; eager, rendezvous and get, in point-to-point communication. In every protocol, control messages are sent from a sender to a receiver, and the message matching is made by the receiver. Even if MPI_Irecv() is issued in advance, the communication latency is greater than our proposal, and moreover, the receiver has to pay an extra cost such as context switch. MPI implementations based on MPICH, such as MPI-FM[6] and MPI-AM[13], are the same. MPIAP[9, 10] is a native implementation on Fujitsu multicomputer AP1000, AP1000+ and AP3000. The MPIAP implements two protocols, in-place method and protocol method. The in-place method is similar to the buered protocol except that in-place method does not use a remote memory operation and the intermediate receive buer is a ring buer. The protocol method is similar to the sync protocol. In both methods, however, message matching is done by receiver.
} else { MPI_Barrier(MPI_COMM_WORLD); }
Figure 9: Pingpong code issuing MPI Irecv() in advance ceive buer remotely and copy to and from the buer, the throughput remains low. Since the MPI-EMX has zero-byte latency of 13.4 sec., throughput increases slowly, however half of peak performance 32 MB/sec. is achieved with 512 bytes. The MPIEMX gains the maximum peak performance of 31.4 MB/sec. with 1.6 MB. 6.3
25
8
Summary
This paper proposed design and implementation of the pointto-point communication of MPI using remote memory write without inquiring of the receiver at send-time in the communication pattern such that nonblocking receive MPI_Irecv() is issued in advance. This design includes message matching by both sender and receiver with consistency. MPI-EMX that is our implementation of the MPI on the EM-X achieves a minimal latency of 13.4 sec., which is competitive with other commercial MPPs. However, the main factor of the latency is brought by the processor speed not the communication latency. If the CPU cycle is faster than ten times, the latency is expected to be reduced to 3 4 sec. The MPI-EMX gains almost full throughput of the network. To achieve high performance, message matching by the sender is indispensable.
Back substitution
Back substitution is a solution of an upper triangular matrix that appears typically after Gaussian elimination. The upper triangular matrix is distributed by row block-cyclic manner, and these blocks are executed in parallel by pipeline. The row denoted by 'pre-request' in Table 2 means a program such that all issues of MPI_Irecv() are done in advance. Since performance of this program for single processor is 3.66 MFlops, the pre-request achieves 85% parallel eciency using message matching by both sides. Because the SCBs are not interrupted, explicit calls of yield() are necessary for ecient execution especially when computation is dominant. The eect of yield() is remarkable in only receiver-side matching because the receiver can6
Table 2: Back substitution on the EM-X (MFlops/#PE) both sides only receiver side buf. get sync buf. get sync normal 2.55 2.33 2.17 2.55 0.71 0.71 w/o yield() pre-request 3.10 2.55 0.71 0.71 normal 2.73 2.78 2.84 2.73 2.78 2.84 w. yield() pre-request 2.84 2.74 2.85 2.85 (#PE = 5, N = 2000, n of block = 25) In currently available MPI implementations, issuing nonblocking receives in advance does not always result in good performance, however it brings MPI library a chance to execute eciently using special hardware supports. Currently, the MPI-EMX supports point-to-point communication, part of collective communications, Cartesian topology and so on, however it is planed to support the entire MPI standard. MPI-2[7] introduces one-sided communications, such as MPI_Put() and MPI_Get(). One-sided communication separates two features of the message-passing; communication of data and synchronization. Since our design of the senderside matching is also applicable for the synchronization, it is possible to implement one-sided communications using remote memory operations without inquiring of the target at communication-time.
[6]
Lauria, M., and Chien, A. MPI-FM: High performance MPI on workstation clusters. Journal of Parallel and Distributed Computing 40, 1 (January 1997), 4{18.
[7]
Message Passing Interface Forum. MPI-2: Extensions to the Message-Passing Interface, July 1997.
http://www.mpi-forum.org/docs/mpi-20-html/ mpi2-report.html.
[8]
[9]
[10]
We are grateful to Dr. M. Sato at RWCP for discussion about EM-C [8] that is a C compiler supporting multithreading, global address pointer, I-structure, Q-structure and other extensions for the EM-X. We appreciate precious discussion with members of parallel architecture group at ETL, Dr. H. Sakane, Dr. H. Yamana and Dr. H. Koike. We also give great thanks for Dr. K. Ohmaki, director of the computer science division at ETL for supporting this research.
[11]
S.,
and Pingali,
Snir, M., Otto, S., Huss-Lederman, S., Walker, D., and Dongarra, J.
Istructures: Data structures for parallel computing. Nikhil,
An MPI library which uses polling, interrupts and remote copying for the Fujitsu AP1000+. In Proceedings of International Sym-
Sitsky, D., and Hayashi, K.
The MIT Press, 1996. [12]
R.,
Sitsky, D., and Hayashi, K. Implementing MPI for the Fujitsu AP1000/AP1000+ using polling, interrupts and remote copying. In Proceedings of Joint Symposium on Parallel Processing 1996 (June 1996), pp. 177{184.
posium on Parallel Architectures, Algorithms, and Networks (June 1996), IEEE.
References Arvind,
Thread-based programming for the EM-4 hybrid data ow machine. In Proceedings of
the 19th Annual International Symposium on Computer Architecture (1992), pp. 146{155.
Acknowledgments
[1]
Sato, M., Kodama, Y., Sakai, S., Yamaguchi, Y.,
and Koumura, Y.
K. K.
MPI: The Complete Reference.
von Eicken, T., Culler, D. E., Goldstein, S. C.,
and Schauser, K. E. Active messages: a mechanism for integrated communication and computation. In Pro-
ceedings of the 19th Annual International Symposium on Computer Architecture (May 1992), pp. 256{266.
ACM Transactions on Programming Languages and Systems 11, 4 (1989), 598{632. [2] Bic, L. F., Nicolau, A., and Sato, M. Parallel Language and Compiler Research in Japan. Kluwer Academic Publishers, 1995. [3] Gropp, W., Lusk, E., Doss, N., and Skjellum, A. A high-performance, portable implementation of the MPI message passing interface standard. Parallel Computing 22 (1996), 789{828. [4] Kodama, Y., Koumura, Y., Sato, M., Sakane, H., Sakai, S., and Yamaguchi, Y. EMC-Y: Parallel processing element optimizing communication and computation. In Proceedings of International Conference on Supercomputing 1993 (1993), pp. 167{174. [5] Kodama, Y., Sakane, H., Sato, M., Yamana, H., Sakai, S., and Yamaguchi, Y. The EM-X parallel computer: Architecture and basic performance. In Proceedings of the 22nd Annual International Symposium on Computer Architecture (1995), pp. 14{23.
[13]
7
F. C., and Culler, D. E. Message Passing Interface Implementation on Active Messages. http://now.CS.Berkeley.EDU/Fastcomm/MPI/.
Wong,