A Thread Taxonomy for MPI Anthony Skjellum, Boris Protopopov, and Shane Hebert Mississippi State University Integrated Concurrent and Distributed Computation Research Laboratory Department of Computer Science PO Box 9637 Mississippi State, MS 39762 ftony,
[email protected],
[email protected]
Abstract In 1994, we presented extensions to MPI and offered an early paper on potential thread extensions to MPI, as well as non-blocking collective extensions to MPI [14]. The present paper is a thorough review of thread issues in MPI, including alternative models, their computational uses, and the impact on implementations. A number of issues are addressed: barriers to thread safety in MPI implementations with MPICH as an example and changes of the semantics of non-thread-safe MPI calls, dierent thread models, their uses, and possible integration. Minimal portable thread management and synchronization mechanisms API extensions for MPI are considered. A tentative design for multi-threaded thread-safe ADI and Channel Device for MPICH is proposed. We consider threads as both an implementation device for MPI and as a user-level mechanism to achieve ne-grain concurrency. The reduction of the process to a simple resource container (as considered by Mach), with the thread as the main named computational unit is suggested. Speci c results thus far with Windows NT version of MPICH are mentioned.
1 Introduction The MPI message passing speci cation is generally a thread-friendly speci cation. This means that conceptually nothing prevents application programmers from using MPI calls in multiple user threads. Since multi-threaded computational model oers many advantages, the MPI standard would bene t from allowing multiple application-level threads of execution in MPI programs and from incorporating de nitions for portable thread management and synchronization
mechanisms. The disadvantages of the multi-threaded model are also considered. In order to allow multiple application-level threads an MPI implementation should be thread-safe. The requirements for thread-safe MPI implementations are stated in this paper. Main issues of the design of a thread-safe MPI implementation are illustrated with MPICH implementation as an example. The wellknown MPICH implementation is a highly portable implementation which has served as a basis for several other MPI implementations. For this reason, the design of a thread-safe MPICH implementation is attractive. We determined that an ecient thread-safe MPICH implementation requires the Device to execute in separate threads. That is why we propose a tentative design for a multi-threaded thread-safe ADI and Channel Device. Since inter-thread communications are less costly then inter-process ones, threads replace processes in many computational models. We propose to make threads independent named units of execution in MPI; processes become \resource holders" (system resources shared by a pool of threads and cheap shared memory - common address space an idea similar to the one instantiated in the Mach kernel design [11]. The rest of the paper is structured as follows. Advantages of multi-threaded computational models are brie y reviewed in the second section, obstacles to thread safety in MPI implementations (MPICH as an example) are discussed in the third section, dierent thread models, their uses, and possible integration are considered in the fourth section. The fth section surveys the requirements for thread packages which can be used to implement MPICH MT, the sixth section considers portable thread management and synchronization API, and the seventh section describes current state of the multi-threaded thread-safe version of
WinMPICH for Windows NT. Conclusions and future plans concerning WinMPICH MT are given in the end of the paper.
2 Advantages of Multi-threaded Computational Models The advantages of using threads are discussed in details in a number of publications [10, 13]. Among the most obvious are the easy use of multiple processors if they are available, easy latency hiding, and cheap inter-thread (as opposed to inter-process) communication and synchronization. It also allows a programmer to create a well-structured code and design it in a natural fashion as a collection of (possibly concurrent) sub-tasks with easily controlled priority [10]. An obvious disadvantage of using threads is the additional latency caused by thread synchronization and scheduling. Since the MPI software itself is \highly parallelizable" in a sense that all communication operations are independent of each other, it is natural to try performing them in parallel, using all available concurrency which particular hardware can provide. At the same time, one should be aware of the costs of thread synchronization and scheduling for a particular runtime environment and use a reasonable amount of threads in applications as well as try to hide additional latency by amortizing them over the increases in performance obtained with the use of multiple threads.
3 Obstacles to Thread Safety in Major Implementations As was mentioned, the MPI message passing speci cation is generally a thread-friendly one. This means that almost nothing prevents from using MPI calls in multiple user threads. Most MPI API calls have thread-safe semantics. In addition, it is speci cally mentioned in the MPI Standard that MPI does not use any global data containing its internal state [1], so one does not have to worry about its consistent updates. In the mean-time, there is a number of issues which make the majority of current MPI implementations non-thread-safe. Several requirements should be met for an MPI implementation to be thread-safe. First of all, the implementation should use thread-safe runtime libraries (thread-safe versions of C runtime libraries are available now on Solaris, Windows NT, etc.), take special care of correct system resources usage ( le descriptors,
environment variables, memory management, signals, etc.). Second, the design of the implementation should inherently consider multiple threads of execution calling its services. This assumes that all calls are reentrant, access to the data representing internal state is mutually exclusive, all static or global thread-speci c data are made thread-speci c with the means provided by a thread package (TLS), and thread-safe calls are called \in a thread-safe manner" (without inducing deadlocks or potential system malfunctioning). A simple example of a set of calls, each thread-safe on its own but the sequence of these calls is not threadsafe, is the \test-and-do" operations (test if a queue is not empty and, if so, dequeue an item); all of such operations should be atomic in multi-threaded implementations. One should make sure that all cases of split-phase \test-and-do" operations, especially with blocking \do"-function, are eliminated from a threadsafe implementation. We should mention here that upon thorough consideration of MPI API calls we found only two of them, namely MPI Probe and MPI Iprobe, having a nonthread-safe semantics. We believe that multi-threaded extensions to the MPI standard should be transparent for single-threaded applications; hence, we can either speci cally mention that MPI Probe and MPI Iprobe calls are non-thread-safe and allow their usage only in single-threaded applications, or enforce their threadsafety [12]. We also can change the semantics of the MPI Probe and MPI Iprobe calls in order to make them thread- safe. Speci cally, the calls take two more arguments, a pointer to the preallocated buer and its maximal size; now we probe for a message and if it is there, we receive it; otherwise we indicate the absence of the message. One can notice the dierences in semantics and eects of these calls and the MPI Irecv call: the latter posts a receive operation and returns a request object which can be used later to obtain the status of the operation in progress. As was mentioned, the calls have non-thread-safe semantics, so it is quite dicult to implement their thread-safe versions eciently. Nevertheless, these are correct thread-safe calls, and they can be added to MPI API in place of or along with the current versions of these calls.
3.1 Non-thread-safe features of the current MPICH design We are not able to discuss all existing MPI implementations in this paper; rather, we consider one of the most known and portable implementations, MPICH. Due to its generality, the current MPICH implementations neither directly support multi-threaded
ADI, nor allow multiple user threads to make MPI communication calls. It is \thread-friendly" [5, 6], and most of the design is thread-safe; indeed, all operations are monitored with the use of handles which are created by the implementation internally and are opaque for the users (users have requests objects associated with handles). If applications use MPI operations in a meaningful manner (for example do not try to post an operation in one thread and complete it in another one), the handles are thread-safe. Nevertheless, such a design works only if the ADI and Device are thread-safe and if they are called in a thread-safe manner (see above). This is not the case for ADI-1 [5, 6] or for ADI-2 [7, 8] (we will refer to the ADI-1 and ADI-2 as \ADI" while considering their common problems concerning multi-threading). A simple example of nonthread-safe calls from the current MPICH Channel Device is the sequence \MPID ControlMsgAvail(. . . ) MPID RecvAnyControl(. . . )" which often deadlocks with multiple application threads. Another important impediment of the ADI design is its \push-and-poll" strategy. Pushing communication operations and polling the Device for incoming messages is used to ensure that every process makes progress in its communications. These operations are necessary because the Device and user code are executed in the same thread. If there are several user threads, pushing and polling becomes inconvenient because threads have to do it in a mutually exclusive manner and constantly check for the messages in the unexpected receives queue (other threads could already place the message there). We believe that the main aw of such a design is that it takes user's processor time to perform internal system operations which should not be the case on multi-processor platforms. This is why we propose to implement the ADI and Device as a collection of threads which work asynchronously with respect to the user's code (and physically concurrently on other processors if possible); in this case, it does not make much sense to poll a multithreaded device since asynchronous noti cations are used. Pushing is also unnecessary; instead, we can use separate sender and receiver threads and design them in such a way that they do as much as possible and wait only on empty queues. It would be wrong to say that it is impossible to modify the standard ADI for use with MPICH implementation allowing multiple MPI user threads (we further call it MPICH MT). The designers of the ADI provided basic mechanisms for communicatorlevel and data-level mutual exclusion. Nevertheless,
the \push-and-poll" and split-phase \test-then-do" nature of the standard ADI, as well as some other points discussed below, make it quite dicult (but not impossible) to make required modi cations which follow good software engineering practices and still have fairly ecient software. Consider the message-processing engine of the MPICH. Communication operations are monitored with the use of handles - data structures which entirely describe a communication operation (source, destination, communication context, completion status, etc.). The MPICH implementation internally allocates such handles for every operation. These handles are supposed to be transparent for users; request objects which users allocate for non-blocking operations are actually pointers to the handles. There are two queues of handles for incoming messages: the queue of posted handles and queue of handles for unexpected messages. Incoming messages which are received from the Device are matched against the information in the handles of the posted handles queue. If a match is found, a message is associated with the handle in the queue; otherwise, a new handle is created by the ADI and enqueued into the queue for unexpected messages. If the code of the ADI is executed in user threads, the ADI should poll for incoming messages, remove them from the Device, and process them in every user thread. The current ADI uses blocking and non-blocking forms of a function MPID CheckDevice(. . . ) for this purpose. This function is quite simple (we now consider the Channel Device). It uses the MPID ControlMsgAvail(. . . ) and MPID RecvAnyControl(. . . ) functions to check for and receive messages. Such a design is not thread safe due to the reasons discussed above. As a matter of fact, one needs an atomic check-andreceive operation. A non-blocking version of the MPID RecvAnyControl(. . . ) function which can be added to the Channel Device could serve such a purpose (see below). But one cannot use the blocking version of MPID CheckDevice(. . . ) in the thread-safe ADI anyway because it is impossible to know if the message is in the Device or if it has already been placed in one of the queues by some other thread which also polls the same device. The current MPICH implementation liberally uses the blocking version of the MPID CheckDevice(. . . ). This blocking call must be emulated with a loop including the non-blocking version accompanied with the calls which look for a required message in unexpected receives queue and process it if is found. Otherwise, the current application thread is supposed to yield its time slice in order to al-
low other threads to be schedulled. This form of busy waiting does not bene t the performance of the whole implementation and is a direct consequence of the design of the ADI. Another impact to the eciency of the system is that multi-threaded implementations cannot use optimized blocking receive operations for the same reason as blocking MPID CheckDevice: a message can already be in one of the unexpected receives queue by some other application thread. An ecient implementation of a thread-safe MPICH should also consider using multiple-readersingle-writer data-level locks instead of the basic mechanisms for data-level mutual exclusion proposed and implemented in the current MPICH. Such locks are available in some thread libraries (Solaris 2.5 Threads) or can be implemented using basic mutual exclusion constructs. In general, a portable threadsafe MPICH implementation should have a platformindependent set of synchronozation primitives. This can be a set of macros which provides the functionality of the Solaris primitives (mutexes, semaphores, conditional variables, multiple reader/single writer locks, etc.) or an equivalent set of macros which should be considered as a part of the MPI API. Section 6 considers this issue in more details.
4 Dierent Thread Models, Their Uses, and Possible Integration As was mentioned earlier, one can consider multithreaded implementations of the MPICH; on the other hand, it is bene cial to allow multiple threads in the application programs which use MPICH. Let us consider these two cases in order.
4.1 Tentative design of multi-threaded thread-safe MPICH (MPICH MT) The most important advantages of the current MPICH design are portability and eciency. Of course, we would like to retain portability and give up as little performance as possible in the design of a multi-threaded ADI and Channel Device. We hope to obtain substantial gains in the performance of the multi-threaded implementation by eliminating all polling, pushing, and busy waiting which takes place in the current implementation. This will allow us to amortize additional latencies and processor time taken for thread scheduling and synchronization over these expectedly big performance gains. We propose the following tentative design for the MPICH MT (Figure 1). As in the original MPICH [6,
API Posted Sends
Posted Receives
Unexpected Receives
ADI
User Threads
Device Threads
Device
Synchronous Sends in Progress
Rendezvous Sends in Progress Terminator Thread
....... Sender Thread
Receiver Thread
Figure 1: Tentative design of the MPICH MT 5], there are three layers of the software: API, ADI, and Device. And MPI application consists of user threads which call API and ADI functions and device threads which execute asynchronously and possibly concurrently with the user threads. Multiple devices can be loaded as necessary at run time as dynamically linked libraries (DLLs). The Device consists of three specialized threads - Sender, Receiver and Terminator. Sender and Receiver are to make sure that everything that can be sent (or received) is sent (received) at any particular moment of time. Terminator is in an ecient wait state waiting for the signal to clean up its environment and terminate the process. Such a design allows us to forget about pushing and polling draining the Device and pushing sends is performed asynchronously with respect to the user code. On the ADI level, there are three queues of pointers to communication handles (we use the terminology adopted in the MPICH implementation of MPI; handles are the MPICH internal handles mentioned above). It is important to notice that we propose to use an asynchronous event object instead of the completion ag in the handle object in order to allow ecient test and wait operations on the handles. Asynchronous events are available in POSIX-compliant thread packages (Solaris, Windows NT, for example). We realize that creation of such objects incurs some overhead which depends on the thread package used. That is why caching and reuse of communication handles will be done to reduce such overhead. (We also work on a de-
sign which does not require events in handles; instead, threads are suspended and resumed when needed.) The three queues mentioned above contain pointers to pending send handles, posted receive handles, and unexpected receive handles accordingly. These queues are thread-safe and are accessed by the user threads as well as by the Sender and Receiver. Three thread-safe operations - MPID EnqueueHandle(. . . ), MPID DequeueHandle(. . . ), and MPID ScanQueue(..) are provided. It is important to mention that the queues have dedicated events associates with them which signal their special states such as if the queue is not empty. These events are used to notify Device threads to do some work. The rest of the ADI calls mostly repeats the functionality of the existing ADI except for the device draining calls. In particular, there are functions which initialize, nalize, and abort the operation of the ADI, return a process's rank and world size, post, test, complete, and cancel communication operations (short, long, synchronous, ready, etc.), and allocate, free, and reuse communication handles. As in the case of the API, MPID Probe(. . . ) and MPID Iprobe(. . . ) have changed semantics as described above. On the Device level, there are queues of the handles associated with operations that require collaboration between the Sender and Receiver, such as synchronous sends, \very long" operations performed with the use of the rendezvous protocol, operations using the get protocol, etc. One can notice that all functions of the ADI can be implemented with the use of MPID ScanQueue(. . . ), MPID EnqueueHandle(. . . ), and MPID DequeueHandle(. . . ) calls and with the use of test/wait operations on the event objects in the communication handles. A substantial part of the MPICH MT's functionality is delivered by the Sender and Receiver threads in this design, especially in the case of operations which require collaboration between the Sender and Receiver in order to complete. This increases the complexity of the threads which is not desirable. Trade-os in terms of moving operation completion control functionality between the threads and the ADI can be considered. In the mean-time, the Device described above can also take advantage of the modular approach taken while designing previous MPICH ADIs. The Device can use simple low-level communication operations such as RecvAnyControl(. . . ), RecvData(. . . ), SendControl(. . . ), SendData(. . . ) which constitute the Channel Device. The semantics of these calls remains the same; special care should be taken while implementing it for a given platform. The only call,
ControlMsgAvail(. . . ), should be omitted because it is not thread-safe when used with RecvAnyControl(. . . ) (compare with the MPI Probe(. . . ), MPI Recv(. . . ) call sequences). Instead, a non-blocking version of the RecvAnyControl(. . . ) which checks the device for an incoming message (and if there is a message to receive, receives it, otherwise returns false) should be added to the minimal version of the Channel Device. With these changes and macros for threadmanagement calls and synchronization mechanisms, we could provide a way to easily port this multithreaded thread-safe design to dierent platforms.
4.2 Multiple user threads in MPICH MT In general, several cases of multiple threads usage with currently existing MPICH applications are possible. We should mention that the default requirement for practical realization of all cases discussed below is the thread-safe MPICH implementation. The classi cation has two dimentions and is by no means exhaustive. One dimension: I. usage of multiple threads which requires no changes or additions to the semantics of the MPI; II. usage of multiple threads which requires some changes or additions to the semantics of the MPI. The other dimension concerns communicating entities: I. MPICH implementation has ranks assigned on the process level; one cannot make MPICH aware of dierent threads explicitly; II. MPICH implementation can assign ranks to the threads. The main diculty of the cases (I,I) and (II,I) is that it is impossible to directly identify threads from dierent processes which share the same communication space and hence construct meaningful communication patterns. The detailed discussion of this issue with the survey of related approaches and our suggestions can be found in [16]. In the case (I,I), there may be several MPICH application-level threads communicating in parallel among each other. In this case, applications use MPICH to perform thread-safe point-to-point communications, but they are responsible for making sure that messages are received by the correct threads (for example, they use custom tag assignment strategies; one should cautiously use wildcards). This requires a thread-safe MPICH implementation - access to global variables and shared data is synchronized, functions are reentrant, deadlock free semantics of calls is enforced (see above consideration of \test-and-do" calls),
runtime system libraries are thread-safe or a programmer takes special care while using system resources, etc. In the mean-time, one cannot explicitly use tags for collective operations among several groups of threads. Hence, it is natural to use communicators, an already existing means to provide a safe communications space. The only complication is that one has to consider the overhead of creating new communicators for new threads as necessary. One should also make sure that MPI communicator management calls (which are collective operations on their own) are executed in a safe manner. A straightforward way to accomplish this is to preallocate N collective contexts with every communicator and then use them for communicator management; all parallel communicator management calls above N will be executed sequentially. The other possible solution to the problem of thread identi cation is the case (II,I) where we propose to extend the notion of communicators by allowing variable number of contexts for point-to-point as well as collective operations inside one communicator and providing extended API for users to dynamically allocate contexts. The threads share the communication subspaces associated with every separate context in accordance with the order of thread creation. The apparent aw of this design is that it works only for the rst nested level of thread creation, but this is probably enough for wide range of thread applications. The details as well as the consideration of an alternative approach [2] can be found in [16]. All the above concerns MPI implementations which have system ranks assigned on the process level; one cannot make MPI aware of dierent threads explicitly. The alternative are the cases (I,II) and (II,II) in our classi cation: MPI implementation has ranks assigned to the threads, but it does not necessarily causes changes or additions to the semantics of MPI communication calls except for the obvious one communicating entities are threads but not processes. This model is actually very similar to the one represented in the Mach kernel design where the role of processes is reduced to simple resource containers [11] with threads as the main named computational units. Basically, two kinds of threads can be considered MPI threads and user threads. The latter do not have distinct MPI ranks; their role is the same as the role of multiple threads for the case (I,I) described above. MPI threads are used in place of MPI processes. There are several advantages of this model. First, we save a lot on the inter-thread as opposed to inter-process communications. Second, all threads use free shared
memory which is the common address space; hence, most of the ADI and the platform-dependent device code can be simpli ed to a substantial extent. With this view on the role of threads in mind, one can easily perceive the \shared-memory" MPICH implementation which uses threads instead of processes in SPMD applications as an example. It does not require any changes or additions to the semantics of MPI calls and MPI runtime system. But since mpirun creates threads instead of processes, the way MPI programs are written should be modi ed. In particular, a user application should be built as a dynamically linked library (DLL), and its main function should be called in a standard way. The mpirun in this case is a simple program which loads the DLL and starts the number of threads speci ed in the command line. The rest of the picture is the same as for the MPI with multiple processes. This design also allows natural dynamic communicating entities creation. One simply has to spawn a pool of threads in the current or separate process address space and update the global information about running entities. If a separate address space is created, one also has to establish an inter-process channel which can be a shared memory region for the oneplatform case and, for example, a TCP device for the inter-platform case. A detailed discussion of the architecture of the MPI software with user threads as named execution units can be found in [16]. The design of a multi-threaded and thread-safe MPI implementation is not just an interesting mental exercise; it nds practical applications as a convenient means for the implementation of non-blocking collective operations [4, 12, 15, 16] as well as extends the sphere of MPI applications to span new models of parallel client-server computation based on threads instead of processes which take advantage of reduced threads context switch time and easy inter-thread communications.
5 Requirements for Thread Packages for MPICH MT In order to implement the ideas described above, one needs to use an existing thread packages which provides an API and functionality required. Since case (I,I) is the closest non-trivial extension of the existing software, we will concentrate on it below. In general, a thread library which can be used to implement the case (I,I) in our classi cation, should provide a programmer with generic means of thread management
and synchronization. With careful design, one might not need thread-level intra-process synchronization. It should be mentioned that it might be necessary to use thread-speci c storage while implementing the case (I,I) (for example, to make a sequence RecvAnyControl(. . . ) and RecvData(. . . ) in the Channel Device thread-safe). Most existing thread packages provide the necessary functionality. There are also several runtime environments [3, 9] which provide support for multiple threads of execution and theoretically could be used to implement MPICH MT. In the mean-time, the usage of such systems would incur a substantial overhead connected with runtime system initialization and functioning. Unless these runtime systems are used to reimplement MPI completely (in this case, one could take advantage of the attractive abstractions supported there such as contexts, global pointers, remote service requests, etc.) or it is possible to use some of their standalone modules (such as Nexus protocol and thread modules), we believe, it does not make much sense to use them in MPI. Existing thread packages support two dierent multi-threaded programming models. The more general one has two levels: lightweight processes (LWPs) which are directly bound to the kernel threads and require context switch to the kernel to be rescheduled and user-level threads which are used by application programmers through the API. These user-level threads can be either bound to particular LWPs (oneto-one) or multiplexed over a pool of LWPs (many-tomany). If a thread is bound to an LWP, it is rescheduled with this LWP; otherwise, it is executed by different LWPs and switch between such threads occurs without the kernel intervention. Such a model (Solaris 2.5 threads) allows to meet the requirements of wide range of applications. It separates logical parallelism from physical one and allows for the control over the amount of system resources consumed by the program. Thread packages which implement this model also have inexpensive synchronization mechanisms to meet the speed of switching between threads. This allows for the ecient implementation of applications with high logical parallelism. The other model considers only one level which corresponds to the LWPs which are schedulable entities on their own, but their rescheduling is expensive. If a process dynamically creates variable amount of threads, one should make sure that the application uses bounded amount of system resources (Windows NT 3.51 threads). Obviously, the rst model has the advantages mentioned above, but careful coding can
ensure proper use of system resources. In the meantime, some of such systems (Windows NT threads) also have advantages: they provide clean and intuitive API with a rich set of synchronization primitives and means to dynamically control scheduling through thread priorities.
6 Portable Thread Management Calls and Synchronization Mechanisms API Since MPICH MT must be portable, one can provide macro interfaces for thread management and synchronization. A natural approach is to provide a minimal set of macros which mimic an API of one of the thread packages of the \Windows NT"-type described above, and also have a set of auxiliary macros which extend this interface to take advantage of the thread packages of the \Solaris"-type. The same approach can be chosen to include MPI portable synchronization mechanism de nitions macros. This could be a minimal subset of the Windows NT API to synchronization objects [13]. We show a tentative API for thread management and synchronization in [16].
7 Multi-threaded MPICH for Windows NT We have developed a multi-threaded MPICH device for the Windows NT operating system using the Win32 API. The device has three threads which act on behalf of the user process to handle message passing. These threads operate with thread-safe queues and insure in-order transfer of packets to other processes and fairly service incoming packets from other processes. Designing a multi-threaded device was our rst step in designing a multi-threaded MPI implementation for Windows NT. At present, we have partially implemented a thread-safe MPICH for Windows NT which allows multiple user threads using point-to-point MPI communication calls. It should be mentioned that we have retained the original design of MPICH which, as argued above, does not allow for a more ecient threaded approach. We plan to nish our work on this thread-safe implementation for Windows NT in Q2'96 and test the performance of our implementation using the set of performance tests provided with the MPICH distribution in order to compare our results with other threaded and non-threaded implementations.
8 Conclusions As was shown, current ADI design jeopardizes the eciency of a thread-safe MPICH implementation and should be reconsidered. We propose another one to remedy this de ciency. We believe that thread-safe runtime environments and POSIX-compliant thread packages will be completely developed in the near future for wide variety of platforms and multi-threaded computational models will replace multi-processed ones in the majority of cases. For these reasons, we think that provision of an ecient thread-safe MPI implementation is an important task.
Acknowledgements We would like to thank our colleagues and coworkers especially Purushotham Bangalore for their help with TeX and proofreading of this paper.
References [1] MPI Forum. The MPI Message-Passing Interface Standard. http: //www.mcs.anl.gov /mpi/mpireport/ mpi-report.html, November 1995. [2] Ian Foster, Carl Kesselman, and Marc Snir. Generalized Comminicators in MPI. http: //estreme.indiana.edu /ports/proposals/ endpoints.ps, May 1996. [3] Ian Foster, Carl Kesselman, and Steven Tuecke. The Nexus Task-Parallel Runtime System. ftp://ftp.mcs.anl.gov /pub/nexus/reports/ india paper.ps.Z, April 1996. [4] Al Geist, Ewing Lusk, William Gropp, William Saphir, Steve Huss-Lederman, Tony Skjellum, Andrew Lumsdaine, and Marc Snir. MPI-2: Extending the Message-Passing Interface. In EuroPar'96, February 1996. [5] William Gropp and Ewing Lusk. MPICH ADI Implementation Reference Manual. ftp: //info.mcs.anl.gov /pub/mpi/ adiman.ps, December 1995. [6] William Gropp and Ewing Lusk. MPICH Working Note: Creating a New MPICH Device Using the Channel Interface. http://www.mcs.anl.gov /home/lusk/mpich/workingnote/newadi/ note.html, December 1995.
[7] William Gropp and Ewing Lusk. MPICH Working Note: The Implementation of the SecondGeneration MPICH ADI. ftp: //info.mcs.anl.gov /pub/mpi/workingnote/ adi2imp.ps, May 1996. [8] William Gropp and Ewing Lusk. MPICH Working Note: The Second-Generation ADI for MPICH Implementation of MPI. ftp: //info.mcs.anl.gov /pub/mpi/workingnote/ nextgen.ps, May 1996. [9] Mattew Haines, David Cronk, and Piyush Mehrotra. On the Design of Chant: A Talking Threads Package. http://meru.uwyo.edu/ h~aines/proj/papers/sc94/ sc94.html, April 1996. [10] Bil Lewis and Daniel J. Berg. Threads Primer: A Guide To Multitheaded Programming. SunSoft Press, 1996. [11] Keith Loepere. Mach 3 Kernel Principles. ftp: //www.cs.cmu.edu /afs/cs/project/mach/ public/doc/osf/ kernel principles.ps, May 1996. [12] MPI Forum. MPI-2: Extensions to the Message-Passing Interface (Draft Proposal). http: //www.mcs.anl.gov /Projects/mpi/mpi2/ mpi2.html, April 1996. [13] Jerey Ritcher. Advanced Windows: The Developer's Guide to the Win32 API for Windows NT and Windows 95. Microsoft Press, 1995. [14] Anthony Skjellum, Nathan Doss, and Kishore Viswanathan. Inter-communicator Extensions to MPI in the MPIX Library. Poster at Supercomputing'94, 1994. [15] Anthony Skjellum, Nathan E. Doss, Kishore Viswanathan, Aswini Chowdappa, and Purushotham V. Bangalore. Extending the MessagePassing Interface (MPI). In Anthony Skjellum and Donna Reese, editors, Scalable Parallel Libraries Conference - II. Mississippi State University, IEEE Computer Society Press, October 1994. [16] Anthony Skjellum, Boris Protopopov, and Shane Hebert. A Thread Taxonomy for MPI Speci cation. Technical report, Department of Computer Science, Mississippi State University, May 1996. under preparation.