Jun 8, 1995 - NASA Langley Research Center, Mail Stop 132C ... Chant provides support for remote service requests, remote thread ..... by the server thread, is passed the chanter identi er for the calling thread, so that a reply can be sent if.
Chant: Lightweight Threads in a Distributed Memory Environment Matthew Haines
Piyush Mehrotra
David Cronk
Institute for Computer Applications in Science and Engineering NASA Langley Research Center, Mail Stop 132C Hampton, VA 23681-0001 June 8, 1995 Abstract
Lightweight threads are becoming increasingly useful in supporting parallelism and asynchronous events in applications and language implementations. Traditionally, lightweight threads are supported only within the single address space of a process, or in shared memory environments with multiple processes. We introduce and describe the design of Chant, a runtime system supporting lightweight threads in a distributed memory environment. In addition to communication between any two threads in the system, Chant provides support for remote service requests, remote thread operations, and collective communication between thread groups called ropes. Chant provides the rst implementation of lightweight threads for a distributed memory platform whose design incorporates existing standards for lightweight threads and interprocess communication. This paper details the issues that arise in extending a standard threads package to support distributed memory execution, and the solutions that are provided by the Chant system.
Index Terms { Lightweight threads, interprocess communication, pthreads, MPI, distributed computing, parallel computing.
1 Introduction Lightweight, user-level threads are becoming increasingly useful in supporting parallelismand asynchronous events in applications and language implementations, for both parallel and sequential machines. Threads are used in simulation systems [17, 40] to represent asynchronous events that can be mapped onto single or multiple processors; they are used in language implementations to provide support for coroutines [39], Ada tasks [32], and parallel C++ method invocations [31]; and they are used in generic runtime systems [15, 20, 46] to support ne-grain parallelism, multithreading, and language interoperability. In light of their increasing use, the IEEE committee for Portable Operating System Interfaces for Computer Environments (POSIX) has adopted a standard interface for lightweight threads within a Unix process [25], and numerous thread libraries have been designed and implemented for workstations and shared memory multiprocessors [1, 6, 17, 27, 33, 42]. Despite their popularity and utility in shared memory systems, lightweight thread packages designed for distributed memory systems have received little attention. This is unfortunate: in a distributed memory Research supported by the National Aeronautics and Space Administration under NASA Contract No. NASA-19480, while the authors were in residence at ICASE, NASA Langley Research Center, Hampton, VA 23681.
1
Chant Chant User Interface Ropes Remote thread operations Remote service requests Point-to-point message passing Chant System Interface (MPI, Pthreads)
Communication Library
Lightweight Thread Library
(MPI, NX, ...)
(Ports0 / Pthreads)
Figure 1: The Chant runtime system system, lightweight threads can overlap communication with computation [11, 12, 19]; they can emulate virtual processors [29, 35]; and they can permit dynamic scheduling and load balancing [9]. However, there is no widely accepted implementation of a distributed memory threads package. We introduce the term talking threads to represent the notion of two threads in direct communication with each other, regardless of whether they exist in the same address space or not. In this paper, we describe the design of a runtime system for talking threads called Chant. Chant is capable of supporting point-to-point communication between any two threads in the system using standard lightweight thread and interprocess communication libraries. Point-to-point primitives [13] are needed to support programs using portable communication libraries [8, 44] and those generated by parallelizing compilers [24, 28, 47]. In addition, Chant can support remote service request primitives, used for RPC communications [38, 45] clientserver applications, and irregular codes. Finally, Chant provides direct support for data parallel applications in the form of thread groups called ropes. Chant is designed as a layered system (as shown in Figure 1), where ecient point-to-point communication provides the basis for implementing remote service requests and, in turn, remote thread operations and collective communication. Our overall goal is to build a runtime system capable of supporting talking threads based on accepted standards for lightweight threads and interprocess communication. Our design goals are: 1. portability, achieved by implementing Chant atop existing lightweight thread and communication systems that have been designed to be portable; 2. eciency, achieved by ensuring a constant time overhead in processing messages without making extra buer copies, and by exposing all levels of the design to the user so that the appropriate level of speed and functionality can be employed; and 3. utility, achieved by building support for additional functionality atop ecient point-to-point message passing. This system is being uses to support, among other things, our extensions to the High Performance Fortran standard for integrating task and data parallelism [22]. The remainder of the paper is organized as follows: Section 2 provides background on lightweight thread and interprocess communication systems. Section 3 details the design of Chant. Section 4 provides performance results describing the overhead of Chant as compared with the underlying systems. Section 5 2
discusses related research projects, and we provide concluding remarks and future directions in Section 6. The Appendix provides a complete listing of the Chant user interface.
2 Background Chant provides an interface for talking threads by building atop existing systems for lightweight threads and interprocess communication. Therefore, our design provides two interfaces (cf. Figure 1):
the user interface, shown in Appendix A, through which a programmer or compiler interacts with the
system, and the system interface, which de nes the capabilities required by the underlying system so that Chant can be ported to any sub-system onto which this interface can be mapped. The system interface is divided into two portions, one for lightweight threads and one for interprocess communication. Thus, the portability of Chant is de ned by the system interface and the underlying packages that can accommodate that interface. By basing the system interface on existing standards for lightweight threads and interprocess communication, Chant is positioned to be highly portable across many distributed platforms, including multicomputers and workstation clusters. We now provide a brief discussion of these systems and their required capabilities.
2.1 Lightweight Threads A thread represents an independent, sequential unit of computation that executes within the context of a kernel-supported entity, such as a Unix process. Threads are often classi ed by their \weight," which corresponds to the amount of context that must be saved when a thread is removed from the processor, and restored when a thread is reinstated on a processor (i.e. a context switch). For example, the context of a Unix process includes the hardware registers, kernel stack, user-level stack, interrupt vectors, page tables, and more [2]. The time required to switch this large context is typically on the order of thousands of microseconds, and therefore a Unix processes represents a heavyweight thread. Contemporary operating system kernels, such as Mach, decouple the thread of control from the address space, allowing for multiple threads within a single address space and reducing the context of a thread. However, the context of a thread and all thread operations are still controlled by the kernel, which must often include more state than a particular application cares about. Context switching times for kernel-level threads are typically in the hundreds of microseconds, resulting in a medium or middleweight thread. By exposing all of the thread state and operations at the user-level, a minimal context for a particular application can be de ned, and operations to manipulate threads can avoid crossing the kernel interface. As a result, user-level threads can be switched in the order of tens of microseconds, and are thus termed lightweight. For the remainder of this paper, we will use the terms \thread," \lightweight thread," and \user-level thread" synonymously. Execution of lightweight threads within a process is controlled by a thread-level scheduler, whose job is to determine the next available thread that should execute and to switch between threads at context-switching points. Thread scheduling can either be
non-preemptive (typically FIFO), in which a thread will continue to execute until it completes, explicitly
yields control of the processor, or blocks on a synchronization primitive; or preemptive (typically round-robin), in which each thread is given a time quantum and is interrupted when the quantum expires. 3
Thread Package
Create Switch
[33], originally developed as the Mach user-level threads package; has been ported to
423
81
The REX lightweight process library [30], de nes a minimal, non-preemptive, priority-based threads package for a number of workstations and shared memory multiprocessors.
230
60
[32], provides a library implementation of the POSIX pthreads standard interface,
1260
43
The Sun Lightweight Process (LWP) library [42], provides a comprehensive set of
400
25
[27], provides a low-level, portable set of stack primitives for writing ecient
440
21
cthreads
many machines.
pthreads
draft 6.
thread routines supporting priorities, user-de ned contexts, and stack management routines; only available under the SunOS 4x operating system.
Quickthreads
thread packages.
Table 1: Performance (in s) of several thread packages on a Sun SparcStation 10 Most lightweight thread packages contain functionality for creating, deleting, scheduling, and synchronizing threads in a shared memory (uniprocessor or multiprocessor) environment. Other features, such as control over stacks, signal handling within threads, thread-local data, and priority scheduling are only available in certain systems. Table 1 lists several of these lightweight thread packages, including a comparison of their thread creation and context switching times that we gathered on a Sun SparcStation 10. Although the POSIX committee has established a standard interface for lightweight threads, the actual implementations are limited to a few systems. Therefore, in cooperation with the Portable Runtime Systems (PORTS) consortium [37], we have established a minimal pthreads interface, called ports0, that can be easily ported to any of the thread-based systems listed in Figure 1. The minimal functionality required by the Chant system interface for a lightweight thread package includes thread creation (create), thread preemption (yield), thread destruction (exit), and synchronization with another thread (mutex). This functionality is supported by pthreads, ports0, and most other lightweight thread packages. Chant can also accommodate both preemptive and non-preemptive thread scheduling policies.
2.2 Interprocess Communication Communication systems for distributed memory architectures have traditionally been provided by the vendors, such as the Intel NX primitives [26] and nCUBE Vertex primitives [34]. In response to the increasing demands of portability, several communication libraries have been established that provide a portable message passing interface over a wide variety of systems. Among these libraries, p4 [8] and PVM [44] have received the most attention. Recently, in an eort to unify the message passing community and entice vendors to support a single message passing interface, the Message Passing Interface Forum was established to prepare a standard interface that could be supported directly by vendors (for eciency) and would provide a portable interface for application programmers and compilers. The result is the message passing interface standard (MPI) [13]. 4
The minimal functionality required by the Chant system interface for an interprocess communication system include nonblocking send and receive (send, recv), message polling (test), collective communication (bcast, barrier, reduce), and process management (rank, nprocs). This functionality is supported by MPI and most other communication libraries, including Intel NX and PVM.
3 Design Chant is designed as a layered system (as shown in Figure 1), where ecient point-to-point communication (Section 3.2) provides the basis for implementing remote service requests (Section 3.3) and, in turn, remote thread operations (Section 3.4) and collective communication (Section 3.5). The Chant user interface is comprised of calls from each of these layers. We begin our discussion on the design by introducing and de ning the concept of a global thread.
3.1 Global Threads Chant extends the pthreads interface and functionality with a new thread object, called a chanter thread. A chanter thread is, both conceptually and functionally, an extension of a pthread, where the pthread has been extended with a global thread identi er and the state necessary to operate on a remote processor. Appendix A.2 lists the Chant user interface for creating and managing global threads. All chanter threads are identi ed as a speci c thread within a speci c process. Since the term \process" typically refers to a Unix OS process, which combines a single thread of control with a single address space, it can be confusing when dealing with multiple threads of control and multiple address spaces. Another confusing aspect of a process is its \name" within a global system, since process identi cation can vary depending on the underlying architecture. For example, on a network of Sun workstations, a process is uniquely identi ed by its machine name (or IP address) and process number. On the Intel Paragon, a process is represented by node number and ptype. To solve these problems, Chant separates a process into two concepts: threads, which provide the threads of control within a single address space, and contexts, which de ne the address space boundaries. Threads are de ned by the entities provided by the lightweight thread package, such as pthreads. Contexts de ne the boundary of an address space, such that any two threads located within the same context are guaranteed to have access to the same memory locations, whereas two threads in dierent contexts must use message passing for communication. Notice that this de nition does not exclude multiple processing elements within a single context, nor multiple contexts within a single addressing space, both of which are options that can be explored for speci c architectures. However, the typical mapping for Chant is that a context is in one-to-one correspondence with a Unix process. All of the contexts in a system are linearly ordered and given a unique identi er, called a context identi er or cid. For example, if MPI is the underlying communication system, then the cid's would correspond directly to a process' MPI Rank within MPI COMM WORLD. The underlying thread system ensures that all threads within a context are uniquely identi ed with a thread identi er, or tid. Therefore, a chanter (or global thread) identi er is de ned as a doublet. In addition to its global identi er, each chanter thread maintains a pointer to its underlying pthread object, whether locally or in another context. Thus a chanter thread is simply a wrapper for a local pthread; any of the pthread operations may be applied to a chanter by extracting the local pthread object.
5
Context 1
Context 2
1 2 3
1 2 Network
Figure 2: Communication between threads in dierent processes: to
3.2 Point-to-Point Communication Point-to-point communication is de ned by the fact that both the sending thread and receiving thread agree that a message is to be transferred. Although there are various forms of send and receive primitives, the understanding on both sides that a communication is to occur makes it possible for the operating system to eciently handle the message transfer without buer copies or user-level interrupts, assuming that the receive is registered with the operating system before the message arrives. However, if a message arrives before the corresponding receive has been posted, the operating system will be forced to copy the incoming message into a system buer, since it does not yet know the user-space destination. The Chant point-to-point communication layer ensures that no additional message buer copies are performed beyond those required by the underlying message passing system to handle asynchronous messages that arrive before a receive has been posted. For many high-performance multicomputers with low-latency communication, this assurance is tantamount to ecient message passing. The basic point-to-point communication operations are chanter send and chanter recv, where a send operation creates a message and places it into the network for a given destination thread, and the receive operation takes a message from a speci ed source thread and removes it from the network. A nonblocking version of the receive operation, chanter irecv, allows the computation to initiate the communication operation and perform additional computations or tasks until the completion of the operation is required. Test and wait operations are provided in conjunction with nonblocking operations to determine if and when an initiated operation completes. The complete list of Chant point-to-point communication operations is given in Appendix A.3 Supporting thread-to-thread communication, as depicted in Figure 2, requires solutions to the problems of identifying global threads within a message and polling for messages. We discuss these solutions in the next two sections. 3.2.1
Identifying Threads in Messages
A message is typically comprised of two parts (cf. Figure 3): a xed-length header followed by a variablelength body. When the operating system receives a message, the header must contain sucient information so that the message can be delivered without having to be copied to an intermediate buer for processing. For Chant, this implies that the entire destination chanter identi er must appear in the message header. Otherwise, the operating system will not be able to distinguish between messages for two threads within the same context, forcing a buer copy. MPI already provides a space in the message header for the destination cid, but the destination tid does not have a reserved location. Therefore, we simply overload the user-de ned tag eld of the message header to contain both the tid and the user-de ned tag (cf. Figure 3). While this eectively reduces the 6
Header
...
Context
Body
...
Tag tid
tag
Figure 3: Message headers in Chant number of user-de ned tags by half, in practice this has little eect on a program since most tags are not used. The alternative to overloading the tag eld, packing the destination tid into the message body, forces buer copies on both sending and receiving sides, and is therefore unacceptable with respect to our design goals. Chant assumes the burden of correctly packing and unpacking the global thread identi er into the message header when the Chant communication routines are used. There are two problems with overloading the tag. Firstly, the source thread id is not transmitted with the message. Secondly, a wild card cannot be used as a tag to receive the message, since the receiving thread has to post a receive with its id built into the tag. 3.2.2
Polling for Messages
The problem of waiting for a message to arrive is solved by polling for incoming messages, and is done either by the operating system or user-level code. Operating system polling either blocks the entire process while the operating system spins and waits (typical blocking receive), or allows the process to continue execution and generates an interrupt when a message arrives (interrupt-driven reception). Neither of these solutions are desirable. Blocking the entire process eliminates the opportunity to execute other (ready) threads, eectively disabling a primary feature of using threads (i.e. multithreading). User-level interrupts can be disruptive to processor pipelines and caches [38], and are problematic for programmers using a nonpreemptive threads package, since interrupts eectively make the system preemptive. Finally, many message passing libraries, including the proposed MPI standard, do not support interrupt-driven message reception. The alternative to operating system polling (and interrupts) is to perform user-level polling, in which user-level code is responsible for checking for the completion of message receptions. Although user-level polling avoids the problems of blocking and interrupts, it suers from the inability to respond quickly to asynchronous requests, and thus may be an unacceptable implementation option for real-time applications. However, real-time applications are currently not in the current target domain for Chant, so we consider this to be an acceptable limitation for now. Chant employs user-level polling, and there are three policies that we investigate: 1. Thread Polls, in which each thread continually re-polls every time it is scheduled for execution. This policy has the advantage of being simple and portable, but can result in unnecessary context switching overheads in the case when a thread is re-scheduled for execution but cannot complete the receive operation. 2. Server Polls, in which a single thread in each context is responsible for all message polling. User threads waiting for a message register the receive operation with the server thread and are then removed from the ready queue and placed on a waiting queue. When the server thread is re-scheduled, it will poll for all outstanding messages and re-enable the threads whose messages have arrived. This approach eliminates the overhead of unnecessary context switching. However, it can be slower than the Thread Polls policy when there are a lot of outstanding messages since the server has to test for each waiting thread individually. This overhead could be reduced if the underlying system supports a message testany call which test for multiple outstanding messages in a single call. 3. Scheduler Polls, in which threads register their receive requests with the scheduler, but remain on the ready queue. The scheduler, with modi cation, can then determine if an outstanding request has 7
been satis ed before incurring the overhead of a full context switch. This approach is a hybrid of the Thread Polls and Server Polls methods, and oers the advantages of both. However, since it requires modi cation of the thread scheduler, it is not a portable solution and therefore incongruous with our design goals. Our results (see Table 2 in Section 4) indicate that, although some form of scheduler polling for the threads results in the best performance, having the threads poll for themselves is only slightly worse in the average case. This result is signi cant because some underlying lightweight thread packages won't allow modi cation of the scheduler's activities [32], but the thread-polls policy can be safely implemented on all packages.
3.3 Remote Service Requests Having established a mechanism by which lightweight threads located in dierent contexts can directly communicate, we now address the issue of supporting remote service requests. Remote service request (RSR) messages are distinguished from point-to-point messages in that the destination is a context rather than a thread, and no thread within the context is expecting the message. Rather, the message details some request that is to be performed on behalf of the source thread in the destination context. The nature of the request can range from returning a value from a local address space that is wanted by a thread in a dierent addressing space (remote fetch), to executing a local function (remote procedure call), to processing system requests necessary to keep the global state up-to-date (coherence management). In developing a solution for remote service requests, we built upon our design for point-to-point communication and introduce a system thread called the server thread, which is continually posting receives for remote service requests from any other thread in the system. Each context has a unique server thread which is created when Chant is initialized. When the server thread is scheduled for execution, it will process all outstanding remote service requests before yielding. Each RSR message, created and sent using the chanter rsr routine, contains the name of a handler function, which is invoked by the server thread with the rest of the message as an argument. Since the starting address for a handler function is not guaranteed to be the same across all processors, remote service request functions must be registered with the server thread on each context before being invoked. The registration process, accomplished with the chanter register routine, informs the system of the identi er string that will be used to refer to a certain local function pointer. The handler function, when invoked by the server thread, is passed the chanter identi er for the calling thread, so that a reply can be sent if required. Since there is only a single server thread on each context, it is imperative that it not block while executing a handler function. One solution is to always fork a new thread to execute the actual RSR handler function, but this overhead is often unwarranted in the case of simple RSR requests, such as a remote fetch. Therefore, Chant allows the user to specify, at registration time (cf., pthread chanter register), whether a handler function will block or not, and only blocking functions will cause the creation of a new thread for their execution. To decrease the response time for a remote service request, the server thread operates at a higher priority than the computation threads, ensuring that its polling request will always be checked before those of the computation threads. However, without interrupts the only way to guarantee minimal response time to remote service requests is to either employ a preemptive scheduling policy for the threads, or have the threads preempt themselves occasionally using the yield() call. The tradeos of these approaches and their in uence on response time to remote service requests is an area of future research for Chant. If the underlying architecture supports a low-latency remote service request mechanism, such as Active Messages [45], in addition to the point-to-point primitives, then Chant would ideally shortcut the remote 8
Context 1
Context 2
1 2 3
1 2 Network
join () Notify on exit
Figure 4: Example of a remote join operation: joins with service request mechanism just described to take advantage of these low-level primitives. This, too, is an area of future research.
3.4 Remote Thread Operations In addition to supporting communication between any two chanter threads, Chant extends the pthreads interface to provide thread routines that operate on a global, possibly remote, chanter thread. Thus, for each routine in pthreads which takes a pthread identi er as an argument, Chant supplies an equivalent routine that takes a chanter identi er as an argument. The result is a coherent lightweight thread interface that is well-supported across multiple address spaces. However, not all pthread routines are supported in this manner. For example, it would be inappropriate and prohibitively expensive to support the pthreads shared memory synchronization primitives (mutual exclusion and condition variables) in a distributed memory environment. In some cases, such as chanter create, invoking the local thread operation on a remote context, via a remote service request, will suce. However, in most cases some additional software \glue" is needed to fulley support the semantics of the operation across separate address spaces. For example, consider the pthread chanter join() operation, which blocks the calling thread until the speci ed chanter thread has completed. For two threads within the same context, this operation is handled by the underlying pthread join routine, pthread join(). However, when the calling thread and argument thread are in separate contexts (cf. Figure 4), executing the local join in the remote context is not sucient, since pthreads has no knowledge of the calling thread in another context. Therefore, a remote join is accomplished by adding, to the list of exit handlers for the argument thread, a routine which will send a message back to the calling context with the eect of unblocking the calling thread. Chant maintains control over the entry and exit segments of a thread, so invoking call-back functions upon thread termination can be supported even without such support from the underlying thread package. Each remote thread operation requires a dierent level of support, but the essential work is always handled by the underlying thread package. The result is a user-level threads interface that is a logical extension of pthreads, where global (chanter) thread identi ers can be substituted for local (pthread) thread identi ers.
3.5 Ropes Chant provides a concept, called ropes, for exploiting data parallelism in a multithreaded environment. Consider a simple data parallel algorithm for computing the sum over a distributed array (cf. Figure 5). In 9
Original Array Rank
C1
C2
C3
T1
T1
T1
T2
T2
T2
Cid
Tid
1
C1
T2
2
C2
T1
3
C2
T3
4
C3
T1
Rope 1 Translation Table
T3
Figure 5: One-dimensional array distributed among four threads in a rope this example, each process will compute its local sum and then participate in a global reduction to obtain the total sum. To execute this example as a set of distributed threads in the midst of other thread activity, and without involving the other threads, a scoping mechanism is needed for identifying the threads that will contribute to the global reduction. Speci cally, only threads , , , and should participate in the reduction. Ropes provides this mechanism. The key to collective operations is the ability for the programmer (or compiler) to specify the scope of the operation; that is, the entities that will be involved in the operation. Collective operations are supported at the context level by the underlying communication system, MPI, using a scoping mechanism called groups. However, support for grouping threads within processes is not currently supported by either MPI or related thread-based runtime systems [7, 15] { yet such support is clearly needed if threads are to perform collective operations within a subset of the threads in the system. Relative indexing allows the programmer to specify spatial relationships among the parallel execution units, which express the natural \neighboring" relationships in data parallel algorithms. Without support for relative indexing among threads, the programmer would be required to assign and maintain the relative identi ers for the threads participating the data parallel operations. Also, with proper support for mapping processes to processors, relative indexing can be used to optimize performance by ensuring that an algorithm is correctly mapped onto the underlying topology. Support for these two features, collective communication and relative index on a subset of threads, allows a data parallel compiler, such as HPF, to target a multithreaded environment. Instead of targeting 0 to p ? 1 processes, the compiler requires minor changes to target 0 to p ? 1 threads spread across a set of contexts [23]. Appendix A.4 lists the Chant operations supporting ropes. A system for implementing thread collections (i.e. ropes) must satisfy the following requirements: 1. The collections are entities whose members can span contexts, and thus their identi ers must be unique within the system. 2. Each collection must keep track of its constituent contexts and threads, and operations to add and delete from this list must be performed atomically. 3. Thread ranks within a collection must be unique so that there exists a one-to-one mapping between the thread identi er with respect to the context (global thread id) and the thread identi er with respect to the rope (relative index). We now describe a design for ropes which satis es these requirements. 10
3.5.1
Rope Servers
The requirements as listed above can typically be satis ed by having a centralized name server responsible for allotting rope identi ers and for performing atomic updates to the internal data structures. However, a centralized solution for naming and updating ropes will certainly cause hot-spots. Therefore, our initial design uses a two-level approach, derived from the idea of two-level page management schemes for distributed shared memory systems [5], that allows the user to control the contention among the servers by dividing the work between two types of centralized servers: 1. a single, global name server used to allot identi ers for new ropes, and 2. a separate rope server associated with each rope that is responsible for maintaining the information pertaining to the rope, including the set of participating contexts, the set of participating threads and their relative ids, and a translation table (see Section 3.5.3), which maps the relative rope ranks to global thread ids. 3.5.2
Rope Creation
A rope is a set of threads that de nes a scope for collective operations, and creating a rope is tantamount to specifying this set of threads. In some instances, it may be useful to create a set of new threads which will de ne a rope. For example, a host program may create a rope with a set of new threads as node programs for a data parallel computation. This would also be the model employed by a data parallel compiler. In other instances, it may be useful to add existing threads into an extant rope. For example, a threaded system may start with a single thread on each processor, and each of these threads may add themselves into a rope representing the global set of threads. Therefore, the rope creation mechanism must be capable of both creating new threads to comprise a rope, and adding existing threads to a rope. This is accomplished by separating the tasks of creating a rope and specifying membership in a rope. Creating a rope is done using the rope create call, resulting in a message being sent from the source thread to the global name server, which returns the next available rope identi er. To avoid further messages and a more complicated protocol, the context of the calling thread is designated as the rope server for the new rope. Thus, distribution of the rope servers is accomplished by having dierent threads invoke the rope create routine, which is under direct control of the user. The global name server keeps track of which context is the server for each active rope. This allows any thread in the system to nd out (via the global name server) the rope server for a particular rope. A newly-created rope is initially empty, and the user can use the following two mechanisms for specifying membership: , which creates a speci ed number of threads on a set of contexts and adds them to a
1.
rope addnew
2.
rope addself
rope; and
, which adds the calling thread to a rope.
In the case of rope addnew, the calling thread sends a message to the server for the speci ed rope, indicating how many threads are to be created and on which contexts. The rope server assigns the ranks and sends messages to the speci ed contexts requesting them to create the required number of local threads, and to update their local rope translation tables with the ranks of the new threads. The individual contexts send the thread identi ers for the new threads back to the rope server so that the rope server can update the master copy of the translation table. If the rope is using the strong consistency model (see Section 3.5.3), then an image of the new rope translation table is propagated to all member contexts. The rope addself 11
Local Rope List Rope Entry 1
Rope Server Identifier Consistency Requirement
2
Context List Local Thread List Rope Translation Table
n
Figure 6: Data structure for local rope list routine is a special-case of rope addnew, in which the step of creating the threads is simply bypassed, and only a single thread is to be added. 3.5.3
Relative Indexing
Spatial relationships play an important role in data parallel algorithms, and most communication systems provide a linear ordering of the participating processes, which allows for relative indexing of the processes independent of their actual system id. In addition to supporting collective operations, ropes provide a relative ordering for a set of threads that is independent of their actual global id. Thus we say that each thread within a rope is assigned a unique rank, starting from zero. This makes it possible to send a message from thread i to thread i + 1 within a rope, without regard to the physical location of those threads. Spatial ordering can also be used to gain performance by exploiting the underlying connectivity of the architecture. However, for this to happen the user must be able to specify a mapping of threads to processes (allowed in Chant) and processes to processors (currently not allowed in MPI). To support relative indexing, the system must provide a one-to-one mapping between the rank within a rope () and the global address of a chanter thread (). This is accomplished via a rope translation table (cf. Figure 5) to store and retrieve this mapping information. If the translation table is kept in a centralized location, then remote references would be necessary for translating all relative indices, which would be prohibitively expensive. Therefore we replicate this information and keep a copy of the table on each participating context for the rope. Figure 6 depicts the data structure for the local rope list. Again, borrowing from earlier work in area of page coherence for distributed shared memory systems [4], we adopt two options for keeping the distributed translation tables consistent: new information is broadcast so that all tables are kept up-to-date at all times (strong consistency), or tables are allowed to remain out-of-date until a reference for a thread is generated, causing the information to be retrieved and stored (cached) in the local table (weak consistency). If each thread in a rope communicates with only a small number of other threads in the rope, and the rope is short-lived, then the weak consistency model should result in better performance, since the creation cost is so much less. If, on the other hand, each thread in a rope will communicate with many other threads in the rope, the strong consistency model should result in better performance. Determining the crossover point for a given application is an open question depending on the overheads of the two approaches (see Section 4). Therefore, the system supports both strong and weak consistency on a per-rope basis by providing an argument to the rope create routine to specify the consistency requirement. 12
3.5.4
Collective Operations
MPI provides the group facility for specifying which processes will participate in a collective operation, and ropes extends this idea to the thread level. To do this, each context participating in a rope must know the other contexts in the rope as well as the list of local threads in the rope, and so this information is maintained for each rope in a rope table (cf. Figure 6). In order to take advantage of system-speci c optimizations for collective operations among processors, all collective operations among threads are performed in two steps: at the thread level and at the context level. For example, consider the rope barrier operation, which performs a barrier synchronization among all threads in a rope. The barrier is performed rst at the thread level within a context, and then at the context level, as described by the following algorithm: 1. Each thread, upon executing the barrier command, will send a message to an accumulator thread local to its context, which will accumulate the count of the number of messages received. After sending the message, the calling thread is blocked on an appropriate event. 2. After the local accumulator thread has collected a barrier message from each of the threads in the rope within this context (this information is stored in the local rope table), a message is sent to the rope server for this rope. The accumulator thread then waits for a reply from the rope server. 3. When the rope server has collected a message from each context participating in the rope, a message is returned to the accumulator threads on the participating contexts, informing them that the barrier is complete. 4. The accumulator threads then triggers the events for the local waiting threads, thus completing the barrier. Ideally, we would like to utilize the context-level primitives from MPI, such as MPI BARRIER, for replacing steps 2 and 3 in our algorithm. However, the MPI BARRIER call invoked by the local accumulator threads would block the entire process, including any other threads in that context not related to the rope, until all participating contexts had invoked the MPI BARRIER call. This would remove one of the key features of a multi-threaded system: the ability to overlap useful computation (in the form of ready, waiting threads) with long-latency, blocking operations. As a result, our design does not use the MPI BARRIER call, but rather a simple message-combining scheme that allows other ready threads to execute while the barrier operation proceeds. Whenever possible, we utilize the MPI collective operations, and should the MPI committee see t to extend the standard with non-blocking versions of the collective operations, we would certainly incorporate them into the design as mentioned above. Other collective communication operations, such as reduction functions, can be implemented in a similar two-level fashion.
4 Performance It is not our goal in this paper to argue that threads themselves are useful for programming distributed memory multiprocessors. Instead, our goal is to demonstrate that our design decisions were eective in implementing the various layers of Chant. Chant is currently implemented and has been tested on a network of Sun workstations and on the Intel Paragon, however its portability is constrained only by the underlying support systems (pthreads and MPI). 13
Operation
pthreads
Sun Sparc10
Intel Paragon
Chant
Time
Time
Overhead
Create
1260 s
1350 s
Switch
43 s
45 s
pthreads
Chant
Time
Time
Overhead
7%
1490 s
1540 s
3%
5%
94 s
95 s
1%
Figure 7: Performance of thread operations on Sparc 10 and Intel Paragon loop { compute (alpha); send (); recv (); }
Figure 8: Pseudo-code for threads, polling exercise.
4.1 Thread Operations We rst examine the overhead of the two main thread operations, thread creation and context switching. Figure 7 details the performance of these operations for both a Sun Sparc 10 workstation and the Intel Paragon. The data indicates that the overhead for Chant is relatively negligible, and corresponds to the time required to extract a pthread from a chanter thread (context switch) and allocate a few extra data structures that chanter threads require (creation).
4.2 Message Polling In Section 3.2.2 we introduced the problem of polling for outstanding messages in point-to-point communication. We now take a closer look at that problem by measuring the three scheduling policies (Thread Polls, Server Polls, and Scheduler Polls) to determining their eect on the overall performance of the system. Figure 8 presents the thread code used in the experiment, where the parameter alpha represents the number of iterations for a generic computation. Alpha is modi ed to aect the average number of outstanding receive requests (or waiting threads) at each scheduling point: by increasing alpha, we increase the number of threads waiting for outstanding requests. Intuitively, increasing the time between the posting of a receive and the execution of the corresponding send will increase the number of threads waiting for messages. The experiment is to execute the code given in Figure 8 on 12 threads in each of two contexts, with each thread performing 100 iterations of the outer send/receive loop. The experiment was carried out on the Intel Paragon. Table 2 presents these results, where Time represent the total running time (ms) of the test, CtxSw represents the total number of complete context switches performed, and Test represents the total number of MPI TEST calls attempted in waiting for the messages to arrive. 14
Thread Polls
Scheduler Polls
Server Polls
alpha Time CtxSw Test Time CtxSw Test Time CtxSw Test 100 2730 6655 2662 2413 5580 2011 5950 5488 11817 1000 2860 6655 2693 2515 5630 2010 6090 5489 11942 10000 4000 7029 3057 3660 5579 2535 6123 5509 11875 100000 7260 7977 3975 6815 5649 3723 9990 5534 13238 Table 2: Execution times (ms) for polling experiment on Intel Paragon The Scheduler Polls algorithm yields the lowest running times. The Thread Polls algorithm must perform full context switches at each scheduling opportunity to check for the completion of a message, whereas the Scheduler Polls algorithm need only perform a partial switch to check for outstanding messages. This overhead can be seen by comparing the total number of context switches for the two methods. However, since the Scheduler Polls algorithm requires altering the thread-level scheduler and is currently not permitted with the standard pthreads system, it is encouraging to note that the Thread Polls algorithm suers, on average, only a 10% degradation in performance when compared to the Scheduler Polls method. Table 2 also shows that the Server Polls algorithm performs much worse than the other two, as a result of having to check all outstanding requests at each scheduling opportunity. The Server Polls algorithm performs far more MPI TEST calls than the other two algorithms, accounting for its degraded performance. Since the Intel Paragon does not correctly support the functionality of MPI TESTANY, each message is tested in turn at each context switch point, accounting for the large number of tests and poor performance. However, the Server Polls algorithm does achieve the lowest number of context switches of the three methods, and we hope that the performance of this policy will improve on architectures that provide a better, parallel implementation of MPI TESTANY.
4.3 Point-to-Point Communication To measure the overhead of Chant's point-to-point communication primitives (send/recv), we employ one thread on each of two contexts to exchange messages, and encode the same problem using the native MPI calls at the context (process) level. Figure 9 graphs the execution time of exchanging messages as a function of the message size for both Chant and native MPI on a network of Sun Sparc 10 workstations and the Intel Paragon. The initial gap in execution times for the Intel version of Chant and native MPI corresponds to the extra function call overhead and context switching between the computation thread and server thread that Chant requires (using the Thread Polls policy). As the execution time for an exchange increases, either because of an increase in message size or because of a slower interconnect, the overhead for the Chant operations is amortized and becomes negligible. This would not be the case if Chant caused extra message buer copies, since the overhead would then be proportional to the message size.
4.4 Rope Creation Creating a rope results in a remote service request being sent to the global name server, and its reply, which requires about 375 s on the Intel Paragon. After the rope server has returned the new rope identi er, some data structures are initialized and the rope creation is complete. If only one rope is being created in the system, then the creation time is independent of the number of contexts; otherwise, contention for the 15
Message Passing Times on Intel Paragon and Sun Workstations
Time (us) per Exchange
1e+06
Intel Chant Intel MPI Sun Chant Sun MPI
100000
10000
1000
100 100
1000
10000 Message Size (bytes)
100000
1e+06
Figure 9: Overhead of point-to-point communication on Paragon and Sun Sparc 10 global rope server will degrade rope creation time depending on the number of simultaneous rope create calls. Adding new threads to a rope, using the rope addnew call, requires sending a message to the rope server, indicating how many threads are to be created on the speci ed list of contexts. The rope server must then broadcast a request to those contexts, informing them to create the new threads. After creating the threads, the participating contexts send a message back to the rope server, detailing the thread identi ers of the new threads so that the rope server can complete the rope translation table. Finally, if the consistency mode for the rope is STRONG, then the rope server must broadcast the new rope table to the participating contexts. Therefore, the total number of message exchanges required to add threads on N contexts is 2N + 1 for weak consistency and 3N + 1 for strong consistency. Figure 10 shows the times (in s) for creating four (4) new threads on each of 2 { 32 contexts and adding them to an existing rope. We can see that the cost of adding threads to a rope increases linearly as we expected, either at a rate of 2N (weak) or 3N (strong).
4.5 Rope Communication To measure the eects of relative indexing on point-to-point communication, we measure point-to-point communication exchange times between two chanter threads on two separate contexts. Figure 11 depicts the execution time for message exchange using Chant send/recv primitives, using relative indexing when the translation information is stored locally, and using relative indexing when the translation information must be retrieved from the per-rope server. The data con rms that relative indexing is an inexpensive operation when the translation information is cached locally, and doubles the exchange time when the information is stored remotely, accounting for the extra remote service request message.
16
Addnew Times on the Intel Paragon 30
25
Time (ms)
20
15
10
5
Strong Consistency Weak Consistency
0 0
5
10
15 20 Number of Contexts
25
30
35
Figure 10: Execution times for rope addnew on the Paragon
Message Exchange Times with Relative Indexing on the Intel Paragon 5000 4500
Time (us) per Exchange
4000
Chant, relative indexing without table entry Chant, relative indexing with table entry Chant, global identifiers
3500 3000 2500 2000 1500 1000 500 0 5000
10000
15000 20000 Message Size (bytes)
25000
30000
Figure 11: Execution times for message exchange using relative indexing on the Paragon
17
5 Related Work There are a variety of packages and systems supporting lightweight threads in a single address space, including [17, 27, 30, 32, 33, 42], but only a few systems support any form of communication between threads in a distributed address space:
NewThreads [12], which supports a C++ thread class on the Intel iPSC/860 with blocking send and
receive member functions. Addressing for communication is ports-based, rather than thread-based, so a global name server is required to hand out the port identi ers. Nexus [15] and Panda [7], support task-level parallelism for several parallel languages, including Fortran-M [14] and Orca [3] respectively. Threads are used to represent parallel task invocations, and all communication between address spaces is based on a remote service request mechanism { direct communication between threads is not possible. Proposed MPI extensions [41], supporting long-lived threads capable of executing user code and using the full range of MPI primitives. Various application-speci c runtime systems, including a runtime system supporting parallel simulations [35] and runtime systems supporting parallel functional languages [10, 19]. The term \rope" was rst coined in the pthreads++ system [43], in which a rope is a C++ class that provides support for data parallel execution of a task in a shared memory environment, and later extended to a distributed memory environment. The only other mention of collective communication among threads is in [41], which suggests altering the role and functionality of communicators to allow for multiple threads per communicator, thus permitting collective operations among the threads. The contribution, and distinction, of Chant is to combine existing standards for message passing and lightweight threads, to provide support for both point-to-point communication and remote service requests, to provide support for collective operations among threads, and and to experiment with the issues and algorithms involved in providing an ecient implementation of \talking threads." An earlier version of this paper, prior to the work on ropes, appeared in Supercomputing 94 [21].
6 Conclusions and Future work Chant provides a solution to the problem of supporting lightweight threads in a distributed memory environment by extending existing the standards for lightweight threads and interprocess communication to create a \talking threads" package. This paper addresses the design and implementation issues of providing such support. There are still many areas in which this research can be extended, including improved support for lightweight threads and debugging/performance-analysis tools. Improving support for lightweight threads is already an active area of research, including ideas for better stack management [16, 18], support for reentrant system libraries [37], and support for threads within MPI [41]. Debugging and performance-analysis tools for lightweight threads in a distributed memory environment are still virtually non-existent, although recent eorts in the Ports [37] and Pablo [36] groups are directed at alleviating this shortcoming. Currently, Chant is being used both as a compiler target for parallel language implementations [20, 22] and as a stand-alone system for supporting multithreaded applications research [11]. Additional areas of research, including parallel graphics rendering, parallel PDE applications, and parallel I/O are all currently being considered for integration with the Chant system, and we hope to report on the application of Chant to these problem domains in the future. 18
References [1] Thomas E. Anderson, Brian N. Bershad, Edward D. Lazowska, and Henry M. Levy. Scheduler activations: Eective kernel support for the user-level management of parallelism. In ACM Symposium on Operating Systems Principles, pages 95{109, 1991. [2] Maurice J. Bach. The Design of the UNIX Operating System. Software Series. Prentice-Hall, 1986. [3] Henri E. Bal, M. Frans Kaashoek, and Andrew S. Tanenbaum. Orca: A language for parallel programming of distributed systems. IEEE Transactions on Software Engineering, 18(3):190{205, March 1992. [4] John K. Bennett, John B. Carter, and Willy Zwaenepoel. Munin: Distributed shared memory based on type-speci c memory coherence. Technical Report Rice COMP TR89-98, Rice University, November 1989. [5] John K. Bennett, John B. Carter, and Willy Zwaenepoel. Adaptive software cache management for distributed shared memory architectures. Technical Report Rice COMP TR90-109, Rice University, March 1990. Appears in Proceedings of ISCA 17. [6] Brian N. Bershad, Edward D. Lazowska, Henry M. Levy, and David B. Wagner. An open environment for building parallel programming systems. Technical Report 88-01-03, Department of Computer Science, University of Washington, January 1988. [7] Raoul Bhoedjang, Tim Ruhl, Rutger Hofman, Koen Langendoen, Henri Bal, and Frans Kaashoek. Panda: A portable platform to support parallel programming languages. In Symposium on Experiences with Distributed and Multiprocessor Systems IV, pages 213{226, San Diego, CA, September 1993. [8] Ralph Butler and Ewing Lusk. User's guide to the p4 parallel programming system. Technical Report ANL-92/17, Argonne National Laboratory, October 1992. [9] T. C. K. Chou and J. A. Abraham. Load balancing in distributed systems. IEEE Transactions on Software Engineering, SE-8(4), July 1982. [10] D. E. Culler, A. Sah, K. E. Schauser, T. von Eicken, and J. Wawrzynek. Fine-grain parallelism with minimal hardware support: A compiler-controlled threaded abstract machine. In 4th International Conf. on Architectural Support for Programming Languages and Operating Systems, 1991. [11] Thomas Fahringer, Matthew Haines, and Piyush Mehrotra. On the utility of threads for data parallel programming. In Proceedings of The ACM International Conference on Supercomputing, Barcelona, Spain, July 1995. Also appears as ICASE Technical Report 95-35. [12] Edward W. Felton and Dylan McNamee. Improving the performance of message-passing applications by multithreading. In Proceedings of the Scalable High Performance Computing Conference, pages 84{89, April 1992. [13] Message Passing Interface Forum. Document for a Standard Message Passing Interface, draft edition, November 1993. [14] I. T. Foster and K. M. Chandy. Fortran M: A language for modular parallel programming. Technical Report MCS-P327-0992 Revision 1, Mathematics and Computer Science Division, Argonne National Laboratory, June 1993. [15] Ian Foster, Carl Kesselman, Robert Olson, and Steven Tuecke. Nexus: An interoperability layer for parallel and distributed computer systems. Technical Report Version 1.3, Argonne National Labs, December 1993. 19
[16] Seth Copen Goldstein, Klaus Erik Schauser, and David Culler. Lazy threads, stacklets, and synchronizers: Enabling primitives for compiling parallel languages. In Proceedings of the Third Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers, Troy, NY, May 1995. [17] Dirk Grunwald. A users guide to AWESIME: An object oriented parallel programming and simulation system. Technical Report CU-CS-552-91, Department of Computer Science, University of Colorado at Boulder, November 1991. [18] Dirk Grunwald, Brad Calder, Suvas Vajracharya, and Harini Srinivasan. Heaps o' stacks: Combined heap-based activation allocation for parallel programs. Technical report, Computer Science Department, University of Colorado, April 1994. [19] Matthew Haines and Wim Bohm. An evaluation of software multithreading in a conventional distributed memory multiprocessor. In IEEE Symposium on Parallel and Distributed Processing, pages 106{113, December 1993. [20] Matthew Haines and Wim Bohm. On the design of distributed memory Sisal. Journal of Programming Languages, 2(1):209{240, Spring 1993. [21] Matthew Haines, David Cronk, and Piyush Mehrotra. On the design of Chant: A talking threads package. In Proceedings of Supercomputing 94, pages 350{359, Washington, D.C., November 1994. Also appears as ICASE Technical Report 94-25. [22] Matthew Haines, Bryan Hess, Piyush Mehrotra, John Van Rosendale, and Hans Zima. Runtime support for data parallel tasks. In Proceedings of The Fifth Symposium on the Frontiers of Massively Parallel Computation, pages 432{439, McLean, VA, February 1995. Also appears as ICASE Technical Report 94-26 and Technical Report TR 94-2, Institute for Software Technology and Parallel Systems, University of Vienna. [23] Matthew Haines, Piyush Mehrotra, and David Cronk. Ropes: Support for collective operations among distributed threads. ICASE Report 95-36, Institute for Computer Applications in Science and Engineering, NASA Langley Research Center, Hampton, VA 23681, May 1995. [24] Seema Hiranandani, Ken Kennedy, and Chau-Wen Tseng. Compiling Fortran D for MIMD distributedmemory machines. Communications of the ACM, 35(8):66{80, August 1992. [25] IEEE. Threads Extension for Portable Operating Systems (Draft 7), February 1992. [26] Intel Corporation, Beaverton, OR. Paragon OSF/1 User's Guide, April 1993. [27] David Keppel. Tools and techniques for building fast portable threads packages. Technical Report UWCSE 93-05-06, University of Washington, 1993. [28] C. Koelbel and P. Mehrotra. Compiling global name-space parallel loops for distributed execution. IEEE Transactions on Parallel and Distributed Systems, 2(4):440{451, October 1991. [29] Ravi Konuru, Jeremy Casas, Robert Prouty, Steve Otto, and Jonathan Walpole. A user-level process package for PVM. In Proceedings of Scalable High Performance Computing Conference, 1994. [30] Je Kramer, Je Magee, Morris Sloman, Naranker Dulay, S. C. Cheung, Stephen Crane, and Kevin Twindle. An introduction to distributed programming in REX. In Proceedings of ESPRIT-91, pages 207{222, Brussels, November 1991. [31] Jenq Kuen Lee and Dennis Gannon. Object oriented parallel programming experiments and results. In Proceedings of Supercomputing 91, pages 273{282, Albuquerque, NM, November 1991. 20
[32] Frank Mueller. A library implementation of POSIX threads under UNIX. In Winter USENIX, pages 29{41, San Diego, CA, January 1993. [33] Bodhisattwa Mukherjee, Greg Eisenhauer, and Kaushik Ghosh. A machine independent interface for lightweight threads. Technical Report CIT-CC-93/53, College of Computing, Georgia Institute of Technology, Atlanta, Georgia, 1993. [34] nCUBE, Beaverton, OR. nCUBE/2 Technical Overview, PROGRAMMING, 1990. [35] David M. Nicol and Philip Heidelberger. Optimistic parallel simulation of continuous time markov chains using uniformization. Journal of Parallel and Distributed Computing, 18(4):395{410, August 1993. [36] The pablo performance analysis group. http://bugle.cs.uiuc.edu/Pablo.html. [37] Portable runtime systems (ports) consortium. http://www.cs.uoregon.edu:80/paracomp/ports/. [38] Matthew Rosing and Joel Saltz. Low latency messages on distributed memory multiprocessors. Technical Report ICASE Report No. 93-30, Institute for Computer Applications in Science and Engineering, NASA LaRC, Hampton, Virginia, June 1993. [39] Carl Schmidtmann, Michael Tao, and Steven Watt. Design and implmentation of a multithreaded Xlib. In Winter USENIX, pages 193{203, San Diego, CA, January 1993. [40] H. Schwetman. CSIM Reference Manual (Revision 9). Microelectronics and Computer Technology Corperation, 9430 Research Blvd, Austin, TX, 1986. [41] Anthony Skjellum, Nathan E. Doss, Kishore Viswanathan, Aswini Chowdappa, and Purushotham V. Bangalore. Extending the message passing interface (MPI). Technical report, Computer Science Department and NSF Engineering Research Center, Mississippi State University, 1994. [42] Sun Microsystems, Inc. Lightweight Process Library, sun release 4.1 edition, January 1990. [43] Neelakantan Sundaresan and Linda Lee. An object-oriented thread model for parallel numerical applicaitons. In Proceedings of the Second Annual Object-Oriented Numerics Conference, pages 291{308, Sunriver, OR, April 1994. [44] Vaidy Sunderam. PVM: A framework for parallel distributed computing. Concurrency: Practice and Experience, 2(4):315{339, December 1990. [45] Thorsten von Eicken, David E. Culler, Seth Copen Goldstein, and Klaus Erik Schauser. Active messages: A mechanism for integrated communications and computation. In Proceedings of the 19th Annual International Symposium on Computer Architecture, pages 256{266, May 1992. [46] Mark Weiser, Alan Demers, and Carl Hauser. The portable common runtime approach to interoperability. ACM Symposium on Operating Systems Principles, pages 114{122, December 1989. [47] Hans P. Zima and Barbara M. Chapman. Compiling for distributed memory systems. Proceedings of the IEEE, Special Section on Languages and Compilers for Parallel Machines (To appear 1993), 1993. Also: Technical Report ACPC/TR 92-16, Austrian Center for Parallel Computation (November 1992).
21
A The Complete Chant User Interface A.1 Initialization and Context Management chant init
(
int *argc, char **argv );
chant start
(
void );
chant num contexts
(
int *ncontexts );
chant timer
(
double *current );
chant shutdown
(
void );
Initialize the Chant system. Start multithreaded execution. Return the number of contexts in the system. Return the current value of the system wall-clock. Terminate multithreaded execution.
A.2 Thread Management pthread chanter create
(
pthread chanter t *chanter, ports0 threadattr t *attr, const char *fname, any t arg, size t argsize, int cid, int tid, int detached );
pthread chanter yield
(
void );
pthread chanter join
(
pthread chanter t chanter, any t *stat );
pthread chanter joinall
(
int count, pthread chanter t chanters[], any t stat[] );
pthread chanter exit
(
void );
pthread chanter equal
(
pthread chanter t c1, pthread chanter t c2, int flag );
pthread chanter self
(
pthread chanter t *id );
pthread chanter context
(
pthread chanter t chanter, int *cid );
22
Create a new chanter thread.
Yield processor to next ready thread. Block until speci ed thread exits. Join with a list of threads.
Terminate execution of calling thread. Return TRUE if two chanter threads are equal. Return chanter id structure for calling thread. Return context id for given chanter thread.
pthread chanter thread
(
pthread chanter t chanter, int *tid );
pthread chanter pthread
(
pthread chanter t chanter, pthread t *pthread );
Return thread id for given chanter thread. Return pthread portion of given chanter thread.
A.3 Thread Communication pthread chanter send
(
any t buf, int count, msgdata t datatype, int cid, int tid, int tag );
pthread chanter recv
(
any t buf, int count, msgdata t datatype, int cid, int tag, msgstatus t *stat );
pthread chanter irecv
(
any t buf, int count, msgdata t datatype, int cid, int tag, msgid t *handle );
pthread chanter msgsource
(
msgstatus t stat, int *source );
pthread chanter msgcount
(
msgstatus t stat, msgdata t datatype, int *count );
pthread chanter msgtest
(
msgid t *handle, int *flag, msgstatus t *stat );
pthread chanter msgtestany (
int count, msgid t handles[], int *index, int *flag, msgstatus t *stat );
pthread chanter msgtestall (
int count, msgid t handles[], int *flag, msgstatus t stat[] );
pthread chanter msgwait
msgid t *handle, msgstatus t *stat );
(
23
Send a message.
Receive a message, blocking.
Receive a message, nonblocking.
Return the source context for the given message. Return the datatype count for the given message. Test for message nonblocking.
completion,
Test message completion for any in a set of messages.
Test message completion for all in a set of messages. Wait for message completion, blocking.
pthread chanter msgwaitany (
int count, msgid t handles[], int *index, msgstatus t *stat );
pthread chanter msgwaitall (
int count, msgid t handles[], msgstatus t stat[] );
pthread chanter rsr
(
const char *fname, any t arg, size t argsize, int cid );
pthread chanter register
(
const char* name, thread func t func, int thread spawn );
Wait for any message completion in a set of messages. Wait for all message completions in a set of messages. Send a RSR to the speci ed context.
Register a RSR handler; create a new thread for each invocation if thread spawn is true.
A.4 Rope Management pthread rope create
(
pthread rope t *rid, int coherence mode );
pthread rope addnew
(
pthread rope t rid, pthread attr t attr, const char *function, any t args, int argsize, int ncontexts, int context list[], int nthreads );
pthread rope addself
(
pthread rope t rid );
pthread rope send
(
any t buffer, int count, msgdata t datatype, pthread rope t rid, int rank, int tag );
pthread rope barrier
(
pthread rope t rid );
pthread rope bcast
(
any t buffer, int count, msgdata t datatype, pthread rope t rid, int root rank );
pthread rope exit
(
pthread rope t rid );
24
Create a new rope with strong or weak consistency and return its identi er. Create nthreads new threads on each speci ed context and add them to the speci ed rope.
Add the calling thread to the speci ed rope. Send a message to the thread speci ed by the relative index .
Participate in a barrier for the speci ed rope. Participate in a broadcast for the speci ed rope, originating from the speci ed thread within the rope. Initiate exit; rope will terminate when all member threads have invoked this function.
pthread rope join
(
pthread rope t rid );
pthread rope self
(
pthread rope t *rid );
pthread rope rank
(
pthread rope t rid, int *rank );
pthread rope maxrank
(
pthread rope t rid, int *rank );
25
Wait for the speci ed rope to exit. Return the rope identi er for the calling thread. Return the rank of the calling thread in the speci ed rope. Return the maximum rank (number of threads) for the speci ed rope.