Nomadic Threads: A Migrating Multithreaded Approach to Remote ...

1 downloads 0 Views 108KB Size Report
This paper describes an abstract multithreaded architec- ture for distributed memory multicomputers that signifi- cantly reduces the number of message transfers ...
Conference on Parallel Architectures and Compilation Techniques, October 1996, Boston, Massachusetts, USA. ©1996 IEEE. Personal use of this material is permitted. Permission for other copying, reprint, or republication uses must be obtained from IEEE.

Nomadic Threads: A Migrating Multithreaded Approach to Remote Memory Accesses in Multiprocessors Stephen Jenks University of Southern California Dept. of Electrical Engineering–Systems Los Angeles, CA 90089 [email protected] This paper describes an abstract multithreaded architecture for distributed memory multicomputers that significantly reduces the number of message transfers when compared to conventional “remote memory access” approaches. Instead of statically executing on its assigned processor and fetching data from remote storage, a Nomadic Thread transfers itself to the processor which contains the data it needs. This enables Nomadic Threads to take advantage of spatial locality found in the usage of many data structures, because the migration of a thread to a node makes access to surrounding data local. By reducing the number of messages and taking advantage of locality, the Nomadic Threads approach allows programs to use fewer data transfers than conventional approaches while providing a simple runtime interface to compilers. The Nomadic Threads runtime system is currently implemented for the Thinking Machines Corp. Connection Machine 5 (CM5), but is portable to other distributed memory systems, including networks of workstations.

1. Introduction Distributed–memory multicomputers provide tremendous processing and communications capability that is potentially much more scalable than shared–memory multiprocessors. This computational power is achieved at the expense of ease of programming such systems and high latencies for remote memory operations. Parallelizing compilers for standard languages, standard languages enhanced with parallel features (C* [1], for example), and functional languages (such as SISAL [2] and Id [3]) attempt to solve the programmability problem by exploiting the parallelism in programs and mapping concurrent operations to different processors. These languages and their compilers allow programmers to write a single program that is partitioned across the nodes in the system. Programs may be partitioned among the processors in the system in several ways, depending on the architecture of the

Jean–Luc Gaudiot University of Southern California Dept. of Electrical Engineering–Systems Los Angeles, CA 90089 [email protected] machine. Systolic distributed memory systems, such as Warp [4], split the user program into a series of pipeline stages and send data periodically from one stage to the next. Other machines use a Multiple–Instruction, Multiple Data (MIMD) approach, where each node runs an independent task that interacts with the tasks on other nodes, as required. Many current distributed memory systems operate in a Single–Program, Multiple Data (SPMD) mode, in which all nodes run the same program independently but with their own data. These node programs interact with each other to fetch data and exchange results. Our approach presented in this paper is currently aimed at SPMD machines and applied to the CM5 [5], but can be applied to MIMD machines as well. Programs running on SPMD machines typically need access to large arrays, lists, or trees of data, often larger than the memory available in a single node. These data structures may either be stored on special–purpose data storage nodes (I–Structure [6] nodes, for example) or partitioned and distributed across the system. When a program running on a given node needs to access data that is not in the node’s local memory, it performs a remote fetch operation to get the data. A fetch consists of a request message to the node where the data resides and a response message conveying the requested data back to the requester. The latency of the remote memory fetch depends on the characteristics of the underlying hardware, including the communications network, network interface, and interrupt service times (if interrupts are used), as well as the runtime system that provides remote memory operation support to the user program. No matter how fast the hardware, the two messages used in a remote memory fetch make the average latency of remote memory fetch operations higher than the memory access time of a conventional uniprocessor. This paper presents an abstract architecture that typically eliminates one quarter to more than one half of such messages by changing how programs execute when access to a remote piece of data is required. Section 2 reports existing hardware and software approaches to accessing remote data. Section 3 discusses our Nomadic Threads scheme in detail and contrasts it with other distributed memory access

approaches. Section 4 presents results comparing the performance of Nomadic Threads execution to equivalent programs with remote memory fetches on the CM5. Finally, section 5 summarizes the results of this paper and lists some future work and research directions to further develop Nomadic Threads and remote data handling for distributed– memory machines.

2. Background and related research Much work has gone into reducing the latency of remote memory accesses, tolerating the remaining latency, and partitioning data to reduce the number of remote accesses needed. Efforts have been directed at designing special–purpose architectures as well as software architectures and compilers that analyze data access patterns and map programs accordingly.

Instead of building special–purpose processors that require new compilers, some computer makers have connected commercial processors, each with local memory, via high–speed interconnection networks. The CM5 uses SPARC processors connected by FAT trees, while the Intel Paragon series uses i860s in a mesh, and the Cray T3D uses Alphas in a mesh. Such machines provide scalable computing and memory capacity and can use existing optimizing compiler back–ends, since the processors are standard. Because these types of distributed–memory computers comply with international and de facto commercial standards, are readily available, and can contain many processors and much memory, they have been popular for large–scale parallel processing. This type of machine is assumed for the remainder of this paper, although the techniques described are also applicable to networks of workstations working together as a parallel system.

2.2. Multithreaded architectures 2.1. Hardware architectures Dynamic and static data–flow systems [7] tolerate communications latency by exposing sufficient parallelism in the application program to keep the processors busy as data items are sent from the computation node that generated them to nodes that use them. These systems were built as specialized machines without a von Neumann processor–to– memory architecture. Because these research machines are complex, often have high overhead to match data with instructions, and do not have strong industry support, they tend to lag commercial RISC processors in terms of CPU speed. The data–flow machines prove that communications latency can be tolerated when there is sufficient parallelism exposed in the program. This idea is fundamental to most latency tolerance approaches, including Nomadic Threads. Data–flow research continues with Monsoon [8], Sigma-1 [9], and other recent data–flow machines. While data–flow machines handled the transfer of data tokens well, they had trouble with large arrays, because it is not reasonable to send copies of arrays around the system. I– structures [6] were added to solve the problem. They provide storage on dedicated nodes to hold arrays of data items. I– structure fetch requests are “split–phase” operations consisting of a request message to an I–structure node and a response message with the data. If the requested data item is not yet available, the response is delayed until it becomes available (is written), thus providing a natural synchronization mechanism. The operation is called split–phase because the caller does not wait for the response to the fetch request. When the response arrives, the instruction waiting for the data will be matched and executed just like other data–flow instructions. I–structures can be applied to data–driven execution on conventional machines, as discussed in section 2.2.

Two types of multithreaded architectures are gaining popularity. Hardware multithreading [10] allows a processor to fetch instructions from several different instruction streams in order to keep the processor busy during memory waits and resource conflicts. Software multithreading approaches allow a processor to rapidly switch from one thread to another when the first thread finishes or executes a long– latency (remote) operation. Software multithreading is the assumed approach for this paper. Dennis and Gao [11] provide an in–depth discussion of multithreading principles. Functional languages like SISAL and Id expose both fine–grained and coarse–grained parallelism in programs. It is easy to utilize fine–grained parallelism in data–flow machines, but it is more difficult to map such fine–grained execution to a collection of conventional processors connected by an interconnection network. The problem is that message transfers take several orders of magnitude longer than an instruction’s execution time. Instead of executing one instruction for every one or two tokens that arrive, as in data– flow systems, many instructions must be executed to balance the message transfer time and increase the efficiency of the processor significantly above zero. Multithreaded systems collect sequential operations together into threads of instructions that can run on conventional or hybrid data–flow processors [12], [13]. Each of these threads becomes ready to run when its inputs are available, much like data–flow operations. Ready threads may be executed in parallel or in series based on available processing resources. In most systems, threads run to completion without interruption. Threads derived from a given source–language function or algorithm constitute a code block and use frames for their input storage and scratch space. The combination of a frame

FP1 IP1

Frame

Code Block

Activation 1 FP2

Frame Activation 2

IP2

Code Block

FP3

Frame Activation 3

Figure 1. Relationship between code blocks, frames, and activations and the code block that uses it is called an activation, as shown in figure 1. The activation’s frame stores the state of execution for the threads and can represent a single iteration of a loop or other suitable unit of execution that can be executed independently and concurrently with other, similar activations. Each node in the system can have one or more activations, some of which share code blocks, but all have separate frames to store their state. The threads associated with these activations run concurrently and perform the source algorithm in parallel. In most multithreaded systems, when a thread performs a split–phase I–structure fetch, that thread does not wait for the fetch result. The result from the fetch is used as input for a thread that follows the one that issued the fetch instruction. Once all its inputs are available, the subsequent thread may run. Because threads do not wait for fetch completion, thread scheduling is very simple: Any thread whose inputs are all available may be run, and once a thread is started, it is run to completion without preemption. Since thread scheduling is simple, multithreaded systems have reasonably low overhead. The goal of multithreaded systems, like their data–flow predecessors, is to keep the processors in the system busy doing application program work while remote memory accesses take place, so the latency of remote operations is “tolerated.” The way to accomplish this is to make threads large enough that their computation time is not completely overwhelmed by the time required to fetch input data from other nodes. Having many threads available to run also helps keep the processor from becoming idle, but switching between many small threads requires more overhead than using fewer, larger threads. The Threaded Abstract Machine (TAM) [14] is a software implementation of a multithreaded architecture that runs on conventional distributed–memory computers. It executes compiler–generated threads in parallel and emulates I–structure operations for array handling. A TAM thread is a collection of sequential instructions that do not jump out of the

thread and only reference data available in the current frame, though they may issue I–structure fetches. Results from I– structure fetches and data from other threads are placed into inlets in the frame. Each thread has a set of inlets that, when full, allows the thread to become enabled. Results of a thread may be sent to the inlets of other threads. Cilk [15] is another software multithreading system that uses threads specified in a modified C language. A closure, which stores the inputs of a thread, is ready to run when all its argument slots are full. Ready closures may be stolen by idle processors to balance the load.

2.3. Thread migration Thread migration is an alternative to conventional message–based remote data access, because threads are sent to the node that contains a needed data item instead of bringing the data item to the node that contains the consumer thread. Nomadic Threads is a thread migration approach. Several thread migration approaches have been proposed previously. One approach [16] uses hardware support and explicit remote data access instructions to implement thread migration. Three remote data access operations are provided: one always causes the thread to transfer to the remote node that holds the data, another always fetches the data to the current node, and the third option migrates the thread or not at the discretion of the runtime system. A software–based thread migration approach is implemented in the Olden project [17]. Olden uses a modified version of the C programming language where parallel operations are explicitly specified. When a data item that resides on a remote node is referenced, the runtime system causes the accessing thread to migrate to the remote node. Olden is implemented for the CM5 and the Intel iPSC/860. In all these thread migration approaches, the code block associated with the thread is not moved, because it is replicated on the SPMD nodes of the machine. Instead, only the state of the current thread is migrated. In the previous two examples, this state consists of the current program counter, some status information, and the thread’s stack frame, which stores the inputs and scratch data for the thread. Migrating stack frames and allowing threads to return after their migration requires sophisticated stack implementations for efficiency. Nomadic Threads takes a different approach to migration, which will be discussed in section 3.4.

2.4. Data partitioning Proper partitioning and distribution of large data arrays across the nodes of the system is critical to reducing the number of remote memory accesses and increasing reference locality [18]. Data partitioning is applicable to Nomadic

Threads as well as to the other multithreading approaches discussed above. The goal of data partitioning is to keep as much data as possible local to the node that uses it. Given enough memory, all data could be replicated on all nodes, but this is unrealistic for most problems that require the class of machines assumed here. Therefore, the data must be distributed across the nodes of the machine just like the computation. The partitioning scheme is determined in several ways: the programmer may explicitly specify the partitioning in the application source code; the programmer may give partitioning hints to the compiler in the source code; or the compiler can attempt to analyze the data access patterns of algorithms and determine an optimal partitioning. The latter case is preferred, as it is the most automatic, but it is by far the most difficult and is the topic of ongoing research [19]. The other two approaches are used in many current parallel programming languages, but may require the programmer to be intimately familiar with the machine architecture and the exact data access pattern of their algorithm. The Nomadic Threads architecture allows any data partitioning approach to be used and takes advantage of the locality provided by it.

2.5. Caching Caching is used in distributed memory systems to eliminate some remote memory accesses. When a data item is fetched from a remote node, it and some nearby (usually consecutive) data items may be cached on the local node. If any activation on the node needs to access any of the cached data items they will be supplied from the cache. Because of locality principles, this will often be the case, thus eliminating some remote fetch operations and providing fast access to cached data items. Caching adds complexity and overhead, however. All potentially remote data accesses must check if the data they need is in the local cache. Then, fetching several data items instead of one takes longer than only fetching the one item. If the other fetched items are never accessed, that additional time is wasted. A cache replacement policy must be added to determine how long to keep cached copies of data and which ones to replace when the cache is full. Finally, the issue of cache coherence can add a great deal of overhead in systems where data items can be overwritten. A way to invalidate the cached copies on other nodes must be implemented so nodes do not use stale data. The use of a single–assignment language, such as SISAL, can eliminate the need for cache coherence because each data item is only written once. The use of caching is not inconsistent with the Nomadic Threads approach, as will be discussed in section 3.3.

3. The Nomadic Threads architecture The Nomadic Threads architecture [20] is an abstract architecture that provides a simple multithreading application program interface and supports thread migration. Details of the underlying hardware are hidden by a small set of data access and thread migration calls. These calls are not used by user programs, but the compiler makes use of them to perform runtime operations. The Nomadic Threads architecture is built around a philosophy that gives threads autonomous control over their own destiny. This applies to scheduling, as discussed in section 3.1, and thread migration, section 3.4.

3.1. Threads, activations, and scheduling An activation is the combination of a set of threads in a code block and a frame that stores the inputs and state of those threads, as defined in section 2.2. Not all threads in an activation need to be ready to run at all times. Some or all of the threads in an activation may be waiting for data from other threads in the activation or other activations. Results may be generated and placed in their destination storage space at any time during the execution of the threads. The granularity of an activation greatly depends on the algorithm from which it is derived. It could be an entire inner loop, one or more iterations of a loop body, or a single recursive function call, for example. In the naive matrix multiply benchmark, discussed in section 4.1, the innermost loop that computes the value of one result matrix element is turned into an activation. The activation traverses the appropriate row and column of the source matrices and performs an inner product as it does so. The pixel averaging benchmark described in section 4.2 uses a different approach. The obvious solution is to use one activation to compute the average for one pixel of the resulting image. This requires the creation of an activation for each pixel in the image, but since each activation does very little work, the activation creation overhead is high and more parallelism is exposed than is needed. Instead, if an activation is used to compute the result for a number of pixels, sufficient parallelism exists to keep the processors busy, but the overhead is reduced and the computation takes 30% less time. Frames are storage blocks that contain inputs and the state of an activation. Figure 2 shows the frames for the matrix multiply and pixel averaging benchmarks. The first four fields exist in all frames and contain status flags used for synchronization, flags used to notify the parent activation when results are complete, and the node and address of the parent activation. The parent activation is the activation that created the current activation by making the frame, filling it in with initial values, and spawning its execution. When the current activation produces some results or finishes, it notifies the

Matrix Multiply Frame

Pixel Averaging Frame

Status Flags Result Flags Caller Node Caller Activation Dest X Dest Y Index Count Final Operand Accumulator

Status Flags Result Flags Caller Node Caller Activation X Y Y Max X Max End Sum Divisor

Figure 2. Example activation frames parent activation using the result flags to modify the parent’s status flags. For activations that have a small set of synchronization points (i.e., an input becoming available), a bit of the status flags field can be used per synchronization event. In other cases, the flags can be encoded for greater flexibility. The rest of the frames consist of inputs and state information used during the execution of the frame. In both examples in the figure, the bounds of the computation are passed in by the parent activation. In addition, both frames have an accumulator or sum field that is updated as the computation progresses. The unique element in the matrix multiply frame is the space for an operand. A matrix multiply operation consists of a large number of c=c+a⋅b operations, where c is the result element and a and b are elements from the source matrices. If the data items a and b reside on different nodes, one of them must be carried to the other node where the multiply occurs. The algorithm always tries to fetch a local operand before migrating to the node containing the other operand if both are not local. This autonomous, rather than automatic, migration gives Nomadic Threads its flexibility. Each activation’s code block has at least two threads, but may have more. The first required thread is called the Control Thread, which is used for scheduling and, often, computation. The other is the Result Thread, which is used to receive results and to synchronize with child activations. The minimum unit of scheduling in Nomadic Threads is the activation. The runtime system calls the Control Thread of an activation to start it running. Then, the Control Thread schedules the threads associated with the activation based on the availability of data as defined by the status flags. This approach exploits locality by allowing as much computation as possible for the current activation frame. When either the activation is finished or there is insufficient data available to continue, the Control Thread returns control back to the runtime system, which will cause the next activation to run. The return value of the Control Thread causes one of several events to happen: if the activation is finished, the frame will be deleted; if some threads are waiting on data or synchronization, the activation will be temporarily suspended until the appropriate data or synchronization event occurs. In the cases where the activation was suspended, the runtime system will call the Control Thread again, which will run any threads whose synchronization events have happened. Over-

all, it is a simple scheduling scheme from both the runtime system’s and the compiler’s perspective, and allows the runtime system to be unaware of activation details. The Result Thread of an activation is called by child activations to synchronize with the parent and possibly return a result. This event normally occurs when the child activation is finished execution, but may occur at any time required by the application and the compiler. There are two main types of synchronization events defined for Nomadic Threads: normal rendezvous and counting rendezvous. In the normal rendezvous, the result flags of the child are passed to the parent activation’s Result Thread, which modifies the status flags of the parent. Then, when the parent activation is executed, it sees that the child activation has synchronized with it. The counting rendezvous is used when a parent activation spawns many children. This case occurs often in loops, where the parent is the loop counter and the children are loop body iterations. A count is kept of the number of children spawned. Each time a child synchronizes by calling the Result Thread, that counter is decremented. When the counter reaches zero, the appropriate flag is set in the parent’s status flags field. A final use for the Result Thread occurs when the parent activation migrates from the node where it spawned a child thread. In that case, the Result Thread is set up to forward the results to the parent’s new node. This option has not been required in any benchmarks to date.

3.2. Activation creation Activation creation in the Nomadic Threads architecture is very simple. First, the parent activation creates storage for the child’s frame, then fills it in with required initial information, and finally calls a spawn function that places the child frame in the execution list. The initial information contains the address and node of the parent activation, as well as result flags that will be used to notify the parent when the child is finished. Finally, parameters, such as loop bounds and initial values, can be stored into the child’s frame by the parent. The spawning function allows three options: 1) the child activation starts on the current node; 2) the child activation starts on a node specified by the parent; and 3) the child activation starts on any node. The third option can be used by the runtime system to spread load across the nodes of the system, although it may not be very helpful if the activation must immediately migrate to fetch data. In addition to this simple activation creation scheme, a broadcast creation scheme is provided to create a very simple activation on each node in the system. These activations can go on to spawn other activations and act as loop counters or perform other functions. This mechanism is very fast and simple, but does not allow many parameters to be passed.

3.3. Distributed data access Two types of distributed data access are supported in the Nomadic Threads architecture: distributed array access and dynamic data structure access. Array access allows access to elements of multidimensional arrays that are partitioned or replicated across the nodes of the system. The size of the array elements and shape of the arrays are not specified in the architecture and depend only on the application program and the runtime system implementation. Therefore, irregular arrays of data structures are allowed and handled in the same way as matrices of integers. Two array access function calls are provided. One allows a thread to read an array element. If the element is local to the current node, the call returns the value to the calling thread, otherwise the call returns the node ID where the data resides. The other call stores a value into a local array location. Currently no per item synchronization, like that in I–structures, is provided, but that could very easily be added. The activation is completely isolated from the partitioning scheme and array handling implementation, thus either can be modified to increase performance or capability without modifying the Nomadic Threads application code. Because the array handler implementation is invisible to the application code, it can use remote memory fetch with caching if desired. Since the implementation will simply return a cached result to the application, the application will never know what happened. A hybrid approach, in which remote fetches are used by the runtime system if the compiler determines that it is advantageous, would be a very reasonable extension to Nomadic Threads. Since the Nomadic Threads architecture is primarily intended for software implementation on a message–passing machine, some additional overhead is required to check if data is cached, but that time is less than the time required to send a message. Finally, the application could implement a type of caching where it could use its frame to carry around information that will soon be used. This final type of activation–specific caching would be completely under compiler control and would reduce the number of migrations at the expense of a larger frame. The other type of data access is for dynamic data structures, such as lists and trees. This support is not needed for SISAL, but is useful for other languages, such as C. The implementation of these data structures is to use link pointers that contain node information as well as address information. Then, if a link points to another node, the activation migrates to that other node to follow the link. Olden [17] has similar provisions for distributed dynamic data structures.

“Owner Computes”

Nomadic Threads

Figure 3. Notional access patterns an “owner computes” paradigm, where computation takes place on the node where the result will be stored and operands are fetched from other nodes as required. Nomadic Threads uses a different paradigm, in which an activation may start anywhere, but may migrate throughout the system in search of data and, finally, to a node with storage for the result. Figure 3 shows a notional example of the two paradigms being used to compute an element of an array that is distributed across several nodes. In this simple example, the owner computes case requires ten message transfers to request and receive remote data items, while the Nomadic Threads case requires only four activation transfers and exploits locality by using nearby items before migrating.

3.4. Thread migration

Other architectures have proposed or implemented thread migration, as discussed in section 2.3, but none are as simple as the approach used in Nomadic Threads. While the approaches discussed in that section need to transfer stack frames during migration and merge results back into the stack upon return, Nomadic Threads just transfers the activation’s frame to the remote node. All the information required by an activation, including its state and where to return, is included in the frame. Since frames are self–contained entities, transferring them can be done very simply with minimal overhead. Once on the new node, the activation’s Control Thread is called and execution continues by retrieving the now–local data required. The activation can continue migrating throughout the system during its execution, never needing to return to its home node (except to get data that resides there) until it is completely finished.

The difference between the Nomadic Threads architecture and most other multithreaded architectures is thread migration. Many distributed memory parallel computers use

An issue that arises as a result of thread migration is frame size. As can be seen from figure 2, frames are sometimes ten or more words long. This is usually longer than the either of the request or response messages used in conven-

tional systems to fetch remote data, but probably not larger than both such messages combined. In addition, for most machines, the overhead of sending a message outweighs the actual transfer time, so slightly longer messages do not take significantly more time than short ones. For this reason, frames can normally be ten or twenty words with no adverse effects, but if frames grow too large, transferring them during migration will take an unreasonable time. This tradeoff is compounded by the fact that many thread generation techniques try to coalesce many fine–grain operations into each thread. This is normally good, because even though thread switching has low overhead, it still takes time. Therefore, the fewer thread switches, the better, assuming there are enough threads running to maximize parallelism. The problem is that longer threads often require more state information and more inputs, thus an increased frame size. So the thread generator for Nomadic Threads must trade off frame size vs. thread size. This is a topic of ongoing research.

4. Implementation and Results The Nomadic Threads runtime system implements the Nomadic Threads architecture for the CM5. Thread migration uses active messages [21] built with the Connection Machine Active Message Library (CMAML) [22]. Active messages are very small messages that cause the execution of a specified handler routine on the remote node. They avoid the overhead of a higher–level protocol stack, but a simple protocol must be implemented using active messages to send frames, since they are always four words or larger. The CM5 provides both polled– and interrupt–driven reception of messages. The polled approach has lower overhead if there is almost always a message ready to be received. The interrupt–driven approach keeps the networks from backing up with stalled messages waiting for a poll operation, but has more overhead. The Nomadic Threads runtime system was initially developed to use polling, but interrupt–driven messaging was added to test its characteristics. Not surprisingly, experiments showed that interrupts slowed down the execution of all the benchmarks they were used with, sometimes by more than double. This occurred because the amount of computation in each thread is quite small in these benchmarks, so polling occurs frequently enough. If the thread code were long and complex, interrupt– driven messaging would probably fare better than polling. The runtime system also provides a simple array access implementation that complies with the Nomadic Threads architecture and allows application programs read and write access to arrays. The current implementation provides regular arrays and matrices and very simple partitioning schemes. Plans include the addition of support for irregular array and more complex partitioning schemes.

% ******* Matrix Multiplication code fragment % ******* From SISAL 1.2: A Brief Introduction % ******* and Tutorial by David C. Cann type OneDim = array[ integer ]; type TwoDim = array[ OneDim ]; function MatMult( A,B:TwoDim; M,N,L integer returns TwoDim ) for i in 1, M cross j in 1, L S := for k in 1, N returns value of sum A[i,k]*B[k,j] end for returns array of S end for end function

Figure 4. Matrix multiply function in SISAL The runtime system is written to be as architecture–independent as possible. Therefore it can be ported relatively easily to other distributed memory systems. The message passing routines and the timers are all that need to be modified to port it to another machine. A simple port was done to Unix workstations [23], with encouraging results. On a small number of machines, the results were comparable to a small CM5, but as the number of machines grew, the lack of network scalability caused performance to tail off quickly. The runtime system and benchmarks are written in C++. Some effort is made in the runtime system to reduce the number of frame allocations and deallocations, but the existing prototype code is designed for easy modifications and has not been highly optimized by hand. Once the desired operational characteristics are met, profiling will be used to determine critical areas for optimization. Ongoing experiments include rapid activation creation and control transfer to support fast recursive function calls. The current implementation of this technique is used in the Tree Add benchmark, discussed in section 4.3, but there is some room for improvement in terms of both speed and storage.

4.1. Matrix multiplication A simple implementation of matrix multiplication was chosen as the first benchmark for several reasons. Though the simple algorithm, shown in figure 4, is not very efficient on distributed memory machines, it exposes a great deal of parallelism and has excellent locality. Since one of the goals of the Nomadic Threads project is to provide a runtime system that works with a SISAL compiler instead of requiring programmers to explicitly specify parallelism, the benchmark code is derived from the SISAL code shown. The hand compilation did not assume any brilliant compilation techniques, although it is possible that the hand generated code is more optimized than a compiler could produce. The performance results of the Matrix Multiply benchmark are shown in figure 5. The NT–PMM (Parallel Matrix Multiply) entries use thread migration and the Nomadic Threads runtime system. The number in parentheses follow-

Matrix Dim. (x * x)

NT– PMM (8 N)

32 33 64 128 256 512

NT– PMM (32N)

0.08 0.08 0.1 0.09 0.46 0.33 2.87 1.63 19.85 8.51 170.03 46.66

NT– PMM (64N)

0.33 1.36 6.83 31.58

NT– NT– PMM PMM PMM (32 N) (256N) (512N)

0.37 1.23 5.04 22.1

0.37 1.3 4.94 23.17

NT– PMM w/Int (32N)

0.08 0.19 0.15 0.21 0.57 0.78 4.8 3.36 36.43 15.28 297.42 78.16

Figure 5. Matrix multiplication benchmark timings 300 Remote Mem Access

250 Seconds

Nomadic Threads

200

Nomadic Threads w/Int

150 100 50 0 0

64

128 192 256 320 384 448 512 Matrix size (X*X)

Figure 6. Matrix multiply CPU time on 32 nodes ing each title is the number of CM5 nodes used. The PMM column shows the results for an implementation that uses remote memory access (without caching) instead of thread migration. Finally, the last column shows the timing of the Nomadic Threads version with interrupt–based message passing instead of polling. Figure 6 compares the execution times on a 32-node CM5. Figure 7 shows the speedup curve for the NT–PMM benchmark using 512x512 element matrices. As can be seen in the figure, the speedup is linear up to 32 nodes. Beyond that point, the locality starts decreasing because not many rows or columns of the partitioned source matrices reside on each node. In fact, in the 512–node case, where only one row

Speedup

128

64

0 0

64 128 192 256 320 384 448 512 Number of processors

Figure 7. Matrix multiplication speedup

or column of the source matrices resides on each node, there is a slight slowdown from the 256–node case. Different partitioning and replication schemes could be used to increase the available locality, but this benchmark clearly demonstrates that Nomadic Threads takes advantage of data locality.

4.2. Pixel averaging The pixel averaging benchmark averages each pixel’s value with the values of its eight surrounding pixels in a grayscale image, resulting in an image that has been somewhat blurred. The computation required for each pixel is small, but the bounds checking required around the image edges complicates the computation. This benchmark has good locality, but there is no way to partition the data so that all accesses are local without replicating the image on all nodes. While replication is possible for small images, it is unreasonable for larger ones due to memory limitations on processor nodes. It is possible to rewrite this program to eliminate all communications after the initial setup by having each node send one row (or column) to its neighbors. This type of optimization is contradictory to efforts to make programs portable and easy to write, so it is not applicable here. The source and destination images were partitioned among the nodes in complete rows. The number of contiguous rows allocated to each node was computed according to r = y/n, where r is the number of contiguous rows, y is the height of the image, and n is number of processor nodes. Therefore, some nodes may not have any rows to compute for small images, and if the image height is not evenly divisible by the number of processors, the last node may have fewer rows than others. This partitioning scheme was chosen because it allowed use of the CM5’s parallel I/O functions. During execution, all nodes fetch their segment of the image simultaneously with a single command. Writing the resulting image to disk also used the parallel write features. The statistics for the pixel averaging benchmark are listed in figure 8. The problem clearly scales with image size—the largest image is over 13 times the size of the smallest, and the ratio of their CPU times for 32 processors or fewer is also around 13. Figure 9 shows the speedup curves of the pixel averaging benchmark. The curves clearly show that problem size greatly affects speedup. With the larger images, there is much more locality available to be used for the configurations with more processors. In all cases, the speedup is linear for small numbers of processors, but the largest case shows a nearly linear speedup to 128 nodes, where the curve starts to level off. Because of the very regular and local data access pattern of the pixel averaging benchmark, caching will provide a significant benefit. To compare Nomadic Threads with caching, we modified the Nomadic Threads runtime system to issue split–phase remote memory access requests when non–local

256

CPU TIme

1 2 4 8 16 32 64 128 256 512

Nomadic Threads Runtime System 640x480 Image

1280x1024 Image

2048x2048 Image

2048x2048 Image

12.11 5.95 2.9 1.55 0.82 0.48 0.33 0.24 0.2 0.19

50.34 25.23 13.66 6.39 3.21 1.79 1.01 0.67 0.49 0.4

168.22 83.28 41.77 20.53 10.47 5.4 2.88 1.65 1.1 0.89

161.56 84.54 45.08 21.44 10.64 5.58 2.76 1.28 0.65 0.36

Figure 8. Pixel averaging timing data was required. This required essentially no modification to the application program or the activation scheduling, so the comparison is fair. Because it is possible to fit four pixel values into a single active message, we implemented a 4 Kbyte direct–mapped cache with a line size of four items. Each remote request and response took a single message in each direction, so it was very efficient. For data items larger than a single byte, a multi–message scheme would be needed, so this is the best case for a cache fill scheme. The last column of figure 8 lists the timing results of the cached remote memory access scheme. The timing of both approaches is quite comparable, with Nomadic Threads doing slightly better with fewer processors and the cached scheme running faster with very large numbers of processors. Since Nomadic Threads takes advantage of spatial locality, it does better when more data resides on each node, as in the fewer processor cases. Since the image is spread thinly across the nodes in the 128–node and larger cases, spatial locality is subsumed by temporal locality, which is well supported by the caching scheme. Though there is much temporal locality in this benchmark, Nomadic Threads does remarkably well by running faster than the cached scheme in many cases and nearly as fast in most of the others.

640x480 Image 1280x1024 Image 2048x2048 Image

192 Speedup

Proc. Nodes

Remote Access w/ Cache

128 64 0 0

64

192

256

320

384

448

512

Number of Processors

Figure 9. Pixel averaging speedup double TreeAdd(tree_node* my_node) { double sum, r, l; r = TreeAdd(my_node–>right_child); l = TreeAdd(my_node–>left_child); sum = my_node–>value + r + l; return sum; }

Figure 10. Tree addition algorithm that processor. Once the leaves are reached, the recursion unwinds until the root node receives the results back from its child activations. The number of migrations in this benchmark is quite small and a great deal of locality is exploited. Figure 11 shows the execution time for this benchmark using trees containing 100,000 and 200,000 nodes. In the speedup graph, shown in figure 12, the curve is quite linear up to 128 processors, but the slope is only 2/3, not the usual slope of 1. This loss of efficiency occurs due to a delay in migration. Though migrating activations queue themselves for migration during the initial recursion, the messages are delayed until the recursive spawn sequence is finished. This delay causes a reduction in available parallelism. As the number of processors grows, the number of recursive steps, hence the delay, decreases linearly, which causes the linear speedup curve. Once the delay is over, the computation takes advantage of the locality due to the tree distribution. CPU Time (Seconds) CM5 Nodes

100,000 Tree Nodes

200,000 Tree Nodes

1 2 4 8 16 32 64 128 256 512

3.655 2.456 1.292 0.644 0.323 0.163 0.083 0.048 0.034 0.0255

4.743 3.114 1.553 0.776 0.397 0.205 0.111 0.061 0.038 0.029

4.3. Tree addition The tree addition benchmark recursively adds the values of nodes in a balanced tree that is distributed evenly across the processors. The tree addition algorithm is based on the algorithm given in figure 10. The activation tasked to add the root of the tree spawns an activation to recursively add the values of each child. These two activations, in turn, spawn their own child activations to do the same until the leaf nodes of the tree are reached. If any activation determines that the node it needs to add is on a different processor, it migrates to

128

Figure 11. Tree addition timing

192

Speedup

100,000 Tree Nodes 200,000 Tree Nodes 128

64

0 0

64

128

192

256

320

384

448

512

Number of Processors

Figure 12. Tree addition speedup

5. Conclusions and future work We have shown that thread migration, as implemented in Nomadic Threads, is a viable alternative to conventional remote memory fetch approaches for distributed data. Benchmark results showed that Nomadic Threads programs took advantage of spatial locality. This locality, plus the autonomous migration capability of Nomadic Threads, significantly decreases the amount of communications required to access data across the machine. A Nomadic Threads version of a benchmark well–suited to cached remote memory access performs as well as the cached approach in many cases. Nomadic Threads provides a simple, abstract multithreaded architecture model to compilers. The architecture provides thread migration to other processors to fi nd required remote data. Spawning and synchronization mechanisms round out the architecture. The architecture has been implemented for the CM5 and networks of workstations. There are several significant areas of future work planned for Nomadic Threads. The first is to build a SISAL compiler backend that targets the Nomadic Threads architecture. This will allow us to develop larger benchmarks and measure the performance of Nomadic Threads with compiled code. This requires significant study because of the thread building issues described in section 3.4. In addition, studies of a hybrid thread migration/remote data access system and efficient scheduling of recursion need to be done.

Acknowledgment This research was supported in part by ARPA Grant No. DABT63-95-0093 and by a Northrop Grumman fellowship.

References [1] C* User's Guide–Version 6.0.2, Thinking Machines Corporation, Cambridge, MA, 1991.

[2] J. R. McGraw, S. Skedzielewski, S. Allan, D. Grit, R. Oldehoeft, J. R. W. Glauert, I. Dobes and P. Hohensee, “SISAL: Streams and Iterations in a Single Assignment Language: Language Reference Manual, version 1.2,” Technical Report TR M–146, University of California – Lawrence Livermore Laboratory, 1985. [3] R. S. Nikhil, “Id (Version 88.0) Reference Manual,” Tech. Report CSG Memo 284, MIT Lab for Computer Science, Cambridge, MA, 1988. [4] M. Annaratone, F. Bitz, E. Clune, H. T. Kung, P. Maulik, H. Ribas, P. Tseng and J. Webb, “Applications and Algorithm Partitioning on Warp,” in Proc. COMPCON Spring '87, 1987. [5] The Connection Machine CM5 Technical Summary, Thinking Machines Corporation, Cambridge, MA, 1991. [6] Arvind and R. E. Thomas, “I–Structures: An efficient data type for functional languages,” Technical Report LCS/TM–178, MIT Laboratory for Computer Science, 1980. [7] Arvind, L. Bic and T. Ungerer, “Evolution of Data–Flow Computers,” in Advanced Topics in Data–Flow Computing, J–L. Gaudiot and L. Bic, (Eds.), Prentice Hall, Englewood Cliffs, NJ, 1991, pp. 3–33. [8] D. E. Culler and G. M. Papadopoulos, “The Explicit Token Store,” Journal of Parallel and Distributed Computing, Vol. 10, pp. 289–308, 1990. [9] Electrotechnical Laboratory, Computer Architecture Section, http:// www.etl.go.jp:8080/etl/comparc/welcome.html, 1996. [10] M. Gulati and N. Bagherzadeh, “Performance Study of a Multithreaded Superscalar Microprocessor,” in Proc. Second International Symposium on High–Performance Computer Architecture, pp. 291–301, 1996. [11] J. B. Dennis and G. R. Gao, “Multithreaded Architectures: Principles, Projects, and Issues,” in Multithreaded Computer Architecture: A Summary of a State of the Art, R. Iannucci, G. Gao, J. R. Halstead and B. Smith, (Eds.), Kluwer Academic Publishers, Boston, pp. 1–72, 1994. [12] G. R. Gao, “A Flexible Architecture Model for Hybrid Data–Flow and Control–Flow Evaluation,” in Advanced Topics in Data–Flow Computing, J–L. Gaudiot and L. Bic, (Eds.), Prentice Hall, Englewood Cliffs, NJ, pp. 327–346, 1991. [13] P. Evripidou and J–L. Gaudiot, “The USC Decoupled Multilevel Data– Flow Execution Model,” in Advanced Topics in Data–Flow Computing, J–L. Gaudiot and L. Bic, (Eds.), Prentice Hall, Englewood Cliffs, NJ, pp. 347–379, 1991. [14] D. E. Culler, A. Sah, K. E. Schauser, T. von Eicken and J. Wawrzynek, “Fine–grain Parallelism with Minimal Hardware Support: A Compiler– Controlled Threaded Abstract Machine,” in Proc. 1991 International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 164–175, 1991. [15] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall and Y. Zhou, “Cilk: An Efficient Multithreaded Runtime System,” in Proc. Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 1995. [16] H. H. J. Hum and G. R. Gao, “Supporting a Dynamic SPMD Model in a Multi–Threaded Architecture,” in Proc. Compcon'93, pp. 165-174, 1993. [17] A. Rogers, M. C. Carlisle, J. H. Reppy and L. J. Hendren, “Supporting Dynamic Data Structures on Distributed Memory Machines,” ACM Transactions on Programming Languages and Systems, vol. 17, no. 2, pp. 233–263, 1995. [18] J. Ramanujam and P. Sadayappan, “Compile–Time Techniques for Data Distribution in Distributed Memory Machines,” IEEE Transactions on Parallel and Distributed Systems, vol. 2, no. 4, pp. 472–481, 1991. [19] E. H.–Y. Tseng and J–L. Gaudiot, “Multi–Dimensional Modular Hyperplane and Automatic Array Partitioning,” Technical Report PPDC 96–02, Dept. of EE–Systems, University of Southern California, 1996. [20] S. Jenks and J–L. Gaudiot, “Nomadic Threads: A Runtime Approach for Managing Remote Memory Accesses in Multiprocessors,” Tech. Report 95–01, Dept. of EE–Systems, University of Southern California, 1995. [21] T. von Eicken, D. E. Culler, S. C. Goldstein and K. E. Schauser, “Active Messages: a Mechanism for Integrated Communication and Computation,” Communications of the ACM, pp. 256–266, 1992. [22] CMMD Reference Manual–Version 3.0, Thinking Machines Corporation, Cambridge, MA, 1993. [23] N. Guérin and J–L. Gaudiot, “Simulation of the Communications Libraries of the CM–5 on UNIX Workstations,” Technical Report 95–19, Dept. of EE–Systems, University of Southern California, 1995.