Abstract. Explicit-multithreading (XMT) is a parallel programming approach for .... target a wider range of applications. ...... such as those in desktop applications.
Towards a First Vertical Prototyping of an Extremely Fine-grained Parallel Programming Approach Dorit Naishlos1 , Joseph Nuzman2 3 , Chau-Wen Tseng1 3 , Uzi Vishkin2 3 4 1 2
Dept of Computer Science, University of Maryland, College Park, MD 20742
Dept of Electrical and Computer Engineering, University of Maryland, College Park, MD 20742 3
University of Maryland Institute of Advanced Computer Studies, College Park, MD 20742
4
Dept of Computer Science, Technion - Israel Institute of Technology, Haifa 32000, Israel Tel: 301-405-8010, Fax: 301-405-6707 {dorit, jnuzman, tseng, vishkin}@cs.umd.edu
Abstract Explicit-multithreading (XMT) is a parallel programming approach for exploiting on-chip parallelism. As such, XMT introduces a computational framework with 1) a simple programming style that relies on finegrained PRAM-style algorithms; 2) hardware support for low-overhead parallel threads, scalable load balancing, and efficient synchronization. The missing link between the algorithmic-programming level and the architecture level is provided by the first prototype XMT compiler. This paper also takes this new opportunity to 1) evaluate the overall effectiveness of the interaction between the programming model and the hardware, and 2) enhance its performance where needed, incorporating new optimizations into the XMT compiler. We present a wide range of applications, which written in XMT obtain significant speedups relative to the best serial programs. Results show the XMT architecture generally succeeds in providing low-overhead parallel threads and uniform access times on-chip. However, compiler optimizations to cluster (coarsen) threads are still needed for very fine-grained threads. We find the lack of explicit task assignment has slight effects on performance for the XMT architecture. The prefix-sum instruction provides more scalable synchronization than traditional locks, and the flexible programming style also encourages the development of new algorithms to take advantage of properties of on-chip parallelism. We show that XMT is especially useful for more advanced applications with dynamic, irregular access patterns, where for regular computations we demonstrate performance gains that scale up to much higher configurations than have been demonstrated before for on-chip systems.
1. Introduction The challenge of speeding up the completion time of a single task has motivated the exploration of parallel algorithms and architectures. Researchers have considered parallelism in two main categories. The first is at an ultra fine-grained level, among instruction sequences. A range of hardware and compiler techniques exists to increase instructionlevel parallelism (ILP) [HP96]. However, conditional branches, variable memory access latencies, and unpredictable behavior of the program prohibit compiler and processor techniques from uncovering much ILP [LW92], [Wall93]. These barriers, as well as the general limited amount of ILP present in programs, prevent computers from fully exploiting the large and fast growing number of transistors available on modern chips. An alternative source of parallelism is at the level of coarse-grained multithreading, and has been traditionally expressed under the shared-memory or the message-passing programming models, either directly by programmers, or with the aid of parallelizing compilers. However, these traditional programming models were designed to target offchip parallel architectures, where communication latencies are long, and communication bandwidth between processing units is limited. As such, traditional parallel programming focuses on limiting communication, synchronization and load imbalance, and tends to express very coarse-grained parallelism, which does not fit a single-chip environment. XMT (explicit multi-threading) attempts to bridge this gap between the ultra fine-grained on-chip parallelism and the coarse-grained multi-threaded program by proposing a framework that encompasses both programming methodology and hardware design. XMT tries to increase available ILP using the rich knowledge base of parallel algorithms. Relying on parallel algorithms, programmers express extremely fine-grained parallelism at the thread level, alleviating the hardware from the need to detect parallelism at run time. This directly results in reduced hardware complexity and an easier model for programming, which together allow XMT to gain speedups on a larger range of applications.
Previous papers on XMT have discussed in detail its fine-grained SPMD multi-threaded programming model, architectural support for concurrently executing multiple contexts on-chip, and preliminary evaluation of several parallel algorithms using handcoded assembly programs [VDB+98] [DV99]. The introduction of an XMT compiler allows us to evaluate XMT for the first time as a complete environment (“vertical prototyping”), using a much larger benchmark suite (with longer codes) than before. Due to the rather broad nature of our work, specialized parts of the work – the evaluation of the XMT compiler and evaluation of the programming model – were published in two respective specialized workshops [N+00], [N+01]. This paper incorporates the feedback from these workshops, and is the first one to present the whole work, including the integrated results, as well as interactions between the compiler and programming model. The main contributions of this paper are as follows: 1. A prototype implementation of an XMT compiler. 2. The first experimental evaluation of the XMT environment using a compiler, reporting results from a wide range of experiments, on a wide range of applications. 3. Some specific compiler optimizations for XMT are presented and evaluated. 4. Evaluation of the XMT programming features, and their interplay with the other components of XMT (compiler and architecture). We begin by reviewing the XMT multi-threaded programming model in section 2. Section 3 reviews the XMT architecture, especially features which support multi-threading. Section 4 presents the prototype XMT compiler and code generation model. We then describe the XMT evaluation environment in section 5, and evaluate the efficiency of our implementation in section 6. Section 7 discusses compiler optimizations to coarsen threads. Section 8 puts it all together by revisiting the XMT programming features, and evaluating to what extent the hardware and compiler are able to support them efficiently. Section 9 presents related work, and section 10 concludes.
2. XMT programming model When discussing parallel programming models, the parallel computing community usually considers two models: message-passing and shared-memory. Both models usually require some form of domain partitioning as well as load balancing. For dynamic, adaptive applications this effort can amount to 25% of the entire code and become a significant source of overhead [Henty00], [SSC+00]. Message-passing in addition requires distributing data structures across processors and explicitly handling inter-processor communication. Many of these issues, however, are of lesser importance for exploiting on-chip parallelism, where parallelism overhead is low and memory bandwidth is high. This observation motivated the development of the XMT programming model. XMT is intended to provide a parallel programming model, which is simpler to use, yet efficiently exploits on-chip parallelism. These goals are achieved by a number of design elements; The XMT architecture attempts to take advantage of the faster on-chip communication times to provide more uniform memory access latencies. In addition, a specialized hardware primitive (prefixsum) exploits the high on-chip communication bandwidth to provide low overhead thread creation. These low overheads allow to efficiently support fine-grained parallelism. Fine granularity is in turn used to hide memory latencies, which, in addition to the more uniform memory accesses, supports a programming model where locality is less of an issue. The XMT hardware also supports dynamic load balancing, relieving the programmers of the task of assigning work to processors. The programming model is simplified further by letting threads always run to completion without synchronization (no busy-waits), and synchronizing accesses to shared data with a prefix-sum instruction. All these features result in a flexible programming style, which encourages the development of new algorithms, and is expected to target a wider range of applications. The XMT programming model has a number of key features: Explicit spawn-join parallel regions Threads run to completion (do not busy-wait) Shared accesses synchronized with prefix-sum instruction
The programming model underlying the XMT framework is an arbitrary CRCW (concurrent read concurrent write) SPMD (single program multiple data) programming model. In the XMT programming model, an arbitrary number of virtual threads, initiated by a spawn and terminated by a join, share the same code. At run-time, different threads may have different lengths, based on control flow decisions made at run time. The arbitrary CRCW aspect dictates that concurrent writes to the same memory location result in an arbitrary one committing. No assumption can be made beforehand about which will succeed. This permits each thread to progress at its own speed from its initiating spawn to its terminating join, without ever having to wait for other threads; that is, no thread busy-waits for another thread. An advantage of using this SPMD model is that it is an extension of the classical PRAM model, for which a vast body of parallel algorithms is available in the literature. The programming model also incorporates the prefix-sum statement. The prefix-sum operates on a base variable, B, and an increment variable, R. The result of a prefix-sum (similar to an atomic fetchand-increment) is that B gets the value B + R, while the return value is the initial value of B. The primitive is especially useful when several threads simultaneously perform a prefix-sum against a common base, because multiple prefix-sum operations can be combined by the hardware to form a multi-operand prefix-sum operation. Because each prefix-sum is atomic, each thread will receive a different return value. This way, the parallel prefixsum command can be used for implementing efficient and scalable (i) load balancing (parallel implementation of queue/stack), and (ii) inter-thread synchronization, by providing an ordering between the threads. The XMT high-level language is an extension of standard C. A parallel region is delineated by spawn and join statements. Every thread executing the parallel code is assigned a unique thread ID, designated TID. The spawn statement takes as arguments the number of threads to spawn and the ID of the first thread (other TIDs will follow consecutively). Prefix-sum takes the form of the ps function. It adds its second argument (the increment) to the first argument (the base) and returns the original value of the base.
Consider the following example of a small XMT program. Suppose we have an array of n integers, A, and wish to “compact” the array by copying all nonzero values to another array, B, in an arbitrary order. The code below spawns a thread for each element in A. If its element is non-zero, a thread performs a prefix-sum to get a unique index into B where it can place its value. m = 0; spawn(n,0); { int TID; if (A[TID] != 0) { int k; k = ps(&m,1); B[k] = A[TID]; } } join();
The SpawnMT model of [VDB+98] does not allow for nested initiation of an arbitrary-size spawn within a parallel spawn region. Such a feature, while useful, would be difficult to realize efficiently with hardware support. As an alternative, [Vishkin00] extended the programming model to support a fork operation. A thread can perform a fork operation to introduce a new virtual thread as work is discovered. Forks must be executed one at a time by a single thread, but forks from multiple threads can be performed in parallel. A parallel region ends after all the threads, whether specified by the original spawn or initiated by subsequent forks, have been executed. The fork extension allows the programmer to approach many problems in a more asynchronous and dynamic manner. The fork is implemented by using a parallel prefix-sum to increment the number of virtual threads to be executed. The caller can use the result of the prefix-sum to set up initialization for the new thread. In XMT C, fspawn is used when forking may be necessary, and xfork performs the fork operation.
3. XMT architecture The XMT programming model allows programmers to specify an arbitrary degree of parallelism in their code. Clearly, real hardware has finite execution resources so all threads can’t run simultaneously. In an XMT machine, a thread control unit (TCU) executes an individual virtual thread. Upon termination, the TCU performs a prefix-sum operation in order to receive a new thread ID. The TCU will then emulate the thread with that ID. All TCUs repeat the process until all the virtual threads have been completed. This functionality is enabled by support at the instruction set level. With our architecture, all TCUs independently execute a serial program. Each accepts the standard MIPS instructions, and possesses a standard set of MIPS registers locally. The expanded ISA includes a set of specialized global registers, called prefix-sum registers (PR), and a few additional instructions New instructions are used for thread management. A spawn instruction interrupts all TCUs and broadcasts a new PC at which all TCUs will start. The pinc instruction operates on the PR registers, and performs a parallel prefix-sum with value 1. A specialized global prefix-sum unit can handle multiple pinc's to the same PR register in parallel. Simultaneous pincs from different TCUs are grouped, and the prefix-sum is computed and broadcast back to the TCUs. This process is pipelined and completes within a constant number of cycles. pread performs a parallel read (prefix-sum with value 0) of a PR register, and pset is used (serially) to initialize a PR register. The psm instruction allows for communication and synchronization between threads. It performs a prefix-sum operation with an arbitrary increment to any location in memory. It is an atomic operation, but due to hardware limitations, is not performed in parallel (i.e., concurrent psm’s will be queued). This is equivalent to a fetch-and-increment. A ps command at the XMT-C level is translated to a psm instruction.
f.xmt
f.c
Xpass
f.s C compiler
assembler
Figure 1: XMT compilation system
Additional instructions exist to support the nested forking mechanism. A new thread to be forked likely requires some form of initialization. This initialization can be performed by the forking thread with the aid of psalloc and pscommit. The psalloc instruction works like a pinc, but the increment to the PR register is not visible to anyone else until the forking thread performs the corresponding pscommit. This allows the forking thread to initialize data before a forked thread starts. Note that, like the pinc instruction, psalloc/pscommit from many TCUs can be performed in parallel batches. The last new instruction, suspend, is also used when forking may occur. An idle TCU can suspend, waiting for its assigned thread ID to become valid, without consuming any execution resources.
4. XMT compiler This section describes how the XMT compiler generates a C source from an XMT program. Parallel execution in the XMT architecture requires handling the following issues: 1. Transition to parallel mode: activating all the TCUs and setting up their environment. 2. Thread creation and termination: emulate the virtual threads on each TCU – obtain a thread ID for each, and verify that it is a valid ID (i.e., less than the spawn size). 3. Transition back to serial mode: detect when all threads have terminated, and resume serial execution. In first presentations of XMT, these tasks were handled entirely by hardware automatons. In this paper, we present a scheme whereby the preceding tasks are orchestrated by compiler. This choice pays off in performance and flexibility. For example, the compiler is free to schedule certain operations to have a per-TCU cost rather than a per-thread cost. Additionally, the more general hardware allows for various extensions, such as different forking
schemes, and can easily support parallelization models other than XMT. The prototype XMT compiler consists of two phases, the front end (Xpass) and the back end (gcc). Figure 1 presents the XMT compilation process. The front end (Xpass) is a source-to-source translator based on SUIF [Wilson94]. This phase converts the XMT code with its parallel constructs into regular C code with specialized assembly templates for runtime threading support. The general scheme used by Xpass is based on transforming parallel codes into parallel procedures. The compiler transforms the parallel region (the code in the spawn-join block) into the body of the procedure. When the procedure is called, the processing units are awakened, and each starts to execute the procedure body, which emulates the threads on each TCU. Figure 2 presents a high level example of the transformations performed by our compiler. Producing this structure involves two tasks: 1. Outlining. Detect all parallel regions (spawnjoin blocks) and create a function definition for each (a “spawn-function”). Replace the spawnjoin block with a call to the spawn-function. 2. Spawn-function transformation. Add TCU initialization code and thread emulation constructs to the spawn-function. These constructs include wrapping the body of the spawn-join block with a loop to emulate the threads, and inserting assembly templates. Xpass is preceded by a number of SUIF passes, and may be followed by XMT optimizations passes. The back end builds an executable for the C code produced by Xpass. As we based our simulator implementation on the SimpleScalar ISA, we used the version of gcc from the SimpleScalar 2.0 package – gcc 2.6.3.
5. XMT evaluation environment
Original XMT-C program
Transformed to
main() {
main() { spawn_setup(num_threads, offset); main_0_spawn(); }
spawn(num_threads, offset); { int TID;
main_0_spawn () { int TID, maxtid, offset; spawn_init(&max_tid, &offset); TID = TCUID + offset; while (TID < max_tid) {
**THREAD-CODE** } join(); }
**THREAD-CODE** TID = get_new_tid(); }; tcu_halt_suspend(); }
Figure 2: XMT code shape
A behavioral simulator, comparable to SimpleScalar [BA97], has been developed for an XMT architecture. The fundamental units of execution for the simulated machine are the multiple TCUs, each of which contains a separate execution context. In hardware, an individual TCU basically consists of the fetch and decode stages of a simple pipelined processor. To increase resource utilization and to hide latencies, sets of TCUs are grouped together to form a cluster. The TCUs in a cluster share a common pool of functional units, as well as memory access and prefix-sum resources. The clusters can be replicated repeatedly on a given chip. More details about the simulated architecture is described elsewhere [BNF+99]. Unlike previous designs, the simulated architecture does not have hard-wired thread management, and uses a banked memory rather than a monolithic memory. For our experiments, we specify 8 TCUs in each cluster. Each cluster contains 4 integer ALUs, 2 integer multiply/divide units, 2 floating point ALUs, 2 floating point multiply/divide units, and 2 branch units. All functional unit latencies are set to the SimpleScalar sim-outorder defaults: integer divide, multiply and ALU ops take 20, 3 and 1 cycles respectively, floating point divide, multiply and ALU ops take 12, 4 and 2 cycles respectively, and square root takes 24 cycles. Each cluster has a L1 cache of 8 KB, and a shared, banked L2 cache of 1 MB. The number of banks is chosen to be twice the number of clusters. The L2 cache latency is 6 cycles and memory latency is 25 cycles. A penalty of 4
cycles is charged each way for inter-cluster communication. Configurations are simulated with 1, 4, 16, 64, and 256 TCUs. (The 1 and 4 TCU configurations obviously have fewer than 8 TCUs per cluster.) Keep in mind that these numbers indicate the number of simultaneous execution contexts, and do not imply hardware functionality equivalent to the same number of standard microprocessors. The highest-end configuration simulated uses 32 clusters. At this point, connectivity to this degree has not been demonstrated for a single-chip system. The interconnection implementation is an important element of a scalable XMT hardware architecture. The simulator used reflects results of VLSI experiments with a specific design, which are discussed in detail elsewhere [Nuzman01]. For the purposes of this paper, then, the results for the highend configuration can be considered to be indicative of the potential for the XMT threading model to scale to high degrees of parallelism. This scalability is one of the most important features of the methods presented here. For our evaluation, we used a set of twelve benchmark codes taken from a variety of application areas. Their characteristics are shown in Appendix A.
6. Evaluation of the XMT Environment This section evaluates the efficiency of the XMT environment by examining 1) the overheads that the parallel constructs incur; 2) memory stall behavior,
mmult, scaling of spawn overheads
cycles
5000
1
4000
4
3000 2000
16
1000
64
0 spawn setup
TCU init spawn-end
256
overheads
Figure 3: spawn block overheads. These results, typical of XMT programs, demonstrate that setting up the parallel region is a cheap operation. The Spawn Setup and TCUInit overheads are in general negligible, and remain low under increasing problem sizes and increasing number of TCUs. As a result, programs that involve lots of spawns and joins still perform well. As the number of TCUs increases, the opportunity cost of idle TCUs at the end of the parallel region becomes more significant. Note however, that these overheads
Granularity: jacobi, 64 tcus spawn setup
TCU init
threadoverhead
load imbalance
30% 25% 20% 15% 10% 5%
cl fine us te re d by ro w
0%
cl fine us te re d by ro w
Setting up a parallel region and managing the threads incur an overhead. We can break down this cost to the following different elements: Spawn Setup: setting up the environment, broadcasting data. TCU-Init: initializing the TCUs context. Thread Overhead: emulating threads on each TCU - obtain a thread ID and verify that it is less than the spawn size. Load Imbalance: idling at the end of a spawn until all threads complete, then transitioning back to serial mode. We examined the costs that the different kinds of overheads incur, and observed several trends: 1. Overheads are generally very low. This allows XMT to obtain good speedups even for very small problem sizes, and very fine-grained parallelism. Figure 3 reports spawn-block overhead costs for mmult.
cl fine us te re d by ro w
6.1 Overheads
amount to less than 0.01% of the entire execution time of the program. 2. The most dominant overhead is the one charged to thread creation. We therefore concentrate on optimizations that aim to reduce this overhead (section 7). Figure 4 presents the overhead breakdown for jacobi.
% of fine-grained execution time
3) load balance, and 4) scalability of the system. We focus here on general features of our platform, independently of programming considerations. Programming issues are discussed in section 8.
problem size: 64X64, 128X128, 512X512
Figure 4: Granularity effects on overheads. We report results for three problem sizes, comparing three versions for each: 1) “fine”: fine-grained program, where each thread operates on a single matrix entry, and consists of only 3 source lines; 2) “clustered”: the finegrained version automatically coarsened by the XMT compiler, where each thread operates on a group of matrix entries (details on this optimization in the next section); 3) “byrow”: coarse-grained version, where a thread operates on an entire matrix row. Observe that fine pays a heavy penalty for thread overhead, while byrow’s most significant cost is due to load imbalance. The relative cost of thread overhead increases as the problem size becomes larger. The clustered version is able to eliminate most of the thread overhead, making it competitive with the coarse-grained program. As we describe below, this optimization can be tuned to reduce the load imbalance cost by reserving unclustered threads at the tail of a spawn, at the expense of a minimal increase in thread overheads. 3. The thread structure of the parallel algorithm greatly affects the overhead distribution.
DAG: async/sync, Overheads (64 tcus) spawn setup wait for tid load imbalance
DAG: sync/async, Speedups
TCU init thread overhead
1024 sync 100 sync
1024 async 100 async
35
100%
speedups over best serial
% of execution time
90% 80% 70% 60% 50% 40% 30% 20% 10% 0%
async 100
sync
async 1024
sync
30 25 20 15 10 5 0 1
4
16
tcus
problem size
64
256
Figure 5: Fork versus synchronous programming
XMT architecture design, which relies on a highbandwidth, scalable on-chip memory system. Time Breakdown, dbtree cpu
% of execution time
Tradeoffs between different choices made by the programmer are clearly illustrated by the overhead distribution. To demonstrate that, we highlight the difference between synchronous and asynchronous programming styles on the dag computation. The synchronous version uses frequent spawns and joins, while the asynchronous forks new threads to explore nodes as they are discovered. Figure 5 shows overhead breakdowns and speedups for both versions on two different graph sizes. As illustrated, the synchronous version pays a heavy price in load imbalance. The forking version is able to adapt to the unpredictable computational demands and avoid these costs. This advantage is evident in the speedups achieved, especially with more TCUs.
memory
idle
100% 80% 60% 40% 20% 0% 1
4
16 tcus
64
256
6.2 Memory stall behavior Figure 6: Memory stall behavior, dbtree. An interesting factor to examine is how memory stall behavior scales with the number of TCUs. We found that the ratio of time spent waiting on memory to time spent on processing was largely constant from 1 to 256 TCUs for most of the programs tested. As an example, Figure 6 shows the breakdown of TCU time between active processing (CPU), memory stalls, and idling for dbtree. As the number of TCUs increases, the memory stall share does not excessively increase. This can be attributed to the
6.3 Dynamic load balancing The XMT architecture provides dynamic load balancing; newly created threads are automatically assigned to TCUs without complicated programmer intervention. This dynamic load balancing is particularly useful for handling cases where work partitioning is of an unpredictable nature. Figure 7
Speedups speedup over best serial
1
4
16
64
256
250 200 150 100 50 0
jacobi
tomcatv
mmult
dot
dbscan
dbtree
convolution
Figure 8: Scaling of speedups reports results for the dag program running on 16 TCUs, showing execution time breakdown for each TCU. 4500 4000
end of spawn
3500 3000 2500
thread total
2000
init
1500
setup
1000 500 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Figure 7: Load Balance in dagn. We observe that the disparity between TCU work loads is relatively small considering the irregular nature of the computation, with large disparity between thread lengths. Each phase of the dag program advances in a topological-sort manner, where threads that handle nodes with in-degree greater than 0 terminate immediately and are therefore very short. On the other hand, threads that operate on nodes with in-degree 0 loop through all their outgoing edges. Out-degree varies considerably between nodes. As a result, thread lengths range from 96 to 516 cycles.
6.4 Scalability To demonstrate the scalability of XMT we present speedups of XMT programs relative to the best serial version, for applications that are considered to be relatively parallelizable. These programs feature regular computations that operate on different entries of a data structure independently of one another. This allows a very simple parallelization scheme that involves minimal extra overhead, where basically a thread is spawned for each loop iteration in the serial version. The results shown in figure 8 demonstrate that XMT programs are able to obtain good speedups, that scale up to much higher configurations (256 TCUs) than have been demonstrated before for a single-chip system. Low speedups demonstrated by tomcatv, are attributed to a problem size that was too small (64 columns), limiting the available parallelism for the scheme used.
7. Compiler Optimizations The XMT programming methodology encourages the programmer to express any parallelism, regardless of how fine-grained it may be. The low overheads involved in emulating the threads allow this fine-grained parallelism to be competitive. However, despite the efficient implementation, extremely fine-grained programs can benefit from coarsening, as it decreases the thread count, consequently reducing the overall thread overhead.
Furthermore, thread coarsening may allow to exploit spatial locality, and reduce duplicate work; however, these opportunities occur only in regular codes (such as scientific array-based computations) where it is also easy to automatically detect and optimize. By grouping consecutive threads, clustering exploits spatial locality, and allows the programmer to ignore granularity and task assignment considerations, which are otherwise relevant. The XMT compiler detects cases where the length of the thread is sufficiently small (such that the thread overhead constitutes a significant enough portion of the thread). This parameter is evaluated at compile time using SUIF’s static performance estimation utility. The compiler then automatically transforms these spawn-blocks such that fewer but longer threads are used. Furthermore, our optimization takes load balance consideration into account by reserving unclusteredthreads at the tail of the spawn. Tuning this value can reduce the load imbalance cost, however at the expense of a minimal increase in thread overhead. Therefore, two sets of threads are emulated: the first set consists of the coarse clustered-threads, and the second is the set of remaining fine-grained unclustered-threads. The obvious two alternatives to realize this are: 1) Split the parallel region into two different spawn blocks, one for each set of threads. 2) Add a conditional control to the spawn block that will determine according to the thread ID, whether it is a clustered or a single-unit thread. The drawback behind the first option is in having to pay the cost of an additional full weight join and spawn. The latter option however pays the additional cost of a conditional statement for each and every thread. We avoid both evils by using the following scheme – we create a single spawn block, which contains two separate thread emulation loops – one for the clustered-threads, and one for the single-unit threads. Thus, given a fine-grained spawn-block of n threads, our compiler approach results in the following execution scheme. The execution starts with a coarse grained version, and then, after m out of the original n threads have been emulated through the coarse-grained version, the execution switches to a finer grained version, to finish all n threads. Once the XMT execution crosses this threshold, the SPMD code becomes the fine-grained version. So,
any TCU which picks up virtual threads from that point on, executes directly the fine-grained version, rather than have each thread refigure that we are in the tail case and only then jump to the code for the tail case. Figure 9 presents the overall improvement obtained by clustering, as a percentage of the original (fine-grained) execution time. We report results for LU, dbscan and jacobi. Two major factors contribute to performance improvement: 1) Exploiting spatial locality: clustering reduces overall memory stall time by 20%, 64% and 14% for LU, jacobi and dbscan, respectively. 2) Eliminating duplicate work: clustering can potentially incur less duplicate work by scheduling operations to have percluster cost rather than per-thread cost. Thus, the overall active processing time can be reduced. In LU and jacobi this is indeed the case; clustering reduces CPU time by 13% for LU, and by 21% for jacobi. However in dbscan clustering actually increases CPU time by 18%, due to the overheads that the clustering transformation introduces. The prefix-sum operation that the threads perform in dbscan inhibit the conservative compilation scheme from extracting computation out of the cluster loop. Consequently,
% improvement due to clustering 40.00%
35.24% 35.00% 30.00% 25.00% 20.00% 15.00%
12.12%
10.00% 5.00%
0.20% 0.00%
LU
jacobi
dbscan
Figure 9: Impact of clustering. clustering does not improve performance for dbscan.
8. Evaluating the XMT programming model Previous sections have established that 1) the XMT implementation is efficient - overheads are low and scalable, as are memory access times (section 6). 2) XMT takes advantage of on-chip environment to support a simple and flexible programming model (section 2). While traditional shared-memory programming consists of assigning as coarse-grained chunks of work to processes as possible, using locks and barriers for synchronization, XMT programs feature 1) No task assignment, 2) Fine-grained parallelism, and 3) No busy-wait. It remains to examine to what extent this programming methodology is able to excel in an onchip environment. In this section we discuss how these features are manifested in the programming and performance of three different types of algorithms: 1) Regular computations; 2) General reduction operations; 3) Irregular dynamic computations. (Sorting algorithms and divide-andconquer type of computation are discussed in [N+01]). 8.1 Regular Computations This family of algorithms encompasses any computation that takes the form of looping through a sequence of independent operations, all consisting of the same amount of work, typically applied to different elements of the data structure. The access pattern characteristic to these algorithms is therefore very regular and predictable. Many scientific applications rely on this type of algorithms, including PDE solvers, mesh generators and linear algebra codes. The independence between the elements of computations in this type of programs, naturally allows parallelism present across all elements to be exploited. In XMT, this is translated to a simple parallelization scheme, where a thread is spawned for each loop iteration. Traditional parallelization differs from XMT in having to first partition the domain to (coarse-grained) units of work, and devise a scheme that assigns these blocks of work to the processing units. For the simple family of algorithms discussed here, this domain partitioning and task assignment take the form of scheduling loop
iterations across the processors. For array based applications, locality considerations determine the scheduling scheme, whether it would be block-wise, cyclic, or blocked-cyclic (Appendix B demonstrates the difference between the two programming schemes). The different programming style directly affects the granularity of the parallelism that is expressed. The XMT programs, taking advantage of the parallelism present to the largest degree possible often result in very fine-grained computations. For the programs we examined, the typical length of a thread is between 3 to 6 source lines. Traditional programming neglects the per-entry parallelism and distributes the work in a coarse-grained fashion. This in turn results in a very coarse-grained computation. To evaluate the effects of granularity and task assignment on performance we use LU, mmult, convolution and jacobi; We compare the following version for each: 1) “by-entry”/”by-row”, where a thread is spawned directly for each entry/row. For these regular computations, the coarser by-row versions outperform the fine by-entry versions, demonstrating the need for thread coarsening for this type of applications. (The by-entry version of mmult and conv is already relatively coarse-grained, which is why these programs are missing results for a byrow version). 2) “trad”, where traditional style programming with respect to task assignment is used. Here, a thread is spawned for a group of row/entries. For mmult and convolution, both versions achieve similar speedups. The traditional mmult is able to amortize some duplicate work, while the XMT version of convolution takes the lead by avoiding some task assignment overhead. For this reason the by-row versions of LU and jacobi outperform trad versions. 8.2 General Reduction Reduction operations are characterized by being both commutative and associative. Algorithms that rely on this type of operation, are therefore free to decompose the computation and perform it in any order. Reduction operations are used in programs from diverse application domains, from linear algebra to image processing.
Granularity and task assignment effects in regular computations by-entry
trad.
by-row
80 70
speedups
60 50 40 30 20 10 0 LU 128
jacobi 512
mmult 128
conv 128
program and problem size
Figure 10: Granularity and Task assignment. The different parallelization schemes for reduction operations vary in the way they handle decomposition of the computation to parallel blocks of work and the update of the global reduction result. The first two algorithms we describe perfrom traditional style work decomposition (i.e. coarsegrained chunks of work assigned to each processor) and differ in the mechanisms they use to update the global result: Algorithm 1 (Atomic update): Each thread accesses the global variable to contribute its part of the computation. The individual updates are monitored using a synchronization utility. In traditional programming this utility takes the form of a lock. The “trad-PRlock” version uses the most efficient but somewhat specialized parallel prefixsum operation, while “trad-memlock” uses a less scalable but more general fetch-and-add mechanism. Although traditional style busy-wait locks can be supported in XMT, an XMT approach tries to use a prefix-sum instead. In the “x_trad-ps” version, the TCUs update the global dot product atomically with their portion of the computation using the parallel prefix–sum mechanism, instead of locking. Algorithm 2 (Serial update): Here, after each thread has completed computing its portion of the reduction, the thread stores it into a global array (at the index that corresponds to its thread ID). When all threads have completed, a single thread sequentially scans the global array to compute the global results. This scheme avoids locks all together, and relies on the implicit barrier at the join to insure
that the global array is ready for the serial scan (“x_trad-join” version). Traditional version (“tradbarr”) explicitly uses a barrier instead. Appendix C demonstrates the differences between the above implementations. Reduction operations are a classic example for a case where XMT employs an entirely different algorithmic approach than the traditional one. We present two algorithms, both utilize a binary tree structure [Vishkin00]. (To simplify the description of the algorithm, we assume that the reduction operation is a summation. However, it applies to any reduction calculation). Algorithm 3: This algorithm propagates values up the tree in a synchronous fashion, using a spawn and join for each layer of the tree (“xmt-sync”). Each spawn block consists of a thread for each node in that layer that sums up the two children of that node. This is repeated for the next level, until the root of the tree is reached. Algorithm 4: Here, the values are propagated in an asynchronous fashion, involving only one spawnjoin block, within which threads advance without busy-waiting (”xmt-async”). This solution requires to maintain an additional data structure - gatekeeper, to ensure that the sum of a node is computed after the sum of both its children is ready. The computation proceeds as follows: after the sum of a node has been computed, the thread performs a prefix-sum relative to the gatekeeper of that node’s parent. The result of the prefix-sum indicates if it was the first thread to do so, in which case it terminates. Otherwise it was the second; it will proceed to calculate the sum for the parent, and continue. The general trend, illustrated in figure 11, is that programs using non-XMT synchronization (tradPRlock, trad-memlock, trad-barr) scale poorly with the number of TCUs compared to the others (x_tradps, x_trad-join, xmt-sync). The exception is the xmtasync version due to the amount of storage and extra work that it involves. We discuss other programming and performance trends in the next subsection, as the two types of algorithms demonstrate similar behavior. 8.3 Irregular, dynamic computations The algorithms we discuss here are characterized by highly irregular and unpredictable access
dot
dag
x_trad join x_trad ps xmt sync trad PRlock trad barr trad memlock xmt async
async edge async node sync trad. trad. Lock 6
90 80
5 4
60
speedups
speedups
70
50 40
3 2
30 20
1 10 0
0 1
4
16
64
256
tcus
1
4
16
tcus
64
256
Figure 11: Synchronization options in dot (4096 elements), and dag (100 nodes, 473 edges). Speedups of parallel versions over the serial version on increasing number of TCUs. patterns. Specifically, we consider computations that begin with a limited amount of work and discover additional work as they proceed. The newly discovered work typically requires splitting the processing to subtasks and combining the contribution of each as they complete. For example, consider rendering techniques that rely on ray tracing. There, primary rays are fired from a viewpoint through the pixels in an image plane and into a space that contains objects to be rendered. When a ray encounters an object, it is reflected from and refracted through the object, spawning new rays. The same operations are performed recursively on the new rays. In this manner, a primary ray generates a tree of rays, all contributing to the intensity and color of the same pixel. This pattern of computation appears in many applications beyond image processing, and in particular in applications that rely on breadth-first-search style algorithms. The irregular nature of this type of algorithm makes them good candidates for a less synchronous parallelization scheme. This is the programming approach that XMT employs, using dynamic forking as new units of work are discovered (be it rays, graph nodes/edges, etc.). The traditional parallel programming scheme typically involves the same features reviewed earlier, namely coarse-grained
work decomposition and task assignment. Here, however, a simple static task assignment, such that the one used for the regular array based computations, leads to severe load imbalances. Traditional parallelization techniques therefore require explicitly balancing processor workloads, either by intelligent partitioning or dynamic work stealing (such as the SPLASH-2 implementations for volume rendering, radiosity and a ray tracing [NL92], [SGL94]). An XMT implementation would take one of the following approaches: 1) A synchronous approach, with a spawn at each stage of the computation. The first spawn block creates a thread for each preliminary unit of work (a primary ray, a node with in-degree 0, etc.); after a join, threads are spawned for newly discovered units of work. The process is repeated until all the work units (rays/nodes) are processed (“sync”). 2) Less synchronous approaches that fork a thread for every new piece of work as it is discovered. Typically, only one spawn block is used. The granularity of the work units for which a thread is forked, varies between the different approaches. For example, in a breadth-first search, a thread can be spawned for every node (“async-node”) or alternatively for every out-going edge (“async-
speedup over best serial
140 120 100 80 60 40 20 0
perimeter
quicksort
radix
treeadd
dag
Figure 12: Speedups for irregular applications. edge”) for a more fine-grained, less-synchronous algorithm. Traditional style versions are based on the “sync” version, adding the necessary task decomposition. In either approach taken, certain data accesses in this computation require synchronization. Where traditional programming style uses locks (“tradlock”), XMT programs use the prefix-sum instead (“trad”). We demonstrate programming tradeoffs of irregular applications using the dag program (figure 11). Similar trends to what have been observed for reduction type of computations are evident here, but to a greater extent: 1. XMT programs can be much simpler, as domain distribution and task assignment are not needed. This is particularly important here, where devising a scheme that achieves good load balance may be very challenging, and requires substantial effort. 2. Reduced synchrony is often achieved at the expense of some additional programming effort. However, asynchronous programs should excel by enabling parallelism as soon as it is discovered (illustrated by the superiority of “async-edge/node” over “sync”). 3. Traditional programming using locks and barriers can be supported in XMT; Programs can be implemented in XMT in the same way they are implemented under traditional parallel programming models. Furthermore, the traditional synchronization mechanisms can be replaced with the XMT utilities. This often
results in a somewhat simpler program (avoiding the need to declare and initialize the lock variables), which is also more efficient and scalable (illustrated by the superiority of “trad” over “trad-lock”). 4. The impact of fine-granularity on programming is more significant in irregular programs that have traditionally resisted parallel solutions due to their unpredictable access patterns. Algorithms for such applications can often take advantage of fine-grained parallelism (illustrated by the superiority of “async-edge” over “asyncnode”). Figure 12 shows results for other programs that have resisted parallel solutions due to the dynamic, irregular access patterns of the computation. Radix is an example of programs that are known to be very problematic with regard to obtaining speedups by parallelization. Both require a lot of allto-all communication under other programming models. SPLASH-2 reports very low speedups on their shared-memory multiprocessor, “due to a parallel prefix computation in each phase that can not be completely parallelized” [WOT+95]. To maximize scalability, our implementation of radix uses fine-grained parallelism wherever possible. This algorithm is much more work-intensive than the serial version, and hence does not achieve speedups for less than 16 TCUs. Perimeter and treeadd both involve traversing a tree from the root down, forking threads along the way, until the leaves are reached. Then, the threads
work their way up the tree performing the finegrained computation. In quicksort we use a hybrid algorithm, where we start in a synchronous, fine-grained fashion until sufficient partitions have been created. We then switch to handling all the partitions in parallel. The first part involves a lot of spawning and joining, whereas the second part is a single spawn that forks threads as new partitions are created. Results show speedups topping out at 64 TCUs. The speedups of these irregular programs demonstrate that XMT is able to obtain speedups for programs where traditional programming approaches achieved very limited success. The reliance on the knowledge base of PRAM-style algorithms often encourages algorithms that result in more data movement than existing multi-threading. However, as the XMT model is more resilient to unpredictable data access time, these algorithms are able to obtain good speedups.
9. Related work XMT has tried to build on available technologies to the extent possible. The relaxation in the synchrony of PRAM algorithms is related to the work of [CZ89] on asynchronous PRAMs. Basic insights concerning the use of a prefix-sum like primitive go back to the Fetch-and-Add or Fetchand-Increment [FG91] primitives (cf. [AG94]). Another design point for on-chip parallelism is that occupied by chip multiprocessors (CMP), where independent processing units share an L2 cache, and each processor is executing a different thread. A classical example for a CMP is the Stanford Hydra architecture [HNO97]. Research in this area has tended to focus on multiprogramming, and on speculative execution to extract threads from a single program. Other proposed multi-threaded architectures, such as Simultaneous Multithreading (SMT) [TEL95] or Multiscalar [Franklin93] also feature multiple program counters and make useful points of comparison. Recent work on SMT [TLE+99] has proposed light-weight synchronization methods for multithreading. In fact the Acquire primitive is very similar to the suspend primitive presented here. The two instructions share motivations, since the execution of an XMT cluster with respect to its
functional units shared among its TCUs, is quite similar in spirit to an SMT processor. However, threading support in SMT is not targeted towards supporting a PRAM-style program. XMT, with the parallel prefix-sum for example, and its support of SPMD programs, aspires to scale up to much higher levels of parallelism than other single-chip multithreaded architectures consider currently. Tile based architectures, such as MIT’s Raw [WTS97], also expect to scale to high levels of parallelism. However, the Raw utilizes a messagepassing model rather than the shared-memory model of XMT. In addition, Raw heavily relies on compiler technology to manage data distribution and movements between tiles. As such, the XMT architecture is much easier to program for, and it is also expected to address a wider range of applications The Tera (now Cray) Multi-Threaded Architecture (MTA) [AC+90] supports many threads on a given processor. The processors switch between threads to hide latencies, rather than running multiple threads concurrently. The MTA, like other MPP machines, is designed for big computations with large inputs. XMT aims to achieve speed-ups for smaller input computations, such as those in desktop applications. Recent work on comparing different parallel programming models [Henty00], [CE00], [SSC+00], [CDS00], typically focuses on the shared-memory and message-passing programming models on multiprocessor systems. Our work attempts to examine parallel programming with respect to the different assumptions implied by an on-chip environment. MIT’s Cilk [FLR98] provides a multi-threaded programming interface and execution model. There are two important differences in scope. First, since Cilk is targeted at compatibility with existing SMP machines, load balancing in software is an important element of the project. XMT requires hardware support to bind virtual threads to thread control units (TCUs) exactly as the TCUs become available. The low-overhead of XMT is designed to be applicable to a much broader range of applications. Second, Cilk presents a programming model that tries to match very closely standard serial programming constructs, where forking a thread takes the form of a function call. While XMT also bases its
programming model on standard C, the programmer is expected to rethink the way parallelism is expressed. The wide-spawn capabilities and prefixsum primitive are present to support the many algorithms targeted to the PRAM model.
10. Conclusion XMT is a computation paradigm that spans from parallel algorithms, through their programming, to the hardware design. The compilation techniques described here enable competitive performance for XMT programs. We find that in some cases the programming features listed above encourage the development of new algorithms that take better advantage of properties of on-chip parallelism; in other cases though, results show that thread coarsening is very important for exploiting spatial locality. The compilation scheme, combined with the efficient architecture and simple programming model, allow XMT to realize its uncompromising approach to parallelism. Performance gains are achieved for a wider range of problem sizes, granularities, and types of algorithms and computations. In particular, XMT is able to obtain speedups for small problem sizes, fine-granularity, PRAM-style algorithms, and irregular computations that have resisted traditional parallel solutions. We present results of simulations on a variety of applications. The speedups that XMT programs obtain over serial programs demonstrate the applicability of the XMT framework and its efficiency across a variety of problem domains.
11. References [AC+90] R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, and B. Smith, “The Tera Computer System,” Proc. International Conference on Supercomputing, 1990. [AG94] G.S. Almasi A. Gottlieb. Highly Parallel Computing, Second Edition. Benjamin/Cummings, 1994. [AUS98] A. Acharya, M. Uysal, J. Saltz. Active Disks: Programming Model, Algorithms, and Evaluation. Proc. ASPLOS’98, October 1998. [BA97] D. Burger and T. M. Austin, “The SimpleScalar Tool Set, Version 2.0,” Tech. Report
CS-1342, University of Wisconsin-Madison, June 1997. [BNF+99] E. Berkovich, J. Nuzman, M. Franklin, B. Jacob, U. Vishkin, "XMT-M: A scalable decentralized processor," UMIACS TR 99-55, September 1999. [CDS00] B.L. Chamberlain, S.J. Deitz, L. Snyder, “A Comparative Study of the NAS MG Benchmark across Parallel Languages and Architectures,” Proc. of Supercomputing (SC), 2000. [CE00] F. Cappello, D. Etiemble, “MPI versus MPI+OpenMP on the IBM SP for the NAS Benchmarks,” Proc. of Supercomputing (SC), 2000. [CZ89] R. Cole and O. Zajicek, “The APRAM: incorporating asynchrony into the PRAM model,” Proc. 1st ACM-SPAA, pp. 169-178, 1989. [DV99] S. Dascal and U. Vishkin, “Experiments with List Ranking on Explicit Multi-Threaded (XMT) Instruction Parallelism,” Proc. 3rd Workshop on Algorithms Engineering (WAE-99), July 1999, London, U.K. [Franklin93] M. Franklin, “The Multiscalar Architecture,” Ph.D. thesis. Technical Report TR 1196, Computer Sciences Department, University of Wisconsin-Madison, December 1993. [FG91] E. Freudenthal and A. Gottlieb, “Process Coordination with Fetch-and-Increment,” Proc. Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IV), 1991. [FLR98] M. Frigo, C. Leiserson, K. Randall, "The Implementation of the Cilk-5 Multi-threaded Language," Proc. of the 1998 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 1998. [Henty00] D.S.Henty, “Performance of Hybrid Message-Passing and Shared-Memory Parallelism for Discrete Element Modeling,” Proc. of Supercomputing (SC), 2000. [HMT95] H. Hum, O. Macquelin, K. Theobald, X. Tian, G. Gao, P. Cupryk, N. Elmassri, L. Hendren, A. Jimenez, S. Krishnan, A. Marquez, S. Merali, S. Nemawarkar, P. Panangaden, X. Xue, and Y. Zhu. A design study of the EARTH multiprocessor. In Proc. Int. Conf. on Parallel Architectures and Compilation Techniques, 1995.
[HNO97] L. Hammond, B. Nayfeh, and K. Olukotun, ”A Single-Chip Multiprocessor,” IEEE Computer, Vol. 30, pp. 79-85, September 1997. [HP96] J. L. Hennesy and D. A Patterson. Computer Architecture: “A Quentitative Approach,” 2 nd Edition. Morgan Kaufmann Publishers, Inc., 1996. [LW92] M. S. Lam, R. P. Wilson, “Limits of Control Flow on Parallelism,” Proceeding of the 19th International Symposium on Computer Architecture (ISCA), May 1992, pages 19-21. [N+00] D. Naishlos, J. Nuzman, C.-W. Tseng, and U. Vishkin, "Evaluating Multi-threading in the Prototype XMT Environment," In Proc. 4th Workshop on Multi-Threaded Execution, Architecture and Compliation (MTEAC2000), December 2000. Best Paper Award. [N+01] D. Naishlos, J. Nuzman, C.-W. Tseng, and U. Vishkin, "Evaluating the XMT Parallel Programming Model,” To appear in Proc. of the 6th Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS-6), April 2001. [Nuzman01] J. Nuzman, Masters thesis, University of Maryland, Department of Electrical and Computer Engineering, 2001. [NL92] J. Nieh, M. Levoy, “Volume Rendering on Scalable Shared-Memory MIMD Architectures,” I Proceedings of the Boston Workshop on Volume Visualization, Boston, MA, October 1992. [SGL94] J. p. Singh, A. Gupta, M. Levoy, “Parallel Visualization Algorithms: Performance and Architectural Implications,” IEEE Computer 27(7):45-55, July 1994. [SSC+00] H. Shan, J.P. Singh, L. Oliker, R. Biswas, “A Comparison of Three Programming Models for Adaptive Aplications on the Origin2000,” Proc. of Supercomputing (SC), 2000. [TEL95] D. M. Tullsen, S. J. Eggers, H. M. Levy, “Simultaneous Multithreading: Maximizing OnChip Parallelism,” In Proc. of the 22nd Annual International Symposium on Computer Architecture, Santa Margherita Ligure, Italy, June 1995. [TLE+99] D. Tullsen, J. Lo, S. Eggers, H. Levy, "Supporting Fine-Grained Synchronization on a Simultaneous Multi-threading Processor," Proc. of
the 5th International Symposium on Performance Computer Architecture, 1999.
High
[VDB+98] U. Vishkin, S. Dascal, E. Berkovich, and J. Nuzman, “Explicit Multi-threaded (XMT) Bridging Models for Instruction Parallelism,” Proc. 10th ACM Symposium on Parallel Algorithms and Architectures (SPAA), pp. 140-151, 1998. [Viskin00] U. Vishkin, "A No-Busy-Wait Balanced Tree Parallel Algorithmic Paradigm," Proc. 12th ACM Symposium on Parallel Algorithms and Architectures (SPAA), 2000. [Wall93] D. W. Wall, “Limits of Instruction-Level Parallelism,” DEC-WRL Research Report 93/6, Nov. 1993. [Wilson94] R. Wilson et al, “SUIF: An Infrastructure for Research on Parallelizing and Optimizing Compilers,” ACM SIGPLAN Notices, v. 29, n. 12, pp. 31-37, December 1994. [WTS97] E. Waingold, M. Tayor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb, S. Amarasinghe, and A. Agarwal, ”Baring It All to Software: Raw Machines,” IEEE Computer, Vol. 30, pp. 86-93, September 1997. [WOT+95] S. Woo, M. Ohara, E. Torrie, J. Singh, A. Gupta, “The SPLASH-2 Programs: Characterization and Methodological Considerations,” Proc. of the 22nd Annual International Symposium on computer Architecture, pp. 24-36, June 1999.
Appendix A
Domain Scientific Computations
Program jacobi
Description 2D PDE kernel
Source
tomcatv
Mesh generation program
SPEC95
64 x 64
209250860
Linear Algebra
mmult
Matrix multiplication
Livermore loops
300 x 300
529615119
dot
Inner product
Livermore loops
2 x 64K
dbscan
[AUS98]
2100000 nodes
dbtree
SQL Select query on a non-indexed attributes relation. A batch of indexed-tree searches.
MySQL
131072 nodes
convolution
Image Convolution
[AUS98]
128 x 128
perimeter
Olden
128 x 128 pixel quadtree
1632686
16K nodes
7645175
16K nodes
2112779
64K nodes
5209216
1024 nodes, 131157 edges
1140753
Database
Image processing
Sorting algorithms
quicksort
Compute the total perimeter of a region in a binary image represented by a quadtree Recursive sort using pivots
radixsort
Integer sort into buckets
Misc
treeadd
Summation of binary tree nodes
dag
Find maximum path in a DAG.
Olden
Input Data Set 512 x 512
# of CPU Cycles 7739190
2064536 128975233 3504809 64228266
Benchmark programs
Appendix B We provide a high level example to demonstrate task assignment effects on programming. We compare two versions - XMT and Traditional. The traditional version spawns exactly one thread for each TCU, and a loop is added to the thread body to span through all the blocks of work that are assigned to the thread. The XMT version spawns a thread for each block of work directly. Note that even when the entire task of decomposition and assignment involves only the few source lines in bold font, we may still get up to 30% increase in length of code. XMT
Traditional
spawn(SIZE,0);
spawn(tcus,0);
{
{ compute(TID);
lb =SIZE*TID/tcus;
} join();
ub =SIZE*(TID+1)/tcus; for(m = lb; m < ub; m++){ compute(m); } } join();
Appendix C Atomic-update approach Traditional, using locks
Serial-Update approach XMT, using prefix-sum
xmt_lock *lock; xlock_init(lock); spawn(nprocs, 0);
spawn(nprocs, 0);
{
{ lb = N*TID/nprocs; ub = N*(TID+1)/nprocs;
my_part = compute(lb,ub); xlock(lock); global_val += my_part; xunlock(lock); } join();
lb = N*TID/nprocs; ub = N*(TID+1)/nprocs;
Traditional/XMT, using a barrier/ join spawn(nprocs, 0); { lb = N*TID/nprocs; ub = N*(TID+1)/nprocs; my_part = compute(lb,ub); arr[TID] = my_part;
my_part = compute(lb,ub); ps(global_val,my_part); } join();
} join(); for(i=0;i