StackThreads/MP: Integrating Futures into Calling Standards ...

1 downloads 0 Views 557KB Size Report
calls st fork block begin and st fork block end. These procedure calls indicate that e is a fork and are removed by the postprocessor. registers that are saved by ...
TECHNICAL REPORT TR99-01

StackThreads/MP: Integrating Futures into Calling Standards

Kenjiro Taura, Kunio Tabata, and Akinori Yonezawa

February 1999

T Department of Information Science Faculty of Science, University of Tokyo

7{3{1 Hongo, Bunkyo-Ku Tokyo, 113-0033 Japan

TECHNICAL REPORT TR99-01

TITLE

StackThreads/MP: Integrating Futures into Calling Standards

AUTHORS

Kenjiro Taura, Kunio Tabata, and Akinori Yonezawa

KEY WORDS AND PHRASES

threads, stack management, dynamic load balancing, shared memory multiprocessors

ABSTRACT

An implementation scheme of ne-grain multithreading that needs no changes to current calling standards for sequential languages and modest extensions to sequential compilers is described. Like previous similar systems, it performs an asynchronous call as if it were an ordinary procedure call, and detaches the callee from the caller when the callee suspends or either of them migrates to another processor. Unlike previous similar systems, it detaches and connects arbitrary frames generated by o -the-shelf sequential compilers obeying calling standards. As a consequence, it requires neither a frontend preprocessor nor a native code generator that has a builtin notion of parallelism. The system practically works with unmodi ed GNU C compiler (GCC). Desirable extensions to sequential compilers for guaranteeing portability and correctness of the scheme are clari ed and claimed modest. Experiments indicate that sequential performance is not sacri ced for practical applications and both sequential and parallel performance are comparable to Cilk[10], whose current implementation requires a fairly sophisticated preprocessor to C. These results show that ecient asynchronous calls (a.k.a. future calls ) can be integrated into current calling standard with a very small impact both on sequential performance and compiler engineering.

REPORT DATE

WRITTEN LANGUAGE

February 1999

English

TOTAL NO. OF PAGES

NO. OF REFERENCES

33

30

ANY OTHER IDENTIFYING INFORMATION OF THIS REPORT

A summary of this report [28] has been published in ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '99). Updates are available from http://www.yl.is.s.u-tokyo.ac.jp/sthreads/.

DISTRIBUTION STATEMENT

First issue 35 copies.

SUPPLEMENTARY NOTES

DEPARTMENT OF INFORMATION SCIENCE Faculty of Science, University of Tokyo 7-3-1 Hongo, Bunkyo-ku, Tokyo 113, Japan

StackThreads/MP: Integrating Futures into Calling Standards Kenjiro Taura, Kunio Tabata, and Akinori Yonezawa

ftau,tabata,[email protected] University of Tokyo 7-3-1 Hongo Bunkyo-ku, Tokyo 113-0033, Japan June 22, 1999

1 1.1

Introduction Background

Threads are becoming a common practice for programmers [13, 14]. Programs that handle asynchronous inputs such as GUI and network servers are naturally written using threads, and parallel programs enjoy performance bene ts via multiplexing a single CPU and/or executing threads on multiple CPUs. They are even more useful when they can be ne-grained , in the sense that the programmer can create a large number of threads without worrying about the overhead of creating threads. Fine-grain threads significantly enhance the maintainability of parallel programs, which is increasingly becoming an important issue as parallel platforms are widespread. Fine-grain multithreaded systems have typically been provided in programming languages that have a builtin notion of parallelism [1, 4, 12, 18, 26, 29, 30], whereas coarse-grain systems, such as Pthreads and Solaris threads, are available as libraries accessible from standard sequential languages, whose compilers are not necessarily aware of multiple threads of control. This is not an accident, because the underling execution mechanism of ne-grain multithreaded systems assumes control ows and frame management schemes that are not present in (most) sequential languages. Thus, these primitives have been implemented with a native code generator [8, 11] or a preprocessor that performs a fair amount of transformations [10, 20]. As a consequence, previous ne-grain multithreaded systems su er from one (or more) of the following drawbacks:

Low Accessibility: The function is accessible only from a particular programming language. If the language does not meet the requirements of user's interest in any ways (e.g., interoperability with other languages, supported platforms, or programming environment), the system will not be used and the user cannot use the function at all.

Poor Performance: Some systems translate sources into portable C. Such systems generally su er per-

formance of multithreaded operations; they must somehow emulate non-standard control transfers by combining calls, returns, and jumps within a function, and perform general frame management in addition to the stack management performed by C. Other systems translate sources into native code for fast thread management. Ironically, these systems also su er performance in practice, because they cannot readily exploit optimizations implemented in sequential compilers.

Engineering Issues: Having non-standard calling conventions and stack frame formats, some systems poorly interact with code compiled by sequential compilers, and lack debugger support.

Of these three factors, performance has been of the primary concern when the community was investigating fundamental performance issues and when parallel platforms were uncommon, thus the people's interest in parallel software is necessarily limited. Today, the community has enough experiences in understanding performance issues [5, 8, 16, 20, 22, 27] and small- or medium-scale shared-memory computers are very common. We therefore believe that other two factors will become equally important. 1

Sequential Compiler (GCC etc.)

source (.c, .cc, etc.)

Assembler

postprocessed assembly

object file

assembly

Postprocessor

runtime library

Linker

executable

Figure 1: Compilation for StackThreads/MP. Source le is compiled by a sequential compiler to an assembly, which is processed by our postprocessor. The output of the postprocessor is another assembly le, which is fed to a standard assembler. The object le is linked with the StackThreads/MP runtime library. The dotted rectangles are the components developed for StackThreads/MP, whereas white ones are standard. 1.2

Overview of Our Approach

Based on the above observations, this paper describes a new implementation strategy for ne-grain multithreading, which does not require a sophisticated preprocessor or a code generator built from scratch. The system is called StackThreads/MP, which is a ne-grain thread library usable from standard sequential languages. By library , we speci cally mean that the function is implemented without relying on sophisticated preprocessors or code generators. By standard sequential languages , we speci cally mean languages that obey the calling standard on a given platform. With these features, StackThreads/MP programs can bene t from future advances in compilation technology of mainstream languages such as C/C++ and can inter-operate with existing sequential sources as well as other StackThreads/MP sources. On multiprocessors, StackThreads/MP supports thread migration for load balancing between processors. To summarize, StackThreads/MP is a ne-grain analogue of traditional coarse-grained thread libraries such as Pthreads. StackThreads/MP implements non-standard control ows by a combination of a simple assembly language postprocessor and runtime libraries (Figure 1). The programmer can attach a new thread for any procedure call. The source program is fed to an unmodi ed sequential compiler that generates assembly, which is then processed by the postprocessor. The postprocessor examines the generated assembly, assuming that procedures obey the calling standard. It extracts such information as the location of the return address in stack frames and attaches the extracted pieces of information to the object le. The runtime system implements non-standard control transfers by directly reading or modifying some slots in stack frames, consulting the tables that describe their formats. The present work extends the authors' previous work [27] in the following three ways.



It employs a more involved stack frame management in which frames of suspended threads are not copied but retained in the stack, whereas the previous work uses a simpler method that moves frames of suspended threads out of the stack. The present work removes the limitation of the 2

previous work that aggregate data cannot be put on the stack. It also achieves a faster context switching.



It supports dynamic thread migration on shared-memory multiprocessors, whereas the previous work only deals with multithreading within a single processor.



It employs an assembly language postprocessor to guarantee its portability, whereas portability of the previous work was unclear. To obtain stack frame formats, it depended on a particular stack frame format (on SPARC) or on the runtime procedure descriptor (on Alpha). Writing our own assembly language postprocessor, the present work is more versatile and the portability is guaranteed on a wide range of platforms.

As of writing, it is running on Pentium, SPARC (without register windows), Mips, and Alpha. The library is approximately 10,000 lines long, of which approximately 1,000 lines are CPU-dependent (in total over all the CPUs). The assembly language postprocessor is an AWK script of approximately 4,000 lines long. As the sequential compiler that compiles the user program, we use GNU C compiler (GCC) [23] for our experiments. We currently rely on some extensions provided by GCC; therefore, although StackThreads/MP in principle works with other sequential compilers obeying the calling standard, it at present works only with GCC. Nevertheless, the degree of departure from standard sequential compilers is much smaller than previous multithread systems and it is practical to extend existing sequential compilers to support StackThreads/MP. In Section 6 and Section 7, we summarize extensions and code generation style needed to guarantee portability and safety of StackThreads/MP. To test the viability of this approach, we compared performance of StackThreads/MP with Cilk [10]. Cilk is an extension to C and the current implementation of Cilk translates Cilk programs into C that performs its own frame management. Cilk was chosen for comparison because its function is similar to ours (C + multithreading on shared-memory multiprocessors), it is widely available with benchmarks, and porting applications from Cilk to StackThreads/MP is straightforward. Performance was comparable both on uniprocessors and multiprocessors. 1.3

The Structure of the Paper

Section 2 mentions related work. Section 3 describes the low-level machinery. Section 4 devotes to implementation of thread migration on multiprocessors. Section 5 details the stack management and proves its correctness. Section 7 clari es features of GCC that we exploited but are not available in all compilers. It also proposes desirable compiler extensions to guarantee portability and safety of StackThreads/MP programs. Section 8 reports performance of the system. Finally, Section 9 states conclusions. 2

Related Work

Traditional thread libraries, such as Pthreads, share the goal with our work in that they provide multithreading usable by programs compiled by sequential compilers (such as C/C++ compilers). They can therefore make use of optimizations implemented in sequential compilers. To be compatible with the standard stack management and the calling convention of sequential languages, however, they sacri ce performance of thread creation in several ways. First, creating a thread requires allocation of a new stack and saving all callee save registers of the parent thread. It typically costs thousands of instructions on typical RISC processors. Second, threads of the same priority are typically scheduled in FIFO, increasing space requirements for stacks. For these reasons, creating a large number of threads is almost always impractical. A common practice in parallel programming is to create a xed number of threads, which is typically the number of processors, design a data structure that represents a unit of work, and share a pool of available work among these threads. Fine-grain multithreading systems can be viewed as an automation of this process. The programmer simply creates a new ne-grain thread wherever it is desired and the system represents a ne-grain thread by a compact data structure. A thread is typically represented by a list of activation records (stack frames), which is much smaller than a whole stack. Moreover, threads are typically scheduled in 3

LIFO (more precisely, each processor gives a higher priority to a child thread than its parent), which is typically very ecient in terms of space requirement. In [2], Blumofe and others showed that for a class of computation in which a thread is blocked only to wait for the completion of its descendant (strict computation), LIFO scheduling requires at most p times space (for activation frames) as its sequential execution, where p is the number of processors .1 These systems therefore tolerate a large number threads, as long as the computation is strict or almost so. Having LIFO scheduling, a thread creation involves very similar operations to a procedure call in sequential languages. Multithreading execution under LIFO scheduling still departs from the true sequential execution mainly in the following two ways. (1) A thread creation must have a provision to schedule the caller (parent thread) even if the callee (child thread) is not nished (i.e., a thread may block). (2) Once a thread is blocked, we must have a way to restart its continuation at any future point of execution. In Section 3 and 4, we will make it clear that implementation of multithreading boils down to implementation of the above two operations. Many multithreading systems have been proposed and their basic execution schemes are common as mentioned above. For the purpose of this paper, we focus on how do they di er in ways to implementing the above operations.



Some systems de ne a custom stack frame format and a calling convention, and generate native code that obeys these conventions [8, 11, 19]. Implementing non-standard control transfers is not an issue in these systems. To make implementation simpler and multithreading operations faster, they are typically incompatible with the standard; some require a dedicated free list for allocating stack frames, and some prohibit callee-save registers. The main advantage is eciency of multithreading operations, which come from the freedom in choosing an appropriate stack frame format, frame allocation strategy, register-usage conventions, and conventions in migrating threads between processors. A drawback is the amount of implementation e ort; writing a native code generator under a non-standard convention requires re-implementing a large part of a sequential compiler and other low-level tools such as debuggers.



Some systems de ne their own stack frame formats and generate portable C. A thread creation is compiled into a procedure call + some bookkeeping operations in C. The main issue is how to block a C procedure in the middle of its execution and restart it later. A common strategy is that each procedure explicitly allocates a frame (in addition to the native stack frame implicitly allocated as a consequence of a procedure call) and keeps its live values in the frame [10, 20, 25]. When a procedure blocks, the control simply returns from the C procedure, discarding the implicitly allocated stack frame, while keeping the explicitly managed one. Restarting a C procedure calls the C procedure again, giving the explicitly managed frame as an argument, and the procedure jumps to the blocking point (using `goto' statement) after loading the live values. An obvious advantage is its portability and simplicity. A drawback is that the generated code may disable some of the optimizations by the C compilers, because such a transformation obscures control ow and the de nitions and uses of variables in the original source. Another drawback is that the implementation e ort is not readily shared by multiple languages, because this approach relies on a preprocessor that performs a substantial amount of transformations.2



The authors' previous work [27] demonstrated ne-grain multithreading that does not require a nontrivial frontend nor a native code generator. The present work extends this work in several ways. First, unlike the previous work, it does not move stack frames, hence it permits address-exposed local variables and stack-allocated aggregates (arrays and structures). Second, it implements thread migration (dynamic load balancing) in shared-memory multiprocessors. Finally, it shows how to ensure portability of the method by assembly language postprocessing.

Table 1 lists existing ne-grain multithreading systems. To the authors' best knowledge, the present work is the rst system that supports thread migration on shared-memory multiprocessors and allows the user program to be directly compiled with a sequential C compiler, without a sophisticated preprocessor that transforms sources into C. 1 On the other hand, FIFO scheduling requires space proportional to the number of threads in the worst case. 2 Note that the preprocessor must perform a form of live-range analysis to determine which values are spilled out

explicitly managed frame.

4

to the

Name LTC [17] MP-LTC [7] Schematic [19] Cilk [10] Concert [20] Lazy Threads [11] Olden [21] Old StackThreads [27] present work

MP yes yes yes yes yes no no no yes

compilation strategy compile to native compile to native compile to C compile to C compile to C compile to native compile to native use standard C compiler use standard C compiler

Table 1: Comparison of the present work to existing ne-grain multithreading systems. `MP' is `yes' when it supports transparent thread migration under multiprocessor systems. `Compilation strategy' is `compile to native' if it compiles into a native code that adopts a non-standard stack usage, `compile to C' if it outputs C after a fair amount of transformations, and `use standard C compiler' when it directly uses a standard C compiler as a code generator (without sophisticated preprocessing). 3 3.1

The Low-Level Machinery Assumptions

StackThreads/MP assumes that the operating system supports multiple threads of control that share memory (either via kernel-threads or multiple processes that share a part of their address spaces). To distinguish a ne-grain thread provided by StackThreads/MP from a thread directly supported by the operating system, this paper uses word OS-thread , or worker , to refer to a thread in the later sense, no mater if it is actually a kernel-thread or a process. A worker has its own stack, which serves as the free storage for allocating a new stack frame. Workers are invisibly preempted and scheduled by the operating system. Simply stated, StackThreads/MP is a library that provides a means to multiplex a single OS-thread by multiple ne-grain threads and to migrate a ne-grain thread between OS-threads. It is a library in the sense that the user program can be compiled by conventional sequential compilers. As mentioned in Section 1.2, we currently assume that the user program is compiled by GCC, but the critical assumption is that the user program is compiled into an assembly code that obeys the calling standard de ned in the environments. Therefore, StackThreads/MP is likely to be usable from other procedural/object-oriented languages such as Modula-3 [3] and Ei el [15] with modest extensions to their compilers. We assume that each non-leaf procedure generated by the C compiler keeps a separate frame pointer register (FP), aside from the stack pointer register (SP), and SP is not used to access local and temporary variables.3 Of the four CPUs on which StackThreads/MP is currently running, this is not the case on Alpha and Mips, where FP is not kept for procedures that use xed-sized frames. Fortunately, every C compiler has a provision to enforce a procedure to keep a separate FP, because it is necessary to support procedures that use variable-sized frames (i.e., procedures that call alloca). Thus it should be straightforward for a C compiler to support a compile-time option that enforces every procedure to keep FP. Currently, GCC and DEC CC support such an option and our experiments use this option. See Section 6 for more precise meaning of this assumption. 3.2

The Basic Execution Model

Provided the above assumption, a stack frame is linked to its parent (caller) by storing the frame pointer of its parent into its stack frame. All the stack frames in a stack e ectively form a linear list of frames from the stack top to the stack bottom (Figure 2). When a procedure returns, it fetches the saved frame pointer and other callee-save registers that are saved at the entry of the procedure, and jumps to the 3 More precisely, we assume that SP-relative addresses are used solely to place outgoing arguments to procedure calls (except in procedure prologues and epilogues).

5

stack growth direction SP FP

Figure 2: Structure of a stack in the sequential (LIFO) execution. A frame saves a pointer to its parent, e ectively forming a linear list of frames starting from the current frame. Both SP and FP point to the current frame. Addresses of frames are monotonic. return address. This e ectively implements the LIFO execution of sequential languages, but this return sequence itself does not quite assume LIFO order; a return sequence is just a general mechanism that loads some registers by whatever values are written in its stack frame and jumps to whatever location is written in the return address slot. The basic idea of StackThreads/MP is to twist this normal execution order by examining/modifying these links between procedures. In the true LIFO execution, addresses of frames in a stack monotonically decrease from the bottom to top (assuming a stack grows towards lower addresses). In StackThreads/MP execution, on the other hand, frames may be detached from or appended to a stack in a much more liberal manner, as a result of a suspension or a resumption of a thread. Each worker still owns a linear list of frames, which we call the logical stack of the worker, but the sequence of frame addresses in a logical stack may not be monotonic; a link between two frames may go in the same direction as the stack growth or even connect frames in di erent stacks (Figure 3). We use a term physical stack to refer to the contiguous region from which each worker allocates new frames. Given these terminologies, the invariants of StackThreads/MP execution can be stated as follows.

Invariant 1: FP of a worker points to the frame at the logical stack top,4 whereas SP to the physical stack top.

Invariant 2: When FP does not point to the frame at the physical stack top, the physically top frame is extended to place outgoing arguments of the procedure being executed.

The rst invariant guarantees that the frame allocation sequence generated by the compiler works without any modi cation. The second invariant may not be very obvious. When a procedure calls another procedure, it may write the arguments in the caller's stack. This region, which we hereafter call the 4 Where does FP exactly point to depends on the machine convention and is not important for this paper. On some conventions such as those of Pentium and SPARC, FP points the bottom-most end of the frame, whereas on other conventions such as those of Mips and Alpha, it points to the topmost end of the frame.

6

SP0 SP1 FP0

physical stack of worker 0

physical stack of worker 1

Figure 3: Structure of stacks in StackThreads/MP execution. Unlike sequential execution, a link between frames may connect arbitrary two frames. Even two frames that are not in the same physical stack may be linked. Each worker still has a linear list of frames starting from the frame pointed to by its FP, called its logical stack. The dotted box (the frame currently executed) and shaded boxes form the logical stack of worker 0. White boxes may or may not belong to another worker's logical stack. Each worker has a physical stack, which is a contiguous region from which it allocates frames.

arguments region of the frame, is generally addressed by SP+x, where x is an o set towards the stack bottom (i.e., positive when a stack grows towards lower addresses). In typical calling standards, the compiler calculates the maximum of such x's for each procedure and allocates a stack frame of an appropriate size at the entry of a procedure.5 In other words, a prologue sequence of a procedure preallocates its arguments region and the compiler assumes that the region is always accessible via SP whenever the procedure is being executed. The problem is that this assumed size of arguments region varies from one procedure to another, and thus, if SP does not point to the frame of the currently running procedure, the procedure may destroy the frame pointed to by SP when it passes large arguments to another procedure. We x this problem simply by allocating enough space on the physical stack top whenever the currently executing frame is physically not on the stack top. The size of the allocation is simply the size of the arguments region that is largest throughout all procedures , so that we need not adjust the size on each procedure return. Other potential problems exist in the execution under the invariant stated above. We explore this issue further in Section 6 and Section 7. 3.3

Postprocessing

The assembly language postprocessor takes compiler-generated code as input, and performs the following tasks.



It tampers the epilogue sequence of each procedure for our frame management; the original epilogue code unconditionally frees a stack frame (by moving SP to the bottom-most end of the stack frame), but it is safe only when the frame is physically on top of the worker's stack. We insert a few instructions that check if it is safe to free the frame, and if not, retain SP to the original value. Other registers (FP and other callee-save registers) are restored normally. Details of stack management and its correctness are shown in Section 5.



It generates an almost identical replica of the epilogue sequence for each procedure. The replica is used to virtually unwind a stack frame of the procedure by restoring FP and other callee-save

5 The only exception we know is Pentium, in which an arguments region is not allocated at entry and arguments are dynamically pushed onto the stack top. The problem described here does not occur in this convention.

7

#define ASYNC_CALL(e ) __st_fork_block_begin(); e; __st_fork_block_end(); g

)f

Figure 4: The de nition of ASYNC CALL macro. It simply surrounds procedure call e by a pair of procedure calls st fork block begin and st fork block end. These procedure calls indicate that e is a fork and are removed by the postprocessor. registers that are saved by the procedure, while keeping SP at the same position. The replica di ers from the above tampered sequence in just two ways.

{ The replica never frees the frame. { The replica is made `pure,' in the sense that it performs nothing other than restoring FP

and callee-save registers. In most cases, an epilogue sequence generated by C compilers is already pure, but compilers may in general interleave instructions for restoring registers with other instructions. We do not include these instructions in the replica to avoid unexpected side-e ects during unwinding frames.



It generates a table that describes the frame format and some other pieces of information for each procedure. Speci cally, a descriptor of a procedure includes:

{ the address of its pure epilogue, { the FP-relative o sets of the slots where it saves its return address and the parent's FP, and, { the maximum SP-relative o set to address its arguments region. This is necessary to maintain the second invariant described in Section 3.1. The postprocessor simply checks every store instruction in the procedure body that uses SP as the base register and computes the maximum of such o sets. { addresses of fork points in the procedure, which are addresses of call instructions that are marked so. A call instruction is marked as a fork simply when it is surrounded by a pair of dummy procedure calls. Names of the dummy procedures are st fork block begin and st fork block end, but their names are arbitrary and their only purpose is to indicate that the procedure call between them is a fork. The postprocessor removes these procedure calls.

Other pieces of information are also included, but we will introduce them as they become relevant. Descriptors from several object les are collected into a single table at link time and the runtime system accesses the descriptor of a procedure by searching the table using any address within the procedure as a key. This architecture is very similar to, and was in fact inspired by, the runtime procedure descriptor found on some platforms [6], supported primarily for exception handling. 3.4

The Core Runtime APIs

Now we explain the core primitives of StackThreads/MP. The following description ignores stack management issues, which are fully described in Section 5.



ASYNC CALL(e ), where e is a procedure call expression, simply executes procedure call e and marks the call site as a fork point in the way described in Section 3.3. We assume there are no nested procedure calls in the argument positions of e.6 At runtime, ASYNC CALL involves no operations other than e. Unlike a normal procedure call, however, e may return to the caller even when e has not been nished. This gives us the basic mechanism to interleave the caller with the callee, either when the callee is blocked or when another worker becomes available for executing the caller in parallel with the callee.

6 Actually, we permit procedure calls that never call StackThreads/MP primitives, such as procedures in standard C libraries, to come at argument positions.

8

SP0 FP0

SP0

FP0 call

Figure 5: A stack before and after a call. Curved arrows represent links between frames (f ! g when is the parent of f ). The dotted box and the shaded boxes form the logical stack of the worker. The dotted box is the currently running frame. Whether the call is a normal call or an asynchronous call, a new frame is allocated on the physical stack top and linked to the frame which was the logical stack top.

g

ASYNC CALL is a C macro that simply surrounds e by a pair of procedure calls st fork block begin and st fork block end (Figure 4), which will be removed by the postprocessor.

Both a normal procedure call and an asynchronous call by ASYNC CALL allocate a new stack frame from the physical stack top and links the new frame to the logical stack top (Figure 5). This happens as a consequence of a C procedure call; we do not explicitly allocate a separate heapallocated frame for procedures.



suspend(c , n ) detaches some frames from the logical stack top; it unwinds frames from the logical stack top towards the bottom until it encounters n fork points. It lls the structure pointed to by c with information about the unwound frames, so that a worker can later continue execution of these frames. After unwinding, the control reaches the nth fork point. In other words, the execution continues as if the unwound frames would have nished normally (Figure 6). Setting n to one e ectively blocks the currently running thread. The reason for unwinding more than one fork points will become clear when we discuss thread migration in Section 4. The machinery of unwinding is as follows. To unwind a frame, we jump to the pure epilogue sequence of the frame. As mentioned in Section 3.3, the pure epilogue restores callee-save registers and FP, but does not change SP. Since SP does not change, we e ectively retain unwound frames at their original locations . This is in contrast to our previous work[27], in which we copy unwound frames out of the physical stack.



restart(c ), where c is a pointer to a structure lled by a call to suspend, concatenates the chain of frames represented by c and the worker's logical stack, and continues execution of c (Figure 7). More precisely, assume c represents frames c1 ; 1 1 1 ; cn , where ci+1 is the parent of ci (i = 1; 1 1 1 ; n01), and frame f calls restart(c ). As a result of the call, c1 becomes the logical top of the stack and f becomes the parent of cn . The execution continues as if the call to suspend that unwound ci 's has just returned to c1 . To link cn to f (i.e., make f the parent of cn ), we modify two slots in cn so that cn looks as if it were called from f . That is, we write the return address of the restart to the return address slot

9

suspend

1

suspend

FP

c

1

2

2

3

3

suspend(2,c)

4

4

FP 5

5

6

6

7

7

Figure 6: A stack before and after suspend. The list of frames represent the logical stack (locations of frames in the gure do not re ect their addresses). A thick arrow represents a fork and a thin arrow a normal call. Frames are unwound from the logical stack top until the designated number of forks have been found. Execution continues as if unwound frames would have normally nished. In the gure, the original logical stack consisted of 8 frames (the frame of suspend and frames marked 1, 1 1 1 , 7) before suspend. After the operation, the logical stack shrinks to 4 frames (frames marked 4, 1 1 1, 7) and the frame marked 4 becomes the current frame.

10

suspend

FP 1

1 c

2

2

3

3

restart(c) restart

FP 4

4

5

5

6

6

Figure 7: A stack before and after restart(c ). Frames represented by c become the top frames of the logical stack. The frame that called suspend becomes the new current frame. The bottom frame of c is linked to the frame that called restart(c ). A slashed box indicates an invalid frame.

11

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33

f

typedef struct join_counter /* number of un nished threads */ int n; /* the context to the waiting thread */ context * waiting; * jc_t;

g

void init_join_counter(jc_t j, int n) j->n = n; j->waiting = 0;

g

f

f

void finish(jc_t j) /* decrement the counter */ if (--j->n == 0 && j->waiting) /* wake up the waiting thread */ restart(j->waiting);

gg

f

f

void join(jc_t j) /* check if everybody has nished */ if (j->n > 0) /* if not, sleep on the counter */ context c[1]; j->waiting = c; suspend(c, 1);

f

f

gg

void example() jc_t j = (jc_t)malloc(sizeof(*j)); /* say I will wait for two tasks */ init_join_counter(j, 2); /* create a new thread */ create_work(j); /* create a new thread */ create_work(j); /* wait for completion */ join(j);

g

Figure 8: An example of a synchronization primitive on top of the low-level APIs of StackThreads/MP. init join counter(j , n ) declares that j will synchronize n threads. finish(j ) declares a completion of a thread, and join(j ) waits for the completion of threads. They assume that only one thread waits on a counter. Code for mutual exclusion is omitted. create work(j ) is a user-de ned function that forks a thread and calls finish(j ) when the work is nished. A join simply checks if the counter is already zero. If it is not, it suspends itself (line 18{22). A finish rst decrements the counter (line 12). If it becomes zero and a context is waiting, it restarts the context (line 14). In general, a mechanism that postpones the scheduling of the resumed context may be necessary. of cn , and write f to the slot where cn saves the pointer to its parent frame. When cn returns or is unwound, it loads FP with f and jumps to the address just after f called restart. With these tricks, we might expect that f successfully continues its execution after cn is returned or unwound, but unfortunately this is still not the case. When cn returns or is unwound, the values loaded into callee-save registers are values that have been saved when cn was originally called . On the other hand, f expects that callee-save registers hold values at the point where it called restart. We say a frame like f in this situation is invalid , in the sense control will return to f with invalid contents in callee-save registers. We handle this problem simply by saving all callee-save registers at the point where restart is called and by restoring them when control returns to an invalid frame. It is straightforward to build various synchronization patterns on top of the primitives just described. For example, Figure 8 shows pseudo code for a simple `join counter' in which a thread can wait for the completion of a given number of tasks. It assumes that only one thread waits on a counter and the number of tasks to be synchronized is known at its creation time. For the sake of simplicity, the gure omits code for mutual exclusion; each procedure is assumed to atomically operate on counter j . Line 14, which is executed when join was called before the last finish, restarts the waiting context immediately. In the actual implementation, it is often better to postpone scheduling the waiting context; section 4 shows such an alternative policy. 12

suspend

FP

FP n-1 threads

scheduled by the requesting worker

FP

t

t FP

(a) suspend(c, n-1)

(b) suspend(..., 1)

(c) restart(c)

Figure 9: Thread migration between workers. The requested worker rst suspends all threads above the thread that was selected to be migrated (thread t in the gure). When the control reaches t, it suspends itself. The parent of t then restarts threads above t. The requesting worker restarts t. 4

Thread Migration between Workers

4.1

Basics

The previous section made it clear that we can multiplex a single worker by multiple ne-grain threads on top of the core primitives. More interestingly, we can implement thread migration between workers on top of these primitives. By thread migration , we speci cally mean that a thread created by a worker can be executed by another worker, even when the thread has not been blocked. The basic idea, shown in Figure 9, is quite simple. Suppose thread t is currently the nth top-most thread in worker A's logical stack and we want to migrate t to an idle worker B . Worker A rst unwinds all frames above t in its logical stack by suspend(1 1 1, n 0 1) ((a) in the gure). Then, t suspends itself by suspend(1 1 1, 1), reaching the point where t was forked ((b) in the gure). Finally, A noti es B of the context of t (via shared memory) and restarts frames just unwound ((c) in the gure). The worker B picks up the context of t and restarts it. To summarize, the worker A pulls t out of its logical stack. This mechanism does not assume any policy as to which threads should be selected for migration. This mechanism assumes that each worker periodically polls requests from other workers. An ecient scheme for inserting pollings has been investigated by Feeley [9] and is not the scope of this paper. In experiments in Section 8, we manually insert pollings according to his method. Figure 10 shows a simpli ed pseudo code for the thread migration protocol between workers. Given a procedure call expression e , ST THREAD CREATE(e ) creates a new thread that evaluates the body of e. It rst calls e using ASYNC CALL(e ). There are three possible situations where e returns to its caller (p). 1.

e

normally nishes or blocks, in which case p just continues.

2. A steal request has been picked up and itself, reaching its caller.

p

was selected for migration, in which case

p

suspends

3. A steal request has been picked up, e was selected for migration and suspended itself, reaching the current frame (p). In this case, p restarts frames that have just been unwound. The actual ST THREAD CREATE is optimized for the rst case (normal nish), which is expected to be 13

/* create a thread that evaluates procedure call e */ #define ST_THREAD_CREATE(e ) ASYNC_CALL(e ); /* asynchronously call e */ while(1) if (why I am here == YOU_WILL_BE_STOLEN) /* this thread has been selected for migration. suspend itself and reaches the parent */ context c[1]; /* respond to the requester */ steal request ->reply = c; /* the parent will perform the rest of the protocol */ why I am here = YOUR_CHILD_WILL_BE_STOLEN; suspend(c, 1); /* now this thread is scheduled by the requester */ break; else if (why am I here == YOUR_CHILD_WILL_BE_STOLEN) /* the child thread has been selected for migration */ why am I here = ANYTHING_ELSE; /* push unwound frames on top of the stack. */ ASYNC_CALL(restart(ctxts above victim)); continue; else /* the child normally blocked or nished */ break; /* end of if .. */ /* end of while(1) */ /* request another worker to give a task */ void steal_thread() struct steal_request req[1]; req->reply = 0; steal request = req; /* send the request */ while (req->reply) /* wait for the reply */; /* schedule the replied context */ ST_THREAD_CREATE(restart(req->reply));

f

)

f

f

g

g

g g

f

f

f

g

g

/* give a task to a requesting worker */ void check_steal_request() if (steal request ) /* the request is there, unwind to the frame to be migrated */ why am I here = YOU_WILL_BE_STOLEN; suspend(ctxts above victim, select_thread());

f

g

f

g

Figure 10: A simpli ed pseudo code for thread creation and a task migration protocol. Two workers send and receive a request through a shared request port, steal request, which exists for each worker in the real implementation. why I am here and pickup context are thread-speci c variables. ST THREAD CREATE(e ) rst makes an asynchronous call to e . After e returns, the caller either continues, suspends itself a means to migrate to another worker, or restarts threads just unwound in order to migrate the thread just above the caller. A requester writes to the request port and waits for the reply (steal thread). The requested worker picks up the request, select a thread to be migrated, and transfers control to that thread (check steal request). The code is simpli ed in many ways. It does not show the code for deadlock avoidance, rejecting a request (in case the requested worker does not have a thread to migrate), canceling a request (due to timeout), and mutual exclusion between multiple requesters and between the requester and the requested worker (to safely cancel a request).

14

common. Thread migration procedures in the implementation are also more complicated for deadlock avoidance, rejecting/canceling a request, and mutual exclusion. 4.2

Lazy Task Creation

The thread migration just described does not assume any policy as to which threads should be selected when the requested worker has multiple threads. It is instructive to see that a good scheduling policy can be straightforwardly and eciently implemented on top. This section shows LTC [17] built on top of the mechanism, whose policy can be summarized below.



Each worker keeps a lazy task queue , a doubly-ended queue that keeps threads the worker is currently responsible for.



When a new thread is created by a worker, the worker pushes the thread to the head of the lazy task queue.



When a thread is nished or blocked, its parent is resumed (i.e., the thread at the head of the lazy task queue is removed).7



When a worker receives a request from an idle worker, it selects the thread at the tail of its lazy task queue and gives the thread to the requester.



When a worker resumes a blocked thread, the thread enters the tail of the worker's lazy task queue (i.e., the resumed thread is not scheduled immediately).

To implement LTC using StackThreads/MP, each worker keeps a doubly-ended queue of threads, called readyq, which has contexts that are schedulable but have not been linked into the logical stack (Figure 11). The lazy task queue of a worker is represented by its logical stack and readyq of the worker; we maintain the invariant that the concatenation of the logical stack and the readyq of a worker is equivalent to its lazy task queue. Speci cally, when a logical stack becomes empty, the thread at the head of the readyq is removed and restarted (by restart). When a worker receives a request from an idle worker it simply gives the context at the tail of the readyq when it is not empty. Otherwise it detaches the thread at the bottom of the logical stack and gives it to the requester. Figure 12 shows the pseudo code that implements LTC. 5 5.1

Stack Management Overview

This section details the stack management and proves its correctness. The goal of the presented stack management is to maintain the two invariants stated in Section 3.2. It keeps SP above any frame that is still used and allocate a new frame above SP (Throughout this section, we use word `above' (`below') to mean the direction to which a stack grows (shrinks)). It also maintains the condition that whenever the currently executing frame is not physically on the stack top, the arguments region of the physically top frame is extended. We do not try to reuse free space sandwiched between two frames still used. An easily expected drawback is that, in the worst case, the space utilization of a stack may be arbitrarily low. On the other hand, by maintaining SP above any used frame, this scheme requires no modi cations to compilergenerated prologue sequences of a procedure (thus guaranteeing a very fast allocation of a frame) and a very small modi cations to epilogue sequences. A safer scheme in terms of space requirements would manage multiple physical stacks per worker, each of which is managed in a way similar to the one described below, and can individually be reclaimed when it becomes empty. Nonetheless, we believe that the presented scheme is practical, its correctness is not trivial especially in the presence of thread migrations. It thus deserves a detailed study, which will give us the basis for more elaborated management schemes. 7 The original paper [17] describes an alternative policy upon blocking, which resumes the thread at the bottom. It is also easy to implement this policy.

15

logical stack

lazy task queue readyq

Figure 11: Data structure for Lazy Task Creation. A lazy task queue is represented by the concatenation of a worker's logical stack and a doubly-ended queue readyq (the long dotted curve in the gure). readyq keeps threads that are schedulable but are not linked to the logical stack. A thread enters its tail when it is resumed from a blocking.

16

/* nothing special in fork */ #define LTC_fork(e ) ST_THREAD_CREATE(e ) /* nothing special in suspend */ void LTC_suspend(context* c ) f suspend(c , 1); g /* a resumed context enters the ready queue */ void LTC_resume(context* c ) f enq_to_tail(readyq, c ); g /* the scheduler loop at the bottom of the logical stack */ void LTC_scheduler_loop()

f

g

while(!done ) f if (!empty(readyq)) f /* there are tasks in readyq. schedule its head. */ restart(deq_from_head(readyq)); g else f /* this worker is idle. try to steal a task */ steal_thread();

g

g

/* give a task to a requesting worker */ void LTC_check_steal_request()

f

g

if (steal request ) f if (!empty(readyq)) f /* give the task at the tail of readyq */ steal request ->reply = deq_from_tail(readyq); g else f /* give the task at the bottom of stack */ why am I here = YOU_WILL_BE_STOLEN; suspend(pickup context, n threads in stack - 1);

g

g

Figure 12: A pseudo code for implementing Lazy Task Creation. LTC fork, LTC suspend, and LTC resume are called when a thread is created, suspended, and resumed, respectively. LTC fork and LTC suspend are straightforward. LTC resume enqueues the resumed thread into the tail of readyq. At the bottom of the logical stack is a scheduler loop (LTC scheduler loop) that dequeues a thread from the head of readyq, if there is any. When receiving a task steal request, the worker gives the thread at the tail of readyq, if there is any. Otherwise it gives the thread at the bottom of the logical stack (except for the scheduler).

17

The basic management scheme is as follows. Aside from its logical stack and SP, each worker maintains a set of frames, called exported set . 8 Maintaining an exported set, a worker prevents its SP from moving below any frame in the exported set. A return sequence of a procedure checks if the nishing frame is above any frame in the worker's exported set. If this is the case, it resets SP just below the nishing frame, just as the original return sequence generated by the sequential compiler does. Otherwise, SP remains unchanged and the frame is marked as ` nished' for future reclamation. 5.2

A Formal Model of the Stack Management

We model the execution of a worker as a sequence of call (a procedure call, either synchronous or asynchronous), return (a return from a procedure), suspendn (a thread suspension), restart~c (a thread resumption), and two other operations discussed shortly. From a view point of stack management, e ects of these operations are summarized as follows. A call allocates a new frame on the physical stack top, a return declares that the logically top frame is no longer used, a suspendn detaches top n frames from the logical stack, and restart~c appends ~c and the logical stack. For the purpose of this section, it suces to consider a single worker; we model other workers as an activity that may use and nish any frame that does not belong to the logical stack of the worker in question. For this purpose, we introduce an event remote finishf , in which another worker nishes a frame f , which is a frame in the physical stack of the worker in question. Finally, each worker periodically performs a shrink operation, which shrinks the physical stack to an appropriate point. Ideally, a stack must shrink as soon as the physically top frame is nished, but this is in general impractical because the top frame may be nished by another worker. Thus we assume that each worker occasionally checks which frames have been nished and shrinks its stack on its own initiative. Having separate shrink operation also makes return sequences simpler. Figure 13 formalizes the stack management scheme. A frame in the worker's physical stack is denoted by a non-negative natural number, increasing towards the physical stack top (Figure 14). Speci cally, we write n to denote the nth bottom-most frame in the worker's physical stack (counting the bottom-most frame as 0). We refer to a frame in other worker's physical stack by a negative number for a notational convenience. That is, we write f > g either when f is in the worker's physical stack but g is not, or when both are in the worker's physical stack and f is above g (When neither is in the worker's physical stack, it does not matter whether f > g holds or not). In particular, f  0 means f is in the worker's physical stack. We call such a frame a local frame. A chain of stack frames is represented by a list of frames. We use an ML-like notation for lists and their operations (see the caption of Figure 13 for details). A worker's state is a ve tuple (~s; t; E ; R; X ) where ~s represents the list of frames in its logical stack, t the stack pointer (SP), E its exported set. In addition, we introduce two auxiliary sets retired set (R) and extended set (X ). Intuitively, a frame is in a worker's retired set when it has been exported by the worker, has been nished by any worker, but the owner has not `observed' that it has nished; upon a shrink operation, the worker removes frames in its retired set from its exported set, calculates the physically top-most frame in the new exported set, and resets its SP accordingly. A frame is in a worker's extended set when its arguments region is extended. These sets do not exist at runtime and are introduced for stating and proving the correctness. Operations involved in each transition are summarized below.

  

call just pushes a new frame above SP. SP is incremented by one. suspendn detaches top n frames from the logical stack, exporting all local frames detached from the stack. The physically top frame enters the extended set. return rst checks if the nishing frame is (strictly) above the maximum frame in exported set. If the condition is true, SP points just below the nishing frame after the return, e ectively freeing the nishing frame and all the frames above SP. Otherwise, SP remains unchanged and the nishing frame enters the retired set. A subtle case is where the nishing frame is the very maximum frame in exported set. In this case, we do not free the frame in the return sequence. This is necessary to maintain the second invariant, as will be illustrated in Section 5.3.

8 The name `exported set' came from an analogous term in the distributed GC community. In a distributed GC, an object is exported when a reference to the object is passed to another processor, hence the owner no longer has a complete control over its reclamation. A frame in an exported set is similar to an exported object in this sense.

18

call(~s; t; suspendn (~u @ ~r; t;

) = ((t + 1) : ~s; t + 1; E ; R; X ) ) = (~r; t; E + fui j ui > 0g; R; X + ftg)  where ~u = (u1 ; 1 1 1 ; un ) (~r; f1 0 1; E; R; X 0 fx j x  f1 g) (f1 > max E ) return(f1 : ~r; t; E ; R; X ) = (f1  max E )  (~r; t; E; R + ff1 g; X ) (~c @f1 : ~r; t; E + ff1 g; R; X + ftg) (f1 > cn and f1  0) restart~c (f1 : ~r; t; E ; R; X ) = (~c @f1 : ~r; t; E ; R; X + ftg) (otherwise) where we assume ~ c = (c1 ; 1 1 1 ; cn ) and fci j ci > 0g  E 8 (m 2 R and f1 > max E 0 ) < (f1 : ~r; f1 ; E 0 ; R 0 fmg; X ) 0 0 0 E ; E ; R 0 fmg; X + fmax E g) (m 2 R and f1  max E 0 ) shrink(f1 : ~r; t; E ; R; X ) = : ((ff11 :: ~~rr;; t;max E ; R; X ) (m 62 R) where m = max E and E 0 = E 0 fmg remote finishf (~s; t; E ; R; X ) = (~s; t; E; R + ff g; X ) where we assume f 62 ~s S

E ; R; X

E ; R; X

! 0 () S

0 = call S; S 0 = unwindn S (for some n); S 0 = restart~c S (for some ~c); 0 0 0 0 S = return S; S = shrink S; or S = remote finish S: S

Figure 13: A formal description of the stack management. The nth (physically) bottom-most frame in the worker's physical stack is denoted by natural number n (starting from zero). A frame in another worker's physical stack is denoted by a negative number for notational convenience. A chain of frames is represented by a list. We use ML-like notation for operations on lists (i.e., (a : ~b) denotes the cons of a and ~b and (~a @ ~b) the concatenation of two lists ~a and ~b). When A is a set, we write (max A) to denote a maximum element in set A. If A is empty, we de ne (max A) to be zero. At several places, we abuse a list in a context where a set is expected (like (~s + E ) or (max ~s)). A list in such contexts is interpreted as the set that consists of elements in the list. call pushes a new frame which is pointed to by the stack top. suspendn detaches top n frames from the logical stack, exporting any local frame encountered during unwinding. It also extends the arguments region of the physically top frame. return frees a frame if the nishing frame is above all frames in exported set. restart~c concatenates ~c with the current logical stack. If f1 > cn , f1 is included in the exported set. It also extends the arguments region of the physically top frame. remote finishf models other workers' activity that nishes f . shrink rst checks if the maximum frame in exported set has been nished. If it has, the worker removes that frame from exported set and adjust stack pointer according to the current frame and the new exported set. t=7 7 6

s = (6, 4, 5, 1, 0)

5 4 3 2 1 0

Figure 14: Representation of a worker's stack. The dotted box represents the current frame. The dotted box and shaded boxes form the logical stack of the worker, which is denoted as a list (6, 4, 5, 1, 0) in the formal model. 19



restart~c concatenates ~c with the current logical stack. A subtle point is that it compares the addresses of two frames cn and f1 , frames that join these two lists. If f1 > cn , f1 enters the exported set. The reason for this is made clear in Section 5.3.



shrink rst checks if the maximum element in exported set, m, has been nished. If it has, we remove m from the exported set and reset SP to the larger of the current frame and the new maximum frame in exported set. In practice, shrink operation repeats this process until an un nished frame becomes the maximum of exported set.



remote finishf just inserts the nishing frame

f

in the retired set.

Note that operations performed on an exported set are only (1) to add an element to it, and (2) to read or remove the maximum element of it. In particular, we never perform a general search that queries if a given element is in an exported set. This makes it possible to implement an exported set as a simple heap . A heap9 is a binary tree in which the value of a node is never smaller than its children. Inserting an element or deleting the maximum element from a heap takes O (log n) time, where n is the number of elements in the heap. More importantly, nding the maximum element in a heap takes O(1) time, since it is the root node. We do not explicitly manage a retired set nor an extended set at runtime; they are introduced to state and prove the correctness of the stack management scheme. At runtime, adding a frame to a retired set is done by marking a frame as ` nished'. More speci cally, we simply write a zero to the slot at which return address of the frame was stored. The choice of this slot is somewhat arbitrary; any place that normally contains a non-zero value suces. Adding the physically top frame to an extended set is done by extending a stack by the amount determined at postprocessing time (see Section 3.2). We never add a frame that is not physically top of the stack to extended set. Let's now look at operations involved in return more carefully, to examine operations that actually occur at runtime. return rst tests condition f1 > max E , which is true when the current frame (pointed to by FP) is above max E . On machines where stack grows toward lower addresses, this test is performed by: SP < FP < max E , where < denotes the unsigned integer comparison on the machine. Note that the comparison FP < max E on the machine returns true not only when f1 > max E actually holds, but also when FP points to a frame of a di erent physical stack, whose address happens to be less than the physical stack in question. Therefore we need another inequality SP < FP, to guarantee that f1 is local. This test typically takes 1 load (to read max E ), two compares, and two conditional branches. When the condition turns out to be true, the rest of the operation does whatever the return sequence generated by sequential compilers does. Otherwise the return sequence zeroes the slot of the return address and retains SP. Assembly postprocessor's job is quite simple; it merely inserts the above instructions that check the condition, branch to the original epilogue sequence if the condition is true, and zeroes the slot of the return address otherwise. 5.3

Subtle Cases

The key invariant maintained by the algorithm is that if the current frame is above all frames in the exported set of a worker, it is above all other frame (exported or not) in the physical stack of the worker, thus the above return and shrink transitions are safe. Having this invariant in mind, the motivation for exporting all unwound frames in unwind will be obvious; it prevents these frames from being reclaimed before being nished. There are a couple of subtle points, however.



restart~c exports the current frame f1 when f1 > cn , where cn is the bottom-most frame of ~c. Otherwise, a shrink operation may invalidate the rst invariant. Consider the following scenario: struct context f_ctxt[1]; main() { ASYNC_CALL(f()); g(); } f() {

9 This

heap should not be confused with the same word in the memory management community.

20

SP

g FP

return*

SP

f

f

main

main FP exported frame extended for arguments

Figure 15: Our second invariant would be violated if return would reclaim a frame that is on stack top and in exported set. main forks f, which forks g. Then, g blocks itself and f, resuming main. Finally, main resumes g (the gure on the left). When g is nished, its frame is on the stack top and exported. If g would reclaim this frame at this point, control would return to main and SP points to the top of f's frame, which has not been expanded for storing arguments performed by main (the gure on the right). This violates the second invariant mentioned in Section 3.1 suspend(f_ctxt, 1); shrink(); } g() { restart(f_ctxt); } main rst creates a thread that evaluates f, which immediately blocks. main then calls g, which restarts f. At this point, the frame of g is above the frame of f, therefore g is exported. After restarted, f performs a shrink operation. Had the frame of g not been exported, this shrink operation resets SP to the frame of f, wrongly discarding the frame of g.



return does not reclaim the nishing frame, when the nishing frame equals to the maximum frame in the exported set of the worker. At rst glance, it seems safe to reclaim this frame at this point, because, provided the above invariant is maintained, all local frames that are still alive are below this frame. Reclaiming such a frame, however, may break our second invariant (Section 3.1). Consider the following scenario: struct context g_ctxt[1]; main() { ASYNC_CALL(f()); restart(g_ctxt); } f() { ASYNC_CALL(g()); } g() { suspend(g_ctxt, 2); }

In this program, main creates a thread that evaluates f, which in turn creates a thread that evaluates g. g blocks itself and f, resuming main, and main immediately restarts g. Assuming main is the beginning of the program, the only exported frames at this point are the frame of f and g, and g is the maximum exported set of this maximum (Figure 15). Then g nishes. Were it reclaimed at this point, SP would point to the top of the frame of f, and the control now returns to main. Since the arguments region of f has not been extended, the second invariant is invalidated. This problem could be xed by extending the arguments region in the epilogue sequence of g, but then, the resulting epilogue sequence would be slower and complicates the postprocessor's job. 5.4

Correctness

This section shows that the invariants introduced in Section 3 are maintained throughout a program. We write S ! S 0 when S 0 is the result of applying one of the above six operations to S . 21

The following Lemma states that if the stack is managed in such a way that satis es property (3) below, then exported set can be used to identify the maximum frame that is still used.

Lemma 1 Let ~s be (f1 ; 1 1 1 ; fn ) and assume a state S = (~s; t; E; R; X ) satis es the following property: 01 < fi ) fi 2 E (2  i  m) | (3), then,

fi

1. f1 > max E ) f1 = max ~s (hence max (~s + E ) = f1 ).

2. f1  max E ) max ~s  max E (hence max (~s + E ) = max E ).

Proof: 1. By contradiction. Suppose max ~s = fi (i 6= 1). Since fi01 This contradicts the assumption f1 > max E .

2. Suppose max ~s = fi (i  1). If i > 1, we have max E  max E  f1 = max ~s. Either case, max ~s  max E holds.

< fi

fi

, we have max E



1 by (3).

fi > f

= max ~s by (3). Otherwise (i = 1),

Now we state that the above property (3) and our rst invariant are always maintained. We introduce an auxiliary proposition for induction.

Lemma 2 Suppose S = (~s; t; E ; R; X ), property, so does S 0 .

~ s

= (f1 ; 1 1 1 ; fm ), and

S

! 0. S

If

S

satis es the following

01 < fi ) fi 2 E (2  i  m).

1.

fi

2.

fi

01 > fi + 1, fi01 > 0, and fi01 62 E ) fi01 0 1 2 E (2  i  m).

3. t = max (~s + E ). The last proposition states that (1) t  max (~s + E 0 R), hence SP never points below any un nished frame, and that (2) if max E 62 R, then max (~s + (E 0 R)) = max (~s + E ) = t. Therefore, the stack management is reasonably prompt in that we can always achieve t = max (~s + E 0 R) by repeating shrink operations until we have max E 62 R.

Proof: 1. Verify that if S satis es property 3, S 0 satis es property 1. 2.

 0

   3.

= call S : Suppose a pair fi001 and fi0 satis es the condition part. If i > 2, any such pair appears in S as well, hence satis es the consequence. Thus we focus on the pair f10 and 0 s, hence f1 = max (~ s + E ) = t. f2 below. If f1 > max E , Lemma 1 implies that f1 = max ~ Therefore, f10 = t + 1 = f1 + 1 = f20 + 1, hence the pair does not satisfy the condition part. Otherwise (i.e., f1  max E ), Lemma 1 implies that max (~s + E ) = max E . Therefore, 0 s + E ) = max E 2 E , satisfying the consequence part. f1 0 1 = t = max (~ 0 0 0 ~0 and E  E 0 , S = suspend S , S = return S , and S = remote finishf S : Since ~ s  s these operations preserve this property. 0 ~0 are already included in E . Therefore S = restart S : Note that all local frames added to s this operation does not create any pair of frames that satis es the condition part. 0 ~0 and E 0 retains all elements in ~ S = shrink S : Since ~ s = s s, this operation preserves this property. S

 0 = call : max ( 0 + 0 ) = max (( + 1) : + ) = + 1 = 0 .  0 = suspend or 0 = restart : Since the di erence between S

~ s

S

E

t

~ s

E

t

t

~0 + E 0 includes ~ s + E and s 0 0 only negative frames, we have max (~s + E ) = max( + E ). Combining this with t = t0 , it

S

S

S

S

follows that this operation preserves the property. 22

~ s

 0 = return

: Consider the two cases separately. (a) f1  max E : We have max (~s + E ) = max (f1 : ~r + E ) = max (~r + E ) = max (s~0 + E 0 ). Combining this with t = t0 , it follows that this case preserves the property. (b) f1 > max E : Lemma 1 states that f1 = max (f1 : ~r). Therefore, we have f1 > max E and f1 > max ~r, that is f1 0 1  max E and f1 0 1  max ~r. To sum up, t0 = f1 0 1  max (s~0 + E 0 ). Below we are going to show that t0  max (s~0 + E 0 ), to conclude that t0 = max (s~0 + E 0 ). Let us distinguish the following two cases: (b1 ) f1 = f2 + 1 and (b2 ) f1 > f2 + 1. For (b1 ), since f2 2 s~0 , max (s~0 + E 0 )  f2 = f1 0 1 = t0 . For (b2 ), since f1 > f2 + 1, f1 > 0, and f1 62 E , it follows that f1 0 1 2 E from property 2. Hence 0 ~0 + E 0 ). t = f1 0 1  max E  max (s 0 0 ~0 + E 0 , this operation preserves the s + E = s S = remote finishf S : Since t = t and ~ property. 0 = shrink S : When m 2 R and f1 > max E 0 , we have f1 = max s~0 from Lemma 1. S Thus, t0 = f1 = max(~s + E 0 ) = max (s~0 + E 0 ). When m 2 R and f1  max E 0 , we have max E 0  max s~0 from Lemma 1. Thus, t0 = max E 0 = max (s~0 + E 0 ). When m 62 R, the property is clearly preserved. S

 

S

Finally, the last proposition in the following Lemma states that the second invariant is also preserved in one-step transition.

Lemma 3 Suppose S = (~s; t; E ; R; X ), ~s = (f1 ; 1 1 1 ; fm ), and S ! S 0 . If S satis es the property listed in Lemma 2 as well as the following property, so does S 0 . 1. 2.

9 2 ( i i01 i01 62 1  max ) 2 . e

E

f

f

e < f

E

t

;f

E

) ) fi01 0 1 2 X

X

Note that from the second proposition, we particularly have f1 < t ) t 2 X , because f1 < t ) f1  max E (contraposition of f1 > max E ) f1 = t  t).

Proof: 1.

 0 = call





 2.

: The only non-trivial part is the pair f10 and f20 . For other pairs of fi001 and fi0 (i > 2), any such pair appears in S as well, hence if it does satisfy the condition part of the property, it satis es the consequence as well because of the assumption that S satis es the property. Thus we only consider the pair f10 and f20 . If f1 > max E , we have f20 > max E 0 , thus the pair does not satisfy the condition part. Otherwise (i.e., f1  max E ), the property 2 implies that t 2 X . Therefore we have f10 0 1 = t 2 X = X 0 , satisfying the consequence. 0 0 0 0 0 0 S = suspend S : Suppose, for some i ( 1) and some e 2 E , fi  e < fi01 and fi01 62 E , 0 that is, fi+n  e < fi+n01 and fi+n01 62 E . If e 2 E , we immediately have fi+n01 0 1 2 X (i :e :; fi001 0 1 2 X 0 ), because S satis es property 1. So let us assume e 62 E , in which case we have e = fj for some j (1  j  n). Since e = fj 6= fi+n , we have fi+n < e < fi+n01 , that is, fi+n + 1 < fi+n01 . Recall that, we have fi001 62 E 0 , hence fi+n01 62 E . Therefore, from the second proposition in Lemma 2, we have fi+n01 0 1 2 E . In particular, we have 9e0 2 E (fi+n  e0 < fi+n01 ; fi+n01 62 E 0 ), implying that fi001 = fi+n01 0 1 2 X = X 0 . 0 0 0 0 0 0 S = restart S : Suppose, for some i ( 1) and some e 2 E , fi  e < fi01 and fi01 62 E . We consider the following three cases separately: (a) i  n + 1, (b) i > n + 1 and e 2 E , and (c) i > n + 1 and e 62 E . (a) i  n + 1: We are assuming such fi01 is negative or an element of E , thus does not satisfy the condition part. (b) i > n + 1: and e 2 E : We have 0 0 fi0n  e < fi0n01 . Since S satis es property 1, fi0n01 0 1 2 X , that is, fi01 0 1 2 X . (c) 0 When i > n + 1 and e 62 E : We have e = f1 , because E  E + ff1 g. Since f1 6= fi0n , we have fi0n < e < fi0n01 . The rest of the discussion is exactly the same as the above suspend case. 0 0 ~0  ~ s and E = E , they preserve this S = return S; shrink S , or remote finishf S : Since s property. S

 0 = call S

S

S

: f10 = t + 1 > max E = max E 0 , hence S 0 does not satisfy the condition part. 23

 0 = suspend , or 0 = restart : We always have 0 2 0 .  0 = return : We consider the two cases separately: (a) 1 max : Suppose 10  max 0 . In this case we have 2  max 1 and 1 62 . Therefore from property 1, we have 0 = 1 0 1 2 = 0 . (b) 1  max : Since = 0 and = 0 , this case simply preserves S

S

S

S

S

S

 

f

X

X

f

f

t

t

X

E < f

f

f

E

t

>

E

f

E

E

t

X

X

the property. 0 = shrink S : We consider the three cases separately. (a) m 2 R and f1 > max E 0 : S 0 0 0 f1 > max E does not satisfy the condition part. (b) m 2 R and f1  max E : We have 0t = max E 0 2 X 0 , satisfying the consequence part. (c) m 62 R. Since t = t0 and X = X 0 , this case simply preserves the property. 0 0 0 S = remote finishf S : Since t = t and X = X , this case preserves the property.

The following theorem is an immediate consequence of Lemma 2 and 3. We write the initial state of a worker, ((0); 0; ;; ;; ;), as S0 .

Theorem 4 Suppose S0 !3 (~s; t; E ; R; X ) and write ~s = (f1 ; 1 1 1 ; fm ), then,

 max ( + ) 2 1

1.

t

2.

f

6 6.1

< t

~ s

E

t

X

). When max E 62 R, the equality holds.

.

Correctness of Generated Code Problem Description

The previous section has shown that our stack management is correct in the sense that SP always points to a free space. In other words, the compiler-generated code never reuses space that is still in use. In general, however, this does not necessarily mean that the code generated by a sequential compiler always works as expected; our execution scheme certainly departs from the LIFO execution that normally occurs in sequential programs. For example, as we have seen in Section 3.2, StackThreads/MP programs may result in a situation in which SP and FP point to di erent frames, a situation which never occurs in sequential programs. The fundamental question, then, is can we still believe that our execution scheme never `surprises' sequential compilers in any signi cant way? A close look at procedure calling mechanism answers to this question. When a sequential compiler generates instruction sequences for an unknown procedure call, the compiler assumes certain conditions hold between machine states before and after the procedure call. This is generally called calling standard (or calling convention), and the compiler-generated code works provided that all unknown procedures follow this convention. More speci cally, typical calling standards state that:

1. Preserve Callee-save Registers: The callee-save registers (speci ed in the calling standard) hold the same value before and after a procedure call. The callee-save register speci cally includes FP.

2. Preserve Caller's Frame: Local and temporary area in the caller's stack frame holds the same value before and after a procedure call.

3. Preserve Stack Pointer: SP also holds the same value before and after a procedure call.10 Notice that the compiler does not quite assume a \sequential execution," by which we mean that the control transfers from a procedure to another in the LIFO order. It only assumes that the above conditions hold at every procedure call; it does not care what actually happens during the called procedure. As a matter of fact, the LIFO execution order is daily violated by every operations that involve non-standard control transfers, such as setjmp/longjmp, conventional thread libraries, and exception handling. Yet, these operations are safe, simply because they do not violate any of the above assumptions. Following this principle, checking whether StackThreads/MP execution scheme is correct amounts to checking whether it does not violate any of the above assumptions: 10 This is not true in all conventions; for example, a parameter passing convention on i386 architectures states that the callee is responsible for freeing received arguments, so the caller assumes SP after a procedure call holds the old SP + argument size (stacks grow towards lower addresses). This di erence is not important for the following discussion. The only important point here is that the caller is allowed to assume that SP after a procedure call holds a value speci ed by the calling convention.

24

C program: void h() f f(g(a ), b );

g

=) its assembly code: 1: h: 2: store b,[SP+4] 3: store a,[SP+0] 4: call g 5: store r,[SP+0] 6: call f

# # # # #

the second arg to f the first arg to g call g the first arg to f call f

Figure 16: A C program and its possible assembly code. The assembly assumes that SP is preserved across a procedure call.



Callee-save registers are preserved across a procedure call by the mechanism described in Section 3.2.



Stack frames are preserved across a procedure call by the stack management, which we have shown correct in Section 5.



SP, however, is not preserved. The simplest example is the following: a procedure f calls g , which then calls suspend. Clearly, values of SP before and after f calls g are di erent.

Therefore, unfortunately, StackThreads/MP execution scheme is not perfectly safe with compilers that assume SP is preserved across a procedure call. In practice, compilers may exploit this assumption in two ways:



In many calling conventions on recent RISC processors, such as Mips and Alpha, a procedure does not set up FP unless it has a variable-sized frame (i.e., unless it calls alloca). Such a procedure can substitute SP+constant for FP, because SP always has the same value whenever the procedure is running.



Compilers may optimize arguments passing sequences for procedure calls. For example, consider program in Figure 16 and its possible machine code, assuming a hypothetical convention in which ith argument to a procedure is placed at address SP + 4(i 0 1). The procedure h rst calls g(a ) and then f(g(a ), b ). The compiled code rst writes b , which is the second argument to the second procedure call. The compiler assumed that the value of SP at line 6 equals to the value of SP at line 1, hence b is still available at [SP+4]. This is not the case if, for example, g calls suspend and returns to line 5, leaving g's frame on stack. If the compiler would not exploit this assumption, on the other hand, the generated sequence would rst store b at some FP-relative position, call g, and then move b to SP+4.

6.2

Proposed Solution

In theory, this problem can only be xed by telling the compiler that SP may not be preserved across a procedure call. We hereby propose a simple compiler change and discuss how is it dicult to apply this change to existing compilers. The proposed change is a compile-time option, which could be called something like: -call-destroys-sp

This ag tells the compiler that every procedure call may destroy SP. It is clear that adding this ag solves our problem. Note that this ag e ectively rules out non-leaf procedures that do not have a 25

separate FP. The question, then, is whether this change is substantial from compiler engineering point of view. Fortunately, C/C++ compilers support dynamic memory allocation from stack, which is called alloca. alloca(n) extends stack by n bytes (or more to guarantee a proper alignment) and returns the pointer to the allocated storage. No matter how is it implemented, the compiler must have a builtin knowledge that alloca, unlike other regular procedure calls, does not preserve SP. Otherwise, the problem we are now facing would occur across alloca. The proposed compile-time option merely instructs the compiler to treat all function calls like alloca. We believe this is not a substantial change, as long as the compiler supports alloca or a similar dynamic memory allocation from stack. GCC (version 2.7), for example, required less than 30 lines of addition to support this option, and the change a ects only on Pentium. Although we could not look into sources of other compilers, compilers for RISC processors are unlikely to pay attention to optimizing argumentpassing sequences, because most arguments are passed via (caller-save) registers. In such compilers, adding this ag in practice means that every non-leaf procedure must be treated as if it has a variablesized frame, therefore must have a separate FP. 7

Other Compiler Extensions for Portability and Safety

This section summarizes other minor features of GCC that we exploited, which we believe are easy to implement in any existing compiler.

Pass arguments via SP (desirable): When a procedure makes two or more procedure calls that are logically concurrent, the caller must write arguments to them to di erent locations, or the callee must copy the received arguments to another location before overwritten by subsequent calls. However, since sequential compilers assume these procedure calls are performed one after another, they have a right to allocate the same storage for their arguments. Problems do not arise for arguments passed via registers. More interestingly, problems do not occur either, for arguments passed via memory locations accessed via SP + compile-time constants. Two procedures that share a parent actually interleave only when the rst procedure calls suspendn . SP is kept above or equal to this frame until it nishes. Therefore, the region from which it accesses its parameters does not expose on the stack top until it nishes, hence is not reused for subsequent procedure calls. Most calling conventions are designed as such. This is because arguments are going to be accessed by the caller, therefore it is convenient to place them at a constant o set (determined by the convention) from the stack top, so that the callee can access them using the constant o set from its FP. Passing arguments in any other way requires passing an extra argument that points to the region in which they are stored. Among the four CPUs on which StackThreads/MP is currently running, SPARC convention for passing structures is the only problematic case. On SPARC, when a procedure passes an argument of a struct type in C, it writes the structure in its FP-relative position and passes the pointer to it to the callee. A desirable change to the compiler is that a compiler assumes that such a region may be destroyed across a procedure call, hence each procedure must copy incoming structure parameters as necessary. GCC requires less than 30 lines of changes to support this. Disable Inlining (desirable): A procedure that is called by ASYNC CALL(e ) must not be inlined,

because it would change the semantics of suspendn within e (Inlining sequential calls of course makes no problem). In practice, this can be achieved by disabling optimizations (or just inlining, if the compiler supports such an option), but performance penalty would be signi cant. There are many ways that practically disable inlining of a call site (e.g., separately compile the de nition of a procedure and its call sites), but we eventually wish to have a way that is guaranteed to work. Once an appropriate syntax is determined, implementing this option will be a trivial task. We currently use GCC's option that disables inlining altogether (-fno-inline-functions) and write an explicit inline directive where it is important.11 11 Experiments

in Section 8 do not use explicit inline directives

26

Have a TLS register (desirable): It is highly desirable that the calling standard speci es a register that holds a pointer to a thread local storage and that a part of the storage is available for user programs. Many multithreaded programs and libraries will bene t from this. 8

Experimental Results

8.1

Overhead on Sequential Applications

This section assesses overhead imposed by StackThreads/MP on sequential applications, using SPEC integer 95 benchmarks. Since StackThreads/MP uses GCC without any preprocessing, overheads imposed on sequential applications are usually small and predictable by the programmer. Yet, there are several sources of overheads:

Postprocessing: This is the overhead inherent in StackThreads/MP. When the postprocessor encoun-

ters an epilogue sequence of a procedure, it augments the epilogue by adding 4-7 instructions that checks if the frame can be freed (Section 5.2), unless it can prove that these checks are unnecessary. We shortly describe the algorithm which determines whether they are necessary.

No Inlining: As we have described in Section 7, we must disable inlining altogether to guarantee that a

procedure call within ST THREAD CREATE is not inlined. Although we hope that compilers eventually support a way to disable inlining on a call site basis, we are also interested in the penalty we have to pay today.

Other Code Generation Constraints: On SPARC, we disable register windows with -mflat option.

On Mips and Alpha, we force every procedures to have separate FP with -fno-omit-frame-pointers options. We also reserve a register to hold worker local storage.

Thread Library: Thread libraries such as Pthread or Solaris threads redirect some standard library functions to their thread-safe versions. This may slowdown programs that use these functions extensively. This overhead seems unreasonably large on some platforms.

We currently determine whether the epilogue of a procedure should be augmented by the following simple criteria:

  

a leaf procedure is not augmented. the postprocessor records the set of unaugmented procedures, and a procedure is not augmented if it only calls procedures in the set. any other procedure is augmented.

In other words, a procedure f is not inlined if all procedures that f calls (either directly or indirectly) have already appeared in the current postprocessing. Provided this condition, the control transfers between procedures in the strict LIFO order during an activation of f , thus when f terminates, its activation frame is physically on the stack top. In particular, f is augmented when it or one of its callees may call unknown procedures (procedures de ned in a di erent compilation unit), including StackThreads/MP library procedures. Figure 17-20 shows relative execution time in various settings. In all cases, we give the highest optimization option (-O4) to GCC. Settings are:

default, which simply compiles with GCC -O4.

at (on SPARC), which compiles with GCC -O4, with register windows disabled. FP (on Mips and Alpha), which compiles with GCC -O4, with forcing every procedures to have FP. fdefault, at,FPg+thread, which links a thread library, with compilation ags being the same as default (on i386), at (on SPARC), and FP (on Mips and Alpha), respectively.

st inline, which enables postprocessing for StackThreads/MP, with inlining enabled. 27

SPARC 167MHz L2=512KB

Elapsed time (default=1)

default

flat

flat+thread

st_inline

st

2 1.5 1 0.5 0 gcc

m88ksim

li

ijpeg

perl

vortex

go

compress

avg

Figure 17: Overhead for SPEC int 95 benchmark on SPARC (167Mhz, 512KB L2 cache).

st, which enables postprocessing for StackThreads/MP, with inlining disabled. Performance of `st' is what we can safely achieve with current GCC. Observations drawn from these graphs are:



Performance di erent between `st inline' and `st' are small in almost all cases (< 2:1% in all cases). That is, penalty of disabling inling is small at least for SPEC benchmarks.12



On IRIX (Mips) and Digital UNIX (Alpha), penalty of linking thread library seems unreasonably large for two applications (perl and gcc). This results in large increases in the average overhead (11% on Mips and 5.5% on Alpha).

 

On SPARC, disabling register windows incurs a large penalty (11%). The overheads due to forcing every procedure to have FP not signi cant (6.9% on Alpha and 4.1% on Mips). When we give the option -fno-omit-frame-pointers, even leaf procedures have FP, which are unnecessary for StackThreads/MP to work. This overhead may be reduced somewhat if compilers are changed according to our proposal in Section 6, so that leaf procedures can freely omit FP.

To summarize, the overheads of postprocessing, per se, are 13% (SPARC), 8.6% (Pentium PRO), 1.0% (Mips), and 2.7% (Alpha). Other sources, most notably disabling register windows on SPARC and linking thread library on IRIX and Digital UNIX, increase the total overhead observed by the programmer; total overheads are 15% (SPARC), 9.5% (Pentium PRO), 18% (Mips), and 15% (Alpha). 8.2

Performance of Parallel Applications

We compared performance of StackThreads/MP with Cilk version 5.1 [24]. Cilk is a parallel extension to C that supports ne-grain multithreading and implements multithreading with a preprocessor to C, in which generated code explicitly manages heap-allocated frames. Table 2 summarizes settings. We ported all benchmark programs that come with the Cilk distribution, except for two applications (choleskey and queens) that use Cilk's thread abortion function, which we have not implemented yet. The purpose of the comparison is to show that our implementation strategy achieves a comparable performance with Cilk, which is known to be a good implementation of multithreading on multiprocessors. The experiments speci cally do not intend to conclude which is a better implementation from performance point of view. Clarifying performance bene ts and drawbacks of both approaches are certainly interesting, but such a study requires a much more careful setting to make the comparison fair. Porting Cilk applications to StackThreads/MP is straightforward; each Cilk procedure manages a synchronization counter that keeps track of the number of outstanding children. spawn in Cilk adds the 12 The

penalty is likely to be large on C++ applications.

28

Pentium PRO 200MHz L2=512KB

Elapsed time (default=1)

default

default+thread

st_inline

st

1.5 1 0.5 0 gcc

m88ksim

li

ijpeg

perl

vortex

go

compress

avg

Figure 18: Overhead for SPEC int 95 benchmark on Pentium PRO (200Mhz, 512KB L2 cache).

Mips 175MHz L2=1MB

Elapsed time (default=1)

default

fp

fp+thread

st_inline

st

2 1.5 1 0.5 0 gcc

m88ksim

li

ijpeg

perl

vortex

go

compress

avg

Figure 19: Overhead for SPEC int 95 benchmark on Mips R10000 (175Mhz, 1MB L2 cache).

Alpha 400MHz L2=4MB

Elapsed time (default=1)

default

fp

fp+thread

st_inline

st

1.5 1 0.5 0 gcc

m88ksim

li

ijpeg

perl

vortex

go

compress

avg

Figure 20: Overhead for SPEC int 95 benchmark on Alpha (400Mhz, 4MB L2 cache).

29

Machine CPU Number of CPUs Memory

Ultra Enterprise 10000 (Star re) 250Mhz, 1MB L2 cache 64 8GB

Table 2: Settings for parallel application benchmark. Cilk runtime library as well as benchmarks are compiled without performance measurement and debugging facilities to maximize performance of Cilk (We de ned CILK CRITICAL PATH, CILK STATS, CILK TIMING, and CILK DEBUG to zero.) GCC compiles both Cilk preprocessor outputs and StackThreads/MP programs with -O3 optimization option, as was done in [10]. Serial Execution Time relative to C StackThreds/MP

Cilk

3 2 1 0 cilksort

notempmul

knapsack

fib

heat

lu

fft -s

spacemul

blockedmul

magic

Figure 21: Uniprocessor overhead on SPARC 250Mhz (1MB L2 Cache) for parallel applications. Execution times relative to sequential C programs are shown. Except for b, in which threads are extremely ne-grained, both achieve performance comparable to C. counter and creates a thread. sync waits for the counter to become zero. Each Cilk procedure takes the synchronization counter as an additional parameter and decrements the counter when it returns. Cilk runtime library as well as benchmarks are compiled without performance measurement and debugging facilities to maximize performance of Cilk.13 GCC compiles both Cilk preprocessor outputs and StackThreads/MP programs with -O3 optimization option, as was done in [10]. Figure 21 shows the overhead on a uniprocessor. Relative execution times of both StackThreads/MP and Cilk are shown, normalizing execution times of sequential C programs to one. Except for fib, in which a thread is extremely ne-grained, both achieve performance comparable to C. The overheads imposed for fib are also similar in the two systems. Figure 22 compares execution times of StackThreads/MP on Ultra Enterprise 10000 (Star re), using 1 to 50 processors. Overall performance is similar. knapsack and fib were the only cases where one is consistently faster than the other. Cilk tends to be faster on small numbers of processors, while StackThreads/MP tends to be faster on large numbers of processors (32 or 50). 9

Summary

We have described StackThreads/MP, a ne-grain thread library that is practically usable with unmodi ed GCC. It supports thread migration on shared-memory multiprocessors. A stack management scheme that only requires a modest amount of changes in procedure epilogue sequences has been presented and proved correct. Desirable extensions to compilers that we believe are modest have been proposed, so that any standard-conforming compiler can use StackThreads/MP. Performance has been measured and 13 We

de ned CILK CRITICAL PATH, CILK STATS, CILK TIMING, and CILK DEBUG to zero.

30

Execution Time of StackThreads/MP relative to Cilk 1

8

32

50

2.5 2 1.5 1 0.5

ic ag m

ul bl oc

ke dm

ul em sp ac

fft

lu

at he

fi b

k sa c kn ap

ul pm m no te

ci

lk so rt

0

Figure 22: Elapsed time of StackThreads/MP relative to Cilk on various numbers of processors. Overall, performance is comparable. Neither was consistently better than the other. compared with Cilk, another good implementation of ne-grain multithreading on shared-memory multiprocessors which, however, requires a fairly sophisticated preprocessor. Neither was clearly better than the other and the performance di erence was small in most cases. StackThreads/MP library is available from http://www.yl.is.s.u-tokyo.ac.jp/sthreads. Acknowledgements

This work is supported by Information-technology Promotion Agency (IPA) in Japan. We are grateful to our colleagues working on our parallel programming language project, Yoshihiro Oyama and Toshio Endo. References

[1] Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. Cilk: An ecient multithreaded runtime system. In Proceedings

of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP)Proceedings, pages 207{216, 1995.

[2] Robert D. Blumofe and Charles E. Leiserson. Scheduling multithreaded computations by work stealing. In Proceedings of the 35th Annual Symposium on Foundations of Computer Science (FOCS), pages 356{368, 1994. [3] Luca Cardelli, James Donahue, Lucille Glassman, Mick Jordan, Bill Kalsow, and Greg Nelson. Modula-3 report (revised). Technical Report 52, Digital Systems Research Center, 1989. [4] Andrew. A. Chien, U. S. Reddy, J. Plevyak, and J. Dolby. ICC++ { a C++ dialect for high performance parallel computing. In Proceedings of the Second International Symposium on Object Technologies for Advanced Software, 1996. [5] David E. Culler, Anurag Sah, Klaus Erik Schauser, Thorsten von Eicken, and John Wawrzynek. Fine-grain parallelism with minimal hardware support: A compiler-controlled threaded abstract machine. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 166{175, 1991. [6] Digital Equipment Corporation. DEC OSF/1 Calling Standard for AXP Systems, 1994. 31

[7] Marc Feeley. A message passing implementation of lazy task creation. In Robert H. Halstead, Jr. and Takayasu Ito, editors, Proceedings of International Workshop on Parallel Symbolic Computing: Languages, Systems, and Applications, number 748 in Lecture Notes in Computer Science, pages 94{107. Springer-Verlag, 1993. [8] Mark Feeley. An Ecient and General Implementation of Futures on Large Scale Shared-Memory Multiprocessors. PhD thesis, Brandeis University, 1993. [9] Mark Feeley. Polling eciently on stock hardware. In Proceedings of the 1993 ACM SIGPLAN Conference on Functional Programming and Computer Architecture, pages 179{187, 1993. [10] Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The implementation of the Cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN'98 Conference on Programming Language Design and Implementation (PLDI), 1998. [11] Seth Copen Goldstein, Klaus Erik Schauser, and David Culler. Lazy threads: Implementing a fast parallel call. Journal of Parallel and Distributed Computing, 37(1):5{20, August 1996. [12] Robert H. Halstead, Jr. Multilisp: A language for concurrent symbolic computation. ACM Transactions on Programming Languages and Systems, 7(4):501{538, April 1985. [13] Steve Kleiman, Devang Shah, and Bart Smaalders. Programming with Threads. Prentice Hall, 1996. [14] Bil Lewis and Daniel J. Berg. Threads Primer. Prentice Hall, 1996. [15] Bertrand Meyer. Ei el: the Language. Object-Oriented Series. Prentice Hall, 1992. [16] Eric Mohr. Dynamic Partitioning of Parallel Lisp Programs. PhD thesis, Yale University, 1991. [17] Eric Mohr, David A. Kranz, and Robert H. Halstead, Jr. Lazy Task Creation: A techinque for increasing the granularity of parallel programs. IEEE Transactions on Parallel and Distributed Systems, 2(3):264{280, July 1991. [18] Rishiyur S. Nikhil and Arvind. Id: a language with implicit parallelism. Technical report, Massachusetts Instituted of Technology, Cambridge, 1990. [19] Yoshihiro Oyama, Kenjiro Taura, and Akinori Yonezawa. An ecient compilation framework for languages based on concurrent process calculus. In Proceedings of Europar '97, number 1300 in Lecture Notes in Computer Science, pages 546{553, 1997. [20] John Plevyak, Vijay Karamcheti, Xingbin Zhang, and Andrew A. Chien. A hybrid execution model for ne-grained languages on distributed memory multicomputers. In Supercomputing '95, 1995. [21] A. Rogers, M. Carlisle, J. Reppy, and L. Hendren. Supporting dynamic data structures on distributed memory machines. ACM Transactions on Programming Languages and Systems, 17(2):233{263, 1995. [22] Klaus E. Schauser, David E. Culler, and Seth C. Goldstein. Separation constraint partitioning | a new algorithm for partitioning non-strict programs into sequential threads. In Conference Record on POPL'95: 22nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pages 259{272, 1995. [23] Richard M. Stallman. Using and Porting GNU CC, 1995. [24] Supercomputing Technology Group MIT Laboratory for Computer Science. Cilk-5.0 (Beta 1) Reference Manual, 1997. http://theory.lcs.mit.edu/~cilk/. [25] Kenjiro Taura, Satoshi Matsuoka, and Akinori Yonezawa. StackThreads: An abstract machine for scheduling ne-grain threads on stock CPUs. In Proceedings of Workshop on Theory and Practice of Parallel Programming, number 907 in Lecture Notes on Computer Science, pages 121{136. Springer Verlag, 1994. 32

[26] Kenjiro Taura and Akinori Yonezawa. Schematic: A concurrent object-oriented extension to scheme. In Proceedings of Workshop on Object-Based Parallel and Distributed Computation, number 1107 in Lecture Notes in Computer Science, pages 59{82. Springer-Verlag, 1996. [27] Kenjiro Taura and Akinori Yonezawa. Fine-grain multithreading with minimal compiler support|a cost e ective approach to implementing ecient multithreading languages. In Proceedings of the 1997 ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 320{333, 1997. [28] Kenjiro Taura and Akinori Yonezawa. StackThreads/MP : Making futures in calling standards. In Proceedings of ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPoPP), 1999. [29] Kazunori Ueda and Takashi Chikayama. Design of the kernel language for the parallel inference machine. The Computer Journal, 33(6):494{500, 1990. [30] Akinori Yonezawa, Jean-Pierre Briot, and Etsuya Shibayama. Object-oriented concurrent programming in ABCL/1. In Proceedings of ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages, and Applications (OOPSLA '86), pages 258{268, 1986.

33