Hardware Support for Extracting Coarse-grain Speculative ... - CiteSeerX

Hardware Support for Extracting Coarse-grain Speculative Parallelism in Distributed Shared-Memory Multiprocessors Renato J. Figueiredo Department of ECE Northwestern University r. gueiredo@computer.org

Abstract

Data dependence speculation allows a compiler to relax the constraint of data-independence to issue tasks in parallel, increasing the potential for automatic extraction of parallelism from sequential programs. This paper proposes hardware mechanisms to support a data-dependence speculative distributed sharedmemory (DDSM) architecture that enable speculative parallelization of programs with irregular data structures and inherent coarse-grain parallelism. Ecient support for coarse-grain tasks requires large buers for speculative data; DDSM leverages cache and directory structures to provide large buers that are managed transparently from applications. The proposed cache and directory extensions provide support for distributed speculative versions of cache blocks, run-time detection of dependence violations, and program-order reconciliation of cache blocks. This paper describes the DDSM architecture and presents a simulation-based evaluation of its performance on ve benchmarks chosen from the Spec95 and Olden suites. The proposed system yields simulated speedups of 3.8 to 12.5 in a 16-node con guration for programs with coarse-grain speculative windows (millions of instructions and hundreds of KBytes of speculative data).

1 Introduction

Modern high-performance computers exploit implicit parallelism across instructions of a sequential stream (instruction-level), as well as explicit parallelism across tasks executing in distributed processing units (threadlevel). The former type of parallelism - ILP - is often achieved transparently from a programmer via compiler and hardware techniques. The latter - TLP - is currently achieved either via explicit parallel programming, or with the aid of parallelizing compilers [8, 7]. Future high-performance computers will be able to leverage plentiful on-chip resources to support various granularities of parallelism [1, 6, 22]. While implicit parallelism has been traditionally studied at the instruction level, several implicit techniques that exploit the large transistor budget of next-generation processors have recently been pursued in designs that exploit speculative TLP [11, 12, 5, 20, 21, 24].

This work was partially funded by the National Science Foundation under grants CCR-9970728 and EIA-9975275. Renato Figueiredo is also supported by a CAPES fellowship. Part of this work was performed while the authors were at the Department of ECE, Purdue University.

Jose A. B. Fortes Department of ECE University of Florida fortes@u .edu This paper proposes a novel data dependence speculation technique for distributed shared-memory (DSM) multiprocessors that allows for automatic extraction of speculative TLP from sequential programs with irregular data structures and inherent coarse-grain parallelism. The speculation mechanisms extend existing L2 caches and directory protocols to support a hardwarebased data dependence speculative DSM. The paper describes the proposed technique and evaluates its performance via simulation. Previous work on thread-level speculation for multiprocessors has considered solutions that exploit parallelism at ne [11, 12, 21] and coarse [24, 25] granularities. It is conceivable that large, high-performance multiprocessors of the future will employ speculative techniques at dierent granularities. However, there are limitations to the eciency of previous approaches that need to be addressed to eciently support highly speculative distributed multiprocessors. First, performance gains for ne-grain applications tend to diminish as more processors are used in computation; the relative cost of the synchronization overheads at start-up and commit time increases, and the system's eciency is reduced. This behavior is observed in the results reported in [20], where modest relative performance improvements - at most 43% are achieved as the number of processors is quadrupled from 4 to 16. Second, there are applications that have inherent parallelism that can be exploited speculatively at a coarse granularity [24]. In this case, speculation has the potential to provide good parallel eciency, since synchronization costs are amortized across large task bodies. For example, increasing the granularity of tasks via loop unrolling has been shown to improve the performance of speculative parallelized applications [5, 16]. However, in order to support coarse-grain tasks, large speculative buers must be available. This paper proposes a technique that addresses these limitations in the context of DSM multiprocessors. The proposed Data-Dependence Speculation technique | DDSM | diers from ne-grain onchip solutions [11, 12] in that a distributed protocol is employed, and large speculative buers are supported. The proposed technique also diers fundamentally from ne-grain, multi-chip solutions [5, 20] in that (1) all speculative state is encoded in cache lines and directory entries, and (2) protocol transactions follow the same sequence of events of conventional directory-based DSM coherence protocols.

2 Programming and execution models

The programming model of DDSMs extends the singleprogram, multiple-data (SPMD) model to support speculative accesses to shared-memory. In SPMD, processors execute the same program on multiple data during parallel execution, and communicate and synchronize via shared-memory accesses. Parallel tasks in SPMD may be identi ed manually by a programmer, or automatically by a compiler; they must be dataindependent to ensure correct execution. Data-dependence speculation allows a compiler or programmer to relax the constraint of dataindependence to issue SPMD tasks in parallel; mechanisms to detect and recover from data dependence violations are provided. The execution model is based on a hierarchical enforcement of sequential semantics [19, 11, 12, 5]. To facilitate the bookkeeping of data-dependent tasks, the model conservatively assumes that all instructions in a task are data-dependent on the rst instruction of the task. Figure 1 shows an example of a speculative execution under this model. Consider a window of instructions consisting of four tasks of a sequential pro-

Task 0 Task 2

violated=0; /* volatile, private */ if(MyId==0) Head=0; /* shared */ Barrier(); Begin_Spec(); if(violated) while(MyId!=Head) {spin()}; violated=1;

Task 3

Task 2

Task 1

Task 0

Task 1

prologue

Speculative window epilogue

Task 3

DDSM treats cache blocks as self-contained speculative units, thereby leveraging existing cache-coherent mechanisms of DSMs. In contrast, speculative state in related proposals is distributed across cache lines and additional hardware buers (LMDT/GMDT tables [5], ORB buers [20]). Related work on coarse-grain speculation for DSMs has considered solutions that support large buers [24, 25]. While in these schemes hardware detects data dependence violations, copy-out software is responsible for bringing speculative data to nonspeculative state. DDSM diers from these solutions in that speculative versions of blocks are reconciled transparently from applications. In summary, the main contribution of this paper is a novel data dependence speculation technique for DSMs that (1) is distributed, and (2) supports large cachebased speculative buers that are managed transparently from applications. Its performance is quantitatively analyzed via simulation, based on a modi ed version of RSIM [18]. The analysis considers a set of ve programs chosen from the Spec95 and Olden [3] benchmark suites, and a sparse matrix-based microbenchmark. These Fortran and C programs operate on both static and pointer-based data structures. The analysis shows that the speculative DSM eciently supports coarse-grain applications with low occurrence ( 10%) of dependence violations, and delivers speedups of 3.8 to 12.5 for the studied programs. It also determines how its performance is aected by the size of the speculative window, number of processors, and mis-speculation frequency for a sparse matrixbased microbenchmark. The rest of this paper is organized as follows. Section 2 describes the DDSM programming and execution models. Section 3 describes the structures that implement the models of Section 2. Section 4 describes the experimental methodology used in the performance analysis; results are summarized in Section 5. Section 6 compares the approach with previous work, and Section 7 presents conclusions.

Task(MyId);

while(MyId!=Head) {spin()}; End_Spec(); Head++; Barrier(); Flush_Commit();

non-speculative code

Figure 1: Example of sequential code speculatively executed in parallel by 4 processors.

gram. Tasks 0 through 3 speculatively execute in parallel: if the tasks in the speculative window are data-independent, their parallel execution preserves the original sequential semantics. If they are not dataindependent, the dependence violation(s) must be detected, and the violated tasks must be re-issued to preserve sequential ordering. In the DDSM realization of the hierarchical execution model, speculative tasks are issued in parallel, commit in sequential order, and are synchronized via a barrier. Tasks that violate dependences during execution are blocked by software and are re-executed in sequential order. Sequential order is implied by the unique identi ers of the processors executing speculative tasks. Speculative execution may involve all processors in the system, or may involve only a subset of them. The Head task is de ned as the earliest task of a speculative program that has yet to commit. Program segments that are amenable to speculative parallelization under this model include loops and sequences of subroutine calls [17] with statically known control dependences. These segments are assigned to speculative execution through the addition of prologue and epilogue wrappers to the source code (Figure 1). Section 3.5 describes the DDSM software interface.

3 DDSM speculation methods

A speculative multiprocessor needs to provide mechanisms for (1) buering speculative state until commit time, (2) detecting data dependence violations, (3) committing program-order speculative data, (4) discarding non-program order speculative data, and (5) checkpointing the beginning/end of speculative execution. The support for these mechanisms in DDSM is discussed in the rest of this section.

3.1 Speculative buers

Speculative data in DDSMs is buered in extended cache lines. In order to leverage conventional cache organizations to hold speculative data, extra state information is appended to the state of conventional cache blocks, and the cache controller is extended to handle speculative accesses. The novel state extensions pro-

1

SP

Data State

SL Fwd

Mask

Readers

buffer Get_SW(i)

L2

Table 2

Swrite

Data

Mask

Data

Mask

Handler

Directory

i < Rid?

CPU

Data

State

Rid = min{Readers[i+1,...,N]}

2 check

yes

Writers

squash Rid, ..., N

(B) Encoder

(D)

3

commit

Rec. data Data State

Software

SP

SL Fwd

Mask

Data

buffer

Memory

Node i (C) gang-clear

L2

4 squash recover context

context

CPU

(A) cpu, cache messages

5 checkpoint Node j

Node i+1

directory messages

Node j+1

Software

Figure 2: Overview of DDSM speculation methods described in Sections 3.1-3.5. In the gure, node i issues a speculative write that violates a data dependence violation, and speculative data from nodes j , j + 1 are reconciled. 1 extensions to L2 cache allow for buering of speculative shared-memory data; 2 the directory checks the list of read sharers for RAW violations. When a violation occurs, the data-dependent tasks are squashed 4 : speculative cache blocks transition to the squashed state, and the processor context - checkpointed at the beginning of speculation 5 - is restored. At the end of speculation, caches

ush committed speculative blocks to the directory, which employs a reconciling function 3 to commit the program-order version of speculative blocks to memory. posed in this paper encode the speculative state of a memory block in two bits (SP-bits). A cache block in DDSM may be in one of the following four states, with respect to its speculative status: SP IN (speculative), SP CO (contents are safe to be committed to main memory), SP SQ (contents have been squashed), and SP NO (not speculative)1 . Figure 3 shows the state diagram for speculative DDSM cache blocks. Transitions in the diagram are triggered by speculative reads and writes (Sreads, Swrites), commit, squash and ush requests. In addition to SP-bits, each cache block has an SL bit ( ags that the block has been speculatively loaded), a Fwd bit ( ags that the block has been forwarded), and a per-word write mask. Write masks allow for accesses at a ner granularity than a cache block [11, 12, 5]: they determines which words of the block have been speculatively written. The resulting cache block is depicted in Figure 2 1 . Although the buering of speculative state takes place in L2 cache blocks, DDSMs allow caching of spec1 Conventional cache-coherent block states, such as shared-clean and dirty, apply to non-speculative (SP NO) blocks.

SP_NO

Sread,

Swrite

Squash

SP_IN

SP_SQ Sread, Swrite

Sread, Swrite Flush Commit

SP_CO

Figure 3: State machine for speculative L2 cache blocks.

ulative data in L1 caches for high performance by extending the L1 cache lines with speculative-state bits. Tables 1 and 2 show how speculative memory accesses to a sub-word i of a cache block B are resolved by the cache controller. The rst speculative read (write) access to B triggers the sending of a Get SR (Get SW) message to B's home node. These are detected testing the value of the block's SL and write-mask bits, and are used by the directory to dynamically build a record of all speculative readers and writers of a block. Subsequent reads and/or writes are satis ed locally by the L2 cache, without involving the directory. There are two special cases handled by the controller. First, if a speculative read is to a word previously written speculatively by the same processor, the access is satis ed locally by the cache, without involving the home node. Second, if a speculative write accesses a forwarded block, the system squashes subsequent speculative tasks (Section 3.6).

3.2 Ordering violation detection

Out-of-order memory accesses may violate data dependences imposed by the sequential semantics of a program. DDSM maintains ordering information of spec-

Message Get SR Get SW Rec ush Rec squash

Description

Request spec. block from home; home marks requester as reader Request spec. block from home; home marks requester as writer Flush spec. block to home to be reconciled Request a partially reconciled copy of block from directory

Table 1: Speculative messages of DDSM L2 caches.

Uncached

Write

Read

Read Write

Read,Write

Shared

Get_SR, Get_SW

Private

Get_SR, Get_SW fetch-invalidate from owner write-back to memory

Get_SR Get_SW add to readers/ writers vector

Get_SR, Get_SW invalidate sharers Speculative

Rec_flush Rec_squash

Figure 4: Extended DDSM state machine for cache blocks at the directory. Extended states and transitions are shown in boldface.

ulative data to detect violations at a coherence block granularity at the directory. The main motivation for using the directory to track memory ordering and detect memory violations stems from the fact that accesses to a memory block are serialized by the directory controller [24]. The DDSM directory extends conventional CCNUMA protocols with extra states to track memory access ordering. Conventional protocols store the state and one read-sharers bit-vector for each memory block [15]. The DDSM directory protocol uses two bitvectors to record both speculative readers and speculative writers of a block, and de nes one extra block state (Speculative). The use of directory state to track ordering diers from the solution by Stean et al. [20]. Although their solution requires less storage than DDSM, the fact that the directory has no knowledge of ordering has potential implications on the number of transactions involved in a remote memory access. In their solution,

Action Access SP-bits SP NO Send Get SR to home(B) Sread SP IN If SL = 1 or Mask(B[i]) = 1, request is satis ed by L2; (B[i]) else send Get SR to home(B) SP SQ Send Rec squash to home(B) SP NO Send Get SW to home(B) SP IN If Fwd = 1, multicast violation; Swrite else, if OR(Mask) = 1, then (B[i]) request is satis ed by L2; else send Get SW to home(B) SP SQ Send Rec squash to home(B)

ush SP CO Send Rec ush to home(B) commit SP IN Change B's state to SP CO squash SP IN Change B's state to SP SQ

Table 2: Actions performed by the cache controller for speculative reads/writes to sub-word i of block B (Flush, Commit and Squash requests apply to the entire block B).

any request received by the directory that may trigger an ordering violation must be communicated to all sharers of the block so that dependence checks can be performed. The potential performance degradations of such transactions may not be noticeable with small con gurations or when communication occurs in a bus. However, for larger, distributed con gurations, these transactions can become costly. In contrast, a DDSM performs dependence checks at the directory, without requiring further messages to be sent to sharers. This functionality requires an extra vector of sharers to distinguish speculative readers from writers. Although a full-map directory has been assumed in this paper, solutions that provide storage for limited number of sharers [4] can be leveraged; if the number of speculative writers exceeds the maximum number of sharers, correctness can be ensured by forcing dependence violations. The DDSM protocol operates with blocks that can be either non-speculative or speculative. Blocks become speculative when the directory receives any Get SR or Get SW request. Blocks become nonspeculative only after a reconciling request (Rec ush) is satis ed by the directory. The extended protocol transactions are summarized in Figure 4. When a speculative request for a non-speculative block is received by the DDSM directory controller, all sharers of the block are invalidated. If the block is exclusive (dirty), it is written back to main memory. The block then becomes speculative, and its memory contents are sent to the requester's cache. Subsequent speculative accesses to the block are serialized by the directory controller, and the requester's identi er is recorded in the readers/writers bit-vectors. The directory includes the memory contents for the block in the reply, unless the request requires forwarding. Blocks in the Speculative state may be modi ed in the L2 caches of multiple processors; DDSM produces a single, consistent version of the block at the end of speculation (Section 3.3). This support for multiple, distributed writable copies allows for automatic privatization of speculative blocks. Output- and antidependences (WAW, WAR) are resolved via this renaming mechanism, and do not cause dependence violations. Violations of true (RAW) dependences, however, are detected and trigger recovery mechanisms. The directory checks for a RAW violation every time a speculative write (Get SW) request is received. The check consists of comparing the processor identi er of the writer | Wid | to the identi er of the earliest speculative reader (stored in the Readers[] vector) that may be data dependent on the writer (Rid = minfReaders[(Wid +1); :::; N]g). If Wid < Rid , a RAW violation is detected, and the directory multicasts a squash message to all processors with identi ers greater than or equal to Rid . Figure 2 2 shows an example of this check for Wid = i.

3.3 Task commits

The DDSM directory protocol allows multiple outstanding writable copies of a shared-memory block to reside in L2 caches during speculative execution. At the end of speculative execution, the DDSM directory applies a reconciliation operation to all speculative versions of a block to commit its program-order version to

main memory. Committing speculative data to global memory is performed in two steps. In the rst step, each processor marks all SP-bits of speculative cache blocks as committed (SP CO) in the L2 cache, similarly to the gang-commit operation in SVC [11]. This operation is local to a processor; processors are guaranteed to perform this operation in sequential order by synchronizing via the Head variable, as discussed in Section 3.5. In the second step, SP CO blocks across distributed caches are globally reconciled to produce the programorder version of each block. This step is performed in parallel by the caches and directory controllers. The second step requires that caches ush SP CO blocks to their respective home nodes. At the time the ush operation is initiated, speculative blocks are guaranteed to be in the SP CO state across multiple cache lines, since all processors have nished the rst commit step in sequential order. When the home controller receives the rst reconciliation request for a given block, it allocates a temporary buer with number of entries equal to the number of processors N. The controller then multicasts fetch-and-invalidate requests to the speculative sharers of the block and sets a reply counter with the expected number of replies. When a remote cache receives a fetch-and-invalidate request, it replies with the block's contents and its write mask, and invalidates the block. For each such reply received by the directory, the block's contents and mask are copied to the temporary buer at the position given by the sender's identi er. When all versions are received (i.e., the reply counter reaches zero), a priority encoding operation, equivalent to the one performed by Hydra [12], produces the nal version of the block. The priority encoder selects, among all write-masks, the ID of the latest speculative writer for each sub-word of a block. This selection drives the output of data multiplexers that generate the program-order nal version of each sub-word. The reconciling transaction is based on a requestmulticast-reply event sequence and can be implemented as an extension to the read-exclusive transaction to a shared block of existing directory protocols [15]. An example of the reconciling protocol transaction is discussed next and illustrated in Figure 2AD. It begins when a reconciling request (Rec ush) is received from a remote cache (node j, (A)). The directory handler allocates a buer for this transaction, copies contents and mask of the requester to the buer, and multicasts fetch-invalidate requests to the sharers of the block (node j + 1, (B)). When the directory receives all outstanding replies for the block (C), the priority encoding function is applied to the reconciling buer (D) and the nal version of the block is committed to memory.

3.4 Task squashes

The DDSM caches are responsible for squashing all data produced by a speculative task that has violated a data dependence. Squashing is achieved by globally setting all SP-bits of speculative cache blocks to the state SP SQ, and resetting the respective Fwd, SL and Mask bits, as shown in Figure 2 4 . When a restarted task accesses a SP SQ block, the cache controller requests a partially reconciled copy of the block

from the directory (Rec squash, Table 2). A partially reconciled block requested by task i is generated from versions of tasks 0; :::; i ? 1. When a task is squashed, its execution is restarted by recovering the processor context (saved at the beginning of speculative execution) via an interrupt. Restarted tasks execute in sequential order.

3.5 Checkpointing

The DDSM software/hardware interface is provided by a set of three system calls: Begin Spec, End Spec and Flush Commit (Figure 1). The interface requires two extensions to the CPU. First, the processor must be able to save its context before speculation begins, and retrieve it when a violation interrupt is received. Only one speculative context needs to be stored; it is also possible to implement context saving/retrieving in software traps. Second, the instruction set needs to support speculative instances of all memory operations. This can be achieved by assigning one bit in the instruction opcode to determine whether the access is speculative or not. This bit is not used during decoding; it is passed on to the data cache controllers to determine whether the access is subject to the conventional (non-speculative) or to the DDSM reconciling protocol (speculative). When a compiler for DDSM speculatively parallelizes a sequence of tasks, it (1) generates prologue code to set up the beginning of speculation, (2) marks the memory instructions inside the body of the task as speculative, and (3) generates epilogue code to set up the end of speculation. The prologue and epilogue consist of barriers, system calls, accesses to a local variable (violated), and non-speculative accesses to a sharedmemory variable that stores the identi er of the Head task, as shown in Figure 12. The prologue code initializes the Head variable, saves the processor context (Begin Spec), and enforces sequential re-issue of data-dependent tasks by spinning on the value of Head if the task has been restarted (i.e., if the private variable violated equals 1). The epilogue code enforces in-order commits of speculative blocks to local L2 caches (End Spec, synchronized by Head) and initiates the global reconciliation of committed blocks (Flush Commit, synchronized by a barrier).

3.6 Forwarding optimization

Some dynamic RAW dependences can be resolved via data-forwarding without incurring violations. The basic DDSM reconciling directory does not provide automatic support for data forwarding. However, full support for forwarding across distributed nodes can substantially increase protocol complexity. DDSM supports a write-violate forwarding scheme that allows a processor Pi to forward a speculatively written block to later processor(s), as long as the forwarded block is not re-written speculatively by Pi. For2 Since a vendor compiler was used in this paper, the distinction between speculative and non-speculative memory operations was not implemented in the generated instruction set, but emulated via a CPU state bit that is set/reset via system calls. Accordingly, the non-speculative Head variable was represented as a special CPU register in the simulations, with access time equal to the average latency of incrementing a global variable with 16 processors. The performance impact of this simpli cation is negligible for the coarse-grain speculative windows of the studied benchmarks.

Unit CPU L1 cache L2 cache Memory

Con guration

300MHz, 4-way out-of-order 2 ALUs, 2 FPUs, 2 address units 16KB, write-through, direct-mapped, 1-cycle hit latency 2MB, write-back, 8-way associative, 8-cycle hit latency 60ns DRAM latency Local miss: 48 cycles, Remote miss: 134 cycles

Table 3: Model parameters (all caches have 64B blocks). warding of a block is initiated by the directory controller if there is any earlier (program-order) writer recorded in the block's bit vector. The latest of such writers provides the data for the cache-to-cache transaction. If a forwarded block (i.e. Fwd = 1) is speculatively re-written by a task, all later tasks are squashed and restarted (Table 2).

3.7 Cache replacements

When a speculatively written block is replaced from a cache, it causes false dependence invalidations: the block cannot be written back to memory, since it is not guaranteed to be valid; nor can it simply be discarded, since its contents are needed to perform reconciliation. Since the replacement of speculatively written lines causes false dependence invalidations and performance degradation, it is important to reduce the probability of their occurrence. Although not studied in this paper, the use of main-memory node caches [9], in addition to L2 caches, can potentially provide a large, associative cache for speculative data and reduce the probability of capacity- or con ict-induced restarts. These are optimizations that may be applied to implementations of DDSMs. In related work, it has been shown that Stache main-memory node caches can be used in conjunction with a reconciling protocol to avoid cache replacements to be sent to remote home nodes [14]. Since in DDSM the entire speculative state is encoded in cache blocks and directory entries, existing mechanisms to store evicted lines in main memory can be leveraged. The experiments described in this paper do not assume the availability of node caches. In this scenario, upon replacement, the DDSM conservatively squashes all tasks data-dependent on a replaced block.

4 Experimental methodology

4.1 Machine model and con guration

The machine model assumed in this work is a releaseconsistent3 CC-NUMA multi-processor with hardware support for DDSM speculation. The system consists of up to 16 identical nodes connected by a 2-D mesh network. Each node has a memory bus connecting a single processor with on-chip L1 and L2 caches, main memory, directory controller and network interface. 3 Speculative memory accesses may be issued out-of-order across DDSM processors, but eventually commit in sequential order. The machine supports release-consistency for non-speculative DSM accesses.

Bench. Turb3d Health Power Perimeter TreeAdd Ssm

Working set

32x32x32, 3 iter. 7 levels, 24 iter. 3200 customers 4K x 4K, 8 levels 256K nodes 1K-64K integers

Tasks

loop/subr. recursive subr. loop/subr. recursive subr. recursive subr. loop iter.

Table 4: Benchmarks used in the performance analysis. Table 3 shows the parameters assumed for processors, caches and memory inside each node of the machine. The average simulated remote vs. local memory read latency ratio between adjacent nodes is 2.8. All simulation parameters not displayed in the table are set to RSIM [18] defaults. The con guration of the L2 cache (size and associativity) loosely models a system that supports speculative execution without the occurrence of replacement-induced restarts for the programs under study, such as one with Simple-COMA mainmemory caches [9, 14].

4.2 Workloads and simulation model

The performance analysis of Section 5 is based on the simulation of the speculatively parallelized benchmarks listed in Table 4 (with respective working sets) from the Spec95 and Olden suites, in addition to a sparse-matrix microbenchmark developed by the authors. Turb3d is a benchmark from the Spec95 suite that simulates isotropic, homogeneous turbulence in a cube. Turb3d has available parallelism across procedures that compute Fast-Fourier Transforms (FFTs) along distinct dimensions. Data dependence speculation allows parallelization across FFT subroutine calls without the need for inter-procedural analysis. The Olden benchmarks Health, Power, Perimeter and TreeAdd are written in C and represent codes with dynamic data structures, where parallelism is hard to be detected automatically at compile-time. Olden benchmarks are distributed in two versions: the sequential version, and a hand-parallelized version based on futures [3]. Hand-parallelized Olden programs have been studied previously in systems without data dependence speculation. In contrast, this paper begins with the sequential version of the Olden programs. The sequential versions of these programs are manually prepared for speculative parallelization in this paper. This preparation consists of the addition of prologue/epilogue wrappers to sequential regions in the source code; no manual dependence analyses are performed. This interface is simpler for a compiler than the traditional SPMD model because threads do not need to be proved data-independent at compile time. The task of the compiler is then to only identify potentially parallel threads for speculative execution and insert prologue/epilogue codes (Figure 1). Such code is application-independent and can be encapsulated in system libraries. The benchmark Power solves a power system optimization problem. Parallelism in Power is dicult to detect automatically, due to the pointer structures that are used in its computation. However, Power can be speculatively-parallelized in the outermost loop. The Health benchmark simulates the Columbian

a) Divide() { a=Divide(); b=Divide(); return a+b; }

DivideInline() { a=Divide(); b=Divide(); c=Divide(); d=Divide(); barrier

P0 spec. task P1 spec. task P2 spec. task P3 spec. task

c)

return(a+b+c+d); }

b) Divide() { Divide(); Divide(); Combine();

DivideInline() { Divide(); Divide(); Combine(); Divide(); Divide();

}

Combine(); Combine(); }

P1 waits for P0

P3 waits for P2 P3 waits for P0, P1 barrier

Figure 5: Partial inlining of recursive calls to expose sub-

routine parallelism across 4 processors (c). The combine step of the recurrence uses partial-sum reductions (a) or explicit serialization (b).

health care system. Health utilizes pointer-based lists with elements that are dynamically inserted/removed, and traversed recursively. It is speculatively parallelized by partially inlining recursive calls and speculating across subroutine calls. TreeAdd is a benchmark that computes the summation of values in a tree. Perimeter computes the perimeter of a set of quad-tree encoded raster images. Both TreeAdd and Perimeter use pointer-based tree structures. Similarly to Health, partial inlining is used in these two benchmarks. Procedure-level speculation is exploited in the benchmarks with recursive subroutine calls by applying partial inlining (Figure 5). Conventional compilers use inlining to both reduce procedure call overhead and enlarge the window of operations that can be inspected for global optimizations. In DDSM, partial inlining does not target either of these goals: it is used to expose speculative parallelism by increasing the number of procedure calls. During speculative execution, these procedures are issued in parallel (Figure 5c)). After their execution have completed, the \combine" phase of the recursion may require either code serialization (Figure 5a)), or, when applicable, a reduction (Figure 5b)) to avoid data dependence violations. The remaining benchmark, Ssm, is a microbenchmark that models indirect addressing of array elements. The pseudo-code for this program is as follows: for(i=0;i