A Unified Approach to Speculative Parallelization ... - Semantic Scholar

A Unified Approach to Speculative Parallelization of Loops in DSM Multiprocessors Ye Zhang, Lawrence Rauchwerger, and Josep Torrellas University of Illinois and Texas A&M University y-zhang2, [email protected], [email protected]

Abstract Speculative parallel execution of statically non-analyzable codes on Distributed Shared-Memory (DSM) multiprocessors is challenging because of the long latency and memory distribution present. However, such an approach may well be the best way of speeding up codes whose dependences can not be compiler analyzed. In this paper, we have extended past work by proposing a hardware scheme for the speculative parallel execution of loops that have a modest number of cross-iteration dependences. In this case, when a dependence violation is detected, we locally repair the state. Then, depending on the situation, we either re-execute one out-of-order iteration or, restart parallel execution from that point on. The general algorithm, called the Unified Privatization and Reduction algorithm (UPAR), privatizes, on demand, at cache-line level, executes reductions in parallel, merges the last values and partial results of reductions on-the-fly with minimum residual work at loop end. UPAR allows for completely dynamic scheduling and does not get slowed down if the working set of an iteration is larger than the cache size. Simulations indicate good speedups relative to sequential execution. The hardware support for reduction optimizations brings, on average, 50% performance improvement and can be used both in speculative and normal execution.

Keywords: scalable shared-memory multiprocessors, cache coherence protocols, run-time parallelization, speculative execution, reduction parallelization.

1 Introduction To achieve a high level of performance for a particular program on today’s supercomputers, software developers are often forced to tediously hand–code optimizations tailored to a specific machine. Such hand– coding is difficult, increases the possibility of error over sequential programming, and the resulting code may not be portable to other machines. Restructuring, or parallelizing, compilers address these problems by detecting and exploiting parallelism in sequential programs written in conventional languages. Although compiler techniques for the automatic detection of parallelism have been studied extensively over the last two decades (see, e.g., [19, 24]), current parallelizing compilers cannot extract a significant fraction of the available parallelism in a loop if it has a complex and/or statically unknown access pattern. Programs exhibiting this kind of behavior account for more than 50% of all Fortran applications [12] and encompass most C codes. Typical examples of applications containing such loops are complex simulations such as SPICE for circuit simulation, DYNA–3D and PRONTO–3D for structural mechanics modeling, GAUSSIAN and DMOL for quantum mechanical simulation of molecules, CHARMM and DISCOVER for molecular dynamics simulation of organic systems, and FIDAP for modeling complex fluid flows [6]. Thus, in order to realize the full potential of parallel computing it has become clear that static (compile– time) analysis must be complemented by new methods capable of automatically extracting parallelism at run–time [5, 6, 9]. Run–time techniques can succeed where static compilation fails because they have complete information about the access pattern. For example, input dependent or dynamic data distribution,

1

memory accesses guarded by run–time dependent conditions, and subscript expressions can all be analyzed unambiguously at run–time. Recently, new software techniques for the automatic parallelization of loops have been introduced [21, 20] which take a more aggressive approach: they speculate about the parallelism of access patterns, generate, and, then execute the loops speculatively in parallel, (alternatively they can inspect the access pattern first) and verify the validity of the parallelization by means of a parallel algorithm. The capability of these software run-time techniques to parallelize otherwise compiler intractable code makes them extremely attractive when compared with their only alternative, sequential execution. However their performance when compared to manual parallelization, leaves a lot of room for improvement. The relative simplicity and repetitive character of the software techniques, as well as their sizable overhead and relatively slow detection of data dependence violations (violations are detected after loop execution) make them a good target for hardware implementation. For this reason we propose to develop a general system that will implement run-time parallelization and related optimizations by combining the flexibility and generality of restructuring compilers and software-based speculative parallelization with the high performance offered by hardware implemented primitives.

The Goal of the Proposed Design In this paper we will present a comprehensive approach to the design of a parallel processing system that can automatically extract all available parallelism from Fortran and C programs and efficiently execute them on a distributed shared memory machine in the SPMD model of execution. The system consists of a state of the art compiler, a modestly enhanced run-time system and a modified DSM architecture that will work together to achieve our goal of automatic and efficient parallel execution. Our overall philosophy is to first let the compiler extract all statically analyzable information about the available parallelism of loops. Those loops and shared data structures for which no definitive answer has been found at compile time will be annotated and transformed for parallel execution. Additional calls to the run-time system (for scheduling and memory allocation) and system calls (for the configuration of the programmable hardware) will be inserted at this time. Then, at run-time, the annotated loops will be executed in a speculative mode as doalls, i.e., as fully parallel loops. The properly configured hardware will monitor the relevant memory references and verify that the flow of the loop’s values respects the semantics of the original, serial loop. Any violations will be almost immediately detected and repaired and the parallel execution continued. Any such violation will be recorded such that appropriate scheduling techniques can avoid them in future instantiations of the loop, thus closing the feedback loop of our adaptive system. We have made two important decisions for our approach to automatic parallelization.

We will be making every effort to execute loops in fully parallel mode (as doalls) because we believe that only from such loops can we expect performance that scales with the number of processors and data size. Loops requiring many synchronizations at arbitrary points during their parallel execution (as is the case with irregular, dynamic applications) are at a definite disadvantage on large multiprocessors due to the actual time lost in synchronization and the intrinsic high cost of cross-processor communication. Therefore both compiler and hardware will adopt the most promising and effective loop transformations to remove the need for synchronizations and reduce the chance of a dependence violation and subsequent repair phase during speculative parallel execution. The proposed architecture verifies the correctness of data flow in a radically different manner than previous software approaches. It uses a modified cache coherence protocol that can handle any crossprocessor communication, which is the only potential source of violations in a speculatively parallelized loop. We will make every effort to shift all such monitoring activity to the memory (directory) side of the system, without interfering with the activity of the processors; this violation detection activity has been in fact ’appended’ to the communication activity which can be overlapped with 2

the systems’ computation, and thus does not necessarily increase the overall execution time of the program. The proposed system provides a unified mechanism to remove all memory related data dependences and to parallelize valid reduction operations by copying shared data into dynamically allocated private storage when first accessed, and then, at loop end, merging it back out, to the original shared space. Cross-processor flow dependences are serviced when possible and, when not possible, recover from any dependence violation by re-executing a minimal amount of work. We note that the work presented in this paper represents a significant improvement over our previous work [27, 25, 28, 26]. It introduces a unified privatization and reduction algorithm (UPAR) that is capable to remove any memory related dependences and parallelize reduction without using a specific choice of coherence algorithm. UPAR privatizes on-demand (only actual references) at the cache line level and merges the results of the computation, i.e., last value and reductions, on-the-fly with minimal additional overhead at the end of the loop. The system described here will squash a minimal number of speculative iterations only in the case of an out-of-order flow dependence. The architecture proposed here is using one unified communication algorithm throughout, without any compiler or user specification. Of course, if more information is available it can be used. Due to space limitations in this paper will only present the functions performed by compiler and run-time system but will show the detailed implementation of the proposed architecture. The paper will first present some important principles of loop parallelization and relate them to their practical use in our system. Then we will present the overall architecture of the system (compiler tasks, machine architecture and run-time system tasks) and follow up with detailed description of hardware. Finally, a performance evaluation and a brief comparison to other related work will be presented.

2 Foundations of Loop Parallelization A loop can be executed in parallel without synchronization only if the outcome of the loop does not depend upon the order of the execution of the different iterations. To determine whether or not the order of the iterations affects the semantics of the loop, we need to analyze the data dependences across iterations (or cross-iteration dependences) [2]. There are three types of data dependences, namely flow (read after write), anti (write after read), and output (write after write). If there are no anti, output, or flow dependences across iterations, the loop can be executed in parallel. Such a loop is called a doall loop. If, instead, there are flow dependences across iterations, the loop cannot generally be executed in parallel. For example, the loop in Figure 1(a) cannot be executed in parallel because iteration i needs the value that is produced in iteration i ? 1. Finally, if there are anti or output dependences only, the loop must be modified to remove all these dependences before it can be executed in parallel. In the following, we describe two important transformations that can be used to remove many such dependences: privatization and reduction parallelization.

do i=1,n A(i) = A(i) + A(i−1) enddo (a)

do i=1,n/2 (S1) tmp = A(2*i) A(2*i) = A(2*i−1) A(2*i−1) = tmp (S2) enddo (b) Figure 1: Examples of loops.

3

do i=1,n A(f(i)) = ... ... = A(g(i)) + ... enddo (c)

(S1) (S2)

2.1 Privatization Through the privatization transformation, we create, for each processor participating in the execution of the loop, private copies of the variables that cause anti or output dependences. The loop can then be executed in parallel. For example, the loop in Figure 1(b) has an anti dependence between statement S2 of iteration i and statement S1 of iteration i + 1. This dependence can be removed by privatizing variable tmp. In this paper we consider an array privatizable if every read of it is preceded by a write to it in the same iteration, i.e., every use of a variable is first defined in the same iteration, thus killing any value defined in a previous iteration. In general, privatizable variables are temporary variables used as workspace within an iteration. If the first time(s) a is accessed as a read reference, and for every later iteration that accesses a, it is always written before it is read, then the loop could actually be executed as a doall by having the initial accesses to a trigger a copy–in of the global value of a, and the iterations that wrote a used private copies of a. In this way loops with a (read)(writejread) access pattern can be safely transformed into a doall. For example, all the loops in Figure 2 can be run in parallel if the single array element accessed is privatized, and the private copies initialized or read in when needed. Such situations can be detected by keeping track that read a (before it was ever written), and the minimum iteration i? of the maximum iteration i+ r w that + ? wrote a. Then, if ir iw , the loop can be executed in parallel.

It 1 It 2 It 3

It 1 It 2 It 3

It 1 It 2

It 1 It 2

It 1 It 2

Rd Rd Rd Rd Rd Rd

Wr Wr Wr Rd Rd Rd

Rd Wr Rd

Rd Rd Wr Rd

Rd Wr Wr Rd

Figure 2: Examples of iterations of loops that can be parallelized by privatization with copy-in. Under this privatization with copy-in model of parallel execution the most efficient way to bring in the shared data into private storage is on demand, i.e., a data is read directly from the shared area until it is written the first time (into private storage). Equivalently we can copy the data into the private area at the first read. In the case of irregular, sparse access patterns this scheme insures that only the minimum number of loads are executed. If the privatized data is needed after loop execution (or it cannot be proven otherwise) then a copyout phase is necessary to insure the sequential consistency of the parallelized program. During this phase, called last-value copy out, the last written data elements, i.e., written in the highest iteration that modified them (they could have been written in many iterations) are copied out from private to shared memory. Generally speaking such an operation involves significant cross-processor communications and occurs after loop execution, thus adding quite a bit to the overall execution time.

2.2 Reduction Parallelization Reduction parallelization is a powerful transformation that removes the flow-dependences associated with a reduction operation. A reduction variable is a variable whose value is used in one associative and commutative (not necessarily) operation of the form x = x exp, where is the operator and x does not occur in exp or anywhere else in the loop. A simple example is statement S1 in Figure 3(a). The function performed by the loop is to add a value computed in each iteration to the value stored in A(:). This type of reduction is sometimes called an update. There are several known parallel methods for performing reduction operations. One method is to transform the do loop into a doall and enclose the access to the reduction variable in an unordered critical section [9, 29], or, equivalently, perform an atomic fetch-and-add. The drawbacks of this method are that it is not scalable and that it requires potentially expensive synchronizations. A scalable method can be obtained by noting that a reduction operation is an associative and 4

commutative recurrence and can thus be parallelized using a recursive doubling algorithm [14, 16]. In this case, the reduction variable is privatized in the transformed doall. A scalar is then produced using the partial results computed in each processor as operands for a reduction operation (with the same operator) across the processors (Figure 3-(c)). This last cross-processor reduction, called merge-out, can be quite time-consuming if the reduction is performed on a large array (every element on all processors has to be added into its shared counterpart). Furthermore, if the access pattern is sparse then this operation becomes quite inefficient because either we perform many zero operand operations (if the partial sums have been kept in private, replicated arrays) or very expensive (if the partial sums have been kept in a compact data structure, e.g., hash tables). The real difficulty encountered by compilers in parallelizing loops with reductions arises from recognizing and validating the reduction statements. So far, this problem has been handled at compile–time by syntactically pattern matching the loop statements with a template of a generic reduction, and then performing a data dependence analysis of the variable under scrutiny to guarantee that it is not used anywhere else in the loop except in the reduction statement [29]. doall i = 1, p pA(1:n) = 0 enddoall do i=1, n do j = 1, m S1: A(j) = A(j) + exp() enddo enddo (a)

S1: S2: S3:

do i = 1, n A(K(i)) = ....... ............ = A(L(i)) A(R(i)) = A(R(i)) + exp() enddo (b)

S1: S2: S3:

doall i = 1, n A(K(i)) = ....... ............ = A(L(i)) pA(R(i)) = pA(R(i)) + exp() enddoall doall i = 1, p A(1:n) = A(1:n) + pA(1:n) enddoall (c)

Figure 3: (a) Simple example of reduction, (b) Reduction needing run-time validation, (c) Reduction transformed for parallel execution In the cases where data dependence analysis cannot be performed at compile time, reductions have to be validated at run-time. For example, although statement S3 in the loop in Figure 3(b) matches a reduction statement, it is still necessary to check at run-time that the elements of array A referenced in S1 and S2 do not overlap with those accessed in statement S3. Thus, in order to validate a reduction at run-time we must check that there is no intersection between the references in S3 and those in S1 and/or S2; just as before, all other potential dependences caused by the references in S1 and S2 will have to be checked.

2.3 Our Approach to Run-Time Application, Verification and Recovery From this brief introduction we conclude that by employing the copy-in, process in private storage, merge (copy) out scheme, most data dependences caused by memory reuse in sequential programs (anti and output dependences) as well as a special, but very frequent case of flow dependences, reductions, can be removed and the loop executed as a doall. However the remaining flow-dependences, that could not be statically identified, need to be detected at run-time. The overall criterion for such an occurrence is a Write/Read access pair to a memory location in different iterations. In principle, an ordered critical section could enforce this dependence, however, in the case of un-analyzable codes, this is not possible. Our solution is to detect such Write/Read pairs at run-time and check if they have occurred in the correct order (a higher iteration should read the data produced by a lower iteration). If this is not the case, then the consuming (higher order iteration) has to be re-executed after the writing iteration has finished. In other words we transform the code into a doall and enforce flow-dependence violations by using a Unified Privatization and Reduction (UPAR) algorithm.

5

3 Speculative Parallelization In this section we will first present the overall hardware and software system that can automatically parallelize irregular, dynamic applications. Then we will list the additional data-structures and their functionality and follow up with a description of the global data flow of the architecture. Finally we will discuss how flow dependences are detected and the actions that the system initiates to continue a correct execution.

3.1 General System Functionality 3.1.1 At Compile Time Our system uses Polaris[4], a state of the art Fortran parallelizing compiler that first performs a static analysis and qualifies loops as parallel, serial and potentially parallel. All arrays and scalars proven independent or privatizable by the compiler are transformed for parallel execution. Then the potentially parallel loops are marked for speculative parallel execution and the code generated for the access of its potentially dependent arrays (the load/store instructions) is modified to a specialized type: “regular” speculative load/store and “reduction” load/store. The shared arrays ‘under test’ are left unchanged. Reduction operands are, whether they have to be validated at run-time or not, are also left unchanged. System calls are inserted before and after the loop for the proper configuration of the hardware. Additional code is inserted for the benefit of the run-time system to use a special scheduler that can make use of data dependence information obtained from previous instantiations of the loop. If this information is not available then code for a generic doall is used. Finally we insert a call to a library routine that can read the information collected (mostly flow dependence information) during a speculative run and build a mapping of iterations to processors that avoids the occurrence of out-of-order flow-dependences. 3.1.2 During Execution After system calls configure the hardware the compile–time modified loop is executed as a doall in a speculative mode. During speculative execution the hardware will perform the following functions (for all variables under test, including all reduction variables):

Cache line level privatization and copy-in. Every first write (WR) operation on a processor will create its own, valid, copy in cache. Through this cache level duplication all memory related dependences can be removed. Displaced cache lines are stored in private, dynamically allocated memory. All allocation and de-allocation is performed on-demand. Flow-dependence violation detection. The normal, unmodified cache coherence traffic is augmented to carry additional information about the iteration number during which variables have been referenced. The home nodes will detect the sharing of variables (a flow dependence – all other sharing has been removed through privatization, i.e., cache line replication) and verify that their actual arrival order is also their sequentially correct order. A flow dependence violation between two iterations is detected when some data producing, lower iteration occurs after the higher data consuming iteration. If no violation has occurred then the hardware will find the correct data from one of the caches or memory subsystem and provide it to the consuming iteration. Flow dependence violation recovery. After a flow dependence violation between two iterations is detected, the higher iteration (the read-in, data-consuming iteration), which occurred after the producing iteration is re-executed. If we detect that the producing iteration (the writing iteration) could have affected more than one consuming iteration then all higher iterations will be squashed, their effect undone and the execution resumed from this point. The repair phase is triggered by a hardware interrupt and is performed by run-time routines. 6

Reduction verification is performed using a special reduction load instruction to check whether the reduction operands have been accessed only within the reduction statement. If the reduction operand is written outside the reduction statement, then the private reduction variables are re-initialized (since their current private values are no longer needed) and the shared data is updated. If the reduction operand is read outside the reduction statement, then the values of the private reduction variables are merged in the shared memory, and the resulting merged value is sent to the requester. Merging results from private storage to shared memory is performed only at displacement at two points during execution: (i) during loop execution if the cache lines get displaced, and (ii) after loop execution, through cache flushing of all written cache lines of the array under test. Lines involved in reductions are ‘reduced’ (e.g., added in ) to the shared memory through a fetch-andoperation performed by their home directory. From all other duplicated privatized lines, the home node continuously chooses those written last (according to the highest iteration in which they were modified). This on-the-fly update of the shared memory with the results computed in the private working space of the loop is extremely efficient. At displacement, values are sent by means of non-blocking writes to their home node - thus not increasing total execution time. The time to flush all dirty lines at the end of the loop is proportional to per processor cache size. This time can be dramatically shorter than the usual software cross-processor last value or reduction operation [21], which is proportional to the data size (in the best case) or array dimensions (the worst case for sparse accesses).

Undo log maintenance. In order to be able to restart loop execution from any point of failure, the state of the loop’s working space must be kept consistent. This is is achieved by keeping a limited and distributed trace of relevant memory modifications (writes) made by the iterations of the loop, i.e., an undo log. As will be shown in Section 4.3, this trace is limited to the span (minimum to maximum) of active iterations in the system (active window). The storage associated with it is recycled at the same rate as the lowest active iteration number advances.

We will now describe in detail how these essential functions are achieved by our system.

3.2 Special Data Structures Two types of additional data structures need to be supported in hardware to correctly and efficiently execute speculative parallel loops: (i) data structures to record the relative iteration-wise order of memory references and, (ii) data structures to support the repair of data dependence violations. 3.2.1 Structures for Recording Memory Reference Order Memory reference order records are necessary for the detect ion of possible data dependence violations. Each shared memory element involved in a speculative execution (for which the compiler could not analyze its access pattern) is associated with a set of time stamps, i.e., the iteration number in which it was accessed. CTS represents the time stamp of the currently executing iteration. For data residing in shared memory we need to keep the following timestamps in their respective home directory:

MaxRD – highest iteration in which the data was read, MinWR – lowest iteration in which the data was written, MaxWD – highest iteration in which the last displaced data has been written. At the end of the loop execution this timestamp becomes the timestamp of the last value. 7

MR – flags if a data is read consecutively by more than one iteration (without an intervening write). It is set only at the second read, and is reset by a subsequent write. It can be used for simplifying error recovery.

For the data residing in private storage we provide two timestamps in their respective local directory:

pMaxRD and pMaxWR which represent the highest iteration in which the data was read or written, respectively, by the corresponding (owning) processor.

To keep track of the relative order of references within one iteration, when the data is in cache, we provide the following tags for each memory element (word) that is under test:

Wr – set when data is written the first time in an iteration. Rd – set when data is read for the first time in an iteration unless it was already written (if it was already written (Wr = 1) then the data is privatizable within the iteration) Rx – set when data is referenced as a reduction operand (with the specialized load instruction).

The general rule for setting the timestamps in the directories is to perform a minimum or maximum operation between the stored timestamp and the current timestamp (the number of the iteration currently referencing the data). They are cleared before each instantiation of the loop. The tags (also called mark bits) of data in cache are set only once per iteration, at the first reference (for both read and write). Tags are reset at every iteration. Setting tags will also trigger setting the private timestamps in the local directory and notify home directory if necessary. 3.2.2 Structures supporting repair of violations Undo Log. The function of the undo log is to provide a consistent state for failure recovery (in case of an out-of-order flow dependence) and to service cross-processor requests of data generated by earlier iterations. Every processor has its own undo log residing in memory and accessible through its local directory. The data written in some previous iteration is saved in the record of the current (saving) iteration. The hardware implementation of the undo log will be discussed in more detail in Section 4.3. We define the following operations (performed by hardware for insert, and run-time library routines) on the undo log:

Insert data element: Before a processor overwrites its own data it will send a copy to the local undo log storage together with its timestamp (available in the local directory). Search data element: Services a cross-processor read request of a value that has been already overwritten by a later iteration than the requesting one. For example, if processor i writes a variable in iteration 5 and 7 and processor j later executes iteration 6 which needs the value produced in iteration 5 (which was saved to the undo log in 7). The undo log will provide such data without causing a general dependence violation. Restore Data: Provides the consistent state reached by the program for any (active) iteration number. Data restoration is performed by copying back all data saved in the undo log into the loop’s data structures. In effect it undoes all modifications performed by squashed iterations.

Private Storage is necessary for the privatized data that is displaced from cache during an iteration. The function of this structure is to act as a ’displacement overflow’ area for the working space of one iteration. It is allocated by the hardware on demand in sub- or multipage page increments when a cache line is displaced and is recycled at the end of its iteration. As will be shown in Section 3.3, the private memory area is 8

addressed by means of an address translation table (translates between shared and private address) which can in fact manage the granularity of the private storage. Active Window and Sliding Commit Point are two data structures used to optimize memory usage of all the other additional structures described so far. The active window represents the span of currently executing iterations. The sliding commit point represents the lowest currently executing iteration. In a very balanced loop the active window is almost equal to the number of processors. In the case of dynamic scheduling, this window can increase dramatically if some iterations are much faster than others. Event History is a simple distributed data structure in which we store the assignment of iterations to processors and the detected data dependence violations. It will be used to optimize the following operations:

Provide faster access to data that needs to be searched in the undo logs in the event of a cross-processor read request. It will indicate on which processors and in which iterations to perform the relatively expensive search. Provides input to a run-time library routine that computes a partioning of iterations to processors and/or possibly synchronization points such that data dependence violations can be avoided in future instantiations of the speculative loop.

3.3 Global Dataflow The datapaths of the proposed architecture are depicted in Figure 4(a-d). The main idea of this scheme is to copy-in, on demand, any data to be processed, to keep it non-coherent in cache as long as possible (it will not be invalidated if other processors have a duplicate copy of the same address) and, when displaced or the loop finishes, merge it into the shared data space. Any data written by an iteration which needs to be displaced will be kept in a private overflow area that is allocated on-demand and recycled in the next iteration (if needed). If data written in an iteration is not displaced, then no private storage will be allocated. Cross-processor requests (reads) are considered a ’read-in’ (they are a flow dependence) and are serviced by a read-in procedure that will find the data in another processor’s cache, private storage or undo log. These requests are always serviced but may not correctly match the producer and consumer (an out-of-order flow dependence). When this happens some iterations will have to be re-executed. All updates of shared and private memory (off-cache) will be handled by the directories. We will now consider several scenarios in more detail. The full description and communication algorithm can be found in [26]. Read Request. If a request hits in cache and was written by the same iteration (Wr tag set) (it is privatizable) data is serviced from cache. In all other cases the request is considered a Read-in. Read-in. If a request misses in cache it will be forwarded to the local node. The local node keeps track of possibly displaced data in private storage. If the requesting iteration is equal to the private storage timestamp (pMaxWR) then data can be serviced without further delay (it is privatizable data) otherwise the request is forwarded to its home node. If it has not been written by any other processor (MinWR = MaxWD = 0) then data is serviced and the home node MaxRD updated and the Rd cache tag will be set. If the request hits in cache and data is not privatizable (was referenced in a previous iteration) then a request will be send to the local node, and, if the data has been written by other processors, forwarded to its home node. At home, the directory will perform a find operation and service the data. If the record on the local node indicates a non-shared status then the data in cache can be safely used. Optimization: To reduce the time it takes to check the local node, data will be speculatively used from cache while a request to local (to check for its shared/not-shared status) is being processed. If the data was actually written by another processor in a later iteration than the data found in cache, then the iteration using the wrong data will be squashed, its effect undone, the new, correct data marked as written (so it does not cause a READ-in in subsequent re-execution) and the iteration restarted. An iteration using such a speculative load will block all service requests for the data it has modified until it becomes non-speculative 9

(so as not to provide possibly incorrect data and cause a chain effect). The iteration will not formally end until the speculative load is resolved. Read from another processor’s private memory. If the read target of a processor has been written by another processor(s) (the home node has MinWR < CTS ) then the home node will enter a find phase in which it will search through caches, and private memory (overflow and undo logs) for the requested data (paths 2 and 3) (see Section 3.4). After the data is found the home node will send it to the requesting processor (paths 1 and 4) and the Rd cache tag updated. Write operations are always performed in the private cache and the Wr tag updated. The first Write to an element in an iteration (known since the tag Wr=0) will send its iteration number to the home node where the MinWR stamp is updated. The home node will also perform the crucial dependence verification. If (CTS < MaxRD), i.e., an out-of-order flow dependence, then a violation is detected. Before overwriting a data from a previous iteration it will be saved in the current iteration’s local undo log. A cache miss at write will cause the last available data (last value) to be backed up in the local undo log (because we need to be able to undo the effect of the current iteration, which could be an update of the last value). Reduction Operands. Reduction operands are handled with special load/stores. In the following we assume that the data has not been accessed outside the reduction statement, i.e., it has never been accessed or it’s tags and corresponding timestamps in the directories indicate it is a reduction operand. On a read, if it misses in cache and was never accessed before, then a private line is allocated and initialized to the neutral element. The Rx bit is set. On a write, we perform the write and store old value to the undo log. If a reduction operand is accessed outside a reduction statement, we will either merge all previous partial results (for reads) or kill all previous partial results (for writes) and undo their contribution to the current state. Execution can then continue. These rather complex operations are executed with run-time library routines as described below. If a read is performed on a reduction operand outside the reduction statement, then the request will be forwarded to its home node. There all private partial results from the undo logs of all earlier iterations will be merged into the shared memory and the result send to the requestor. If a write is performed on a reduction operand outside the reduction statement, then an invalidation is send to all processors who have the line. All processors with smaller timestamps will kill their partial results (e.g., sums) since they are no longer needed, and the shared data will be written with the new value at the home node. From the undo logs of the processors with higher timestamps we will collect all partial results obtained previous to the CTS (the current iteration executing the write outside the reduction statement) and subtract the new value from the current value stored in the private partial sums. Displacement. When a data is displaced it is stored by the local directory in private storage, which is allocated on-demand. Reduction operands are also stored in private storage, and their value reset to the neutral element (e.g., 0) but their tags left intact (Rx=1). It is also sent to the home directory where it will be merged: reduction operands update the shared memory (e.g. added in) while all other data update the last value based on CTS and MaxWD. Read-only data (MinWR=0 at home node) will not be displaced to private storage.

3.4 Finding data and servicing data When data is not found on the local processor (cache or private storage) a request is sent to the home node. If it is a read-only data then the request is serviced from the shared memory. If it has been written by only one processor then the directory will request a write back by the owning processor and service the data. If it was written by multiple processors the hardware will try to provide the correct data from private storage and caches of sharing processors otherwise a run-time library routine will search the undo logs. From the, 10

possibly, multiple copies of the requested data we will choose the one with the highest WR timestamp that is still lower than that of the CTS (requesting read).

3.5 Data Dependence Detection Due to the copy-in/merge out paradigm all output and anti dependences are removed. The only dependences that need to be enforced at run-time are cross-iteration flow dependences. A flow dependence or potential violation occurs when a read request has to be served from memory previously written by another processor. A violation occurs if a potential violation is about to be served out of sequential order. The violation detection mechanism is triggered when the timestamps are updated for a Write access if (CTS < MaxRD). If this condition becomes true then a failure recovery or repair phase will be activated.

3.6 Failure Recovery and Optimizations If an in-order flow dependence is detected, the only negative effect is a slower data service time. However, the detection of a violation implies that an iteration has already ’consumed’ the wrong data, and computed the wrong results which may have been in turn consumed by other later iterations, i.e., may have had a ’chain effect’. To repair such an occurrence, the system will first let the earlier iteration finish writing the correct data after which all higher iterations have to be squashed, their effect on all shared data undone (including their timestamps), and execution restarted just after the writing iteration. A timeline of a the failure recovery routine is: 1. Barrier Synchronization 2. Software routine will restore Data from undo logs 3. Barrier Synchronization 4. Scheduler will restart work from restart iteration 5. Dependent iterations will be recorded in execution history This recovery procedure is always safe but may be quite often conservative. If, as previously mentioned, the out-of-order reading iteration does not affect the values used by the following iterations (in sequential order) then only a subset (usually 1 iteration) of the speculative higher iterations needs to be squashed and reexecuted. Such a situation lends itself to a precise recovery operation. We will now examine the conditions under such a desirable situation can occur.

Precise Recovery A precise recovery is possible only if we can accurately establish the effects of the violation. A violation can cause incorrect results in several ways: (a) the read iteration of the the out-of-order write—read reference pair has consumed the wrong value (because it had not been produced yet). The effect is an incorrectly executed read iteration. (b) more than one read iterations have occurred until the late write iteration causes the violation. In other words, the possibility exists that more than one iteration has consumed some wrong value before the correct one was produced. Such a situation causes the incorrect execution of all such read iterations, i.e., a chain effect. (c) the incorrectly read iteration has produced (written) possibly incorrect values that have been consumed by later iterations. In the most general case we cannot establish with accuracy what iterations may have been affected by the violation and thus, have to play safe and squash all iterations higher than the writing one. However we can provide simple additional hardware support to limit the effect of violations and improve, sometimes dramatically, the overall performance. 11

The MR (multiple read) bit of an element is set only by successive read references to it. This means that, in case of a violation, the MR bit will show if more than one iteration may have consumed a wrong value, causing a chain effect. If the MR is set we cannot determine which iterations have been affected by the violation (because, unlike the write references, we do not trace the read references) and a general squash and restore needs to be initiated. In case MR is not set, the violation has affected only the read iteration of the write—read reference pair and thus is the only one that has to be re-executed. However we still need to make sure that no other later iteration has consumed its values or that these values where computed correctly. This check will be performed in the repair phase. The repair phase requires a restoration of all the values modified by the iteration to be undone. This is accomplished by restoring all the old values from the undo log. In order to verify whether other, later iterations may have consumed values produced by the squashed iteration, we will execute the ’restore’ from the undo log as part of the re-execution of the squashed iteration (the writes from the undo log will be handled like regular writes). In this way any processor that may have consumed values from these addresses will detect a violation (a later write than their own read reference) and cause their own squash and restore phase. In case a significant chain effect occurs we are better off to generate a global synchronization and restoration operation. The reason behind incorporating the restoration phase into the re-execution phase rather than letting the ’replay-ed’ iteration do its own writes and possibly cause violations with iterations that may have consumed its values requires a bit more explanation. A read of an iteration that uses the wrong value may affect both the values written by this iteration and/or the addresses referenced in this iteration. If only values are affected then re-execution of the iteration in itself will detect all possible earlier consumers because the iteration will perform the same access pattern. If, however, the original incorrect execution caused the iteration to access incorrect addresses (because the subsequent data computation affected the control flow or address computation) then its re-execution may not traverse the same references and thus not uncover previous consumers of incorrect data. Incorporating the restoration phase into the re-execution phase will insure that all potential consumers will get notified of that their data may have been wrong. A very practical approach is to use a rather simple and conservative compiler analysis decide whether data computation can affect later address computation (directly or through control flow). If data and address computation can be established to be decoupled then type of repair needed can be decided by the MR bit alone: If MR=0, the effect of the violation is limited to te read iteration alone and can be repaired by re-executing it. If we have to assume a cycle between address and data computation then, if MR=0, the restoration phase has to be incorporated into the re-execution (making writes ’visible’), possibly generating new violations (due to the write-backs from the undo log which have been made visible, which would have been re-referenced anyway at re-execution and which increase of the chance of an out-of-order reference). (Note: this optimization has not been implemented yet). We will now give a more formal description of the recovery procedure. YE: read and put in pseudo code................... It is important to understand that the time lost when a violation occurs is quite important. Informally it is:

TimeLost = GlobalSynchTime + SquashedWorkTime + UndoTime

The above precise recovery scheme will drastically reduce all three components of this overhead. Feeding back the execution history can reduce the probability of such a violation altogether.

3.7 Active Window and Sliding Commit Point As noted in Section 3.2 the lowest executing iteration is executing in non-speculative mode. Whenever it finishes execution it can commit all results because the execution will never have to be restarted at an earlier point (provided iterations are scheduled in increasing order). The active window represented by the

12

Proc

State in directory (shared data):

Cache 2 1

Dir

4

3

1

2

Bits

MaxRead

Mem

Home

Home

Bits 1

2

log (Iter) MaxDisp

log (Iter) MinWrite

State in directory (private data):

Read from private data on other processor

Read from shared data

Shared Data

log (Iter)

log (Iter) pMaxRead

1

log (Iter) pMaxWrite

3 2

Private Data

4 5

State in cache tags:

6

Bits

1 Read

1 Write

Home Processing

Displace data

Figure 5: State necessary per array element to implement the unified privatization algorithm (UPAR).

Figure 4: Data paths of proposed architecture: (a) Copy-in, (b) compute in private, (c) displace /merge out

difference between the lowest and highest active iteration is the maximum number of iterations that could be squashed in case of a violation and therefore need checkpointed data. The most important use of the advancement of a commit point is the ability to recycle undo logs - a potentially large memory consumer. Whenever an iteration commits its undo log can be reused by the next scheduled iteration. The detailed functioning of the undo log will be explained in Section 4.3.

4 Special Hardware In the following sections we present some of the special hardware needed to perform the functions of the architecture presented in the previous sections.

4.1 Timestamps for Dependence Violation Detection The implementation of the UPAR algorithm requires hardware support for storing the access bits, logic for their test and, modification and a translation table in the directory to map a given physical address to these access bits. The mechanisms require the modification of the primary and secondary caches and the directory. In the following, we refer to the state tags in Figure 5 as the access bits. The design for the primary cache is shown in Figure 6(a). The access bit portion of the cache tags is stored in an SRAM table called the Access Bit Array. The access bit array, the tag array and the data array have the same number of entries. The desired entry is selected with the address lines. Once the correct access bit entry is selected, the Test Logic performs the operations discussed in Section 3 as determined by the Control input (whether it is a read or write reference). The test logic is simple enough to generate the new access bits, control signals, and a signal indicating whether the test failed, at the same time as the tag comparison is done. If the new access bits are different from the old ones, they will be saved back into the access bit array in the next cycle. In addition, if the corresponding cache line is not exclusive in the primary cache, the new access bits are immediately propagated down the memory hierarchy to the secondary cache and directory. So, except for the access bit update, the complete operation is hidden behind the cache access.

13

Access Bits from L2 Cache

Processor / L1 Cache Access

Access Array

Data

Failure? Control Access Bits to L2 Cache

Hit/Miss

Test Logic

Access Bits

Tag

Control

Data

Bit Array

Array

Access Bits to Directory

Tag

(b) Secondary cache

(a) Primary cache

Access Bit Address

Access Bit Array

Access Bits from Processor Control

Test Logic

Failure? Control Access Bits to Processor or another Directory

Array

Array

Data

Access Bits

Array

Lookup Table

Address

Data

Bit

Tag

Tag

Address

Access Bits

Address

(c) Directory

Figure 6: Hardware for speculative run-time parallelization. For the secondary cache, we also need to provide an access bit array (Figure 6(b)). After a primary cache miss, if the secondary cache hits, the secondary cache provides both the data and the access bits to the primary cache. The access bits are sent directly to the test logic in the primary cache. If the test logic generates a set of access bits that are different from the old ones, they are propagated down to the secondary cache and directory. In any case, the generated bits are stored in the access bit array of the primary cache. Figure 6(c) shows the directory hardware. We use a small dedicated memory for the access bits and a lookup table. The data address serves as a key in the lookup table and generates a pointer to the right entry in the access bit array. The test logic generated access bits are sent to the processor. The whole transaction (lookup table and bit array access, test logic operation) is overlapped with the memory and directory access. Note: A protocol processor could perform the functions of the test logic and partially that of the lookup table.

4.2 Address Translation Table The translation between the shared and private address of an element is performed by an Address Translation Table (ATT) in the directory. The ATT also keeps the base of the index to the time stamp array in the directory. The ATT is organized as a cache in which each entry contains the page number of the shared data and that of its corresponding private page (if allocated), and a pointer to the beginning of its access bits in the access bit table. The shared page number is used as a tag while the other fields contain the data stored for each entry. Before the speculative execution starts the ATT is loaded with the page number of the shared array under test. During the speculative loop the ATT will monitor all memory references and when it detects an access to the data under test it will provide its private data address and the time stamp for the algorithm. The number of entries in the ATT is fixed and includes the processor’s TLB entries. When a page which contains shared data is swapped out, its entry in the ATT can also be stored in the system page table. If a write hits on the ATT and no private page for it exists, then an ATT raised exception will trigger the OS to allocate a page on its local memory and the data will be sent to the local page. From this point on this page of the shared array has been privatized on this processor. The programmable granularity of the private memory allocation depends on the total size of the ATT, the size and, most importantly, on the sparse/dense character of the access pattern. The compiler can easily determine these characteristics. The allocation can be done at subpage level for small and/or sparse access patterns and page or multi-page level (shared and private in contiguous physical pages) for large/dense patterns.

14

4.3 Undo Log The undo log records the data overwritten during speculative execution to assure a safe, consistent state in case a violation is detected and a previous state has to be reconstructed. It also serves read-in requests (Section 3.3) when data has already been overwritten. Since the unit of work is an iteration, the log needs to record, for a given iteration, the initial value of all the array elements before they are updated in that iteration. The minimum necessary information needed for reconstructing the state is the Physical Address of the shared data, the Old Value of the element being written to, and pMaxWrite of the former write. Because this logging occurs frequently, it needs to be supported in hardware. The modified cache coherence protocol identifies the first update to each array element in each iteration by checking mark bits in the tag or time stamp in the local directory. On a write hit, the old data in cache is stored in the undo log before being overwritten. On a write miss, the memory will copy data to the undo log while it sends the line to cache. If pMaxWrite is zero, i.e., it is the first write on the element done by this processor, no log entry is inserted. Pointer Cache

Undo Log Paddr

Valid Ovfl Curr_Iter START END NEXT

Old_Data Iter i

Iter j

Iter k

Iter j

Figure 7: Organization of Undo Log.

4.3.1 Undo Log Organization The high frequency of logging and the potentially large size of of the undo log requires a data structure with fast access time for logging and size management. To satisfy these requirements we have designed a log buffer which can be accessed in constant time. Its size is kept proportional to the window size (number of uncommitted iterations) and can be recycled on the fly in constant time. (See Figure 7.) The undo log is a continuous block of memory, which is partitioned into sections that can be indexed by iteration number. All data, and their corresponding addresses, checkpointed by an iteration are stored in an undo log section. If an iteration overflows its current undo log section, another section will be assigned to it; this new section will contain a back pointer to the previous section. When an iteration commits, it releases its undo log section(s). A ‘free list’ structure (e.g., a stack) is used to maintain the addresses of the free sections of the undo log. Before the speculation starts, a contiguous block of memory is preallocated for the undo logs (each processor has its own). Its size is estimated by the compiler (a reasonable task) to be proportional to the maximum number of distinct writes that may occur in the active window. An overflow of the preallocated undo log is handled by an exception handler which will allocate more memory (and things slow for a while).

15

For every active iteration, a pointer to the next available entry in the undo log section is stored in a fully associative pointer cache in the directories. The pointer cache uses Curr Iter and a Valid bit as tags and its lines contain the fields NEXT (address of next available entry, incremented when new data is logged) and BOUND (boundary of current undo log section, updated when a new section is allocated). By comparing NEXT and BOUND, it can be determined if another undo log section should be allocated to the iteration. The number of entries in the pointer cache is proportional to the window size, i.e., on the order of p (the number of processors); as iterations commit their entries in the pointer cache are invalidated. If the pointer cache becomes full, its entries can be displaced to memory, or alternatively, the scheduler can stop issuing new iterations until the commit point advances and an entry is freed.

4.4 Directory with ALU and Cache To merge a cache line with its shared copy in regular memory, the directory needs a buffer to hold the merge requests. Additional simple functional units like an adder, comparator or logic operator are provided for the reduction operation. The buffer holds, in one entry, the displaced memory block, the time stamps for all elements in the block, and its address. For shared data, the time stamps in the displaced line are compared with MaxWD in the shared array and those with the larger timestamp are written to memory. For reduction data, the directory executes a fetch-and-add instruction. When regular and reduction elements are interleaved in the same cache line, the per-word Rx bit is used as a mask for the directory operations. These merging operations are given lower priority than the regular coherence transactions, thus not slowing the regular memory traffic. They must be completed before the global synchronization at the end of the parallel loop.

5 Special Software – Interrupt and Scheduling Routines The adaptive parallelization and optimization system presented here needs software support at the user level, at the OS (privileged) level, and embedded in the hardware. This software is used to reduce the amount of run-time overhead by using dynamically acquired information, to deal with exceptions (due to the speculative execution) and to simplify the hardware by implementing complex recovery procedures in software. Below we mention only a partial list of the needed support and its functionality. Application level software consists of a special but relatively simple scheduler that can enforce synchronizations (advance/await) between iterations and to which we can add extra work on-the-fly (in case of a violation). Some memory management routines for reserving storage for the various structures are also needed. The application level routines also have to provide parallel algorithms that can use the run-time collected dependence information and generate a partioning of the iteration space such that violations become less likely. OS level code is needed to provide handlers for page faults, dynamic allocation of memory for the additional data structures discussed previously, and special exception handlers. For example, special routines need to be used when handling page faults or segmentation faults that are caused by speculative memory references. System calls are needed to configure the protocol and the hardware before and after speculative execution of a loop. An important set of routines have been developed to handle exceptions caused by dependence detection hardware. The routine find that finds a correct data in other processor’s memory has been implemented in software because it is complex and occurs (hopefully) seldom. Another such routine is used to perform the state recovery in case of a dependence violation. Embedded software is used to perform write back of all modified cache lines at the end of a speculative execution and for recycling the undo log (most of the undo log recycling is done in hardware – only overflows are in handled in software).

16

6 Performance Evaluation 6.1 Simulation Environment Our evaluation is based on execution-driven simulations of a CC-NUMA shared-memory multiprocessor using MINT [23]. The modeled multiprocessor has the hardware support discussed in the previous sections. The simulated applications have been pre-processed with the Polaris [4] parallelizing compiler that has been specifically enhanced to transform selected loops for speculative run-time parallelization. The modeled architecture has 200-MHz 4-way dynamically-scheduled superscalar processor, which have 4 integer, 2 load-store, and 2 floating point function units, 32 entries instruction queue, and both 32 integer and floating point registers. Each node has a 32-Kbyte on-chip primary cache and a 512-Kbyte off-chip secondary cache. Both caches are direct-mapped and have 64-byte lines. The cache sizes have been purposely selected so small in order to scale with the reduced working sets of the chosen applications. Real-life working sets could not be used because they would have required impractically long simulation times. The caches are kept coherent with a DASH-like cache coherence protocol [17]. Each node has part of the global memory and the corresponding section of the directory. We have modeled the contention in the whole system with the exception of the global network, which is abstracted away as a constant latency. The round-trip latencies to the on-chip primary cache, secondary cache, memory in the local node, memory in a remote node with 2 hops, and memory in a remote node with 3 hops are 1, 12, 60, 208 and 291 cycles on average, respectively. These figures correspond to an unloaded machine and will increase with resource contention. Processes synchronize using locks and barriers. The pages of the shared data have been allocated roundrobin across the different memory modules. We choose this allocation because these loops have irregular access patterns and iterations are scheduled dynamically. Private arrays are allocated locally.

6.2 Workloads Due to the impractically long running simulation times of full-length applications we have extracted and measured only the performance of representative loops (in terms of relative execution time) from wellknown codes. Table 1 lists the set of the simulated loops, their weight relative to the total sequential execution time of their respective application (%Tseq, as measured on 1 processor of a SGI PowerChallenge), the number of times they are instantiated during program execution, their average number of iterations, and the total size of the arrays under test. Adm and Track are Perfect Club [3] codes, Euler and Dsmc3d are HPF-2 applications [8], and Rmv is a Spark98 kernel [18], vml and mml are from Sparse BLAS library [7].

Appl

Loop Name

Adm Track Euler

run do[20,30,40,50,60,100] nlfilt do300 dflux do[100,200], psmoo do20 eflux do[100,200,300] move3 goto100 local smvp for VecMult CAB MatMult CAB

Dsmc3d Rmv vml mml

% of Tseq 20.6 40.9 89.9

Instantiation Number 900 56 120

Iteration Number 32=64 480 59863

Array Size (bytes) 803072 72400 703080

32.8 29.0 90.6 89.4

80 1 1 1

383107 92160 4929 39432

158400 737280 39432 315456

Table 1: Application Characteristics. The Adm loop instantiations have 32 or 64 iterations and a small working set. We test 5 arrays and result is a fully parallel loop every time. In Track the algorithm is applied to 4 arrays. Interestingly, 5 of the 56 17

loop instantiations are not fully parallel. After the first two failures, all remaining dependences appear in order at run-time. So later iterations can be executed in parallel without failure recovery. The dflux loop in Euler has been used in two experiments. First we have treated the array under test as shared and obtained a partially parallel loop that has caused dependence related failures just before it finishes. After recovery, the remaining iterations were executed sequentially. A readily available manually parallelized version of this loop presents an outer sequential loop and inner parallel loop that uses sparse reduction parallelization. On this inner loop we have applied the reduction optimization algorithm and obtained very good results. For the loop move3 goto100 in Dsmc3d, privatization removes all dependences [1] but, due to its sparse nature, causes high initialization and final merging overhead. By treating the tested array as a shared array and applying the UPAR algorithm we obtain good overall results. Spark98 is a sparse matrix and dense vector multiplication C kernel. Rmv needs only reduction optimization. Sparse BLAS is a set of level 3 Basic Linear Algebra Subprograms for sparse matrices. vml is a dense vector and sparse matrix multiplication and mml is dense matrix and sparse matrix multiplication. They are fully parallel and need reduction optimization.

6.3 Evaluation We have used Adm, Track, Euler, and Dsmc3d to evaluate the proposed algorithm, and Euler, Rmv, vml and mml for evaluating the hardware for reduction optimization. We simulated these applications on both 8 and 16 processors. 6.3.1 Overall UPAR Algorithm Performance When the UPAR algorithm was applied to these applications, only Adm was found to be fully parallel, while all the others experience violations during the speculative parallel execution. We present the parallel execution time normalized to the serial execution time. Figures 8 and 9 show the comparative performance of applying the algorithm on four applications running on 8 processors. Two different sequential executions were simulated, one using only one (local) memory (Figure 8), and one using distributed memory (Figure 9); all parallel executions used distributed memory (one per processor). Figures 10 and 11 show analogous results for 16 processors. In the graphs, UPAR1 represents the parallel execution without flushing the cache at the end of the loop, while UPAR2 includes the time to flush cache. The difference between these two bars represents the overhead of flushing cache. On average, the UPAR algorithm provides speedups of 2.5 on 8 processors, and 3.9 on 16 processors, when the sequential execution used only a local memory, and 18.0 on 8 processors, and 35.6 on 16 processors, when the sequential execution used a distributed memory. The dramatic difference in the speedups between the two versions of the sequential execution is explained by the fact that when only one memory is used, all accesses in the sequential execution are to the local memory, while only O(1=p) of the accesses are local when a distributed memory is used. Thus, the superlinear speedups in the distributed case are explained by the fact that the parallel executions have a larger proportion of local accesses than the sequential execution. We believe the distributed case is more realistic, as data will generally be distributed for applications requiring parallel implementation. On the other hand, we have used unrealistically small data sets (those available in the regular benchmark suites) to allow for more practical simulation times. We believe that when using a data set size commensurate to a 16 processor machine the speedups for the distributed memory case will be significantly lower (no favorable cache effects) and be more in line with what we can usually expect from parallelized loops (sub-linear). For Adm which is fully parallel, the overhead of flushing cache is around 2%. While, superlinear speedups are obtained for the distributed memory sequential execution, it is hard to get a good speedup for 16 processors in the local memory case because the total number of iterations is small. Synchronization time is also increased when executed on 16 processor. 18

Dsmc3d

UPAR2

Serial

UPAR1

UPAR2

Serial

UPAR1

UPAR2

5

Figure 10: Speedup on 16 processors (1mem).

Dsmc3d 100 Useful Sync Control Data Memory Struct Other 5

5

UPAR2

6

Serial

6

UPAR1

1

UPAR2

Serial

UPAR2

Serial

UPAR1

UPAR2

Serial

UPAR1

UPAR2

Serial

UPAR1

UPAR2

Serial

5

18 18 1

Serial

20

UPAR1

40

Euler 100

UPAR2

60

0

UPAR1

0

80

Track 100

100

Serial

|

20 20

16 18

10 11

UPAR1

100

|

|

23 24

100

|

|

40

100

100

|

|

60

100

Execution Time

Adm Useful Sync Control Data Memory Struct Other

|

|

80

3

Serial

Serial

UPAR2

Serial

UPAR1

UPAR2

Serial

UPAR1

Serial

UPAR1

UPAR2

Serial

UPAR1

UPAR2

Euler 124 124

|

100

20

Track

Useful Sync Control Data Memory Struct Other


|

Execution Time

Adm 120

3

UPAR1

|


Dsmc3d 100

21 21

0

UPAR2

20

Euler 100

UPAR2

|

40

Track 100

100

UPAR1

|

35 35

27 29

60

|

|

31 33

80

|

100

|

100

100

|

100


|

|

100

Adm

Dsmc3d

UPAR1

Euler 136 136

Execution Time

Track

|

Execution Time

Adm |

140 120 100 80 60 40 20 0


The loop in Track, has 5 dependences in its 56 instantiations. Two of them have violations at run-time while the rest are appear fully parallel because the flow dependences occur in order. Since we provide an efficient failure recovery scheme, we get good speedups for the overall program, even for the local memory case. The overhead of flushing cache is 2%. Besides useful and memory time, instruction level data dependence is a major part of execution time. In Euler the loop body is relatively simple, so the memory time dominates the total time, which can be seen from the breakdown of the sequential execution. To increase the workload in the loop body, 8 iterations are scheduled as a super iteration dynamically. A failure will occur very late in the execution of the loop. Dsmc3d has dependences in half of its instantiations. To keep the time stamps within their 2 byte maximum length the loop is scheduled dynamically in chunks of 16. This method reduces synchronization time and improves data locality. Because its workload (total iteration number) is very large, the overhead of flushing cache is negligible.. 6.3.2 Reduction Optimization In this section we evaluate the performance of the reduction optimization hardware for Spark98, Euler, and Spblas. The reductions present are regular and the loops are fully parallel. Figures 12 and 13 show the performance gains of the algorithm over the standard software approach (private replicated arrays for partial sums) on 8 processors and, respectively, 16 processors. In the graphs, SW and HW are, respectively, the normalized execution times of the software and hardware optimized merging phases. The bars of SW and HW are broken down into the Loop (initialization and actual loop workload), and Merge phases. Note that for the software scheme, Merge represents the time to merge the private arrays, while for the hardware scheme it represents the time needed to flush the primary and secondary caches. The size of the SW points to the relative importance of optimizing the merge of the private arrays, i.e., the need for optimizing the most commonly used method in parallelizing reductions. The size of the HW bars proves our concept: for Euler and Spark, the overhead of flushing caches is less than 1%, an excellent result. For Spblas routines, the overhead of flushing cache is noticeable because its working set is comparable to the cache size. But the difference between SW and HW is still very large, which means this optimization reduces merging overhead significantly. The compiler can evaluate the size of the private structures needed 19

mml 100

vml 100

Euler 100

Merge Loop 27 5

HW_UPAR

Serial

SW_RED

Serial

13

4

HW_UPAR

13

5

HW_UPAR

11

Serial

9

SW_RED

HW_UPAR

Serial

SW_RED

0

HW_UPAR

8

SW_RED

26

Serial

20

Spark(sf10) 100

SW_RED

8

40

HW_UPAR

Serial

SW_RED

15

7

HW_UPAR

Serial

SW_RED

HW_UPAR

Serial

SW_RED

HW_UPAR

Serial

SW_RED

HW_UPAR

Serial

SW_RED

0

17 6

60

Spark(sf5) 100

|

14

80

|

16

15

100

|

Execution Time

Euler 100

33

31 |

20

vml 100

Merge Loop

|

40

mml 100

|

Execution Time

|

60

Spark(sf10) 100

|

|

80

Spark(sf5) 100

|

100

Figure 12: Reduction Execution Time Breakdown (8 Figure 13: Reduction Execution Time Breakdown (16 proc/mem).

proc/mem).

for the partial results and chose the appropriate technique: with or without reduction optimization.

6.4 More Justification of Architecture Complexity In our design we provide the possibility of out of order execution and an overflow private storage for each active iteration. These features add to the overall complexity of the design but, as will be shown, contribute significantly to the system’s performance. 6.4.1 The benefit of out-of-order finish In this experiment, we compare our proposed out-of-order finish to in-order finish of iterations. These applications commonly require dynamic scheduling due to imbalanced workloads. The results shown in Figures 14 and 15 show the beneficial effect of dynamic scheduling on performance. The graph labeled as UPAR represents out-of-order, dynamic scheduling, while INF represents in-order finish. The figures show that after enforcing in-order finish, the execution time of all applications increased due high synchronization time. Its magnitude depends on the degree of loop imbalance. For Track and Dsmc the performance degradation is significant due to a large amount of busy waiting.

INF

UPAR

INF

UPAR

INF

UPAR

INF

UPAR

INF

UPAR

INF


|

|

UPAR

100

|

|

INF

100 104

100

Dsmc3d 137

|

|

UPAR

100

Euler

|

|

Figure 14: Scheduling comparison (8 proc/mem).

Track 141

115

UPAR

100

|

|

100 105

100

140 120 100 80 60 40 20 0

|

|

100

Adm Useful Sync Control Data Memory Struct Other

|

Dsmc3d 141

INF

Euler

124

116

Execution Time

Track

|

Execution Time

Adm 140 120 100 80 60 40 20 0

Figure 15: Scheduling comparison (16 proc/mem).

6.4.2 The benefit of allowing displacement In this experiment, we compare our method of allowing displacement of data under test to a scheme which stalls the processor when it would have to displace data under test from its level one cache and then resumes its execution when that iteration becomes non-speculative. This also implicitly enforces in-order finish in the case when all iterations displace data out of cache. The difference between the stalling experiments and our scheme show the importance of allowing displacement to memory. The results with stalling on displacement are labeled as DPL in Figure 6.4.2 and Figure 6.4.2. The delay time is counted in memory time. Performance degrades more than in the previous case (in-order finish) because of the serialization of iterations executed by displacing processors and a cumulative in-order finish effect. 20


DPL

UPAR

100

DPL

UPAR

100

DPL

100

UPAR

Execution Time

DPL

UPAR

DPL

UPAR

DPL

UPAR

Execution Time

|

Figure 16: Displacement comparison (8 proc/mem).

Euler 736

624

|

100

Track

|

|

100

Adm 794

|

|

100

395

800 700 600 500 400 300 200 100 0

|

|

200

411


|

|

300

0

Euler

|

400

100

Track

|

Adm 546 |

500

Figure 17: Displacement comparison (16 proc/mem).

7 Related Work Some work related to ours is four schemes that support speculative parallelization inside a multiprocessor chip [10, 13, 11, 22]. These schemes are relatively similar to each other. The cache coherence protocol inside a chip is extended with versions or time stamps similar to ours. Parallelism is exploited by running one task (for example one loop iteration) on each of the processors on chip. One of the tasks is marked non-speculative, while the others are speculative with a certain order. Tasks are scheduled for execution and committed in order. The data written by a speculative task is kept in the private cache or write buffer until the task becomes non-speculative. At that point, the updates can be merged with memory. Before that, the lines with speculative state must not be displaced from the cache or buffer. If a reference by a task would force the displacement, the processor stalls. Recovery from a wrong speculation is relatively simple: the cache lines with speculative data are invalidated and the successor tasks are squashed and restarted. Overall, the above schemes are targeted to small-scale parallelism available on a chip. The UPAR algorithm proposed in this paper is targeted to the larger-scale, coarser-grain parallelism exploited in DSM multiprocessors and that exhibit significant amounts of parallelism. Indeed, many of the differences in our method from the previously proposed techniques are motivated by the need to exploit large-scale vs. smallscale parallelism. In particular, we note our support for out-of-order commit/completion of iterations, which enables a much better load balance for irregular, imbalanced applications. Our support for displacement without stalling is also key. It allows asynchronous, during execution, merging of private data into shared data which greatly reduces the cost of the final merge (as opposed to software schemes), and private displacement storage for the iteration working space which allows the unrestricted execution of any data size. Our cache line reduction optimization scheme is similar to the one proposed in [15].

8 Conclusions Speculative parallel execution of statically non-analyzable codes on Distributed Shared-Memory (DSM) multiprocessors is challenging because of the long latency and memory distribution present. However, such an approach may well be the best way of speeding up codes whose dependences can not be compiler analyzed. In this paper, we have extended past work [27, 25, 28, 26] by proposing a hardware scheme for the speculative parallel execution of loops that have a modest number of cross-iteration dependences. In this case, when a dependence violation is detected, we locally repair the state. Then, depending on the situation, we either re-execute one out-of-order iteration or, restart parallel execution from that point on. The general algorithm, called the Unified Privatization and Reduction algorithm (UPAR), privatizes, on demand, at cache-line level, executes reductions in parallel, merges the last values and partial results of reductions on-the-fly with minimum residual work at loop end. UPAR allows for completely dynamic scheduling and does not get slowed down if the working set of an iteration is larger than the cache size. Simulations indicate good speedups relative to sequential execution. The hardware support for reduction optimizations brings, on average, 50% performance improvement. and can be used both in speculative and normal execution. Further optimizations mentioned in this paper may improve performance significantly when finally implemented.

21

References [1] R. Asenjo, E. Gutierrez, Y. Lin, D. Padua, B. Pottenger, and E. Zapata. On the automatic parallelization of sparse and irregular fortran codes. Technical Report 1512, Center for Supercomputing Research and Development, February 1997. [2] U. Banerjee. Dependence Analysis for Supercomputing. Kluwer Academic Publishers, Norwell, MA, 1988. [3] M. Berry et al. The Perfect Club Benchmarks: Effective Performance Evaluation of Supercomputers. International Journal of Supercomputer Applications, 3(3):5–40, Fall 1989. [4] W. Blume, R. Doallo, R. Eigenmann, J. Grout, J. Hoeflinger, T. Lawrence, J. Lee, D. Padua, Y. Paek, B. Pottenger, L. Rauchwerger, and P. Tu. Advanced Program Restructuring for High-Performance Computers with Polaris. IEEE Computer, 29(12):78–82, December 1996. [5] W. Blume and R. Eigenmann. Performance Analysis of Parallelizing Compilers on the Perfect BenchmarksTM Programs. IEEE Transactions on Parallel and Distributed Systems, 3(6):643–656, November 1992. [6] W. J. Camp, S. J. Plimpton, B. A. Hendrickson, and R. W. Leland. Massively parallel methods for engineering and science problems. Comm. ACM, 37(4):31–41, April 1994. [7] I. Duff, M. Marrone, G. Radiacti, and C. Vittoli. A set of Level 3 Basic Linear Algebra Subprograms for Sparse Matrices. Technical Report RAL-TR-95-049, Rutherford Appleton Laboratory, 1995. [8] I. Duff, R. Schreiber, and P. Havlak. Hpf-2 scope of activities and motivating applications. Technical Report CRPC-TR94492, Rice University, November 1994. [9] R. Eigenmann, J. Hoeflinger, Z. Li, and D. Padua. Experience in the Automatic Parallelization of Four Perfect-Benchmark Programs. Lecture Notes in Computer Science 589. Proceedings of the Fourth Workshop on Languages and Compilers for Parallel Computing, Santa Clara, CA, pages 65–83, August 1991. [10] S. Gopal, T. N. Vijaykumar, J. E. Smith, and G. S. Sohi. Speculative Versioning Cache. In Proceedings of the 4th International Symposium on High-Performance Computer Architecture, February 1998. [11] Lance Hammond, Mark Willey, and Kunle Olukotun. Data speculation support for a chip multiprocessor. In ASPLOS8, 1998. [12] Ken Kennedy. Compiler technology for machine-independent programming. Int. J. Paral. Prog., 22(1):79–98, February 1994. [13] V. Krishnan and J. Torrellas. Hardware and Software Support for Speculative Execution of Sequential Binaries on a Chip-Multiprocessor. In Proceedings of the 1998 International Conference on Supercomputing, July 1998. [14] C. Kruskal. Efficient parallel algorithms for graph problems. In Proceedings of the 1986 International Conference on Parallel Processing, pages 869–876, August 1986. [15] James R. Larus, Brad Richards, and Guhan Viswanathan. LCM: Memory system support for parallel language implementation. In ASPLOS6, pages 208–218, 1994. [16] F. Thomson Leighton. Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes. Morgan Kaufmann, 1992. 22

[17] D. Lenoski, J. Laudon, K. Gharachorloo, A. Gupta, and J. Hennessy. The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 148–159, May 1990. [18] D. O’Hallaron, J. Shewchuk, and T. Gross. Architectural implications of a family of irregular applications. Technical Report CMU-CS-97-189, School of Computer Science, Carnegie Mellon University, November 1997. [19] D. A. Padua and M. J. Wolfe. Advanced compiler optimizations for supercomputers. Communications of the ACM, 29:1184–1201, December 1986. [20] L. Rauchwerger, N. Amato, and D. Padua. A scalable method for run-time loop parallelization. IJPP, 26(6):537–576, July 1995. [21] Lawrence Rauchwerger and David A. Padua. The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization. In Proceedings of the SIGPLAN 1995 Conference on Programming Language Design and Implementation, La Jolla, CA, pages 218–232, June 1995. [22] J. G. Steffan and T. C. Mowry. The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization. In Proceedings of the 4th International Symposium on High-Performance Computer Architecture, February 1998. [23] J. Veenstra and R. Fowler. MINT: A Front End for Efficient Simulation of Shared-Memory Multiprocessors. In Proceedings of the Second International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS’94), pages 201–207, January 1994. [24] M. Wolfe. Optimizing Compilers for Supercomputers. The MIT Press, Boston, MA, 1989. [25] Y. Zhang, L. Rauchwerger, and J. Torrellas. Speculative Parallel Execution of Loops with CrossIteration Dependences in DSM Multiprocessors. Technical Report 1536, University of Illinois at Urbana-Champaign, Center for Supercomputing Research and Development, July 1997. [26] Y. Zhang, L. Rauchwerger, and J. Torrellas. A Unified Approach to Speculative Parallelization of Loops in DSM Multiprocessors. Technical Report 1550, University of Illinois at Urbana-Champaign, Center for Supercomputing Research and Development, October 1998. [27] Y. Zhang, L. Rauchwerger, and J. Torrellas. Hardware for Speculative Run-Time Parallelization in Distributed Shared-Memory Multiprocessors. In Proceedings of High Performance Computer Architecture 1998, (HPCA-4), Feb. 1998. [28] Y. Zhang, L. Rauchwerger, and J. Torrellas. Speculative Parallel Execution of Loops with CrossIteration Dependences in DSM Multiprocessors. In To appear in Proceedings of the 5th International Symposium on High-Performance Computer Architecture, January 1999. [29] H. Zima. Supercompilers for Parallel and Vector Computers. ACM Press, New York, New York, 1991.

23

A Unified Approach to Speculative Parallelization ... - Semantic Scholar

A Unified Approach to Speculative Parallelization ... - Semantic Scholar

Suggest Documents

Fastpath Speculative Parallelization - Semantic Scholar

Fastpath Speculative Parallelization

The R-LRPD Test: Speculative Parallelization of ... - Semantic Scholar