Latency-Tolerant Software Pipelining in a Production ... - CiteSeerX

Latency-Tolerant Software Pipelining in a Production Compiler Sebastian Winkel

Rakesh Krishnaiyer

Robyn Sampson

R Compiler Lab Intel Intel Corporation Santa Clara, CA

R Compiler Lab Intel Intel Corporation Santa Clara, CA

R Compiler Lab Intel Intel Corporation Nashua, NH

{sebastian.winkel,rakesh.krishnaiyer,robyn.sampson}@intel.com ABSTRACT In this paper we investigate the benefit of scheduling non-critical loads for a higher latency during software pipelining. "Noncritical" denotes those loads that have sufficient slack in the cyclic data dependence graph so that increasing the scheduling distance to their first use can only increase the number of stages of the software pipeline, but should not increase the lengths of the individual stages, the initiation interval (II). The associated cost is in many cases negligible, but the memory stall reduction due to improved latency coverage and load clustering in the schedule can be considerable. We first analyze benefit and cost in theory and then present how we have implemented latency-tolerant pipelining experimentally in R product compiler. A key component of the the Intel Itanium technique is the preselection of likely long-latency loads that is integrated into prefetching heuristics in the high-level optimizer. Only when applied selectively based on these prefetcher hints, the optimization gives the full benefit also without trip-count information from dynamic profiles. Experimental results show gains of R up to 14%, with an average of 2.2%, in a wide range of SPEC CPU2000 and CPU2006 benchmarks. These gains were realized on top of best-performing compiler options typically used for SPEC submissions.

Categories and Subject Descriptors D.3.4 [Programming Languages]: Processors—code generation, compilers, optimization; C.1.1 [Processor Architecture]: Single Data Stream Architectures—RISC/CISC, VLIW architectures

General Terms Algorithms, Performance, Experimentation

1.

INTRODUCTION

Software pipelining [14, 20, 15] is a powerful optimization technique to improve the performance of loops by exploiting parallelism between loop iterations. It transforms the original loop into

This is the author’s version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in the CGO proceedings. http://doi.acm.org/10.1145/1356058.1356073 CGO’08, April 5–10, 2008, Boston, Massachusetts, USA. Copyright 2008 ACM.

a pipeline consisting of multiple stages. These stages execute independent work from different iterations of the original loop in parallel, allowing for a much higher instruction-level parallelism (ILP) and throughput. Pipelining is especially crucial on in-order processor architectures, which cannot rely on an instruction window in hardware to extract parallelism between loop iterations. However, in-order architectures also tend to suffer more from long-latency operations like cache-missing loads, because the processor pipeline has to stall if needed data is not yet available from the memory subsystem. The high static instruction-per-clock rate (IPC) of pipelined loops – often only limited by the number of parallel execution units in the target processor – can be substantially reduced by memory stalls at runtime. Software prefetching has been very successful in reducing such stalls, but it does not work effectively for all memory references (see Sec. 3.2). This work shows that it is possible to significantly mitigate stalls due to non-critical loads in pipelined loops by scheduling them for a higher latency during pipelining. "Non-critical" denotes those loads that have sufficient slack in the cyclic data dependence graph so that increasing the scheduling distance to their first use can only increase the number of stages of the software pipeline, but should not increase the lengths of the individual stages, the initiation interval (II). More precisely, such non-critical loads have the property that they – when assumed to have a higher latency – do not increase the length of a recurrence cycle1 beyond the estimated II of the loop. As we will discuss in Sec. 2, the cost is in many cases negligible, but there can be a significant benefit from reducing and overlapping of memory stalls. Earlier work on cache sensitive modulo scheduling [22] has already demonstrated the effectiveness of long-latency scheduling of non-critical loads as an alternative to software prefetching. We go beyond this and apply it not alternatively, but additionally in those instances where the compiler heuristics determine that prefetching is not fully effective. This close coupling with the prefetching heuristics is one of the main contributions of our paper. Also, in addition to [22] we discuss and observe load clustering as a beneficial side effect. Finally, we study cost-benefit tradeoffs of the method and demonstrate in extensive experiments that our heuristics can contain the regression risk well enough for a production compiler (even without profiling). A detailed comparison between [22] and our work can be found in Sec. 5. The rest of the paper is organized as follows: The following subsections give a brief introduction to software pipelining and the 1 A recurrence cycle is a dependence cycle involving a loop-carried dependence in the original loop.

L1:

ld4 r4 = add r7 = st4 [r6] br.cloop

[r5],4 ;; r4,r9 ;; = r7,4 L1 ;;

// // // //

Cycle Cycle Cycle Cycle

0 1 2 2

Figure 1: Source loop with compiler-estimated execution cycles.

Cycle ↓

0 1 2 3 4 5 6

From Source Iteration −→ 1 2 3 4 5 ld4 add ld4 st4 add ld4 st4 add ld4 st4 add ld4 st4 add st4

6

Figure 2: Conceptual view of a software pipeline executing instructions from five source loop iterations. R architecture. Sec. 2 discusses the support for it on the Itanium benefit and cost of increasing latency tolerance on a higher abstraction level. Subsequently in Sec. 3 we describe how we have balanced cost and benefit in our experimental implementation in the Itanium compiler in practice. The experimental results are presented in Sec. 4. After discussing related work in Sec. 5 we conclude the paper with a summary and outlook in Sec. 6.

1.1

Software Pipelining

We outline software pipelining and demonstrate our latencytolerant scheduling technique using a very simple running example. Fig. 1 shows this loop in Itanium assembly code. In the following, we refer to the original loop before software pipelining as the source loop, and we call the iterations of this loop source iterations. Each iteration loads a word from the address in register r5 to r4, adds r9 to it and stores the result at the address in r6. The address registers r5 and r6 are post-incremented by four during each iteration. The branch “br.cloop” decrements a special loop count register and exits the loop when it becomes zero. In the assembly code, the double semi-colons represent stops that are set by the compiler to delimit groups of instructions that can be executed in parallel. Here the first two stops are required because there are register flow dependences between the first three instructions. Under the ideal assumption that the load is always an L1 cache hit with a single-cycle latency, each source loop iteration takes three cycles. The post-increment represents the only loop-carried dependence in the loop (under the assumption that the source and destination memory locations do not overlap). Without the register flow dependences, an Itanium processor would have sufficient execution resources to potentially complete one loop iteration in a single cycle. Fig. 2 depicts an execution scheme that permits this by letting the processor execute three instructions from three different (successive) source loop iterations in each cycle. Because there are no flow dependences between them, they can potentially be issued in parallel. This is the basic idea of software pipelining: The source loop code is partitioned into stages, which all work in parallel in the transformed, pipelined loop kernel (processing data from successive iterations of the source loop in a pipelined fashion). In this example, each of the three stages contains just one of the three instructions.

L1:

(p16) ld4 r32 = [r5],4 (p17) add r34 = r33,r9 (p18) st4 [r6] = r35,4 br.ctop L1 ;;

// // // //


0 0 0 0

Figure 3: Pipelined version The length of each stage (in cycles) is equal to the schedule length of the resulting pipelined loop, which executes all these stages in parallel. We refer to the latter loop as the kernel loop, call its iterations kernel iterations, and its schedule length the initiation interval (II). In the pipelining scheme in Fig. 2, the II is one cycle (Fig. 3 shows a kernel loop that implements this pipeline). It is important to note that, as with any pipeline, there are phases of filling and draining where not all stages are active. There is a prolog phase where the pipeline is filled: new source iterations are started, but none are yet ready to be completed (cycles 0-1 in Fig. 2). Then the pipeline is fully working in steady state (cycles 2-4) until it is drained in the epilog phase: no new source iterations are commenced, just older iterations are completed (cycles 5-6). Compared to the source loop, the kernel loop needs an additional number of iterations to fill and drain the pipeline, and this number is exactly one less than the number of stages in the pipeline. This cost is incurred only once per loop execution and can therefore be well amortized if the number of iterations is high enough. Under the assumption that the loop trip count is high, it is naturally beneficial to partition the software pipeline in as many small stages as possible, in other words, to generate a kernel with a small II. However, there are two fundamental lower bounds on the II of any software-pipelined loop: • A Resource II, estimating the minimum number of cycles needed to execute all the instructions in the loop body on the execution units of the target processor. • A Recurrence II, equal to the largest length of a recurrence cycle in the cyclic dependence graph. In the simplest case where a recurrence cycle contains exactly one loop-carried dependence with distance one, the length of the cycle is equal to the sum of the latencies of all contained dependence edges. Such a cycle imposes a lower bound on the stage length and therefore on the II. As we will see later in Sec. 3.3, these lower bounds are used in modulo scheduling to narrow down the search for a feasible kernel schedule.

Hardware Support for Software Pipelining in the Itanium Architecture. Before we discuss latency-tolerant software pipelining, we briefly explain how pipelining is supported by the Itanium processor architecture (more comprehensive introductions can be found in [12]). Register rotation is a lightweight hardware mechanism to transfer data between the individual stages of the pipelined loop via register renaming [21]. A fixed-sized area of the architectural predicate and floating-point registers (p16-p63 and f32-f127) as well as programmable-sized area of the general register file (starting at r32) are defined to rotate. In pipelined loops, such as in Fig. 3, special branch instructions are used that implicitly perform register rotation on each back edge branch. The effect is that the value of register number X appears after rotation in register X + 1. In the example of Fig. 3, the value loaded to r32 appears in the next kernel iteration in r33 and is there read by the add. The

result of the add, r34, appears one iteration later in r35 and is there stored to memory. Without register rotation, explicit register moves or unrolling would be needed to make sure that the same instructions do not overwrite data from previous source iterations that is still needed. The predicate registers are also rotated in order to control the filling and draining of the pipeline. In the example it can be seen that three different stage predicates p16-p18 are assigned to the three stages of the pipeline. The values one and zero are rotated through these stage predicate registers to turn on and turn off stages in the prolog and the epilog phase, respectively. The technical details of programming the register rotation mechanism are beyond the scope of this paper and are explained in [18, 12]; the important observation in the context of this paper is that each register lifetime that spans x stages (= x kernel iterations) also occupies a range of at least x − 1 rotating registers.

2.

THEORY OF LATENCY-TOLERANT SOFTWARE PIPELINING

Software pipelining in the literature usually assumes fixed latencies of instructions. While this assumption is valid for most classes of instructions, loads are a critical exception with a wide range of R possible latency values at runtime. On the Dual-Core Itanium 2 processor, the best-case delays until integer loads return data range from 1, 5, 14, and more than a hundred cycles depending on whether the data is found in the L1D, L2D, L3 caches, and the main memory, respectively [11]. To allow better hiding of such long latencies, most in-order microprocessors (e.g., Itanium) implement a stall-on-use policy for loads: The stall of the execution pipeline does not occur on the cache miss itself, but only when the loaded data is needed by an instruction and not yet available. Itanium processors keep track of the physical target registers of outstanding loads and stall once an instruction tries to access such a register. If a load and its first use are scheduled d cycles apart and the load latency is L cycles, then at least d cycles of the latency can be covered by the schedule and the remaining stall is L − d cycles (or less if further stalls occurred in-between). Of course this calculation implies that useful work can be done in the d cycles between the load and the use, which is the case only under certain circumstances. Apart from a better latency coverage [9], there is a further strong reason for increasing load-use distances in the schedule. On the Itanium 2 processor, the memory subsystem is decoupled from the execution pipeline and can reorder requests based on a relaxed memory ordering model [12]. At least 48 outstanding requests can be active throughout the memory hierarchy without stalling the execution pipeline [16]. Therefore it is highly beneficial for the code generator to increase memory-level parallelism by clustering loads in the schedule, which means issuing of several load requests in parallel before the first use of a load. If the first use triggers an execution pipeline stall, then all outstanding load requests can continue processing in the shadow of this stall (of possibly hundreds of cycles). As we will see, increasing load-use distances in the schedule fosters clustering as a side effect and, in doing so, overlaps more stalls at runtime. In the next two subsections we examine the benefit and cost of the optimization more closely.

2.1

Benefit

We investigate the benefit in a simplified scenario consisting of a loop with a single (non-critical) load, such as our running example loop. We assume that the load latency at runtime is constantly L+1

Cycle ↓

From Source Iteration −→ 1 2 3 4 5 ld4 ld4 ld4 add ld4 st4 add ld4 st4 add st4 add

0 1 2 3 4 5 6

6

Figure 4: Pipeline organized for a three-cycle load latency.

cycles. Then, if a use is scheduled in the next cycle after the load (the minimum distance needed for a legal schedule), there is a stall of L cycles. In other words, L is the part of the latency that can be exposed as a stall. We denote by d the additional distance between the load and its first use in the schedule, i.e., the amount that exceeds the minimum latency. This is also called the additional scheduled latency in the following. Then d is the part of L that is covered by the schedule and we can define a coverage ratio c as follows: d L

c=

(1)

If the trip count of the kernel loop is n, the total number of stall cycles is n(1 − c)L. Fig. 4 depicts a software pipeline for the example loop where the load is scheduled for a three-cycle load latency (d = 2). In comparison to the previous pipeline in Fig. 2, still each stage is planned to take one cycle, but the number of stages has grown from three to five as now two empty “latency buffer stages” have been added after the load. Code implementing this schematic pipeline is shown below in Fig. 6. We examine now how this pipelined loop is executed under the assumption that the load latency is 14 cycles at runtime (L3 cache latency): In the first three cycles, three instances of the load (from the first three source loop iterations) are issued before any of the uses. Every three instances of this load are clustered and we say that the load has a clustering factor of k = 3 in the pipeline. In cycle 3 then the use of the first load instance is executed and takes a stall of L−d = 13−2 = 11 cycles. This stall overlaps with most of the load latency from the second and the third load instance, both of which are actively being executed in the memory subsystem. As a result, none of the next two instances of the add (from source iterations two and three) incurs a further stall. However, the add instance from source iteration four again incurs an 11-cycle stall and the subsequent two instances again none. The combination of latency coverage and clustering results in a stall of L − d cycles every k kernel loop iterations. This is obviously preferable over a stall of L cycles in each iteration, as it would occur without latency tolerance in the schedule. We can form the ratio of the total accumulated stall cycles with and without latency-tolerant scheduling and simplify it using the coverage ratio (1): n (L k

− d) = nL

n (1 k

− c)L 1−c = nL k

Based on this result, the total reduction of stalls during the loop execution can be expressed – independently of absolute latency values – by the following percentage: 100(1 −

1−c ) k

(2)

Stall Reduction

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

L1:

// // // //


0 0 0 0

Figure 6: Code corresponding to Fig. 4.

0

1

2

3

4

5

6

7

8

Clustering Factor k →

Coverage ratio:

1

0.5

0.1

0.01

Figure 5: Stall reduction (in percent) in relation to the clustering factor (Equ. (2)). It is interesting to study how c and k influence this percentage, which correlates to the achievable performance gain. The diagram in Fig. 5 shows the stall reduction (y-axis) for different clustering factors and latency coverages of 100%, 50%, 10%, and 1% (the four curves). It can be seen that clustering can compensate well even for very low coverage ratios, which are realistic for delinquent loads with main memory latency. In practice it is not advisable to schedule loads for more than 20-30 cycles of latency because the cost of doing so grows linearly with the latency amount (see next section). Therefore loads with latencies of a hundred and more cycles will realistically experience coverages of only 1-10%. While for such loads the stall reduction due to latency coverage alone is only within the range of 1-10%, the clustering effect gives an additional boost: As the diagram shows, a clustering factor of 3 results in an overall stall reduction of two-thirds. For loads that hit in the lower-level caches with latencies in the range of 5 and 20 cycles, clustering plays a much less important role because most of such latencies can already be covered in the schedule. There is a simple relationship between the clustering factor k, the II, and the additional latency used for a load during pipelining, d. To derive this relationship, we observe that during each kernel loop iteration – after every II cycles – a new instance of the load (from the next source iteration) is issued. If we want to postpone the first use until k of such instances have been issued (clustered), we have to schedule it with an additional latency of at least d = (k − 1)II

(3)

cycles in the software pipeline. Doing so will guarantee clustering of k successive instances.

2.2

(p16) ld4 r32 = [r5],4 (p19) add r36 = r35,r9 (p20) st4 [r6] = r37,4 br.ctop L1 ;;

Cost

Increasing the latencies of non-critical loads primarily adds a one-time cost during each execution of the loop. While the II is expected to remain the same, the number of pipeline stages may grow. As explained in Sec. 1.1, for each additional stage one further kernel loop iteration becomes necessary (per loop execution). In addition, more rotating registers may be needed as a result of the longer scheduled latencies: Longer lifetimes of the load target registers naturally increase the register pressure in the loop. This includes lifetimes that span multiple pipeline stages: The lifetime of the target register of a load with a clustering factor of k is at least kII cycles long and therefore occupies a range of k or more consecutive rotating registers.

Fig. 6 illustrates this using the example loop: The load is scheduled here for an additional latency of d = 2 cycles. With the II of one this results in a clustering factor of k = 3 (Equ. (3)). At any time during the steady state of the loop execution, three instances of the load associated with the target registers r32, r33, and r34 are being serviced in parallel in the memory subsystem. The longer load target register lifetimes can also extend other, indirectly connected lifetimes as a side effect: If the distance between a load and its uses grows, then this transitively also increases the distances between all data dependence graph (DDG) predecessors of the load and all DDG successors of the uses. Possible other lifetimes between these predecessors and successors will be extended as the pipeline is “stretched”. In general, the rotating register pressure grows linearly with the additional scheduled latency and the number of loads to which it is applied. In the extreme case it can even happen that the pipeliner runs out of rotating registers (on the Itanium architecture, 96 integer and 96 FP registers can rotate). If the rotating register allocation fails after scheduling, it is possible to try again with reduced latency deltas or with a higher II (see Sec. 3.3). Even if the rotating register allocation succeeds, the number of used registers may be higher and there may be additional overhead in the form of spills and fills in the loop prolog and loop epilog blocks (the code segments before and after the loop). Again, the cost of executing these spills and fills is incurred only once. In summary, latency-tolerant scheduling of non-critical loads increases the “fixed costs” per loop execution, but not the “variable costs” per loop iteration. Relative to the total loop execution time the cost will be diminishing if the trip count is high, but it can be dominant in a low-trip-count loop.

3.

A FRAMEWORK FOR LATENCYTOLERANT SOFTWARE PIPELINING

After studying the cost and benefit of latency-tolerant pipelining in theory in the previous section, we now describe the changes we have made to support this optimization experimentally in a reR Itanium R C++ and Fortran compilers. cent version of the Intel In practice, lack of information about trip counts and cache miss behavior can make it hard to evaluate the profitability of the optimization at compile time. Therefore we focus in this section in particular on prefetcher heuristics we have developed in order to identify and mark presumably cache-missing loads.

3.1

Tradeoffs in Practice

The earlier cost-benefit analysis has pointed out two factors that have a determining impact on the profitability of latency-tolerant pipelining: First obviously the actual cache miss behavior of the affected loads at runtime and second the loop trip count. If either one of these types of information is known at compile time, we can make a relatively safe assessment about the performance impact of the optimization: The cache miss information tells us whether there is an expected benefit and the trip count information helps us estimate the weight of the associated costs.

If, for example, long latencies are expected, then the optimization may be profitable even in a loop with a low trip count (an example will be presented later in Sec. 4.4). If the only available information is that the loop has a high trip count, we can still optimistically schedule for longer latencies because we know that the potential downside is limited. In practice, sampling-based cache miss information is rarely available during static compilation, therefore we resort to static heuristics as described below. Classic block count profiles are more common, and from the execution counts of basic blocks we can easily calculate the average trip counts of loops. Even if the average trip count is high, low-trip-count executions are still possible if the variance in the trip-count distribution is large, but then we know that they must be counterbalanced and dominated by an equal number of executions with a very high trip count. There is another sort of variability that may limit the effectiveness of latency-tolerant scheduling to some extent: latency variations. Unlike the assumptions in our theoretical studies in Sec. 2.1, the latencies of individual loads are not necessarily stable at runtime. Large latency variations could limit especially the effectiveness of clustering because only latencies of similar durations can be effectively overlapped. Luckily, very long latencies – where clustering is especially effective – occur in many applications concentrated in a small set of delinquent loads where we can expect consistently high miss rates [19, 3]. Clustered instances of the same delinquent load (from successive iterations) especially benefit from this.

3.2

When there are multiple data references that access the same cache line inside a loop, prefetching is done only for the leading memory reference.

Prefetcher Heuristics to Set Latency Hints. There is a token associated with each memory reference that is used to provide hints from the prefetcher to the code generator in the back-end. For each prefetch candidate the compiler determines whether the reference should be marked to have a higher expected latency than the default. More precisely, an L2 hint is set for integer loads and an L3 hint for FP loads – one level lower than the highest cache level where these loads can hit (FP loads bypass the L1 cache [11]). This information is then used by the pipeliner to selectively apply high-latency scheduling for non-critical loads, restricting the optimization to loads where the prefetch efficiency is less than the optimal amount. Currently this occurs in the following situations: 1. Any non-loop-invariant memory reference that could not be prefetched at all is marked for longer latency scheduling (see an example in Sec. 4.4). 2. In some cases, the prefetcher decides to reduce the prefetch distance below the calculated “optimal” amount to limit other adverse side effects. The affected loads are also marked since more of their latency is exposed. Reasons for reduced prefetch distance include: (a) Data access involving an unknown symbolic stride whose value may be large. The compiler limits the prefetch distance for such accesses to limit TLB (translation lookaside buffer [11]) pressure due to outstanding prefetches from different memory pages.

Support in the Software Prefetcher

Software prefetching is done as part of the High Level Optimizer (HLO) in the compiler. Prefetching enables the compiler to tolerate memory latency by overlapping data access with useful computation as well as accesses to multiple data items simultaneously [17, 5]. It has a large impact on application performance on IPF as demonstrated in previous studies [8, 23]. But there are several factors that can reduce its efficiency: Prefetching is not very effective for low-trip-count loops, since there is not enough time to cover the memory latency. Furthermore, in some cases (explained below) the prefetcher chooses a lower prefetch distance, which may cover only a part of the latency. The prefetch distance is the number of iterations ahead that a prefetch is issued before the corresponding load or store. This distance is computed generally by applying the formula Lat/IIest , where Lat is the average memory latency that needs to be covered and IIest is the HLO estimate of the initiation interval of the loop. The computation also takes into account any knowledge about the trip count of the loop. If the prefetch distance is too large compared to the estimated trip count, then the distance is adjusted to make sure that at least half of the prefetches issued will be useful. If the compilation options include the use of dynamic profiles, the trip-count information is readily available. In other cases, the trip-count estimation and calculation of prefetch distance makes use of information such as: • Static array sizes of any array accessed inside the loop. Assuming that there are no out-of-bounds accesses, the access pattern of the static array can be used to get a maximum limit for the trip-count of the loop. • If the data access occurs in a loop-nest, and the compiler can prove (using symbolic analysis) that the data access is contiguous across outer-loop iterations, then the prefetch distance can be high even if the inner-loop trip-count is small.

(b) When prefetching indirect references (of the form a[b[i]]), the compiler issues prefetches for both the index reference (b) as well as the indirect reference (a). The indirect reference is prefetched with a lower distance compared to the index reference to ensure that there are no stalls in the prefetch address calculation for the indirect reference. Another reason for the lower distance is that the indirect reference may access many different memory pages, and we want to prevent TLB overflows as described in (a). 3. In loops with a large number of integer data references that miss in L1, there is extra pressure on the OzQ, the queuing structure that sits between L1 and L2 [11]. In such cases, the compiler prefetches the data into L2 only (not into L1), and these references are marked to have the higher L2 latency. If the efficiency of prefetching is low as determined by the compiler heuristics given above, then all such accesses (to the same cache line) will get marked for higher-latency scheduling.

3.3

Support in the Software Pipeliner

The compiler uses iterative modulo scheduling [20] for scheduling pipelined loops. The loop is first if-converted to remove control flow, then the Resource II is computed as defined in Sec. 1.1. Next, data dependence analysis is done to determine dependence information and compute the loop’s Recurrence II. If the latter is greater than the Resource II, optimizations such as predicate promotion, riffling, and data speculation are done to reduce the recurrence cycle lengths [7].

When computing the Recurrence II, the pipeliner queries the machine model component of the code generator to obtain the latencies of instructions. For loads, an additional parameter is provided with the query that specifies whether the machine model should return the minimum (base) latency of the load, or a (possibly higher) expected latency value specified by HLO hints (as discussed in the previous section). Initially, when the Recurrence II is computed, the pipeliner always requests the base latencies. Then the pipeliner determines which loads should be considered as non-critical and which as critical as outlined in Sec. 1. Initially, all loads in the loop are marked as non-critical. Then the pipeliner iterates over all recurrence cycles and checks for each cycle if increasing the latencies of all loads in this cycle to the expected latency values would increase the Recurrence II to a value higher than the Resource II, and hence would likely lead to an overall II increase. If this is the case, all loads in this cycle are marked as critical, indicating that minimum latencies should be used for them during modulo scheduling. The modulo scheduler then iterates from Min II = max(Recurrence II, Resource II) until it finds an II at which either the loop can be scheduled successfully or we estimate that the acyclic global code scheduler can do a better job scheduling the loop. During scheduling, the pipeliner communicates the previously computed critical/non-critical information to the machine model when it queries for load latencies, indicating whether it needs the minimum or the hint-derived expected latency value. L2 and L3 latency hints are not translated into the best-case latencies of these cache levels (5/14), but into higher values that are closer to the typical latency values (11/21) specified in the manual [11]. This provides more headroom to allow for latency-increasing dynamic hazards such as conflicting stores and bank conflicts. The latter can occur if multiple accesses to the same L2 cache bank are issued in the same cycle [10]. If such conflicts are statically predictable, the machine model can usually prevent them by rearranging loads [6], but due to the dynamic out-of-order nature of the OzQ this is not possible under all circumstances. The above latency numbers are for integer loads; FP loads require one additional cycle for format conversion. Once a successful schedule has been achieved, rotating register allocation is done using a blades allocation style in allocating rotating registers [21], which takes into account the live-in and live-out values in the loop kernel. Sometimes, the compiler can successfully schedule a loop but fails in rotating register allocation because there are not enough registers available to satisfy the schedule’s register requirements. In this case, instead of spilling or giving up, the pipeliner will first reduce the non-critical load latencies in the loop to the base level and then try scheduling/allocating at the same II. If this still fails, it will continue to iterate at successively higher IIs (reducing the register pressure) until either the register requirements for the loop can be met or we estimate that pipelining at this II is not profitable. Hence latency-tolerant pipelining can, as described, lead to additional modulo scheduling attempts if the register allocation fails, but the compile time increase we measured due to this is in the noise range (0.5%).

4.

EXPERIMENTAL RESULTS

We have evaluated the performance of latency-tolerant pipelinR CPU2000 and CPU2006 benchmark ing on both the SPEC suites on Linux. All benchmarks were run on a 1.6 GHz Dual-Core R Itanium R 2 processor with 12 MB of L3 cache per core Intel (in “speed” runs where one copy of each benchmark program is

run in a single thread on a single core). We used compiler options that usually give the best performance and are applied for SPEC submissions: -O3 -ipo -static -ansi_alias -IPF-fp-relaxed These options enable compiler optimizations that include interprocedural optimizations (IPO), loop/memory optimizations such as loop-unrolling, fusion, distribution, blocking, unroll-jam, and scalar replacement, as well as scalar optimizations such as partial redundancy elimination (PRE), strength reduction, and various back-end optimizations [7]. For some experiments additionally profiling was used.

4.1

Testing Methodology for SPEC

The results in this paper are for research and academic usage and were run following the guidelines in the SPEC CPU2000 and CPU2006 Run Rules for “Research and Academic Usage” [1]. The measurements are not from a formal SPEC run, but were generated using the SPEC tools where (1) each benchmark was run by itself and (2) only for benchmarks that showed a performance difference of more than 1% triple runs were performed.

4.2

Headroom Experiment

We start our measurements with a headroom experiment that compares two extremes: A compiler that schedules all non-critical loads for the typical L3 latency against a baseline compiler that applies no non-critical latency increases at all. These experiments where conducted with five different trip-count thresholds n between 0 and 64. Only in loops with at least these average trip counts were the longer latencies applied (the case n = 0 means effectively that no such threshold was used at all). Profiling (referred to as PGO in the following) was used for reliable trip count information. Figure 7 shows the results for CPU2006 and CPU2000. In line with our predictions in Sec. 2.2, the trip count plays a significant role: While the losses almost neutralize the gains in CPU2006 without trip count threshold (+0.5%), the speedup on the geomean rises to 1.3%, 2.4%, 2.3%, and 2.1% for n = 8, 16, 32, and 64, respectively. The same effect is observed in CPU2000, where the performance changes on the geomean are -0.7%, 0.8%, 0.6%, 0.6%, and 0.3% for n = 0, 8, 16, 32, and 64, respectively. The numbers for 464.h264ref exemplify this trend: Here a very hot pipelined loop – formed out of a hot trace from the loop in “FastFullPelBlockMotionSearch()” – causes the regression. It has a low trip count of only around 10 and contains a non-critical load with a linear access pattern (high L1D hit rate). Under low trip-count thresholds this load is scheduled at a higher latency, which gives no benefit, but adds five more pipeline stages. When a trip count threshold of 32 is used, most of the losses disappear and substantial gains prevail: In CPU2006, the largest gains are observed in 429.mcf (+14%), 444.namd (+10%), 462.libquantum (+7%), and 481.wrf (+7%). In CPU2000, 179.art (+12%) and 200.sixtrack (+8%) benefit most. In a few benchmarks losses persist when the trip-count threshold is increased. We investigated this unexpected behavior in 177.mesa and found out that it is due to a discrepancy between the SPEC training and reference input sets: While the hot loop at line 746 in “gl_write_texture_span()” has an average trip count of 154 in the training sets, it becomes a short-trip-count loop in the reference input sets with 8 iterations on an average. As our execution profile and the trip count estimates are based on the training run, scheduled latencies are increased in the loop and lead to a slowdown in the measurement runs where the trip count is low and the loads hit in the cache.

10.0% 5.0% 0.0%

483

Geomean

301

Geomean

482

481

473

471

470

465

462 200

464

459 197

458

456

454

453

450

447

445

444

437

436

435

434

433

429

416

-20.0% -25.0%

n=0

n=8

n=16

n=32

n=64

SPEC CPU2006 Benchmarks

15% 10% 5% 0%

-15% -20%

n=0

n=8

n=16

n=32

n=64

300

256

255

254

253

252

191

189

188

187

186

183

181

179

178

177

176

175

173

172

171

-10%

168

-5%

164

% Gain over Baseline

410

-15.0%

403

-10.0%

401

-5.0%

400


15.0%


Figure 7: Headroom experiment results (with PGO) Results generated per "Research and Academic Usage Guidelines" from SPEC; see Sec. 4.1. When disabling software prefetching in the compiler, the gain in this headroom experiment grows to 4.6% on the geomean (CPU2000 and CPU2006 combined, with n = 32, graph not shown). In general, the value 32 appears as an empirically reasonable choice for the trip-count threshold: It is more conservative than the value 16, reducing the general regression risk, but it still gives virtually the same gains. Therefore it is used in all the following experimental runs with PGO.

4.3

Results with Prefetcher Hints

Setting the expected latency hints of all loads to the L3 level “across the board” is a quite “pessimistic” setting. Figure 8 shows that comparable speedups can already be achieved by a much more moderate general hint setting, namely by marking all floating-point loads for an L2-level expected latency. Such loads already bypass R 2 procesthe L1 and directly access the L2 cache on the Itanium sor, but as mentioned in Sec. 3.3 the hints are translated into higher typical latency values, therefore non-critical FP loads are scheduled with this setting for a latency that is almost twice the minimum latency. When compared against the same baseline compiler as in the previous section, the gains are 1.1% and 0.6% on the geomean in CPU2006 and CPU2000, respectively. The advantage of such a general, moderate hint setting is its simplicity in combination with a relatively low regression risk, making it interesting as a default compiler setting. In the following, we continue to use the L2 hint as a default for FP loads for which no HLO hint is specified. The right bars of Fig. 8 eventually show the gains from applying the HLO-directed hints based on the heuristics from Sec. 3.2. With 2.0% and 1.3% on the geomean in CPU2006 and CPU2000, respectively, they give almost twice the speedup as just the default setting. In comparison to the headroom experiment from Fig. 7, the more selective application of longer-latency scheduling reduces the

potential for regressions, as exemplified by the disappearance of the 177.mesa loss. At the same time, the large gains like in 444.namd (+12%) and 200.sixtrack (+11%) are preserved, and integer benchmarks like 429.mcf (+12%) now benefit from increasing the latencies of integer loads as well. We naturally do not observe gains in all 55 SPEC benchmarks because they have highly different execution characteristics. Some do not even contain hot pipelined loops in the first place. A key observation is that there are no more substantial regressions because we have managed to contain the cost component of the optimization.

Results without PGO. The advantage of HLO-directed versus general latency increases becomes much more pronounced when PGO is dropped. In the absence of dynamic profile information the compiler computes a static profile based on heuristic rules. The accuracy of this static profile, and in particular of the trip count estimates, is naturally low. It can occur that more loops with a very low trip count are now assumed to have a higher trip count and are therefore pipelined. If additionally the number of stages in such loops is increased by boosting non-critical loads to L3 latencies, their total execution time can be multiplied, as demonstrated by the -0.7% loss on the geomean in Fig. 9. This is essentially the same headroom experiment as in Fig. 7, but without PGO; the comparison is against the same baseline compiler as previously, but again without PGO. We have conducted this experiment only for CPU2006 because in this suite – in contrast to CPU2000 – PGO is not permitted for base runs, making the results more relevant. Load latency information can compensate for the absence of reliable trip-count information, as discussed in Sec. 3.1: If the latency increases are confined to HLO-selected candidates, the loss is turned into a 2.2% gain on the geomean (Fig. 9, right bars). One outlier is the persisting 445.gobmk loss: It represents a “worst

10% 5%

-10% All FP loads L2 hint

Plus HLO hints

Geomean

483

482

481

465 253

473

464 252

471

462

470

459

200

458

197

456

454

453

450

447

445

444

437

436

435

434

433

429

416

410

403

-5%

401

0%

400


15%


10%

5%

301

300

256

255

254

191

189

188

187

186

183

181

179

178

177

176

Plus HLO hints


Geomean

All FP loads L2 hint

175

173

172

171

-5%

168

0%

164


15%

Figure 8: Gains from marking all FP loads with an L2 hint and from HLO-directed hints (with PGO). Results generated per "Research and Academic Usage Guidelines" from SPEC; see Sec. 4.1. case scenario” that can happen when both the trip count information and the load latency estimates do not match the reality: Here in several loops indirect references are scheduled for a longer latency, but the trip count as well as the load latencies turn out to be very low at runtime. With PGO these loops are not pipelined in the first place, preventing this scenario from happening (Fig. 8). On the geomean, the few minor losses are dominated by large gains in particular in 444.namd (+11%), 462.libquantum (+14%), 481.wrf (+7%), and 429.mcf (+10%). We take a closer look at the latter gain in the following section.

4.4

Example from 429.mcf

We use a hot loop in “refresh_potential()“ in this benchmark as an example to briefly illustrate in practice the clustering benefit, which was theoretically analyzed in Sec. 2.1. The indirect references in bold in the code excerpt below are delinquent with average latencies of up to a hundred cycles; they cannot be prefetched since they depend on a pointer-chasing recurrence (“node = node->child”). while( node ) { if( node->orientation == UP ) node->potential = node->basic_arc->cost + node->pred->potential; else { ... } node = node->child; } Hence they are marked for higher-latency scheduling according to heuristic (1) from Sec. 3.2 and, since not on a recurrence cy-

cle, scheduled accordingly in the pipelined loop. As a result, instances of these loads from successive kernel iterations are clustered (k = 2). Although this occurs on average only for two respective instances per loop execution – the average trip count of this loop is 2.3 – there is a 40% speedup for the loop due to reduced stalls.

4.5

Statistics and Performance Counter Measurements

We have collected further compiler and performance counter statistics in order to study the effects and tradeoffs of latencytolerant pipelining. All the data presented in the following is aggregate data for the entire CPU2006 suite, comparing a compiler that applies the HLO hints against the baseline compiler used in the previous sections, without PGO (i.e., exactly the last experiment from Fig. 9). First, we have counted in the software pipeliner by how much the number of allocated registers in pipelined loops increases (counting both rotating and non-rotating registers): The number of needed general (integer) registers increases by 14%, the number of FP registers by 20%, and the number of predicate registers (required to control the pipeline stages) by 35%. Still, for all register types the number of allocated registers remains less than one fifth of the number of available registers on an average – the large supply of R is far from being exhausted by architected registers on Itanium long-latency scheduling. Consequently, the number of spills outside of pipelined loops2 (due to increased register usage within these loops) grows by merely 1.8%. The percentage of spills among 2

Within pipelined loops, no spilling occurs (see Sec. 3.3).

10% 5% 0%

All loads L3 hint

HLO hints


483

Geomean

-20%

482

481

473

471

470

465

464

462

459

458

456

454

453

450

447

445

444

437

436

435

434

433

429

416

410

-15%

403

-10%

401

-5%

400


15%

Figure 9: Gains without PGO and two different hint settings (general L3 hints and HLO-directed). Results generated per "Research and Academic Usage Guidelines" from SPEC; see Sec. 4.1.

4.5E+13

CPU2006 clock cycles

4E+13 3.5E+13 Unstalled Execution

3E+13

BE_RSE_BUBBLE.ALL

2.5E+13

BE_L1D_FPU_BUBBLE.ALL

2E+13

Furthermore, the counter BE_RSE_BUBBLE.ALL shows that the fraction of cycles spent in register stack engine (RSE) activity grows by 14% – this is a side effect of the increased number of allocated stacked registers, which are automatically spilled and filled by this hardware engine [11, 12]. The overall impact is minor because the RSE component is small to begin with. The unstalled execution time also increases a bit (1.2%) due to the additional epilog iterations in pipelined loops.

BE_FLUSH_BUBBLE.ALL

1.5E+13

BE_EXE_BUBBLE.ALL

1E+13

BACK_END_BUBBLE.FE

5E+12 0

Baseline HLO Hints

Figure 10: Cycle accounting data corresponding to Fig. 9.

all instructions remains very low (1.1%). Overall, it appears that the register pressure increase due to our optimization has only a minor impact. Next, we have used the hardware performance counters to study how long-latency scheduling shifts the ratio between unstalled and stalled execution time, as well as between individual stall components. The data, shown in Fig. 10, was measured using Caliper [2]. Each of the two bars represents the entire amount of clock cycles of a CPU2006 run. The subdivisions of the bars show how much of this time the processor spends in six different microarchitectural states. In this context, only four of these states are of interest: BE_EXE_ BUBBLE.ALL represents the cycles in which the in-order pipeline is stalled because data is not yet available – in most cases data from memory. Long-latency scheduling reduces this time drastically by 12% across all of CPU2006. But the percentage of clock cycles in which the OzQ, the central out-of-order memory request queue [11], is full, is increased from 8.2% to 9.4% (measured using the L2D_OZQ_FULL counter, not shown in the figure). This leads to an 8% increase in the BE_L1D_FPU_BUBBLE.ALL component, which includes execution pipeline stalls due to a full OzQ. This data shows that our optimization stresses the memory subsystem of R 2 processor to its limits, and it indicates the Dual-Core Itanium that the benefit could be much higher if the queuing capacities in the cache hierarchy were increased.

5.

RELATED WORK

Balanced scheduling [13] is one of the first approaches to take uncertain memory latencies into account during instruction scheduling. The algorithm, designed for non-superscalar architectures with non-blocking caches, increases load-use distances in the schedule in order to execute more instructions in the shadow of a cache miss. It tries to balance these increases equally among all loads in a basic block to allow for uncertain latencies and to reduce register pressure. In our work, the available number of rotating registers and the available parallelism in the software pipeline are so large that we can increase load-use distances in the schedule more aggressively. Balancing latency increases between different loads on a recurrence cycle is a possible future extension of our work. Among later work on latency-aware software pipelining [22, 4], cache sensitive modulo scheduling [22] comes closest to our technique. This scheme combines pipelining with early scheduling of memory instructions instead of compiler-generated prefetches, whereas our method uses the latency-scheduling on top of our compiler-generated prefetching scheme, and even leverages the latter to heuristically provide latency hints. [22] presents experimental results for traces extracted from the innermost loops of five SPECfp95 benchmarks, run on a simulator. It assumes memory latencies of only 10-20 cycles. In contrast, we demonstrate the general nature of the performance headroom across a wide range of integer and FP benchmarks, compiled at the highest optimization levels and measured on real hardware. We study cost-benefit tradeoffs of the method to make it solid enough for a production compiler (even without PGO). For the Itanium, generating prefetches is really key for good performance, and the large number of architected registers mitigate problems with register pressure. In addition, rotating registers easily enable clustering of load instances from successive iterations, one of the key insights of our work. Without rotating registers, this effect could only be achieved with unrolling.

In [19] a classification scheme is described that statically identifies delinquent loads based on their access properties. In this paper, we use prefetching heuristics to mark only those loads that may have less than 100% prefetch efficiency as delinquent. Our way of choosing the delinquent loads follows naturally from the prefetching algorithm and has many elements that are desirable for delinquency as described in [19], e.g., memory references that involve more than one level of dereference.

[9]

[10] [11]

6.

CONCLUSION AND OUTLOOK

We have investigated latency-tolerant software pipelining, an optimization that schedules non-critical loads for a higher latency to mitigate memory stalls. Our technique supplements existing software prefetching and also makes use of it by letting the prefetcher preselect loads with suboptimal prefetching efficiency for longR product comlatency scheduling. Implemented in an Itanium R CPU2006 at base piler, it gives average gains of 2.2% in SPEC optimization levels. Theoretical studies as well as the experimental results demonstrate that the stall reductions comes from a combination of latency coverage and clustering effects. We showed that the general applicability of the technique depends significantly on available cache-miss and trip-count information. To make this information more precise and consequently increase the net gain from the optimization, we are looking into dynamic cache-miss sampling, more refined HLO and pipeliner heuristics, and/or trip-count versioning.

[12] [13]

[14]

[15]

[16]

Acknowledgments R Itanium compiler team, in parWe would like to thank the Intel ticular Daniel Lavery, John Ng, Kalyan Muthukumar, and Wei Li for their support. Our thanks also go to Albert Cohen and the anonymous reviewers for their constructive comments.

7.

[17]

REFERENCES

[1] Standard Performance Evaluation Corp. www.spec.org. [2] HP Caliper, 2003. www.hp.com/go/caliper. [3] S. G. Abraham, R. A. Sugumar, D. Windheiser, B. R. Rau, and R. Gupta. Predictability of Load/Store Instruction Latencies. In Proceedings of the 26th Annual ACM/IEEE International Symposium on Microarchitecture (MICRO), pages 139–152, Austin, TX, 1993. [4] M. Bedy, S. Carr, S. Onder, and P. Sweany. Improving Software Pipelining by Hiding Memory Latency with Combined Loads and Prefetches. In Proceedings of the the 5th Annual Workshop on Interaction between Compilers and Computer Architectures (INTERACT-5). Kluwer Academic Publishers, 2001. [5] D. Callahan, K. Kennedy, and A. Porterfield. Software Prefetching. In ASPLOS-IV: Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, 1991. [6] J.-F. Collard and D. Lavery. Optimizations to Prevent Cache R Itanium R 2 Processor. In Penalties for the Intel Proceedings of the First IEEE/ACM International Symposium on Code Generation and Optimization (CGO), San Francisco, Mar. 2003. [7] C. Dulong, R. Krishnaiyer, D. Kulkarni, D. Lavery, W. Li, R IA-64 J. Ng, and D. Sehr. An Overview of the Intel Compiler. Intel Technology Journal, (Q4), 1999. [8] S. Ghosh, A. Kanhere, R. Krishnaiyer, D. Kulkarni, W. Li, C. Lim, and J. Ng. Integrating High-Level Optimizations in a

[18]

[19]

[20]

[21]

[22]

[23]

Production Compiler: Design and Implementation Experience. In Proceedings of the 12th International Conference on Compiler Construction (CC), Apr. 2003. C. R. Hardnett, R. M. Rabbah, K. V. Palem, and W. F. Wong. Cache Sensitive Instruction Scheduling. Technical Report GIT-CC-01-15, Georgia Institute of Technology, 2001. R Itanium R 2 Processor Reference Manual for Intel. Intel Software Development and Optimization, May 2004. R Itanium R 2 Intel. Dual-Core Update to the Intel Processor Reference Manual for Software Development and Optimization, Jan. 2006. R Itanium R Architecture Software Developer’s Intel. Intel Manuals Volumes 1-3, Jan. 2006. D. R. Kerns and S. J. Eggers. Balanced Scheduling: Instruction Scheduling When Memory Latency is Uncertain. In Proceedings of the ACM SIGPLAN 1993 Conference on Programming Language Design and Implementation (PLDI), pages 278–289, Albuquerque, NM, USA, June 1993. M. Lam. Software Pipelining: An Effective Scheduling Technique for VLIW Machines. Proceedings of the SIGPLAN88 Conference on Programming Language Design and Implementation, pages 318–328, June 1988. D. M. Lavery and W. W. Hwu. Modulo Scheduling of Loops in Control-Intensive Non-Numeric Programs. In Proceedings of the 29th Annual ACM/IEEE International Symposium on Microarchitecture (MICRO), Paris, France, 1996. T. Lyon, E. Delano, C. McNairy, and D. Mulla. Data Cache R 2 Processor. In Design Considerations for the Itanium Proceedings of the IEEE International Conference on Computer Design (ICCD), Freiburg, Germany, Sept. 2002. T. C. Mowry, M. S. Lam, and A. Gupta. Design and Evaluation of a Compiler Algorithm for Prefetching. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating System (ASPLOS), volume 27, pages 62–73, New York, NY, 1992. ACM Press. K. Muthukumar, D.-Y. Chen, Y. Wu, and D. Lavery. Software Pipelining of Loops with Early Exits for the R Architecture. In Proceedings of the First Itanium Workshop on EPIC Architectures and Compiler Technology (EPIC-1), Austin, TX, Dec. 2001. V.-M. Panait, A. Sasturkar, and W.-F. Wong. Static Identification of Delinquent Loads. In Proceedings of the Second Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO), Palo Alto, CA, Mar. 2004. B. R. Rau. Iterative Modulo Scheduling: An Algorithm for Software Pipelining Loops. In Proceedings of the 27th Annual ACM/IEEE International Symposium on Microarchitecture (MICRO), San Jose, CA, Nov. 1994. B. R. Rau, M. Lee, P. P. Tirumalai, and M. S. Schlansker. Register Allocation for Software Pipelined Loops. In Proceedings of PLDI 1992, San Francisco, CA, June 1992. F. J. Sánchez and A. González. Cache Sensitive Modulo Scheduling. In Proceedings of the 30th Annual ACM/IEEE International Symposium on Microarchitecture (MICRO), Research Triangle Park, NC, Dec. 1997. X. Tian, R. Krishnaiyer, H. Saito, M. Girkar, and W. Li. Impact of Compiler-based Data-Prefetching Techniques on SPEC OMP Application Performance. In Proceedings of the 19th International Parallel and Distributed Processing Symposium (IPDPS), Denver, CO, Apr. 2005.