Speculative Prefetching - CiteSeerX

52 downloads 0 Views 273KB Size Report
A hardware prefetching mechanism for cache memories named Speculative ... this paper, we propose a hardware scheme named Speculative Prefetching forĀ ...
Speculative Prefetching Y. Jegou O. Temam IRISA/INRIA University of Leiden Campus de Beaulieu Niels Bohrweg 1 35042 Rennes Cedex 2333 CA Leiden France The Netherlands

Abstract

A hardware prefetching mechanism for cache memories named Speculative Prefetching is proposed. This scheme detects regular accesses issued by a load/store instruction and prefetches the corresponding data. The scheme requires no software add-on, and in some cases it is more powerful than software techniques for identifying regular accesses. The tradeo s related to its hardware implementation are extensively discussed in order to nely tune the mechanism. Experiments show that average memory access time of regular codes is brought within 10% of optimum for processors with usual issue rates, while performance of irregular codes is little reduced though never degraded. The scheme performance is discussed over a wide range of parameters. Keywords: cache, hardware prefetch, numerical codes, memory latency.

1 Introduction

The memory latency observed by current processors is high whether they are superpipelined (fast processor clock), have no secondary caches or belong to a multiprocessor architecture (high network latency). Moreover, this latency tends to increase rapidly. Consequently, the cost of a cache miss is becoming prohibitive. In addition to high memory latency, superscalar processors have a very high memory requests issue rate which makes them even more sensitive to cache performance. If cache misses are numerous processor performance is severely degraded. Therefore, though classic cache designs theoretically provide a one-cycle memory access time, they are, in general, unable to deliver such performance. It then becomes worthwhile if not critical to design mechanisms for decreasing the time spent in memory accesses, or in other words, the average memory access time per reference. Another conclusion of previous observations is that a code performance on the above mentioned architectures is directly related to the data trac generated. A major class of applications which is characterized by large memory workspaces is numerical codes. Another remarkable characteristic of these applications is that data accesses are frequently vector accesses with various strides. Consequently the memory references of a numerical code are highly predictable. Vector architectures exploit this property using pipelined memories and exhibit very high average memory issue rates, but they do not exploit the potential temporal locality. Current processors rely on caches for exploiting temporal locality, and cache lines for exploiting spatial locality. However, this latter method looses eciency or fails when strides are too important, and cannot avoid periodic cache misses during a vector access. Besides, large cache lines induce interference phenomena which can degrade performance of software methods used to exploit temporal locality [9]. A solution to these problems is to resort to prefetching, i.e predict which data will soon be used and load them into upper levels of memory hierarchy before they are referenced, thereby avoiding cache misses. Prefetching can be managed either by hardware, software or a combination of both. In 

This work was funded by the BRA Esprit III European Project APPARC.

1

this paper, we propose a hardware scheme named Speculative Prefetching for performing ecient data prefetching in programs with regular accesses. This prefetching technique reduces average memory access time within 10% of optimum for regular codes, and slightly improves the behavior of irregular codes. The scheme looses performance when the issue rate is high, though it still brings signi cant improvements. The mechanism requires no software add-on and in some dicult cases, it is more successful than software techniques at exploiting stride accesses. Besides, only a reasonable amount of on-chip hardware logic is required, though the mechanism needs to be nely tuned for delivering optimum performance. In shared-memory multiprocessors, the observed memory latency can be high, and can further increase if memory trac is high enough to breed network contentions. The underlying goal of this paper was to propose a cache memory design suitable for shared memory multi-processors. Because this mechanism is capable of hiding high memory latencies without excessive additional memory trac, it is particularly well suited to multiprocessors. Besides, it is shown that spatial locality is exploited through prefetching, so that the cache line size can be decreased, which is particularly critical in shared-memory multiprocessors where cache line size is a tradeo between spatial locality exploitation and invalidations (cache line invalidations due to true or false sharing of the line between several processors, i.e., local caches ). In section 2, the respective assets and drawbacks of software and hardware prefetching are discussed so as to de ne the optimal characteristics of a prefetching technique. In section 3, the workings of Speculative Prefetching mechanism are presented. In section 4, the performance of this mechanism is discussed. Finally, in section 5 further developments are explicited and conclusions are drawn.

2 Prefetching tradeo s 2.1 Software prefetching Prefetch instructions Software prefetching relies on compiler for predicting when data will be used

and for generating the necessary memory requests suciently in advance. Because usual loads instructions are still necessary for normal execution, the compiler must generate one classic load instruction plus one prefetch instruction for each data to be fetched in advance. Consequently, program size is larger, the processor instruction set must be modi ed and processor cycles are lost in executing such instructions. A solution to that problem proposed by Callahan and al. [2] is to replace nops with prefetch instructions or dedicate one instruction thread of superscalar processors to this task. Data trac A consequence of software prefetching is heavy data trac (memory requests are nearly doubled). A solution proposed by Veidenbaum and al. [6] is to group prefetch requests in blocks, a solution made possible by the regularity of accesses within numerical codes. Even then, unnecessary prefetch requests can be issued for data already located in cache. In order to avoid that, Klaiber and al. [8] proposed a hardware mechanism for testing whether a prefetch request is in cache before issuing it (prefetch on miss). So solutions to some issues of software prefetching rely on hardware mechanisms. Besides, software prefetching relies on compiler performance for predicting prefetch request issue dates, grouping requests and detecting regular accesses. Compiler performance Several recent works on detection and exploitation of temporal and spatial locality [9, 4] may greatly improve compiler capacity for prefetching. However, techniques for managing numerous regular and concurrent streams of references are still under development. Besides, dicult and not infrequent cases are poorly or not handled such as if statements within a loop nest, or nonrectangular loops. Let us consider the two examples of gures 1 and 2. The if statement of the rst loop nest has the e ect of periodically changing the access stride to array mat (every 1000 iterations). It is probable a compiler would not be capable of generating ecient prefetching instructions for this loop nest (a solution would be to prefetch for both possible strides). The second loop is a triangular loop. 2

for (i=0; i < dim; i++) { j = i - (i%1000); if ((i-j) > 500) { mat[(i-j) + j] += 1; } else { mat[2*(i-j) + j] += 2; } }

Figure 1:

IF: Do-Loop with conditional statements.

for (i=0; i < DIM; i++) { for (j=0; j < DIM; j++) { for (k=1; k < j; k++) { mat[i][j] = mat[i][j] + y[i][k]*x[k][j]; } } }

Figure 2:

TRI: Triangular Do-Loop.

Either a prefetching instruction is issued for each reference, and prefetching is done correctly though ineciently as explained above, or prefetch requests are grouped and a compiler could only generate prefetch requests for rectangular blocks of instructions resulting in unnecessary requests. So, though software prefetching is very promising, it requires hardware developments, and improvements of compiler techniques. Hardware prefetching requires no software development but it can either lack eciency or be too heavy to implement.

2.2 Hardware Prefetching Cache lines The use of cache lines is the most simple means for prefetching. However, all cache

misses of a vector access cannot be avoided. Besides, cache lines must be small (around 64 bytes) in order to limit cache interferences [12]. Prefetch always This prefetching scheme consists in prefetching the cache line immediately following the last one referenced. It may breed cache pollution. Besides it is inecient when the access stride is greater than line size. Finally, the prefetch request may occur too late and the processor miss on the line before it could have been brought to cache. On the other hand, this scheme is simple and easy to implement. Because strides are often small, it may be useful in a large number of cases. It has actually been implemented on the Viking processor [14]. Stream bu ers Stream bu ers have been proposed by Jouppi [7]. Four bu ers containing four cache lines each are added to cache. On a miss, the bu ers are checked and if a hit occurs, the line is transferred to cache, and the line following the last one in the bu er is prefetched. This technique can accommodate four times longer strides than Prefetch always and avoids cache pollution. Because there are four bu ers, concurrent regular accesses can be managed such as four array references within a loop nest. If there are more than four array references, the scheme fails. Besides, any loss of regularity in the accesses (if statements, non-rectangular loops: : : ) disrupts the stream bu er behavior. 3

Combined Hardware/Software prefetching This scheme proposed by Klaiber and al. [8] uses an on-chip bu er for storing prefetch requests. On a cache miss, the bu er is checked. Prefetch requests are issued only if data is not in cache. The bu er is ecient at avoiding disrupting the cache and consequently processor execution by storing incoming prefetch requests. However, the scheme still strongly relies on the compiler for detecting and issuing prefetches. Preloading scheme  The principles of this scheme are close to those we adopted [1]. The address request issued by each load/store instruction is stored in a table indexed by program counters, along with the stride, i.e the di erence between the new and the previous address requests. Using a second sequencer, the program is run a few instructions ahead of the primary sequencer. For each load/store instruction found during future execution of the program, the table is checked to see if the stride is stable. In that case, a prefetch is issued at the address estimated by adding the stride to the last referenced address. This method relies on branch predictions to predict future execution of the program. Besides, from a hardware point of view, this scheme seems to require heavy implementation (a second sequencer, branch prediction support: : : ). Moreover, the issues related to detailed implementation remain yet to be thoroughly addressed.  In [5] the technique for computing the predicted address is more simple than in [1] and close to the one we designed. On the other hand, the issues related to implementation are again not thoroughly addressed, since the design proposed is relatively rough. Cache misses are being considered as a metric instead of average memory access time so that delicate tradeo s of implementation are ignored.  In [11] the preloading technique is proposed for dealing with vector accesses in machines with cache memories. Implementation issues are hardly addressed, but the ability of the design to exploit locality in loop nests where compilers would fail to detect and exploit it, is stressed. Speculative Prefetching The scheme we propose is also based on the detection of stride stability of the references emitted by a load/store instruction. On the other hand, the prefetch decision is based on this only criterion which greatly simpli es the mechanism design. This technique can accommodate any reference stride, and handle cases where software prefetching would fail. Besides it is implemented so as not to disrupt cache and processor execution. Data trac is minimized by avoiding unnecessary prefetch requests. The processor instruction set is not modi ed, only small caches and little hardware logic are added. We found our scheme to be of at least similar eciency and easier to implement than the Preloading scheme. Though branch prediction is not used, it is shown that the scheme behaves well with conditional branches. Besides, we address the issues and tradeo s raised by hardware implementation. It is shown that implementation is not straightforward and must be thoroughly studied in order to preserve the mechanism eciency. It is also demonstrated that it is possible to reduce line size in order to improve the behavior of codes which do not bene t from Speculative Prefetching (no vector accesses). Finally, for the speci c case of superscalar processors it appears that a more complex design is necessary.

3 Speculative Prefetching mechanism 3.1 Principles and basic implementation of Speculative Prefetching

The global algorithm and a basic hardware diagram of Speculative Prefetching can be found in gures 23 and 3. Let us indicate the scheme basic principles. Load/store cache When a load/store request is issued by the processor it is simultaneously handled by cache and sent to a load/store cache. This load/store cache is used to store the program address of load/store instructions, the last data address referenced, and the stride, i.e the di erence between the last address and second last address referenced. In this cache the mapping is done using instruction address (instead of data address for classic caches). On a hit, the di erence between the new data address and the one stored in the cache is computed, and the new stride is compared to the one stored in the cache. If strides are equal, then the load/store instruction is assumed to generate vector accesses because for the last three references the distance between two consecutive references is constant. Whether strides 4

To memory Prefetch requests

From memory Normal requests Prefetch requests Normal requests

Address Address decoding tags logic Address Address Cache to prefetch Adder New Multiplier Address New New Stride Stride Comparator

Buffer Prefetch Cache Address Address Cache line data decoding tags logic Normal Cache

Adder

New Address

Instruction Stride Data Address Address Load/Store Cache PC address Data address CPU

Figure 3: Hardware diagram for Speculative Prefetching are equal or not, the new address and new stride are stored in the load/store cache replacing the old address and stride. On a load/store cache miss, the address is stored and the stride reset to 0. Prediction mechanism If a vector access has been detected then a future reference is predicted. It is an issue whether the next or a further reference should be prefetched. This issue is discussed in section 3.2.2. Anyway, the future reference address expression is approximately equal to n  stride + last address, where n = 1 if next address is prefetched. Once a reference address has been predicted, the cache is tested to check whether this address is already in cache in which case the prefetch request is discarded. So as to avoid disrupting cache, the prefetch requests to be tested are stored in a small FIFO bu er named the prefetch test bu er (size = 1-entry). Due to the pipelining of instructions, the processor knows a cycle in advance whether the next instruction is a load/store. Therefore, the processor can issue a signal stating that cache will not be used at next cycle, i.e that a prefetch test can be issued to cache at next cycle. Once a prefetch request has missed in cache it must be sent to memory. Because memory may be unavailable because of cache requests, the requests are stored in another small FIFO bu er named the prefetch request bu er (size = 2-entry). Whenever memory is available, a pending prefetch request is issued. When a prefetch request is issued it is marked present but not available in prefetch cache (see paragraph below). Prefetch cache All memory requests coming back to processor while no cache requests are outstanding are prefetch requests. Incoming prefetch requests are stored in a small prefetch cache so as not to disrupt normal execution by stalling cache and processor. The processor always tests simultaneously the cache and the prefetch cache. When processor misses in cache but hits in prefetch cache, the requested word is loaded by the processor, and simultaneously its cache line is sent to cache and invalidated in prefetch cache (this is equivalent to a line transfer). In case prefetch cache is busy (receiving memory requests), the processor waits until it is free and checks it before issuing a miss request. If the required data is in prefetch cache and not available, the processor waits for its availability and then resumes execution.

3.2 Implementation details and tradeo s

In this section, implementation details and tradeo s are further discussed.

3.2.1 Load/store cache

5

32 2.0

Average Memory Access Time

64 128 256 512

1.0

CPR

IF SpMV MM MMB FLO TRI L/S Cache Size (entries)

Average Memory Access Time

Figure 4: Optimal choice for load/store cache size. 2.0

(xxx) Load/Store Cache hit ratio (0.07) (0.22)

1.5 (0.47) (0.81)

(1.00)

1 32

64

128

256 512 Load/Store Cache size

Figure 5: In uence of load/store cache hit ratio on average memory access time.

Load/store cache size Because numerical codes are mainly made of Do-Loops, load/store instructions corresponding to array references, i.e to vector accesses, mainly appear within these loops. In assembly code, these loops correspond to basic blocks which size is relatively small. Thanks to this basic block structure, few load/store instructions are simultaneously active during execution of loop-structured code (the program repeatedly passes over the same instructions until loop completion). Therefore, only a small load/store cache is necessary to contain most or all load/store instructions active within a loop. For most test programs used (see section 4.1), a load/store cache size equal to 256 entries proved to be sucient (cf gure 4). When loop nests are very deep, the basic block corresponding to outermost loop may be large. Then, references present in outer loops may not bene t from prefetching because they could not be kept in load/store cache. However, the outer these references are located the smaller the fraction of total loop nest memory requests they issue. So the loss is relatively negligible. For instance, it can be seen on gure 5, that for program FLO, doubling load/store cache size (256 to 512) breeds a 19% increase of load/store cache hit ratio which in turn brings 0% decrease of average memory access time. A corollary to above remarks is that loop unrolling may be damaging to speculative prefetching by inducing too large basic blocks. For programs with large basic blocks, turning o loop unrolling may be worthwhile. Load/store cache associativity The experiments done proved that associativity is not necessary for load/store cache (cf gure 6). This is due to the small size of basic blocks and consequently the small address distance between load/store instructions within a basic block. 6

2.0

Average Memory Access Time

1 2 4 8

1.0

CPR

IF SpMV MM MMB FLO TRI L/S Cache Associativity

Figure 6: Associativity of load/store cache.

label: Inst. Addr. Address Stride . . 100 A 0 100 load r1