Software Assistance for Data Caches O. Temam PRiSM Laboratory Versailles University 78 Versailles France
Email:
[email protected]
N. Drach LRI Orsay University 91 Orsay France
Email:
[email protected]
Abstract Hardware and software cache optimizations are active elds of research, that have yielded powerful but occasionally complex designs and algorithms. The purpose of this paper is to investigate the performance of combined though simple software and hardware optimizations. Because current caches provide little exibility for exploiting temporal and spatial locality, two hardware modi cations are proposed to support these two kinds of locality. Spatial locality is exploited by using large virtual cache lines which do not exhibit the performance aws of large physical cache lines. Temporal locality is exploited by minimizing cache pollution with a bypass mechanism that still allows to exploit spatial locality. Subsequently, it is shown that simple software informations on the spatial/temporal locality of array references, as provided by current data locality optimization algorithms, can be used to increase cache performance signi cantly. The performance and design tradeos of the proposed mechanisms are discussed, Software-assisted caches are also shown to provide a very convenient support for further enhancement of data locality optimizations.
Keywords: software-assisted caches, data locality, numerical codes.
1 Introduction
This paper derives from several observations on application codes, cache designs and state-of-the-art compiler-optimizers. Let us rst discuss the spatial and temporal locality properties of numerical codes. With respect to temporal reuse, gure 1a shows the reuse distance distribution of the traced memory references for the numerical benchmarks used in this paper (0 corresponds to data referenced only once). First, it appears that a sizable amount of data are used only once or very few times, so that techniques for hiding compulsory misses are required. It also appears that reuse distances are often larger than 1000 references, while for these same traces the average lifetime of a cache line in a 8-kbyte cache with a 32-byte cache line is approximately equal to 2500 references. So, for these codes the temporal reuse is likely to be disrupted by cache pollution. With respect to spatial reuse, gure 1b shows the average vector length of requests issued by load/store instructions.1 This vector length proves to be often larger than the cache line size currently used in small on-chip caches (32 bytes). In other terms, there is This work was supported by the Esprit Agency DG XIII under Grant No. APPARC 6634 BRA III. A vector sequence terminates when the instruction has not been used during more than 500 references, (i.e., a value much smaller than the average lifetime of a cache line,) or when the stride is greater than 32 bytes (i.e., the corresponding spatial locality would not be exploited with a cache line size of 32 bytes). 1
1
Vector size