Improving Last-Level Cache Performance by ...

Improving Last-Level Cache Performance by Exploiting the Concept of MRU-Tour Alejandro Valero, Julio Sahuquillo, Salvador Petit, Pedro López, and José Duato Department of Computer Engineering Universitat Politècnica de València Valencia, Spain [email protected], {jsahuqui, spetit, plopez, jduato}@disca.upv.es I NTRODUCTION Last-Level Caches (LLCs) implement the LRU algorithm to exploit temporal locality, but its performance is quite far of Belady’s optimal algorithm as the number of ways increases. One of the main reasons because of LRU does not reach good performance in LLCs is that this policy forces a block to descend until the bottom of the stack before eviction. Nevertheless, most of the blocks that leave the MRU position are not referenced again before eviction. This work pursues to select candidate blocks to be victimized before reaching the bottom of the stack. To this end, this work defines the number of MRU-Tours (MRUTs) of a block as the number of times that a block enters in the MRU position during its live time. Based on the fact that most of the blocks exhibit a single MRUT, this work presents the family of MRUT-based algorithms aimed at exploiting this block behavior to improve performance. P ROPOSED A PPROACH AND E XPERIMENTAL R ESULTS The number of MRUTs have been quantified depending on whether the accessed blocks exhibit a single or multiple MRUTs at the time they are evicted. Figure 1 depicts the results for the SPEC2000 benchmark suite and a 128B/line 1MB-16way L2 cache under the LRU policy. It can be observed that blocks having a single MRUT dominate those having multiple MRUTs. The devised algorithms only need to be aware if each line has had one or multiple MRUTs, so reducing the hardware complexity. The simplest algorithm of the family (MRUT1) only requires two status bits per cache line (MRU-bit and MRUT-bit). The first one is used to identify the MRU block and the latter to know if the block has had a single or multiple MRUTs. The policy works as follows. When a block is fetched into the cache, its associated MRUT-bit

Figure 1. Number of replacements split into single and multiple MRUTs.

is reset to indicate that its first MRUT has started. When the block leaves the MRU position for the first time, it can potentially be replaced. Then, if the same block is referenced again, it returns to the MRU position and the MRUT-bit is set to one to indicate that the block has had multiple MRUTs. The algorithm aims to avoid that blocks with a single MRUT stay in cache. To make hardware simple, it randomly selects the victim among those blocks with a single MRUT except the MRU block. If there is no block with its MRUT-bit cleared, all the blocks (except the MRU one) are candidates for eviction and the victim is randomly selected. Variants of the MRUT-1 policy maintaining recency of information for a few blocks have been also studied. In this way, the MRUT-x family of algorithms store the order of the last x referenced blocks, and will not be considered as candidates for eviction. Notice that complexity is largely reduced with respect to the LRU policy which keeps the order of all the blocks of the stack. Figure 2 plots the MPKI of LRU, DC-Bubble, and MRUT policies. The proposed MRUT-1 algorithm reduces MPKI by 4% and 15% on average compared to both recently proposed DC-Bubble and LRU algorithms, respectively. On the other hand, we found that a value of x=3 provides the best results, meaning that, in general, the stack order beyond three positions is not important. The MRUT-3 policy achieves an MPKI reduction up to 4%, 8%, and 19% compared to MRUT-1, DC-Bubble and LRU, respectively. ACKNOWLEDGMENTS This work was supported by the Spanish Ministerio de Ciencia e Innovación (MICINN), and jointly financed with Plan E funds under Grant TIN2009-14475-C04-01 and by Consolider-Ingenio 2010 under Grant CSD2006-00046.

Figure 2. MPKI of LRU, DC-Bubble, and MRUT algorithms. Applications having an MPKI less than one or a percentage of compulsory misses higher than 75% have been skipped.

Improving Last-Level Cache Performance by ...

Improving Last-Level Cache Performance by ...

Suggest Documents

Improving Cache Performance by Combining Cost-Sensitivity and ...

Improving Cache Performance for Network-Intensive Workload

Soft-OLP: Improving Hardware Cache Performance ... - Google Sites

Improving Data Cache Performance with Integrated Use of ... - CiteSeerX

Improving the Performance of a Proxy Cache Using ... - Science Direct

Experience with Improving Distributed Shared Cache Performance on ...

Femto-Caching with Soft Cache Hits: Improving Performance ... - arXiv

A Cache Based Traffic Regulator for Improving Performance in IEEE

Improving the Performance of Cache Memories Without ... - CiteSeerX

Improving the Data Cache Performance of ... - Semantic Scholar

Improving Database Performance Using a Flash-Based Write Cache

Soft-OLP: Improving Hardware Cache Performance Through Software ...

Soft-OLP: Improving Hardware Cache Performance Through Software ...

Dynamic cache reconfiguration based techniques for improving cache ...

Improving Durability of SSD Cache Drives by Caching ... - IEEE Xplore

Improving Adaptive Replacement Cache (ARC) by Reuse ... - Usenix

Improving Durability of SSD Cache Drives by Caching ... - IEEE Xplore

Improving Web Performance Improving Web Performance

Cache Write Policies and Performance

Performance White Paper Zend Cache

Cache Write Policies and Performance

Improving supervised learning performance by ... - Semantic Scholar

improving traffic performance by coordinating traffic signals

MergeAlign: improving multiple sequence alignment performance by ...