Reducing Latency in an SRAM/DRAM Cache Hierarchy via a Novel Tag-Cache Architecture Fazal Hameed, Lars Bauer, and Jörg Henkel Chair for Embedded Systems (CES), Karlsruhe Institute of Technology (KIT), Germany {hameed, lars.bauer, henkel}@kit.edu
INTRODUCTION
Future multi-core systems are expected to use more and more cores on a single chip, increasing the aggregate demand for offchip memory that exacerbate the main memory access latency compared to an increasing core speed. State-of-the-art multi-core systems rely on memory latency hiding techniques such as multilevel caches [1] and using DRAM as the Last-Level-Cache (LLC) [2-7]. Multi-level caches have been used in commercial processors. For instance, the recent Intel Xeon E5-2690 processor chip introduced in 2012 employs three levels of cache hierarchy. In these hierarchies, fast and small L1 and L2 caches are typically dedicated to a particular core to satisfy the core’s need in terms of low latency. The larger L3 cache is shared among all cores to satisfy their needs in terms of reduced miss rate. To further mitigate the core-memory speed gap, recent research in industry [2] and academia [3-7] has introduced on-chip DRAM as L4 cache between L3 SRAM cache and main memory. The primary reason for employing on-chip L4 DRAM cache is that it provides increased cache capacity due to its high density compared to SRAM cache [4, 5]. At the same time, it provides faster on-chip communication via high bandwidth and low latency interconnects compared to offchip memory [8, 9]. Fig. 1 shows the organization of a typical SRAM/DRAM cache organization. A primary design challenge in architecting a large L4 DRAM cache is the design of the tag store mechanism [4, 5, 7], i.e. where to store the tags for the large L4 DRAM cache and how to access them. For instance, a 512MB DRAM cache requires a tag Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
[email protected]. DAC '14, June 01 - 05 2014, San Francisco, CA, USA Copyright 2014 ACM 978-1-4503-2730-5/14/06…$15.00. http://dx.doi.org/10.1145/2593069.2593197
Core 0
Core 1
L1
L1
...
Core N -1 L1
2 cycles L2
L2
L2 5 cycles
STC SDA STA STC insertion (2 cycles) (15 cycles) (20 cycles) policy
DTC inser- DTC MissMap tion policy (2 cycles) (12 cycles)
Miss
Hit DRAM Row Buffer Mapping Policy Organization of a DRAM bank D$ Controller DRAM Array
Row Buffer
D$ Bank
L3 SRAM cache
MM Data bus Main MM Memory controller MM Cmd bus (MM)
D$ Cmd bus
I.
store size of 48MB assuming 6 bytes per tag entry [4, 5]. The tags can be stored in an SRAM memory (called Tags-In-SRAM approach) which provides fast access to them, but comes with a huge area overhead. Alternatively, they can be stored in the DRAM cache (called Tags-In-DRAM approach), but that significantly increases the access latency, as every L3 miss requires a high latency L4 DRAM cache access for tag lookup to identify whether it is an L4 hit or miss. State-of-the-art uses the Tags-In-DRAM approach along with a low overhead on-chip SRAM structure namely MissMap (for 512MB DRAM cache it requires 3MB instead of 48MB for the tags) that precisely determines whether an access to the L4 DRAM cache will be a hit or a miss [3-5]. If the MissMap identifies a miss, the request bypasses the DRAM cache and is directly sent to the main memory controller (see Fig. 1). This significantly reduces the access latency and the load for the DRAM cache controller. However, if the MissMap identifies a hit, the request still needs to access the DRAM cache twice, once for tag lookup and once for data.
D$ Data bus
Abstract—Memory speed has become a major performance bottleneck as more and more cores are integrated on a multi-core chip. The widening latency gap between high speed cores and memory has led to the evolution of multi-level SRAM/DRAM cache hierarchies that exploit the latency benefits of smaller caches (e.g. private L1 and L2 SRAM caches) and the capacity benefits of larger caches (e.g. shared L3 SRAM and shared L4 DRAM cache). The main problem of employing large L3/L4 caches is their high tag lookup latency. To solve this problem, we introduce the novel concept of small and low latency SRAM/DRAM Tag-Cache structures that can quickly determine whether an access to the large L3/L4 caches will be a hit or a miss. The performance of the proposed Tag-Cache architecture depends upon the Tag-Cache hit rate and to improve it we propose a novel Tag-Cache insertion policy and a DRAM row buffer mapping policy that reduce the latency of memory requests. For a 16-core system, this improves the average harmonic mean instruction per cycle throughput of latency sensitive applications by 13.3% compared to state-of-the-art.
D$ … Bank
L4 DRAM cache (D$)
Legend
STC: STA: SDA: DTC: Cmd:
Our Contribution Address and Control Data SRAM (L3) Tag Cache SRAM (L3) Tag Array SRAM (L3) Data Array DRAM (L4) Tag Cache Command
Fig. 1: SRAM/DRAM cache hierarchy for an N-core system; see Section IV and Table I for timing parameters and cache sizes Summarizing: while significant performance improvements can be achieved by using state-of-the-art SRAM/DRAM cache hierarchies compared to a traditional three-level SRAM cache hierarchy, they incur high L3 and L4 lookup latencies. This paper improves the performance of the SRAM/DRAM cache hierarchy for multi-core systems while reducing lookup latencies both for L3 and L4 caches. We make the following new contributions: 1. We propose small and low latency SRAM structures namely DRAM Tag-Cache (DTC, details in Section III.A) and SRAM Tag-Cache (STC, details in Section III.B). The STC and DTC hold the tags of the sets that were recently accessed in L3 and L4, respectively. They provide fast lookup because for a Tag-
Cache hit, they quickly identify hit/miss for the larger caches and provide tag-related information (e.g. in which set the data is stored). In contrast, state-of-the-art SRAM/DRAM cache hierarchies [3, 4, 6] incur high L3 and L4 tag latencies to identify the hit/miss before the request can be sent to the next level. 2. We found that some applications exhibit limited spatial locality that leads to reduced Tag-Cache hit rates when they run concurrently with other applications having high spatial locality. To overcome this problem, we propose an adaptive Tag-Cache insertion policy (details in Section III.C) which is based on the concept of identifying and restricting the number of useless insertions into Tag-Caches, which leads to increased Tag-Cache hit rates. The proposed insertion policy decides at runtime whether the tags of the recently accessed sets shall be inserted into the Tag-Cache or not. 3. To further exploit the latency benefits of the proposed TagCache architecture, we propose a novel DRAM row buffer mapping policy (details in Section III.D) that improves DTC hit rates and performance compared to the row buffer mapping policy proposed in [3].
II.
BACKGROUND AND STATE-OF-THE-ART
A typical DRAM cache organization is shown in Fig. 1. A DRAM cache consists of multiple banks where each bank is arranged into rows and columns of DRAM cells, called the DRAM array. Each DRAM bank provides a row buffer (see Fig. 1) that consists of SRAM cells and buffers one row of the DRAM bank (typically 2 to 8 KB). Data in a DRAM bank can only be accessed after it is fetched to the row buffer. Any subsequent access to the same row (so-called row buffer hit) will bypass the DRAM array access and the data is directly read from the row buffer. Such row buffer locality reduces the access latency compared to when actually accessing the DRAM array. This paper assumes 4KB row size for qualitative and quantitative comparisons. However, the concepts proposed in this paper can also be applied for other row sizes.
0
(a) (b)
6 bytes Tag-0
Tag-1
Tag-2
Tag-3
Tag-4
Tag-5
Dirty
1 bit 1 bit
Unused
22 bytes
7 X 6 = 42 bytes Valid
Tag-6
L4 Tag
Coherence
LRU
28 bits
16 bits
3 bits
Fig. 2: DRAM cache row organization used by LAMOST for 4KB (4096bytes) row size and 16-core system Recent research has proposed various DRAM cache organizations [3, 4, 6]. The concept proposed in this paper can be applied on top of other DRAM cache organizations and we perform detailed analysis and comparisons with the most recently proposed DRAM cache organization namely LAMOST [3]. Their 4KB row is divided into 8 cache sets where each cache set consists of one tag block (64 bytes) and 7 cache lines as shown in Fig. 2. The 7 cache lines need 7 × 6 = 42 bytes for their tag entries with 22 bytes left unused. A DRAM cache access must first read the tag block before accessing the cache line. After an L4 hit is detected by the MissMap, the row buffer is reserved until the tag block and cache line are both read from it. This guarantees a row buffer hit for the cache line access after the tag block is accessed.
80
96
(53 cycles) CAS (LINE)
CAS (TB)
Bus latency (2 cycles)
(71 cycles)
MissMap latency (12 cycles)
ACT (18 cycles)
Tag latency (1 cycle)
TB: Tag Block
Fig. 3: Row buffer access latencies for LAMOST after a L4 hit read: (a) row buffer hit (b) row buffer miss (see Section IV for details of DRAM cache timing parameters) Fig. 3 shows the row buffer hit and miss latencies for LAMOST for an L4 hit after an L3 miss. LAMOST requires 12 cycles to access the MissMap. If the L4 hit will also hit in the row buffer (i.e. tag and cache line reside in the row buffer), then LAMOST requires 18 cycles for CAS (to access the tag block from a particular row buffer), 2 cycles to transfer the tag block (64 bytes) on the 32 bytes wide DRAM cache bus, 1 cycles for the tag check, 18 cycles for CAS (to access the cache line from the row buffer), and 2 cycles to transfer the cache line (64 bytes). CAS is the delay between the moment a DRAM cache controller requests the DRAM cache to access a particular block and the moment the block is available on the DRAM bus. The DRAM cache in the example contains 64 row buffers (one associated with each bank) and each 4KB row buffer contains 64 blocks. Thus, CAS includes the communication delay from a particular DRAM bank to the DRAM bus. If the data is not located in the row buffer (row buffer miss, i.e. data resides in the DRAM array), it requires additional 18 cycles for ACT (row activation) compared to the row buffer hit latency. The tag latency in LAMOST for an L4 read hit is 33 cycles as shown in Fig. 3. Main memory block address (48 bits) Row-Id (17bits) 3 bits L4-Tag (28 bits)
log2(R)
Row# 0 1 2047 RB #
Row#
Bank-0
Bank-1
Row-0 Row-1
Row-0 Row-1
log2(R)
RB-i
Set-id
log2(B) log2(S) Bank-63 Row-0 Row-1
…
…
L Cache Line (64 bytes)
64
Processor Clock cycles
…
Set-7 T Tag Block (64 bytes)
T L L L L L L L
D$ ACT
CAS (18 cycles)
…
Set-1
48
CAS (LINE)
CAS (TB)
…
Set-0
32
L4 tag latency (33 cycles)
One 4KB row containing 8 cache sets with 7-way associativity
...
16
L4 tag latency (33 cycles)
Row-2047
Row-2047
Row-2047
RB-0
RB-1
RB-63
B: Number of DRAM banks (here: 64 banks) R: Number of rows in a DRAM bank (here: 2048 rows/bank) S: Number of sets in a DRAM row (here: 8 sets/row) RB-i: Row buffer associated with bank i
Fig. 4: L4 DRAM row buffer mapping policy for the LAMOST cache organization for 4KB row size [3] Each bank of the DRAM cache is associated with a row buffer that holds the last accessed row of that bank and Fig. 4 illustrates how LAMOST maps main memory blocks to the row buffers. The DRAM cache row number within a bank (indicated by the “Row#” field) is determined by the main memory block address. A high row buffer hit rate can effectively amortize the high cost of a DRAM array access by reducing the hit latency. LAMOST exploits the row buffer hit rate by mapping 8 spatially close blocks to the same row buffer. For instance, memory block-0 to memory block-7 are mapped to RB-0 and memory block-8 to memory block-15 are mapped to RB-1. The DRAM cache set number (each DRAM row contains 8 cache sets as shown in Fig. 2) within a row is determined by the three least significant bits of the memory block address as illustrated in Fig. 4. For instance, main memory block-8 is mapped to set-0 and main memory block-15 is mapped to set-7 within RB-1.
III. OUR SRAM/DRAM CACHE ORGANIZATION Fig. 1 shows the organization of our SRAM/DRAM cache organization along with a MissMap, highlighting our novel contributions. Similar to state-of-the-art [3-5, 10], our approach stores the tags in the DRAM cache and employ the MissMap to identify DRAM cache hit/miss. Our DRAM row buffer organization is similar to LAMOST [3] and shown in Fig. 2. The novel contributions of our cache architecture are explained in the following subsections. This work reduces the latency of memory requests via novel Tag-Caches (Section III.A and III.B). It further reduces the latency by improving the Tag-Cache hit rates via a novel adaptive Tag-Cache insertion policy (Section III.C) and L4 DRAM row mapping policy (Section III.D).
A. DRAM Tag-Cache (DTC) Organization The tag latency for an L4 cache hit in LAMOST [3] is 33 cycles as shown in Fig. 3. To reduce it, we add a small low latency onchip SRAM structure named as DRAM Tag Cache (DTC) that holds the tags of recently accessed rows in the DRAM cache. The integration of the DTC into the cache hierarchy is shown in Fig. 1 and Fig. 5 presents the DTC details. Note that the DTC only stores the tag blocks of recently accessed rows and does not contain any cache data. The DTC has fast access latency due to its small size. It is accessed right after an L3 miss and, in case of a DTC hit, it reduces the total access latency because it avoids the high latency MissMap access to identify L4 hit/miss and it avoids reading the tag block from the DRAM cache. Main Memory Block Address (48 bits) 12 bits 5 bits L4-Tag (28 bits)
3 bits
Fig. 6 shows the steps involved after an L3 miss in our proposed SRAM/DRAM cache architecture. At first, the DTC is accessed. If it hits, then the tags from DTC are accessed to identify an L4 DRAM cache hit/miss and to identify the location of the cache line (Tag-Compare at the right bottom part of Fig. 5). An L4 DRAM cache hit (i.e. tag matches with an incoming cache line tag) requires only data access from the DRAM cache. A DRAM cache miss requires main memory access to get the data. If the DTC misses, then the MissMap needs to be queried to identify an L4 hit/miss. Note that our DTC only caches the last accessed tags from L4 cache, whereas the MissMap exactly determines an L4 hit/miss, but therefore cannot store the tags for all these entries. After a DTC miss, the Tag-Cache insertion policy (explained in Section III.C) will decide whether all tag blocks corresponding to a DRAM row shall be inserted into DTC or not. The proposed Tag-Cache insertion policy exploits the spatial locality of concurrently running applications, i.e. that adjacent tag blocks will most likely be accessed in the near future. The row buffer hit latency is reduced for a DTC hit (Fig. 7-a) compared to a DTC miss (Fig. 7-b). Similarly, the row buffer miss latency is also reduced for a DTC hit (Fig. 7-c) compared to a DTC miss (Fig. 7-d). 0
16
32
64
(b) (c) (d)
CAS (18 cycles)
ACT (18 cycles)
DTC Row-Id array
# Sets in DTC (32 sets)
Row-tag entry 8 Tag blocks for a DRAM row (42 x 8 = 336 bytes)
DTCValid Dirty Row-Tag 1 bit 1 bit
LRUinfo
12 bits
2 bits
Comand bus
Way-no DTC hit
8 tag blocks
DTC miss One tag block (7 cache lines)
... L4 hit/miss
Tag Compare Cache line
Fig. 5: DTC organization
MissMap Hit?
No
Read cache line from DRAM cache
Read cache line from Memory
L4 hit? Yes Update DTC
Read relevant tag block from DRAM cache
No
DTC Hit? Yes
Yes
Insert DTC ?
Insert cache line into DRAM cache
No
Yes Read all tag blocks from DRAM cache
Yes DTC Hit?
No
...
T0
tCAS
RD CL
tCMP
RD T1
Fig. 5 shows the DTC organization with 32 sets and 4-way associativity where the data payload of each entry contains the tags for a particular Row-id. On a DTC access, the DTC-Index field is used to index a DTC set in the “DTC Row-Id” array. All 4 RowTag entries (grey blocks in Fig. 5) within that DTC set are then compared to the DTC-Row-Tag field from the main memory block address to identify a DTC hit/miss. No
RD T1
tCAS
Insert tags into DTC
Update MissMap
Fig. 6: DTC and MissMap lookup following an L3 SRAM miss
WR T1
RD T0
RD T2
tCAS
Set-7 T7
tCAS
tCAS = 18 cycles tbus = 2 cycles tCMP = 1 cycle CL: Cache line
... tbus RD CL
tbus
WR T1
L4-Tag (28 bits)
Request from L3
...
Set-0 T0
Tag Compare
Bus latency (2 cycles)
Fig. 7: L4 DRAM cache read hit row buffer hit latency for (a) DTC hit (b) DTC miss; L4 DRAM cache read hit row buffer miss latency for (c) DTC hit (d) DTC miss
Tag blocks
Legend
96
MissMap latency (12 cycles)
Tag/DTC latency (1 cycle each) Associativity (4)
80
Processor Clock cycles CAS (LINE) (22 cycles) Tags read from DTC Tags read from DRAM CAS (TB) CAS (LINE) (54 cycles) cache followed (40 cycles) Tags read CAS (LINE) D$ ACT by MissMap from DTC access D$ ACT CAS (TB) CAS (LINE) (72 cycles)
(a)
DTC-Row-Tag DTC-Index Set-Id
48
...
tbus
RD T0 RD T7 14 Cycles
Fig. 8: Timing and sequence of commands to fill DTC entry following DTC miss for a cache line (CL) that belongs to Set-1. The inclusion of DTC requires an efficient DRAM cache controller implementation in order to insert 8 tag blocks into DTC after a DTC miss. In our controller implementation, a read operation is performed for the requested cache line first, and then the nonrequested tag blocks are read from the DRAM cache to be filled in the DTC. Fig. 8 shows an example of the sequence of commands to fill DTC after a DTC miss with low latency overhead. Let us assume that there is a DRAM cache read hit for a cache line that belongs to Set-1. In our implementation, the controller issues a read request to read the tag block T1 (tag block associated with Set-1) which indicates the location of the cache line in Set-1. The requested cache line is read from the DRAM cache and forwarded to the requesting core followed by updating tag block T1 (e.g. updating LRU information). After that, subsequent read commands are sent to access the remaining tag blocks (i.e. T0, T2, T3, T4, T5, T6, and, T7) which are filled into DTC. All of the operations are performed on the row buffer. In our controller implementation, the
non-requested tag block transfers are performed off the critical path so that it does not affect the latency of the demand request. The extra bus latency incurred to read the remaining 7 tag blocks is 14 cycles. The latency overhead to read additional 7 blocks (to fill DTC) is compensated by future hits in the DTC exploiting the fact that adjacent tag blocks will be accessed in the near future (see Section IV for evaluation).
B. SRAM Tag-Cache (STC) Organization A large SRAM tag array (e.g. L3 SRAM requires a tag storage of ~1.7MB) is composed of multiple banks [11], where each bank consists of multiple sub-banks with one sub-bank activated per access as shown in Fig. 9. Each sub-bank is composed of multiple identical mats, where all mats in a sub-bank are activated per access. Each mat is an array of SRAM cells with associated peripheral circuitry. Each row in a mat contains the tags of one cache set. State-of-the-art SRAM/DRAM cache organizations [3-6, 10] always read the tags from that large L3 SRAM tag array (see Fig. 9) which incurs high L3 tag latency. To reduce L3 tag latency, we add a small SRAM Tag-Cache (STC) organization with 32 sets and 4way associativity that identifies L3 hit/miss in a single cycle. The STC organization is similar to the DTC organization (see Fig. 5) except that it holds the tags of the 8 adjacent sets (i.e. belonging to the same row of 8 mats) that were recently accessed in the L3 SRAM tag array. Entries in STC are indexed by the STC-index field of the main memory block address as shown in Fig. 9. Bank L3 SRAM Tag Array # mats (8)
Mat-0 Mat-1
...
Sub-bank
Sub-bank Row in a mat Memory block address (48 bits) L3-Tag 33 bits
Sub-bank #
Row #
Bank #
Set/Mat #
3 bits 6 bits 3 bits 3 bits STC-Row-Tag STC-Index (5 bits)
Fig. 9: Layout of a large L3 SRAM tag array [11]
C. Adaptive Tag-Cache Insertion Policy We found that some applications exhibit limited spatial locality with majority of useless Tag-Cache insertions (Tag-blocks inserted into Tag-Cache but not used before they get evicted) for a fixed Tag-Cache insertion policy (i.e. always inserting extra tag blocks into the Tag-Cache after a Tag-Cache miss), which reduces the Tag-Cache hit rate. To overcome this problem, we propose an application-aware adaptive Tag-Cache insertion policy which aims at restricting the number of useless insertions into the Tag-Cache. These useless Tag-Cache insertions unnecessarily consume bandwidth (e.g. the extra number of bus cycles required for DTC insertion is 14 cycles as shown in Fig. 8, while the extra number of cycles for STC insertion is 7 cycles) without improving Tag-Cache hit rates and performance. For brevity, we will focus on explaining the operation of the adaptive insertion policy for DTC. The working principle for STC is similar. The adaptive DTC insertion policy uses an Insertion vector for an N-core system, where Ii denotes the DTC insertion policy for corei. If Ii is 1, then corei is allowed to insert the extra tag blocks into DTC, while the extra tag blocks are read from the DRAM cache (see Fig. 8). If Ii is 0, then corei does not read the
3 bits Set-Id
Associativity (4)
Mat-7 # Rows (64)
Sub-bank Sub-bank
...
...
# Sub-banks (8)
Bank
5 bits
DTC-Row-Tag DTC-Index
L4-Tag (28 bits)
DTC Monitor array
Bank
12 bits
Row-tag entry
# Sets (32 sets)
# Banks (8)
extra tag blocks from the DRAM cache because they are not inserted into DTC. Ii is determined by the DTC monitor shown in Fig. 10, where each core is associated with a 7-bit access counter and a 7-bit miss counter. The DTC monitor (Fig. 10) is a separate cache structure and it only contains the recently accessed DTCRow-Tag fields and does not contain the tag blocks. Whenever there is an access to L4 from corei, the requesting address and core-id is forwarded to the DTC monitor for a monitor hit/miss (see the right part of Fig. 10). If there is a monitor miss, the miss and access counters associated with corei are incremented. After that, the DTC-Row-Tag field of the incoming memory block address is inserted into the DTC monitor array. If there is a hit in the DTC monitor, only the access counter associated with corei is incremented. The DTC monitor tracks the DTC miss rate of each core using access/miss counters. When the access counter associated with corei reaches its maximum limit (i.e. 127) then the corei DTC miss rate (Miss-counter-i/Access-counter-i) is used to determine Ii. If a corei DTC miss rate is greater than a certain threshold thr (0.5 in our case), then Ii is set to 0, otherwise it is set to 1. After that, the miss and access counters associated with corei are halved. This allows the DTC monitor to retain past history while giving importance to future DTC access behavior.
15 bits DTC- LRUValid Row-Tag info 1 bit
12 bits
Hit ?
No (Monitor Miss)
Yes (Monitor Hit) Do Nothing
DTCRow-Tag
Monitor Access
Core-id Miss counter++ N 7-bit counters
Core-id Access counter++
2 bits
Fig. 10: DTC monitor for estimating per-core DTC miss rates for an N-core system
D. DRAM Row Buffer Mapping Policy DRAM row buffer mapping is the method by which memory blocks are mapped to a particular DRAM row buffer. We base our row buffer mapping policy on the observation that the latency of the DRAM cache can be reduced by improving the row buffer and DTC hit rate. The differences between DRAM row buffer mapping policy employed in LAMOST [3] and the one used in our proposal is shown in Fig. 4 and Fig. 11 respectively. Main memory block address (48 bits) Row-Id (17 bits) 3 bits log2(CM) L4-Tag1
Row #
RB-i
Set-id L4-Tag2
RB-i: Row buffer associated with bank i CM: Consecutive main memory blocks mapped to the same set
L4-Tag [28 bits] = {L4-Tag1, L4-Tag2}
Fig. 11: Our L4 DRAM row buffer mapping policy To exploit spatial locality, the DRAM row buffer and DTC hit rate can be improved by mapping more consecutive main memory blocks to the same DRAM row buffer. Fig. 11 shows our DRAM row buffer mapping policy where the DRAM row buffer and DTC hit rates depend upon the parameter CM (defined as number of consecutive main memory blocks mapped to the same set). CM is chosen as a multiple of two. If CM is equal to 1, then 8 spatially close memory blocks will be mapped to the same 4KB row buffer similar to LAMOST [3]. If CM is equal to 2/4/8, then 16/32/64 spatially close memory blocks will be mapped to the same 4KB row buffer. However, the drawback of increasing CM is that it
increases the L4 miss rate due to reduced set-level-parallelism (because high order address bits are used to select L4 cache set). In the result section, we will explore the impact of CM (1, 2, 4, and 8) on the overall performance.
1. 2.
E. Storage Overhead
IV. EVALUATIONS TABLE I
CORE, CACHE, AND MAIN MEMORY PARAMETERS Core Parameters ROB/RS/LDQ/STQ size 128/32/32/24 Decode/commit width 4/4 Core Frequency 3.2 GHz SRAM Cache Parameters Private L1 Cache 32KB, 8-way, 2 cycles Private L2 Cache 256KB, 8-way, 5 cycles STC and DTC latency 1 cycle for hit/miss + 1 cycle for Tag-check 16MB, 8-way, 15 cycles Tag-latency, Shared L3 SRAM Cache 20-cycle data latency DRAM cache Parameters MissMap 3MB, 12 cycles to determine L4 hit/miss DRAM cache size 512 MB Number of DRAM banks 64 Number of channels 4 Bus Width/Bus Frequency 256 bits per channel/1.6 GHz tRAS-tRCD-tRP-tCAS (cycles) 72-18-18-18 Main Memory Parameters Number of channels 2 Bus Width/Bus Frequency 128 bits per channel/ 800 MHz tRAS-tRCD-tRP-tCAS (cycles) 144-36-36-36
A. Experimental Set-up and Application Classification We use a cycle accurate x86 simulator [12] for evaluation. The configuration of cores, caches, and the memory system for a 16core system is shown in Table I. We compute the latency of TagCaches and L3-SRAM cache using CACTI v5.3 [11] for a 45nm technology. We run various multi-programmed workloads from SPEC2006 [13], as shown in Table II. We classify the applications into Latency Sensitive and Memory Sensitive applications. Latency sensitive applications are very sensitive to the latency incurred in L3 and L4 caches. Reducing the latency of these applications provides noticeable improvement in performance. On the other hand, memory sensitive applications get little benefits from reducing the latency of L3 and L4 caches. We evaluate our Tag-Cache architecture with the state-of-the-art DRAM cache organization namely LAMOST [3] which is explained in Section II. We make the following assumptions for evaluation:
3.
4.
TABLE II APPLICATION MIXES. LATENCY SENSITIVE APPLICATIONS SHOWN IN ITALIC. VALUE IN PARENTHESIS DENOTES THE NUMBER OF INSTANCES USED FOR THAT PARTICULAR APPLICATION
Name Mix_01 Mix_02 Mix_03 Mix_04 Mix_05 Mix_06 Mix_07
Benchmarks astar.t(4), leslie3d.t(4), milc(4), soplex.r(4) astar.b(4), omnetpp(4), libquantum(4), lbm(4) astar.b(2), astar.t(2), leslie3d.t(2), omnetpp(2), libquantum(2), mcf(2), milc(1), soplex.r(1), lbm(1), leslie3d.ref(1) astar.t(4), omnetpp(4), mcf(4), leslie3d.ref(4) astar.b(3), leslie3d.t(3), libquantum(2), soplex.r(2), milc(3), lbm(3) astar.b(2), astar.t(2), leslie3d.t(3), omnetpp(2), mcf(2), lbm(2), leslie3d.ref(3) leslie3d.t(4), libquantum(4), soplex.r(4), leslie3d.ref(4)
B. Impact of Tag-Cache and Tag-Cache Insertion Policy This section evaluates and investigates the performance benefits of incorporating Tag-Caches (Section III.A and III.B) and our Tag-Cache insertion policy (Section III.C). Fig. 12 shows the average normalized harmonic mean instruction per cycle (HM-IPC) throughput results of latency sensitive and memory sensitive applications with the speedup normalized to LAMOST [3]. On average, our proposed Tag-Cache architecture along with our Tag-Cache insertion policy improves the HM-IPC speed of latency sensitive applications by 6.4% and of memory sensitive applications by 1.4% compared to LAMOST [3]. Avg. Normalized HM-IPC speedup
Each DTC entry requires 1 valid bit, 1 dirty bit, 12-bits for the DTC-Row-Tag field, 2-bits LRU info, and 336 bytes (42 × 8 = 336 bytes) for the 8 tag blocks. Note that each tag block requires 42 bytes as shown in Fig. 2. This leads to storage overhead of 338 bytes required for each DTC entry as illustrated in Fig. 5. Our DTC with 32 sets and 4-way associativity requires storage overhead of ~43KB (32 x 4 x 338 bytes = 43264 bytes = 42.25KB). The DTC monitor required for the DTC insertion policy requires a storage overhead of 60 bytes (32 x 15 x 4 bits = 1920 bits = 240 bytes) for the DTC monitor array and 28 bytes (14 x 16 = 224 bits = 28 bytes) for the access and miss counters as shown in Fig. 10. Similarly, our STC requires storage overhead of ~57KB and the STC monitor requires a storage overhead of 272 bytes in total. The total storage overhead required for our proposal is less than 0.1MB which is negligible compared to the large L3 SRAM cache (16 MB for L3 data array) and MissMap (3MB required for DRAM cache hit/miss prediction).
Similar to state-of-the-art [3-5, 10], we employ FR-FCFS (First Ready First Come First Serve) access scheduling [14] in the DRAM cache and memory controllers. We employ a state-of-the-art adaptive DRAM insertion policy [10] that decides at runtime whether an incoming line, when brought from off-chip memory, shall be inserted into the DRAM cache or not. Additionally, the line is always inserted into the L4 DRAM cache. We assume four DRAM cache channels and the DRAM cache bus width per channel is assumed to be 256 bits (32 bytes). Similar to state-of-the-art [3-5, 10], we assume that DRAM cache timing latencies are approximately half of that compared to off-chip memory. We assume that the tags are stored in the DRAM cache similar to state-of-the-art [3-5, 10] while a 3 MB MissMap [4, 5] is employed to identify a DRAM cache hit/miss.
8% Fixed Tag-Cache Insertion Policy
6% 4%
Adaptive Tag-Cache Insertion Policy
2% 0%
(a)
(b)
Fig. 12: Normalized Harmonic Mean Instruction Throughput (HMIPC) speedup compared to LAMOST [3] averaged over all application mixes for (a) latency sensitive applications (b) memory sensitive applications. The performance of the proposed Tag-Cache architecture depends upon the ‘useful Tag-Cache insertion rate’, which is defined as the number of useful insertions (useful insertion implies that the insertion receives at least one hit after insertion) divided by the total number of insertions. Our adaptive Tag-Cache insertion policy (details in Section III.C) improves the useful insertion rate by 9.6% for the SRAM Tag-Cache (STC) and by 22% for the DRAM Tag-Cache (DTC) compared to a fixed Tag-Cache insertion policy. Thus, by reducing the number of useless insertions, our adaptive Tag-Cache insertion policy provides additional 1.8% improvement in performance of latency sensitive applications compared to a fixed Tag-Cache insertion policy.
C. Impact of L4 DRAM Row Buffer Mapping Policy
V.
Rate
The performance of an L4 DRAM cache based multi-core system depends upon the DRAM row buffer hit rate (higher is better), DTC hit rate (higher is better) and L4 miss rate (lower is better). This section evaluates the impact of parameter CM (number of consecutive main memory blocks that are mapped to the same set; details in Section III.D) on the above metrics and the overall performance. For the rest of this paper, we consider the Tag-Cache architecture along with the adaptive Tag-Cache insertion policy for all values of CM. Fig. 13 shows the DRAM row buffer hit rate and L4 miss rate for different values of CM. Increasing CM improves the DRAM row buffer hit rate (25.9% for CM = 1, 29.5% for CM = 2, 33.5% for CM = 4, and 37.4% for CM = 8) as shown in Fig. 13-(a). Increasing CM not only improves the row buffer hit rate but it also improves the DTC hit rate (62.3% for CM = 1, 73.8% for CM = 2, 78.8% for CM = 4, and 81.3% for CM = 8). However, the improvement in DRAM row buffer and DTC hit rates for a high value of CM comes at the cost of increased L4 miss rate (see Fig. 13-b). 0.5 0.4 0.3 0.2 0.1
CM = 1
CM = 2
CM = 4
CM = 8
Avg. Normalized HM-IPC speedup
This paper presents our novel DRAM Tag-Cache (DTC) architecture along with an adaptive DTC insertion policy for DRAM caches that reduce DRAM tag access latency via improved DTC hit rates. Along with that we propose DRAM row buffer mapping policy that further improves the performance via improved DRAM row buffer and DTC hit rates. We further applied the concepts of Tag-Cache architecture and Tag-Cache insertion policy on top of SRAM cache. We have performed extensive evaluations and compared the performance of our approach with the state-of-the-art SRAM/DRAM cache hierarchy. For a 16-core system, our proposed approach (CM with a value of 2) along with our Tag-Cache architecture and adaptive Tag-Cache insertion policy improves the harmonic mean instruction per cycle throughput of latency sensitive applications by 13.3%. Our detailed analysis shows that it is the combination of our Tag-Cache architecture, adaptive TagCache insertion policy, and DRAM row buffer mapping policy that provides these performance improvements. They come at little area overhead which makes our approach generally applicable for a wide range of applications and architectures.
VI. ACKNOWLEDGEMENT This work was partly supported by the German Research Foundation (DFG) as part of the Transregional Collaborative Research Center “Invasive Computing” (SFB/TR 89).
(a)
(b)
Fig. 13:(a) DRAM row buffer hit rate (b) L4 miss rate; for different values of CM averaged over all application mixes 1.2 1.1 1.0 0.9 0.8
CONCLUSIONS
LAMOST
CM = 1
CM = 2
CM = 4
CM = 8
REFERENCES [1]
S. Borkar and A. A. Chien, “The Future of Microprocessors”, Communications of the ACM, vol. 54, no. 5, pp. 67–77, May 2011.
[2]
R. X. Arroyo, R. J. Harrington, S. P. Hartman, and T. Nguyen, “IBM POWER7 Systems”, IBM Journal of Research and Development, vol. 55, no. 3, pp. 2:1 – 2:13, 2011. F. Hameed, L. Bauer, and J. Henkel, “Simultaneously Optimizing DRAM Cache Hit Latency and Miss Rate via Novel Set Mapping Policies”, in International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES), 2013. G. Loh and M. Hill, “Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches”, in 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2011, pp. 454– 464. G. Loh and M. Hill, “Supporting Very Large DRAM Caches with Compound Access Scheduling and MissMaps”, in IEEE Micro Magazine, Special Issue on Top Picks in Computer Architecture Conferences, 2012. M. Qureshi and G. Loh, “Fundamental Latency Trade-offs in Architecting DRAM Caches”, in 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2012, pp. 235–246. L. Zhao, R. Iyer, R. Illikkal, and D. Newell, “Exploring DRAM Cache Architecture for CMP Server Platforms”, in 25th International Symposium on Computer Design (ICCD), 2007, pp. 55–62. B. Black, M. Annavaram, N. Brekelbaum et al., “Die-Stacking (3D) Microarchitecture”, in 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2006, pp. 469–479. Y. Deng and W. Maly, “Interconnect Characteristics of 2.5-D System Integration Scheme”, in International Symposium on Physical Design (ISPD), 2001, pp. 171–175. F. Hameed, L. Bauer, and J. Henkel, “Adaptive Cache Management for a Combined SRAM and DRAM Cache Hierarchy for Multi-Cores”, in 15th conference on Design, Automation and Test in Europe (DATE), 2013, pp. 77–82. S. Thoziyoor, J. Muralimanohart, R.and Ahn, and N. Jouppi, “CACTI 5.1 HPL 2008/20, HP Labs”, 2008. G. Loh, S. Subramaniam, and Y. Xie, “Zesto: A Cycle-Level Simulator for Highly Detailed Microarchitecture Exploration”, in International Symposium on Performance Analysis of Systems and Software (ISPASS), 2009. “Standard Performance Evaluation Corporation”, http://www.spec.org. S. Rixner, W. Dally, U. Kapasi et al., “Memory Access Scheduling”, in 32nd International Symposium on Computer Architecture (ISCA), 2000, pp. 128–138.
[3]
(a)
(b)
(c)
Fig. 14: Normalized HM-IPC speedup compared to LAMOST [3] averaged over all application mixes for different values of CM (a) latency sensitive (LS) applications, (b) memory sensitive (MS) applications, (c) both LS and MS applications Fig. 14 shows the impact of parameter CM on the performance showing average normalized Harmonic Mean Instruction per Cycle (HM-IPC) throughput results. On average, our L4 DRAM row mapping policy with our Tag-Cache architecture improves the HM-IPC speed (Fig. 14-a) of latency sensitive applications by 6.4%/13.3%/8.9% for CM = 1/2/4 respectively compared to LAMOST [3]. A higher value of CM improves the DRAM row buffer hit rate and DTC hit rate but it suppresses set-levelparallelism which results in an increased L4 miss rate. For instance, CM = 8 incurs significant increase in L4 miss rate (38.8% compared to CM = 1) because it maps a large contiguous memory space (8 consecutive blocks) to a single set that leads to a lot of conflict misses due to reduced set-level-parallelism. For this reason, CM with a value of 8 degrades the overall HM-IPC by 12.1% (Fig. 14-c) compared to LAMOST. Setting CM to a value of 2 provides the best performance improvement because it improves the row buffer hit rate by 13.9% and the DTC hit rate by 18.4% compared to CM=1 with a slight increase in L4 miss rate (3%). For this reason, CM with a value of 2 significantly improves HM-IPC speed of latency sensitive by 13.3% (Fig. 14-a), HM-IPC speed of memory sensitive by 2.4% (Fig. 14-b), and overall HM-IPC speed by 6.9% (Fig. 14-c) compared to LAMOST.
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11] [12]
[13] [14]