Simultaneously Optimizing DRAM Cache Hit Latency and Miss Rate ...

Simultaneously Optimizing DRAM Cache Hit Latency and Miss Rate via Novel Set Mapping Policies Fazal Hameed, Lars Bauer, and Jörg Henkel Chair for Embedded Systems (CES), Karlsruhe Institute of Technology (KIT), Germany {hameed, lars.bauer, henkel}@kit.edu

A challenging problem in multi-core systems is the high access latency to the off-chip memory because of the significantly increasing processor-memory speed gap. To combat this problem, recent research employs on-chip DRAM caches [4, 5, 8, 9, 12] to reduce the off-chip accesses in order to maximize the performance under limited off-chip memory bandwidth. DRAM cache is used both in industry (IBM POWER7 uses a 32 MB Last Level Cache using embedded DRAM technology [2]) and academia [3, 5, 8, 9, 12] because it provides greater capacity benefits compared to SRAM caches (~8× [3]) which leads to reduced off-chip accesses. DRAM cache plays a critical role in improving the performance of a multi-core system compared to SRAM cache [5, 8, 9, 10, 12]. Typically a DRAM cache is composed of multiple banks, where each bank consists of multiple rows. Each row may contain a single set [5, 9, 10] or multiple sets [12], where each set contains “A” memory blocks (also called cache lines), i.e. an associativity of A. In the context of DRAM cache, set mapping is the method by which memory blocks are mapped to a particular set of a particular row in a particular bank. The mapping method used in [5, 9, 10, 12] effects the DRAM cache hit latency and miss rate that determine the efficiency of DRAM cache. Recently proposed DRAM set mapping policies are biased towards either DRAM cache hit latency or DRAM cache miss rate. The policies proposed in [5, 9, 10] are optimized for DRAM cache miss rate by providing a high associativity. We call this policy as Miss rate Optimized SeT mapping (MOST) policy. The

978-1-4799-1400-5/13/$31.00 ©2013 IEEE

(b)

DRAM cache hit latency (HL) [cycles]

1. INTRODUCTION AND MOTIVATION

DRAM cache miss rate (MR)

(a)

160 140 120 100 80 60 40 0.6

(c)

(d)

MOST[5,9,10] LOST[12]

4 cores

8 cores

16 cores

0.5 MOST[5,9,10]

0.4

LOST[12]

0.3 0.2

DRAM cache miss latency (ML) [cycles]

Two key parameters that determine the performance of a DRAM cache based multi-core system are DRAM cache hit latency (HL) and DRAM cache miss rate (MR), as they strongly influence the average DRAM cache access latency. Recently proposed DRAM set mapping policies are either optimized for HL or for MR. None of these policies provides a good HL and MR at the same time. This paper presents a novel DRAM set mapping policy that simultaneously targets both parameters with the goal of achieving the best of both to reduce the overall DRAM cache access latency. For a 16-core system, our proposed set mapping policy reduces the average DRAM cache access latency (depends upon HL and MR) compared to state-of-the-art DRAM set mapping policies that are optimized for either HL or MR by 29.3% and 12.1%, respectively.

MOST policy has a serialized tag-and-data access with reduced row buffer hit rate (evaluated in Section 6) that leads to increased hit latency. Another DRAM set mapping policy [12] optimizes the DRAM cache hit latency by reducing tag serialization latency and improving the row buffer hit rate (evaluated in Section 6). We call this policy as Latency Optimized SeT mapping (LOST) policy. The LOST policy reduces the DRAM cache hit latency at the cost of increased DRAM cache miss rate because it employs direct mapped cache.

Avg. DRAM cache access latency (AL) [cycles]

ABSTRACT

700 600 500 400 300 200 100

4 cores

8 cores

16 cores

MOST[5,9,10] LOST[12]

4 cores

8 cores

16 cores 366

350 250

MOST[5,9,10]

150

LOST[12]

50

4 cores

8 cores

16 cores

Figure 1: (a) DRAM cache hit latency (b) DRAM cache miss rate (c) DRAM cache miss latency (d) Average DRAM cache access latency to the DRAM cache for 4/8/16 core system Figure 1 shows the DRAM cache hit latency, DRAM cache miss rate, DRAM cache miss latency, and average DRAM cache access latency experienced by state-of-the-art set mapping policies [5, 9, 10, 12] for a 4/8/16 core system. The parameters for the cores, caches and off-chip memory are the same as used in the experimental setup in Section 6.1 (see Table 1) with various multi-programmed workloads from SPEC2006 [1] listed in Table 2. On one extreme, the MOST policy [5, 9, 10] has a high DRAM cache hit latency compared to the LOST policy as depicted in Figure 1-(a). On the other extreme, the LOST policy [12] has

AL = (1 – MR) * HL + MR * ML

(1)

Figure 1-(d) shows the average DRAM Cache access latency for state-of-the-art set mapping policies for a 4/8/16 core system. On one hand, the LOST policy reduces the average DRAM cache access latency by 16.7% for small scale multi-core system (4-core system with low DRAM cache miss rate) compared to the MOST policy. On the other hand, the MOST policy reduces the average DRAM cache access latency (6.9% for 8-core and 19.6% for 16core system) compared to the LOST policy for medium scale multi-core systems (8-core and 16-core system with high DRAM cache miss rate). This shows that a DRAM set mapping policy optimized for one parameter does not provide the best DRAM cache access latency, depending on the systems size (number of cores). We argue that the DRAM cache must be designed to minimize both DRAM cache hit latency and DRAM cache miss rate at the same time. We make the following new contributions: 1. We propose a novel DRAM set mapping policy namely LAMOST (Latency And Miss rate Optimized SeT mapping policy) after analyzing that single-faceted set mapping policies [5, 9, 10, 12] do not work efficiently because these policies are optimized only for a single parameter. LAMOST (details in Section 4.1) simultaneously reduces DRAM cache hit latency and DRAM cache miss rate at the same time. 2. We additionally propose a DRAM set balancing policy namely LAMOST-SB on top of our proposed LAMOST policy after analyzing that multiple applications running on a multi-core system exhibit a non-uniform distribution of accesses across different DRAM sets. It exploits the imbalance across DRAM sets (detail in Section 4.2) and reduces the DRAM cache miss rate by 5.7% via improved DRAM cache set utilization compared to our LAMOST policy. 3. Our proposed policies (LAMOST and LAMOST-SB) reduce the average DRAM cache access latency of applications that are not memory intensive (applications with low DRAM cache miss rate) via reduced DRAM cache hit latency. And they reduce the average DRAM cache access latency of memory intensive applications (applications with high DRAM cache miss rate) via reduced DRAM cache miss rate. For a 16-core system, our proposed LAMOST-SB policy reduces the average DRAM cache access latency by 12.1%, 29.3% and 8.6% compared to the MOST, LOST, and our LAMOST policy, respectively.

Core 0

Core 1 I$ D$ L2 Cache

I$ D$ L2 Cache

L3-SRAM Cache

MissMap hit Set Mapping policy

...

Bank

Core N - 1 I$ D$ L2 Cache

MissMap (2 MB)

Fill from memory DRAM cache controller

Row # Set #

The overall system performance is determined by the average DRAM cache access latency (AL) that depends upon the DRAM cache hit latency (HL), DRAM cache miss rate (MR), and DRAM cache miss latency (ML) as shown in Eq. (1).

called the DRAM array. Data in a DRAM bank can only be accessed from the bank row buffer (see Figure 2) that holds the last accessed row in that bank. Any subsequent accesses to the same row (row buffer hit) will bypass the DRAM array access and the data will be read from the row buffer directly. This reduces the access latency compared to when actually accessing the DRAM array. This concept is referred to as row buffer locality [7, 14]. The data in the row buffer (row buffer hit) is accessed much faster than the data residing in a different row (row buffer miss) [5, 7, 9, 10, 12] of the DRAM bank.

Bank #

a high DRAM cache miss rate compared to the MOST policy as depicted in Figure 1-(b). The higher DRAM cache miss rate of the LOST policy also leads to a higher DRAM cache miss latency compared to the MOST policy (Figure 1-c). The DRAM cache miss latency is not directly determined by the miss rate, but also increases with an increased number of applications running on a multi-core system due to increased contention in the memory controller.

MissMap miss Main Memory controller Main Memory

…

Bank

DRAM Array

Legend:

DRAM Cache Row Buffer

Data Control

Figure 2: DRAM cache architecture based on MissMap A primary design consideration for the DRAM cache is the tagstore mechanism [9, 15]. A 256MB DRAM cache can store 222 64-byte data blocks, which results in a tag overhead of 24MB assuming 6 bytes per tag entry [9]. The tags can be stored in a separate SRAM tag array which eliminates DRAM cache access if the tag array indicates a cache miss. But that incurs a significant area and latency overhead [9, 10]. To reduce the overhead, recent research provides an efficient and low-overhead SRAM-based structure named as MissMap [9, 10] (2MB overhead instead of 24MB) that accurately determines whether an access to a DRAM cache will be a hit or a miss. If the MissMap indicates a hit (Figure 2), the request is sent to the DRAM cache controller. A MissMap miss (i.e. data is not available in the DRAM cache) makes DRAM cache misses faster by eliminating the DRAM access before sending the request to main memory (Figure 2). Throughout this paper, we make the following assumption for all DRAM set mapping policies (explained in Section 3) including ours (explained in Section 4): 1. We employ a MissMap [9, 10] to determine whether an access to a DRAM cache will be a hit or a miss, while the tags are stored in the DRAM cache similar to the state-of-the-art [5, 9, 10, 12]. 2. The MissMap is accessed in parallel to L3-SRAM cache (see Figure 2) so that the request can be sent to DRAM cache (MissMap hit) or main memory (MissMap miss) as quickly as possible after detecting L3-SRAM cache miss.

2. BACKGROUND

3. We employ a state-of-the-art cache insertion policy on top of DRAM cache in order to reduce contention in the DRAM cache controller [5].

Figure 2 shows the cache architecture that is used in this work and described in the following. The DRAM cache is composed of DRAM banks which consist of rows and columns of memory cells

4. We employ a state-of-the-art access scheduling [14] in the DRAM cache controller and memory controller similar to the state-of-the-art [5, 9, 10, 12].

Legend T

(a) MOST: One Row-Buffer containing one cache set with 29 way of data One Cache Set

192 bytes

Tag block

D Data block

D

T T T

64 bytes

DATA OUT

Organization of a DRAM bank

TAG (8B)

TAG-AND-DATA (TAD) DATA(64B)

(b) LOST: One Row-Buffer containing 28 cache sets with 1 way of data One Cache Set

4096 rows

72 bytes

DRAM Array

64 bytes T CS = 0

32 x 64-byte block size = 2048 bytes/row

TAG + DATA OUT (c) LAMOST: One Row-Buffer containing 4 cache sets with 7 way of data

Tag Block0

One Cache Set

6 bytes Tag Block1

...

T CS = 0 Tag Block0

Unused

64 bytes

CS: Cache set number determined by least significant two bits of memory block address bits

22 bytes

(d) LAMOST-SB: One Row-Buffer containing 4 cache sets with 7 way of data

64 bytes

Data bus width (128 bits = 16 bytes)

Tag Block7

1 tag block T CS = 3

T CS = 2 DATA OUT

7 x 6 = 42 bytes

Row Buffer

7 data blocks

D

T CS = 1

One Cache Set

7.5 bytes Tag Block1

D

T CS = 1

...

7 data blocks

T CS = 3

T CS = 2 DATA OUT Tag Block7

7 x 8.125 = 56.875 bytes

1 tag block

64 bytes

Unused 7.125 bytes

Figure 3: DRAM set mapping policies for (a) Miss rate optimized set mapping policy (MOST) [5, 9, 10] (b) Latency optimized set mapping policy (LOST) [12] (c) Latency and Miss rate optimized set mapping policy (LAMOST) [Ours] (d) Latency and miss rate optimized set mapping policy with set balancing (LAMOST-SB) [Ours]

3. STATE-OF-THE-ART SET MAPPING POLICIES

Main memory block address Tag

An important design consideration for a DRAM cache is the set mapping policy that determines the DRAM cache efficiency via affecting its hit latency and miss rate. In this section, we present different state-of-the-art set mapping policies [5, 9, 10, 12].

Blocks in memory 0,131072,262144,... 1,131073,262145,… 2,131074,262146,…

Row-0 Row-0 Row-0

log2(B) RB-0 RB-1 RB-2

… Row-1 Row-1 Row-1

RB-0 RB-1 RB-2

...

… Bank-0

Bank-1

Bank-2

0, ... 32, ...

1, ... 33, ...

2, ... 34, ...

Bank-31 31, ... 63, ...

…

131040, ...

131041, ...

131070, ...

131071, ...

RB #

RB-0

RB-1

RB-2

RB-31

…

…

…

4095

…

Figure 4 illustrates how main memory blocks are mapped to a row buffer and to a DRAM row within a bank. Each bank is associated with a row buffer that holds the last accessed row in that bank. The DRAM cache row number within a bank is determined by the main memory block address (Figure 4). For in-

32, 131104, 262176, … 33, 131105, 262177, … 34, 131106, 262178, … Row # 0 1

Bank/RB-i #

log2(R)

…

3.1 MOST policy [5, 9, 10] A typical DRAM row size of 2 KB can store up to 32 64-byte memory blocks (32×64 = 2048 bytes/row). In the Miss rate Optimized SeT mapping (MOST) policy [5, 9, 10], the 2KB row is partitioned into 29 data blocks (i.e. an associativity of 29 ways per set) and 3 tag blocks (29×6 = 174 bytes are used out of 192 bytes reserved for the tags) assuming 6 bytes per tag entry. The DRAM row organization for the MOST policy is illustrated in Figure 3-(a), where each DRAM row comprises one cache set with 29-way set associativity. A cache access must first access the three tag blocks before having an access to the data line. Ref. [9] proposes Compound Access Scheduling so that the second access for the data is guaranteed to have a row buffer hit after a tag access.

Row/Set #

B: Number of DRAM banks (we use 32 banks) R: Number of rows in a DRAM bank (we use 4096 rows) RB-i: Row buffer associated with bank i

Figure 4: Main memory block mapping for the MOST [5, 9, 10] policy

stance, main memory block-0 is mapped to row-0 of bank-0 and main memory block-1 is mapped to row-0 of bank-1. Spatially close memory blocks (e.g. main memory block-0, block-1, block2 and block-3) are mapped to different row buffers as depicted in Figure 4. For instance, main memory block-0 is mapped to RB-0 (RB-i stands for row buffer associated with bank i) and memory block-1 is mapped to RB-1.

3.2 LOST policy [12]

…

… 896-923, 3670912-3670939, ... 924-951, 3670940-3670967, ... 952-979, 3670967-3670994, ...

RB-0 RB-1 RB-2

Row-1 Row-1 Row-1

RB-0 RB-1 RB-2

…

…

Figure 5: Main memory block mapping for LOST [12] policy

4. OUR SET MAPPING POLICIES In this section, we introduce our novel DRAM set mapping policies that simultaneously optimize DRAM cache hit latency and DRAM cache miss rate.

4.1 LAMOST policy The DRAM row organization for our proposed Latency And Miss rate Optimized SeT mapping (LAMOST) policy is illustrated in Figure 3-(c), where each DRAM row comprises four cache sets with a 7-way set associativity. Each cache set consists of 1 tag block (64 bytes) and 7 data blocks. The 7 data blocks need 7 x 6 = 42 bytes for their tag entries with 22 bytes left unused. Figure 6 illustrates how main memory blocks are mapped to a row buffer and to a DRAM row within a bank. The LAMOST policy maps 4 consecutive main memory blocks to the same DRAM row buffer. For instance, main memory blocks 0-3 are mapped to RB-0 and blocks 4-7 are mapped to RB-1 as depicted in Figure 6. The DRAM cache set number within a row is deter-

log2(R)

Set #

log2(B)

2 bits

Blocks in memory 0-3, 524288-524291, … 4-7, 524292-524295, … 8-11, 524296-524299, …

Row-0 Row-0 Row-0

128-131, 524416-524419, … 132-135, 524420-524423, … 136-139, 524424-524427, …

RB-0 RB-1 RB-2

Row-1 Row-1 Row-1

RB-0 RB-1 RB-2

...

Row-0 Row-0 Row-0

6 bytes

Bank/RB-i #

...

Blocks in memory 0-27, 3670016-3670043, ... 28-55, 3670044-3670071, ... 56-83, 3670072-3670099, ...

Row #

…

Figure 5 illustrates how main memory blocks are mapped to a row buffer and to a DRAM row within a bank. The LOST policy maps 28 consecutive main memory blocks to the same row buffer. For instance, main memory blocks 0-27 are mapped to RB-0 and main memory blocks 28-55 are mapped to RB-1 as depicted in Figure 5.

Main memory block address Tag

…

The tag check in the MOST policy [5, 9, 10] requires a DRAM cache access for the three tag blocks before accessing the data. To reduce the tag serialization latency, the Latency Optimized SeT mapping (LOST) policy [12] converts the DRAM cache to a direct mapped cache. The LOST policy tightly integrates the tag and data into a single entity called TAD (Tag And Data) as illustrated in Figure 3-(b). Each TAD entry represents one set of the direct mapped cache. Thus, instead of having separate accesses (one for the three tag blocks and other for the data block), the LOST policy requires a single DRAM cache access to get the unified TAD entry. As a result, the LOST policy significantly reduces DRAM cache hit latency compared to the MOST policy as depicted in Figure 1-(a). However, the LOST policy incurs high DRAM cache miss rate compared to MOST due to increased conflict misses (Figure 1-b), because it employs direct mapped DRAM cache.

mined by the two least significant bits of the memory block address as illustrated in Figure 6. For instance, main memory block-0 is mapped to set-0 and main memory block-3 is mapped to set-3 within a DRAM row (see Figure 3-c). A cache access must first access one tag block (in contrast to three tag blocks accesses in the MOST policy) before having an access to the data line. This reduces tag serialization latency compared to the MOST policy. The LAMOST policy reduces the DRAM cache hit latency compared to the MOST policy via reduced tag serialization latency (details in Section 5) and high row buffer hit rate (evaluated in Section 6). It minimizes DRAM cache miss rate via high associativity (7-way associative cache) compared to the LOST policy (direct mapped cache).

B: Number of DRAM banks (we use 32 banks) R: Number of rows in a DRAM bank (we use 4096 rows) RB-i: Row buffer associated with bank i

Figure 6: Main memory block mapping for the proposed LAMOST policy

4.2 LAMOST-SB policy As the number of DRAM cache sets increases, the efficiency of set-associative caches is reduced because programs exhibit a nonuniform distribution of accesses across different cache sets [13]. For a non-uniform set distribution, some of the DRAM cache sets may be under-utilized, whereas others may be severely overutilized. As a result, over-utilized sets incur large miss rates compared to under-utilized sets which may degrade the performance. An intuitive answer to achieve uniform cache set distribution is to assign a row number to each memory block in a round robin way after a memory block miss. This would require storing the row number (requires 12 bits) with each memory block in the MissMap and this would lead to significant storage requirements of 768 bits (12 x 64 = 768 bits) for each MissMap entry. Instead, our proposed LAMOST-SB policy stores the DRAM row number at coarser granularity, as described in the following. Each MissMap entry tracks the memory block (we use a memory block of 64 bytes similar to state-of-the-art [5, 9, 10]) associated with a memory segment (we use a 4KB MissMap entry size similar to the state-of-the-art [5, 9, 10]). Each 4KB MissMap segment is associated with a tag (called Seg-Tag) and a bit vector (called Seg-BV) with one bit per memory block as shown in Figure 7. The Seg-Tag field determines whether a particular memory segment is present in the MissMap (segment hit) or absent (segment miss). The Seg-BV field determines the hit/miss of a memory block within a particular segment. Our proposed LAMOST-SB policy stores the DRAM row number at segment

level which only requires storage overhead of 12 bits for each MissMap entry. For this reason, we add an additional Seg-Row field to each MissMap entry to support set balancing. The SegRow field is assigned to each MissMap entry after a segment miss (see Section 4.2.1). MissMap entry Seg-Tag

V

D

L

Seg-Row

1 1 0 0 1

...

0 1 0

64-bit Bit vector (Seg-BV)

V Valid bit D Dirty bit

1

L LRU info

0 DRAM cache miss

DRAM cache hit

Figure 7: MissMap entry covering a 4KB memory segment; we add an additional Seg-Row field to MissMap entry to support set balancing LAMOST-SB is built on top of our LAMOST policy. The row buffer and set mapping for our LAMOST-SB policy is similar to the LAMOST policy except that the row number is stored in the MissMap (Figure 6 and Figure 8). The primary difference between the LAMOST and LAMOST-SB policies is that the LAMOST policy determines the DRAM row number based on the memory block address (Figure 6) while the LAMOST-SB policy store the DRAM row number in the MissMap which is assigned to each MissMap entry after a segment miss (see Section 4.2.1). Main memory block address Tag

Bank/RB-i #

Set #

7.5 bytes

log2(B)

2 bits

Figure 8: Main memory block mapping for the LAMOST-SB policy. The row number is stored in the MissMap

4.2.1 Row Assignment MissMap Access

Seg-Tag matches? MissMap segment hit

Yes

(determined by Seg-Row field of the MissMap entry). The DRAM bank number (i.e. row buffer) and the DRAM set number (within a row) is determined by the least significant bits of the main memory block address as illustrated in Figure 8. Since the LAMOST-SB policy assigns the DRAM row number in a round robin manner for each core, it leads to an improved DRAM cache row utilization (and hence it leads to an improved DRAM cache set utilization).

No Allocate new MissMap MissMap entry E for new segment S segment miss

Read Seg-Row field of hit segment (Fig. 7)

Determine a row number r in round robin way for corei

Determine DRAM bank and set (Fig. 8)

Store r in Seg-Row field of E (Fig. 7)

Figure 9: Row assignment for LAMOST-SB policy The LAMOST-SB policy assigns a DRAM row number to a MissMap segment after a segment miss (when the segment is referenced for the first time) as shown in Figure 9. When an application running on corei accesses a new segment S that is currently absent in the MissMap (i.e. segment miss), then a new MissMap entry E is allocated for S and a DRAM row number (called Seg-Row) is assigned to S in a round robin manner for corei. After a MissMap segment hit (DRAM row number already assigned), the DRAM row number is provided by the MissMap

4.2.2 Overhead Storing the DRAM row number in the MissMap increases the tag entry size for the memory block by 12 bits (tag size is now 7.5 bytes instead of 6 bytes) as illustrated in Figure 8. This leads to a tag-storage overhead of 52.5 bytes (7.5 bytes x 7 = 52.5 bytes) for each DRAM cache set (Figure 3-d). Each cache set in the LAMOST-SB policy consists of 1 tag block (64 bytes) and 7 data blocks. The 7 data blocks need 52.5 bytes for their tag entries with 11.5 bytes (64 – 52.5 = 11.5) still left unused. The increased tag size entries for the LAMOST-SB policy does not incur additional hardware overhead as it utilizes the unused bytes in the tag block. The DRAM row organization for our proposed LAMOST-SB policy (Figure 3-d) is similar to our LAMOST policy except that the DRAM row number is stored with each MissMap entry and requires a larger tag-size entry (7.5 bytes instead of 6 bytes). Storing the DRAM row number in the MissMap increases the size of the MissMap by 240KB. A 2MB MissMap cache (as used by us and in state-of-the-art [5, 9, 10]) can store 163,840 MissMap entries, which results in additional storage overhead of 240 KB (163,840 x 12 bits = 1,966,080 bits = 245,760 bytes = 240 KB) for the LAMOST-SB policy compared to other set mapping policies. To stay within the same storage budget, we reduce the L3-SRAM cache associativity by one for the LAMOST-SB policy as illustrated in Table 1. This slightly increases the L3-SRAM cache miss rate and the DRAM cache access rate. Storing the DRAM row number in the MissMap may slightly increase the MissMap access latency. However, we assume that for all set mapping policies, the MissMap is accessed in parallel to L3-SRAM cache access. As the MissMap access latency is much smaller compared to the larger L3-SRAM cache access latency, this does not incur any additional latency overhead for our LAMOST-SB policy. The only overhead for our LAMOST-SB policy is the 12-bit round-robin row selection logic for each core.

5. Latency breakdown In this section, we analyze the row buffer hit latencies (i.e. the accessed data is in the row buffer) and row buffer miss latencies (i.e. the accessed data is not in the row buffer) for different set mapping policies. Figure 10 shows the latency breakdown (in terms of processor clock cycles) for different set mapping policies. Note that the latency breakdown does not show the latency of the DRAM cache controller (time spent in the DRAM cache controller before having an access to a DRAM bank). We assume identical latency values for all DRAM cache parameters which are listed in Table 1. The DRAM cache has a row activation latency (ACT) and column access latency (CAS) of 18 clock cycles each. The tag check latency is assumed to be 2 clock cycles. Similar to state-of-the-art [5, 9, 10, 12], the DRAM cache bus width per DRAM channel is assumed to be 128 bits (16 bytes).

0

8

16

Row buffer hit latency

CAS (TAGS)

Row buffer miss latency

ACT


CAS (TAGS + DATA)


ACT


CAS (TAGS)


24

32

40

48

56

54 cycles

CAS (DATA) CAS (TAGS)

25 cycles

64

Processor Clock cycles

Bus Latency

(a) MOST

72 cycles

CAS (DATA)

Tag check Latency

72

(b) LOST CAS: Column access latency

ACT

CAS (TAGS + DATA)

CAS (DATA) CAS (TAGS)

43 cycles

ACT: Row activation latency

(c) LAMOST and 46 LAMOST-SB cycles CAS (DATA)

Bus latency Tag check latency

64 cycles

Figure 10: Latency breakdown for (a) Miss rate optimized set mapping policy (MOST) [5, 9, 10] (b) Latency optimized set mapping policy (LOST) [12] (c) Latency and miss rate optimized set mapping policy (LAMOST) [Ours]. The break down does not show DRAM cache controller latency Table 1: Core, cache and main memory parameters Core Parameters ROB/RS/LDQ/STQ size 128/32/32/24 Decode/commit width 4/4 Core Frequency 3.2 GHz SRAM Cache Parameters L1 Cache 32KB, 8-way, 2 cycles L2 Cache 256KB, 8-way, 5 cycles Shared L3-SRAM size 4/8/16 MB for 4/8/16 cores Shared L3-SRAM assoc 16/32/32 for 4/8/16 cores Shared L3-SRAM assoc for LAMOST-SB policy

15/31/31 for 4/8/16 cores

Shared L3-SRAM size for 3.75/7.75/15.5 MB for 4/8/16 cores LAMOST-SB policy Shared L3-SRAM latency 15/20/30 cycles for 4/8/16 cores DRAM cache Parameters MissMap size 2MB DRAM cache size 256 MB Number of DRAM banks 32 Number of channels 4 Bus Width 128 bits per channel Bus Frequency 1.6 GHz tRAS-tRCD-tRP-tCAS (nsec) 20-5.5-5.5-5.5 Main Memory Parameters Number of channels 2 Bus Width 64 bits per channel Bus Frequency 800 MHz tRAS-tRCD-tRP-tCAS (nsec) 36-9-9-9

If the data is already in the row buffer (i.e. row buffer hit), the MOST policy requires 18 clock cycles for CAS (to access the tags from the row buffer), 12 cycles to transfer three tag lines on the bus (64 x 3 = 192 bytes need to be transferred on the bus that incurs 12 clock cycles for the bus latency 192/16 = 12), 2 clock cycles for the tag check, 18 clock cycles for the CAS (to access the data from the row buffer), and 4 clock cycles to transfer the data lines. If the data is not in the row buffer (i.e. row buffer miss), it would require an additional latency of 18 ACT clock

cycles for row activation. For the MOST policy, the row buffer hit latency is 54 clock cycles and the row buffer miss latency is 72 clock cycles as illustrated in Figure 10-(a). For the LOST policy, the row buffer hit latency is 23 clock cycles (18 clock cycles for CAS + 5 bus cycles for the TAD entry) and the row buffer miss latency is 43 clock cycles (18 additional cycles required for row activation) as illustrated in Figure 10-(b). For our LAMOST and LAMOST-SB policies, the access latency of a row buffer hit includes the time to access the tags (18 clock cycles for CAS tags), time to read the tags through DRAM bus (4 bus cycles for tags), time to check the tag (2 clock cycles), time to access the data (18 clock cycles for CAS data) and time to read the data through the DRAM bus (4 bus cycles for data). Thus, the row buffer hit latency is 46 clock cycles and the row buffer miss latency is 64 clock cycles (18 additional cycles required for row activation) for the LAMOST and LAMOST-SB policies as illustrated in Figure 10-(c).

6. EVALUATION AND DISCUSSION 6.1 Experimental Set-up We use the x86 version of SimpleScalar (zesto) [11] to simulate 4, 8, and 16 core systems. The parameters for the cores, caches and off-chip memory are listed in Table 1. Our performance evaluations make use of various multi-programmed workloads from SPEC2006 [1], as shown in Table 2. For each benchmark, we used the Simpoint tool [6] to select the representative sample. Our evaluation is based on the assumption described in Section 2 (see Figure 2). For each application, we fast forward 500 million instructions to warm up the caches and gather the simulation statistics for the following 250 million instructions. When a faster application terminates early by completing its first 250 million instructions, then it is restarted to contend for the cache resources. However, the simulation statistics are recorded for the first 250 million instructions for each application. For evaluation, we have compared our LAMOST and LAMOSTSB policies with the state-of-the-art set mapping policies namely MOST [5, 9, 10] and LOST [12], which are discussed in detail in Section 3. The main drawback of these set mapping policies is that they are optimized for a single parameter (DRAM cache hit

462.libquantum 470.lbm 471.omnetpp 473.astar.ref 473.astar.train

429.mcf 433.milc 437.leslie3d.ref 437.leslie3d.train 445.gobmk 450.soplex

401.bzip2

Table 2: Application mixes (entries with the number “2” indicate that two instances of that application are used)

Mix_02 Mix_03 Mix_04 Mix_05 Mix_06 Mix_07 Mix_08 Mix_09 Mix_10 Mix_11 Mix_13

2

Mix_14

2

2

2

2

2

2

Mix_15

2

Mix_16

2

Mix_17 Mix_18 16 cores

2

Mix_19 Mix_20 Mix_21

LOST[12] LAMOST-SB [Our]

30% 20% 10% 0%

4 cores 8 cores 16 cores Average

Figure 11: Row buffer hit rate for different set mapping policies averaged over all application mixes

6.3 DRAM cache hit latency

Mix_12 8 cores

40%

MOST[5,9,10] LAMOST [Our]

Our LAMOST and LAMOST-SB policies exploit the row buffer locality by mapping four spatially close memory blocks to the same row buffer (Figure 6). These four spatially close memory blocks are mapped to different cache sets as illustrated in Figure 3-(c-d). The least significant two bits of the memory block address are used to identify the set number within the row (Figure 6 and Figure 8). The LAMOST and LAMOST-SB policies have high row buffer hit rates (16.4% for LAMOST and 16% for LAMOST-SB) compared to the MOST policy (1.3%). The LAMOST and LAMOST-SB policies have a reduced row buffer hit rate (as they map 4 spatially close memory blocks to the same DRAM row) compared to the LOST policy (maps 28 spatially close memory blocks to the same DRAM row).

Mix_01

4 cores

DRAM row buffer hit rate

latency or DRAM cache miss rate). In contrast, our proposed set mapping policies (LAMOST and LAMOST-SB policies) simultaneously reduces DRAM cache hit latency and DRAM cache miss rate at the same time.

2 2 2 2 2 2

2

2 3

2

2

2 2 3

2 2 2

2

2

3 3 3

Figure 12 shows the DRAM cache hit latency for different set mapping policies for a 4/8/16 core system respectively. The DRAM cache hit latency highly depends on whether an access leads to a row buffer hit or a row buffer miss. For this reason, the set mapping policies with high row buffer hit rates (LOST, LAMOST, and LAMOST-SB) have a reduced DRAM cache hit latency (Figure 12) compared to the set mapping policy with a reduced row buffer hit rate (MOST policy). The MOST policy incurs a higher DRAM cache hit latency due to an increased tag serialization latency (reading three tag lines), and the reduced row buffer hit rate (1.3%) further worsens the DRAM cache hit latency.

The DRAM cache hit latency highly depends on whether an access leads to a row buffer hit or a row buffer miss (details in Section 5). Row buffer hits have a reduced access latency compared to row buffer misses for all set mapping policies as illustrated in Figure 10. In this section, we analyze the row buffer hit rates (i.e. the fraction of time the requested row is in the row buffer) for different set mapping policies. Figure 11 shows the row buffer hit rates for different set mapping policies for 4/8/16 core systems. In the MOST policy, the row buffer hit rate is very low (1.3%, see Figure 11), because spatially close memory blocks are mapped to different row buffers as illustrated in Figure 3. The MOST policy provides a high associative DRAM cache to reduce the conflict misses at the cost of reduced row buffer hit rates. The LOST policy maps 28 consecutive memory blocks to the same row buffer (Figure 5), so the probability of a row buffer hit is very high and thus the LOST policy has the highest row buffer hit rate (27.4%) compared to other set mapping policies as illustrated in Figure 11. As the number of cores increases, the row buffer hit rate decreases due to increased interleaving of cache requests from multiple applications.

DRAM cache hit latency (HL) [cycles]

6.2 Row buffer hit rate 160 140 120 100 80 60 40




Figure 12: DRAM cache hit latency for different set mapping policies averaged over all application mixes The LOST policy is optimized for DRAM cache hit latency (Figure 12) by providing tag and data in a single access (Figure 10-b) and it is optimized for a high row buffer hit rate (27.4%). Our proposed LAMOST and LAMOST-SB policies have a reduced DRAM cache hit latency compared to the MOST policy but they have an increased DRAM cache hit latency compared to the LOST policy. The increased DRAM cache hit latency compared to the LOST policy is compensated by a reduction in DRAM cache miss rate and DRAM cache miss latency as illustrated in Figure 13-(a-b).

(a)

DRAM cache miss rate

55%

DRAM cache miss latency (ML) [cycles]


45% 35% 25% 15%

(b)


650 550 450 350 250 150 50

4 cores 8 cores 16 cores Average MOST[5,9,10] LAMOST [Our]


Avg. DRAM cache access latency (AL) [cycles]

Figure 13 shows the average DRAM cache miss rate and DRAM cache miss latency for different set mapping policies for a 4/8/16 core system. Increased DRAM cache miss rate results in a high DRAM cache miss latency (due to increased contention in the memory controller) as illustrated in Figure 13. The MOST policy is optimized for DRAM cache miss rate (Figure 13-a) via a high associative DRAM cache (29-way associativity) to reduce the number of conflict misses. The MOST policy reduces the DRAM cache miss rate by 20.6% and 2.4% compared to the LOST and LAMOST policies, respectively. The MOST policy (29way associative cache) has a reduced DRAM cache miss rate (2.4% less) compared to our LAMOST policy (7-way associative cache). However, Our LAMOST-SB policy reduces the DRAM cache miss rate by 3.5% compared to the high associative MOST policy due to set balancing (details in Section 4.2). As a result, our proposed LAMOST-SB policy reduces the average DRAM cache miss rate by 3.5%, 20% and 5.7% compared to the MOST, LOST, and our LAMOST policies respectively.

400 350 300 250 200 150 100 50



4 cores 8 cores 16 cores Average Figure 14: Average DRAM cache access latency for different set mapping policies averaged over all application mixes Figure 15 shows the average harmonic mean instruction per cycle (HMIPC) throughput results for all set mapping policies relative to the MOST policy for 4/8/16 core systems. On average, our proposed LAMOST-SB policy improves the HMIPC throughput by 7.6% and 11.8% compared to the MOST [5, 9, 10] and LOST [12] policies, respectively. For a 16-core system, it improves the HMIPC throughput by 7.6% and 25.7% compared to the MOST and LOST policies, respectively.

HMIPC Throughput

6.4 DRAM cache miss rate and DRAM cache miss latency

1.1



1.0 0.9 0.8

4 cores 8 cores 16 cores Average Figure 15: Average HMIPC throughput relative to the MOST policy averaged over all application mixes

6.6 Detailed results


Figure 13: (a) DRAM cache miss rate (b) DRAM cache miss latency for different set mapping policies averaged over all application mixes

6.5 Average DRAM cache access latency The overall performance is determined by the average DRAM cache access latency (AL) which depends upon the DRAM cache hit latency (HL), DRAM cache miss rate (MR), and DRAM cache miss latency (ML) as determined by Eq. (1) in Section 1. Figure 14 show the average DRAM cache access latency for different set mapping policies for 4/8/16 core systems. For a 16core system, our proposed LAMOST-SB policy reduces average DRAM access latency by 12.1%, 29.3% and 8.6% compared to the MOST, LOST, and LAMOST policies, respectively. Our proposed LAMOST and LAMOST-SB policies reduce the DRAM cache hit latency (Figure 12) via a high row buffer hit rate (Figure 11) compared to the MOST policy. It reduces the DRAM cache miss rate and miss latency (Figure 13-a-b) via its high associativity compared to the LOST policy. The LAMOST-SB policy further reduces the DRAM cache miss rate (Figure 13-a) and DRAM cache miss latency (Figure 13-b) compared to the LAMOST policy via set balancing.

The performance of a set mapping policy depends upon the DRAM cache miss rate. Figure 16, Figure 17, and Figure 18 illustrate this observation by showing the harmonic mean instruction per cycle (HMIPC) throughput, total instruction throughput per cycle and the DRAM cache miss rate results for all set mapping policies respectively. When the DRAM cache miss rate is low (e.g. Mix_04, Mix_07, Mix_09, Mix_13), the LOST policy [12] performs better than other set mapping policies, because it is optimized for DRAM cache hit latency. When the DRAM cache miss rate is high (e.g. Mix_12, Mix_16, Mix_18, Mix_19, and Mix_20), even the baseline MOST policy performs better than the LOST policy. The LOST policy performs worse for application mixes with a high DRAM cache miss rate. Figure 19 illustrates this observation showing the individual application worst case performance degradation relative to the MOST policy. For application mixes with a high DRAM cache miss rate (e.g. Mix_12, Mix_16, Mix_18, Mix_19, and Mix_20), the LOST policy significantly degrades the individual application worst-case performance (average 16.7% and maximum 82%) compared to the baseline MOST policy. For some application mixes running on a 4-core system (e.g. Mix_04, Mix_07, Mix_09), the LOST policy performs better than our LAMOST-SB policy because these application mixes have a reduced DRAM cache miss rate. As the number of cores increases, our LAMOST-SB policy performs better (13.1% for an 8-core and 25.7% for a 16-core system) than the LOST policy due to reduced conflict misses.

HMIPC Throughput (realtive to MOST policy)

1.2

MOST[5,9,10]

LOST[12]

LAMOST [Our]

LAMOST-SB [Our]

1.1 1.0 0.9 0.8

TIPC Throughput (realtive to MOST policy)

Figure 16: Normalized harmonic mean instruction per cycle (HMIPC) relative to the MOST policy for all application mixes

1.2

MOST[5,9,10]

LOST[12]

LAMOST [Our]

LAMOST-SB [Our]

1.1 1.0 0.9 0.8

DRAM cache miss rate

Figure 17: Normalized total instruction per cycle (TIPC) relative to the MOST policy for all application mixes

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

MOST[5,9,10]

LOST[12]

LAMOST [Our]

LAMOST-SB [Our]

Worst-case performance degradation

Figure 18: DRAM cache miss rate for all application mixes

1.6 1.5 1.4 1.3 1.2 1.1 1.0 0.9 0.8

MOST[5,9,10]

LOST[12]

LAMOST [Our]

LAMOST-SB [Our]

1.82

Figure 19: Normalized individual application worst-case performance degradation relative to the MOST policy for all application mixes For all application mixes, our proposed LAMOST and LAMOSTSB policies provides better performance improvement compared to the baseline MOST policy. Our proposed policies improve the performance of applications that are not memory intensive (i.e.

applications with a low DRAM cache miss rate) via a reduced DRAM cache hit latency. It improves the performance of memory intensive applications (i.e. applications with a high DRAM cache miss rate) via a reduced DRAM cache miss rate.

6.7 Comparison of LAMOST and LAMOSTSB policies In this section, we compare the performance of our proposed LAMOST and LAMOST-SB polices. On average, the LAMOST-SB policy reduces the DRAM cache miss rate by 5.7% (Figure 13) compared to the LAMOST policy via set balancing (i.e. storing the DRAM row number in the MissMap and assigning it in a round robin fashion). On average, our proposed LAMOST-SB policy improves HMIPC throughput by -0.5%, 1.2% and 3.8% compared to the LAMOST policy for a 4-core, 8-core, and 16-core system, respectively. LAMOST-SB policy requires an additional storage overhead of 240MB to support set balancing (details in Section 4.2). To stay within the same storage budget, LAMOST-SB policy employs 3.75MB, 7.75MB, and 15.75MB L3-SRAM cache for a 4, 8, and 16 core system, respectively. For all other set mapping policies, we employ 4MB, 8MB, and 16 MB L3-SRAM caches for a 4, 8, and 16 core system, respectively. Due to the smaller L3SRAM cache size for the LAMOST-SB policy, the L3-SRAM cache miss rate is increased by 1.9%, 1.1%, and 0.7% compared to the LAMOST policy for a 4-core, 8-core, and 16-core system, respectively. LAMOST performs better than LAMOST-SB for a 4core system due to reduced L3-SRAM cache miss rate (L3-SRAM has faster access latency compared to DRAM cache hit latency) despite the fact that LAMOST-SB policy reduces the DRAM cache miss rate and miss latency via set balancing. As the number of core increases (hence the DRAM cache miss rate increases), the slight increase in the L3-SRAM cache miss rate (1.1% for 8-core and 0.7% for 16-core compared to the LAMOST policy) for the LAMOST-SB policy is compensated by a reduction in the DRAM cache miss rate (5.7% for 8-core and 5.9% for 16-core compared to the LAMOST policy). This results in 1.2% and 3.8% improvements in HMIPC throughput compared to the LAMOST policy for 8-core and 16-core systems, respectively. When the overall DRAM cache miss rate is high (33.8% for an 8-core and 43.6% for a 16-core system as shown in Figure 13-b), LAMOST-SB provides greater performance improvements for a 16-core system (3.8%) compared to an 8-core system (1.2%).

7. CONCLUSIONS We identified that both DRAM cache hit latency and DRAM cache miss rate plays an important role in reducing average DRAM cache latency for multi-core systems. Recently proposed DRAM set mapping policies are either optimized for hit latency or for miss rate. Our novel DRAM set mapping policy along with set balancing simultaneously optimizes the DRAM cache hit latency and the DRAM cache miss rate. We evaluated our set mapping policies for various multiprogrammed workload mixes and compared it to state-of-the art. Our evaluations with a 256MB DRAM cache for a 16-core system shows that our proposed LAMOST-SB policy reduces the average DRAM access latency by 12.1% and 29.3% compared to the MOST [5, 9, 10] and LOST [12] policies, respectively. This leads to a performance improvement (harmonic mean instructions per cycle) of 7.6% and 25.7%, respectively. Our detailed analysis shows that it is the combination of reduced DRAM cache hit latency and DRAM cache miss rate that provides improved performance results compared to state-of-the-art set mapping policies.

8. Acknowledgement This work was partly supported by the German Research Foundation (DFG) as part of the Transregional Collaborative Research Center “Invasive Computing” (SFB/TR 89).

REFERENCES [1] [2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

Standard Performance Evaluation Corporation. http://www.spec.org. R. X. Arroyo, R. J. Harrington, S. P. Hartman, and T. Nguyen. IBM POWER7 Systems. IBM Journal of Research and Development, 55(3):2:1 – 2:13, 2011. B. Black, e. Die-Stacking (3D) Microarchitecture. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 469–479, December 2006. Y. Deng and W. Maly. Interconnect Characteristics of 2.5-D System Integration Scheme. In Proceeding of the International Symposium on Physical Desing (ISPD), pages 171–175, April 2001. F. Hameed, L. Bauer, and J. Henkel. Adaptive Cache Managment for a Combined SRAM and DRAM Cache Hierarchy for Multi-Cores. In Proceedings of the 15th conference on Design, Automation and Test in Europe (DATE), pages 77–82, March 2013. G. Hamerly, E. Perelman, J. Lau, and B. Calder. Simpoint 3.0: Faster and More Flexible Program Analysis. Journal of Instruction Level Parallelism, 7, 2005. Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor Harchol Balter. Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior". In Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 65–76, December 2010. Gabriel H. Loh. Extending the Effectiveness of 3D–stacked Dram Caches with an Adaptive Multi-Queue Policy. In IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 174–183, 2009. Gabriel H. Loh and Mark D. Hill. Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches. In IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 454–464, 2011. Gabriel H. Loh and Mark D. Hill. Supporting Very Large DRAM Caches with Compound Access Scheduling and MissMaps. In IEEE Micro Magazine, Special Issue on Top Picks in Computer Architecture Conferences, 2012. Gabriel H. Loh, S. Subramaniam, and Y. Xie. Zesto: A CycleLevel Simulator for Highly Detailed Microarchitecture Exploration. In Int’l Symposium on Performance Analysis of Systems and Software (ISPASS), 2009. M.K. Qureshi and M.K. Loh. Fundamental Latency Trade-offs in Architecting DRAM Caches. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2012. M.K. Qureshi, D. Thompson, and Y.N. Patt. The V-Way Cache: Demand Based Associativity via Global Replacement. In Proceedings of the 32nd International Symposium on Computer Architecture (ISCA), pages 544–555, June 2005. S. Rixner, W. Dally, U. Kapasi, P. Mattson, and J. Owens. Memory Access Scheduling. In Proceedings of the 32nd International Symposium on Computer Architecture (ISCA), pages 128–138, June 2000. L. Zhao, R. Iyer, R. Illikkal, and D. Newell. Exploring DRAM Cache Architecture for CMP Server Platforms. In Proceedings of the 25th International Symposium on Computer Design (ICCD), pages 55–62, 2007.

Simultaneously Optimizing DRAM Cache Hit Latency and Miss Rate ...

Simultaneously Optimizing DRAM Cache Hit Latency and Miss Rate ...

Suggest Documents

Adaptive-Latency DRAM: Optimizing DRAM ... - Computer Science

Adaptive-Latency DRAM (AL-DRAM)

Reducing Latency in an SRAM/DRAM Cache ... - Semantic Scholar

Estimating Cache Hit Rates from the Miss Sequence - Timothy Chow

DRAM Hybrid Cache Architecture ... - CiteSeerX

Log-Structured Cache: Trading Hit-Rate for Storage Performance (and ...

Strongly Hit and Far Miss Hypertopology and Hit and Strongly Far Miss

Compiler Techniques for Reducing Data Cache Miss Rate on a ...

Reducing DRAM Latency by Exploiting Design-Induced Latency ...

Causes of Hit and Miss

Simultaneously Learning and Optimizing using Controlled ... - CiteSeerX

Optimizing simultaneously over the numerator and ... - LAAS

Simultaneously Learning and Optimizing Using Controlled Variance ...

Cache Miss Characterization in Hierarchical Large-Scale Cache ...

Reducing Cache Miss Ratio For Routing Prefix Cache - Google Sites

BugCache for Inspections : Hit or Miss?

Cache Hit Optimal vs. Throughput Optimal

Tiered-Latency DRAM: A Low Latency and Low ... - 400 Bad Request

Reducing Memory Access Latency with Asymmetric DRAM Bank ...

Mlcached: Multi-level DRAM-NAND Key-value Cache - Usenix

Disk reads with DRAM latency - Carnegie Mellon School of Computer ...

Disk reads with DRAM latency - Carnegie Mellon School of Computer

Probability cueing influences miss rate and

Analysis of Forecast Performance for Hit, Miss, and False Alarm ...