SP-NUCA: A Cost Effective Dynamic Non-Uniform Cache ... - CiteSeerX

2 downloads 24217 Views 416KB Size Report
obtained using CACTI 5.0 [24] with a power-efficient ... L1 Data cache and is connected to a node of the network. ... use a centralized monitoring circuit.
SP-NUCA: A Cost Effective Dynamic Non-Uniform Cache Architecture Javier Merino, Valentín Puente, Pablo Prieto, José Ángel Gregorio Universidad de Cantabria Spain {jmerino, vpuente, pprieto, monaster}@unican.es Abstract 1

be the best candidate for its organization; capable of minimizing on-chip latency and fulfilling bandwidth requirements [10]. The best sharing policy in the cache hierarchy among the cores of a CMP is strongly dependent on the workload running on it. Some applications are characterized by a significant sharing degree whereas others have little to no sharing at all. On-chip memory hierarchy should be flexible enough to adapt its own behavior to the requirements, maximizing hit rates and minimizing unnecessary inter-core conflicts. Otherwise, unfairness [11] or lack of on-chip quality of service (CQoS) [7] could appear. For a fully shared NUCA two major alternatives exist, which can be denoted as static NUCA or S-NUCA, and dynamic NUCA or D-NUCA. In the first one the block placement is only dependent on its address and interleaving used whereas in the second one the block placement is variable and depends on the application access pattern. A pure D-NUCA implementation could be rather complex, especially when the block placement flexibility is high [1]. On the contrary, the S-NUCA approach is more feasible due to its simplicity, but in most cases cannot take advantage of the workload locality, usually having higher average on-chip latency. The architecture presented in this work, denoted Shared/Private-NUCA or SP-NUCA for short, addresses the issues introduced previously in a combined way. The SP-NUCA architecture is able to dynamically self-adjust the sharing partitioning depending on the workload behavior. Each set is divided into private and shared slices. If the replacement policy can flexibly allocate private ways for shared blocks, and shared ways for private blocks, we could adjust the ratio of shared/private capacity in the cache dynamically. The idea is built over the S-NUCA approach, reducing the average cache latency but keeping the implementation as simple as possible. The original SNUCA proposal is modified, creating a private addressing function for each processor, holding private blocks of each processor in one of its closest NUCA banks. Under these conditions, the average access latency could be reduced significantly for private access and unnecessary inter-core interference of private accesses could be avoided. Although other choices could be compatible, the coherence protocol employed is based on token coherence [16] and so no directory is required to search for blocks only located in remote L1 or blocks located only in private remote banks.

This paper presents a simple but effective method to reduce on-chip access latency and improve core isolation in CMP Non-Uniform Cache Architectures (NUCA). The paper introduces a feasible way to allocate cache blocks according to the access pattern. Each L2 bank is dynamically partitioned at set level in private and shared content. Simply by adjusting the replacement algorithm, we can place private data closer to its owner processor. In contrast, independently of the accessing processor, shared data is always placed in the same position. This approach is capable of reducing on-chip latency without significantly sacrificing hit rates or increasing implementation cost of a conventional static NUCA. Additionally, most of the unnecessary interference between cores in private accesses is removed. To support the architectural decisions adopted and provide a comparative study, a comprehensive evaluation framework is employed. The workbench is composed of a full system simulator, and a representative set of multithreaded and multiprogrammed workloads. With this infrastructure, different alternatives for the coherence protocol, replacement policies, and cache utilization are analyzed to find the optimal proposal. We conclude that the cost for a feasible implementation should be closer to a conventional static NUCA, and significantly less than a dynamic NUCA. Finally, a comparison with static and dynamic NUCA is presented. The simulation results suggest that on average the mechanism proposed could improve system performance of a static NUCA and idealized dynamic NUCA by 16% and 6% respectively.

1

Introduction

The future evolution in the number of cores per chip of CMP architectures could be jeopardized by the available off-chip bandwidth. Technology trends [8] indicate that off-chip pin bandwidth will grow at a much lower rate than the estimated number of processor cores per CMP chip. In order to minimize this mismatching problem, a large amount of intra-chip cache should be provided. The budget for transistors of current or future technologies guarantees a generous supply of these elements. Such a large capacity for the cache will be wire-dominated, and NUCA seems to

ACM SIGARCH Computer Architecture News

64

Vol. 36, No. 2, May 2008

Other modifications for the coherence protocol compared to a conventional S-NUCA are minimal. Moreover, our evaluation suggests that a global-LRU seems to be the best cost-effective replacement policy. The main penalty induced by our proposal is the required initial search in private banks for shared blocks, which slightly increases the latency in accessing this type of data. The SP-NUCA is thoroughly analyzed using a full system simulator and realistic workloads. A comparative study with the major NUCA alternatives is provided and the results indicate clear benefits in multiprogrammed workloads with performance improvements up to 45% and around 10% for multithreaded workloads. A plethora of works trying to tackle the same issues have been proposed, such as [1][2][3][4][5][6][19][25]. The significant complexity of most of them may impede their practical implementation. Notwithstanding, SPNUCA’s main contribution is its feasibility, requiring minimal changes over an S-NUCA, and leading to noticeable performance benefits. The rest of the paper is organized as follows: Section 2 introduces the SP-NUCA and details about the coherence protocol. Section 3 describes the experimental methodology employed. Section 4 explains some architectural decisions taken in the final proposal. Section 5 details some performance results and compares them with other alternatives. Section 6 discusses related research and, finally, Section 7 states the main conclusions of the paper.

2

latency when searching in other private banks is required for each shared block access. When a cache block arrives from memory, by default, it is marked as private and stored in the corresponding private bank LLC of the requesting processor. Therefore, the block placement depends on both its address and the requesting processor. If no other core requests it, the cache block remains in a bank close to the processor, so future hits in the LLC for this block are faster. On the other hand, if another core requests the same cache block it is migrated to the corresponding shared bank, which only depends on the block address. A block marked as shared (requested by more than one core) remains shared until it is evicted from the cache. In order to write-back the data to the suitable bank in LLC the next-to-last level cache needs to keep track of the shared/private status of the blocks too. A static partition between shared and private blocks could lead to suboptimal cache utilization and therefore must be avoided. As each bank is n-way set-associative, different ways of the same set can hold shared data or private blocks. We add an additional bit (the private bit) to every block stored in the cache to mark it as shared or private. This bit is also present in the cache requests and responses. It is added to the tag comparison when looking for a block in the cache, so private requests only match among private blocks and shared requests only match among shared blocks. This approach makes the architecture both easy to implement and dynamically partitioned. Nevertheless, one of the key challenges of this proposal is deciding which ways of each set should hold private data and which ones should store shared blocks. In our architecture, reassigning a private way to a shared partition is done by flipping the private bit, so changing the number of private and shared ways dynamically is done with little cost. For example, when new data arrives at a set, the replacement algorithm should choose which way it should replace. If the arriving data is private but the private data in the set is frequently used, the replacement algorithm could choose to evict a shared block and reassign the shared way to a private one. We will explore different algorithms to dynamically partition the cache in Section 4. Our system has 2n LLC banks and 2p processors. As explained before, all 2n banks can have shared data and each core can store private data in 2n-p banks. To interleave the addresses across the chip, we map the bank number in the lowest bits of the line address. As shown in Figure 1, the address is divided as follows. The lowest B bits are used to select the byte in the cache block. Then, we use the n-p bits to select the private bank or the n bits to select the shared bank depending on whether it is a private or a shared request. The next i bits, the index, are used to select the corresponding set in the bank and the rest of the address is the tag. It is important to note that the address is the same; the figure shows how it is

SP-NUCA Architecture and Coherence Protocol

In this section we present the basis of the SP-NUCA, an efficient architecture for the Last Level Cache (LLC). The LLC is organized as a Non-Uniform Cache Architecture, divided into bank sets which can hold shared or private blocks. A private block must be accessed only by one processor, whereas a shared block is used by two or more processors. Therefore banks can store shared blocks for any on-chip processor and private blocks for the closest processor. When a request first arrives at the LLC, a private request is sent to the corresponding private bank of the requesting core and address. If the requested address is not found in the private LLC bank, the request is forwarded to the shared bank. If the cache line is found in either the private or the shared LLC bank, the corresponding data is sent back to the requestor. On the contrary, if the shared bank does not have the line, it forwards the request to the private banks of the rest of the cores, where the block can reside, and also to memory. If the block is found in another private bank, it is marked as shared and migrated to the corresponding shared bank. Further accesses to this block by other processors will hit in the shared bank, so an extra

ACM SIGARCH Computer Architecture News

65

Vol. 36, No. 2, May 2008

divided and should be interpreted depending on whether it is a private or a shared request. The private tag is p bits bigger than the shared one, but as they are to be stored in the same tag array, it must have the size of the private tag, increasing slightly the required area of LLC banks by p bits per line.

3

Prior to clarifying the open decisions sketched in the previous section, we will describe, in this section, the employed methodology. We will utilize a full system simulator. A wide range of benchmarks have been selected in order to simulate different sharing degrees and workloads.

Private Request i bits

n-p bits

index

tag

bank

B bits

3.1

byte

MSB

Shared Request i bits

n bits

index

B bits

bank

byte

MSB

LSB

Figure 1. Address interpretation for private or shared requests. To maintain coherence in the LLC banks we employ a token-based coherence protocol [16] using the MOESI states. We use Token broadcast as the performance policy. Inclusiveness is not enforced in the memory hierarchy, so an LLC can evict a block without having to evict it from the next-to-last level cache. Token counting assures the correctness of each operation. The memory controller keeps track of the memory tokens, so if it receives a request for a block from a shared bank and has no tokens for this block, the memory controller can drop the request. Although the idea is applicable when on-chip L3 is present, in our work we will particularize the analysis for an architecture where LLC is L2. Figure 2 shows a representation with n processors for this case. The resulting architecture is equivalent to having n virtual private L2 caches and one virtual shared L3 caches, where each capacity is dynamically allocated on demand and both are completely exclusive. P0

P1

L1 D$

L1 I$

L1 D$

L1 I$

...

3.2

Table 1. The Benchmarks utilized. Benchmark

Virtual size Simulation Point Description (Mbytes) (million cycles) SPECint2000 (homogeneous multiprogrammed) gcc 158 1150 C programming language compiler mcf 192 2000 Combinatorial optimization parser 62.5 970 Natural language processor twolf 4.1 350 Place and route simulator SPECfp2000 (homogeneous multiprogrammed) art 5.9 150 Image recognition/neural networks ammp 30 400 Computational chemistry SPEC2000 (heterogeneous multiprogrammed) twolf-mcf 4.1 / 192 1600 Mixed multiprogrammed parser-twolf 62.5 / 4.1 4000 Mixed multiprogrammed art-gcc 5.9 / 158 3500 Mixed multiprogrammed NAS Parallel benchmark (multithreaded) CG ‘class A’ 52 (4 of 15 iter.) 496 Conjugate gradient FT ‘class W’ 22 (4 of 6 iterations) 48 Fast fourier transform LU ‘class A’ 44 (25 of 250 iter.) 2250 Lower Upper fact. of CFD equation MG ‘class W’ 58 (2 of 4 iterations) V-cycle MultiGrid method 130

L1 I$

L2Pn-1 L2P1 L2P0 L2 Shared

Memory

Figure 2. Application of SP-NUCA.

ACM SIGARCH Computer Architecture News

Workloads

The applications considered in this study are six multiprogrammed and four multithreaded scientific workloads running on top of Solaris 9 OS.

Pn-1 L1 D$

Simulation Framework

We will show the impact of the SP-NUCA on system performance under realistic conditions. For this purpose we will use a full system simulator based on Simics [14] extended with the GEMS timing infrastructure [15]. GEMS is an event driven simulator that provides a complete model of the memory system, RUBY, and a state-of-the-art detailed processor model, OPAL. With RUBY, the memory system simulator, we are able to obtain an accurate implementation of the memory hierarchy that we are working with. This includes interconnection network parameters, bank access time, mapping, replacement policies, etc. In RUBY, each cache bank has its own controller, and using a domain-specific language called “SLICC” we can specify with precision the coherence protocol. OPAL is a completely detailed processor simulator which uses the Timing First Simulation technique [17]. This environment allows us to simulate a complete multiprocessor system capable of running a commercial operating system without any modification.

LSB

tag

Methodology

66

Vol. 36, No. 2, May 2008

The multiprogrammed workloads are part of the SPEC2000CPU [20]. They are evaluated in rate mode (one thread per available processor) and with reference inputs. In heterogeneous multiprogrammed workloads half of the processors are running each application. The multithreaded workloads are numerical applications from the NAS Parallel Benchmark, using the OpenMP implementation, version 3.2.1 [9]. For multithreaded applications we measure the number of cycles spent in executing one iteration of the benchmark. In the multiprogrammed case we run 50 million instructions in core 0 and calculate the arithmetic mean of the CPI of the cores. Benchmarks have been run from 40 up to thousands of millions of cycles before starting the measurement. For each case a variable number of runs are performed with pseudo-random perturbation in order to estimate workload variability. All the results we provide have a 95% confidence interval.

3.3

controllers. The four L2 banks nearest to each processor constitute by themselves an S-NUCA of the “private portion” of the L2 cache associated to it, as explained in Section 2. For instance, in Figure 3, the highlighted banks will store the private blocks for processor 0. Any bank could store shared blocks, ordered according to the interleaving and block address.

Table 2. Main simulation parameters. 8

Window Size / outstanding memory requests per CPU

256 / 16

P2

P3

L1 L1 D$ I$

L1 L1 D$ I$

L1 L1 D$ I$

L1 L1 D$ I$

L1 L1 D$ I$

L1 L1 D$ I$

L1 L1 D$ I$

L1 L1 D$ I$

P4

P5

P6

P7

Memory Controller

P1

Memory Controller

System Configuration

Number of Cores

P0

Figure 3. CMP Layout of the evaluated system.

Issue Width

4

L1 I/D cache

Private, 64KB, 4-way, 64Bytes block, 3-cycle, 1cycle tag

Direct branch predictor

768Bytes YAGS

Indirect Brach Predictor

64 entries (cascaded)

L2 cache

NUCA 8x4 banks, 4 per router

L2 cache bank

512KB, 16-way, 5-cycles, 64 Bytes block, sequential access, 2 cycles tag

Main Memory

4GB, 270 cycles, 320 GB/s

Network Topology

Mesh with DOR routing, 128 bits wide links

Network Hop latency

5 cycle (3 cycle router + 2 cycle link latency

4

Using the infrastructure depicted in the previous section, we will proceed to explore and clarify how different architectural alternatives perform.

4.1

Dynamic Partitioning policy

Here we describe the different dynamic partition algorithms used. When data arrives at an L2 bank and there is no room for it, the replacement algorithm needs to select which block to evict. In our private/shared cache scheme, the decision must take into consideration not only the LRU block or whatever replacement policy used, but from which way we will evict the block: shared or private. The different dynamic partitioning policies help us to choose the best cost effective option trying to adapt to the behavior of the workload executed.

The simulated system will be an 8-processor CMP with the layout depicted in Figure 3. The main configuration parameters are shown in Table 2. In the simulated system, the clock frequency ratio between processor and memory hierarchy is two. The L2 bank access time has been obtained using CACTI 5.0 [24] with a power-efficient sequential access cache for 45nm technology. Each processor has its own private L1 Instruction and L1 Data cache and is connected to a node of the network. The L2 NUCA is distributed in 32 banks connected four by four to each switch that each CPU is connected to. The switches of the central row are connected to memory

ACM SIGARCH Computer Architecture News

Design Alternatives

x

Always steal In this algorithm, whenever a private or shared block should be written and there are no available ways, the LRU blocks in the opposite partition are stolen. Consequently, if a private block arrives, the LRU block of the shared partition of the bank is evicted and vice versa. This method is slightly more complex than the classic LRU algorithm as we need to keep track of both LRU blocks of private and shared parts.

67

Vol. 36, No. 2, May 2008

least 8 shadow tags are required. Using various shadow tags per set, we are able to make a better prediction of the advantage of stealing blocks from each partition. Finally, in the last measurement, denoted by “16 noLRU”, we try to increase the dynamic behaviour by not considering the LRU. Only hits in shadow tags modify the stealing bit. The performance results appear to be insensitive to that decision. Further comparison in this section will consider only 8 shadow tag implementation because it is the most cost effective alternative.

x

Shadow Tags Suh et al [22] described a method for dynamic partitioning of caches which was extended by Dybdahl and Stenström [5] to NUCA. However, our implementation of shadow tags has several noticeable differences. We do not use a centralized monitoring circuit. The dynamic adaptation of the cache is done at set level. The evaluation and enforcement of the policy is not decided every thousand misses but it is integrated within the replacement policy. First of all, we used different numbers of shadow tags per set and partition (shared/private) and two LRU blocks as in the previous policy “Always steal”. When a block is evicted, its tag is written in the proper shadow tag, so the shadow tags have the last blocks evicted for each partition in each set. We also need one counter per set which will allow us to make the decision of “stealing” a way from the opposite part. If a hit occurs in a shadow tag, the counter is increased. If a hit occurs in the opposite partition LRU block, the counter is decreased. When data arrives and there is no room available in its partition, the opposite partition LRU block is evicted if the counter is above a defined threshold. Otherwise, the LRU block in the current partition is replaced. Simulations demonstrate that the best threshold is 0. Larger thresholds restrict the dynamic behaviour, over restricting the capacity changes between shared and private partitions. With this threshold, the counter is replaced with only a bit (the stealing bit), set to 1 if there is a hit in the shadow tag, and set to 0 if the hit is in the opposite partition LRU block. The implementation cost of this policy could be noticeable, mainly because additional tag arrays are required in each block to store shadow tags.

x

Global LRU The last policy we are considering is the most simple. We use the basic LRU replacement policy, and when data arrives and we need to evict a block, no matter if the block is private or shared, we choose the LRU of both, private and shared. In this method no extra implementation cost for replacement policy is required over a standard LRU. x

Performance comparison of replacement policies We compare the three dynamic approaches with a static implementation, in which sets have a fixed number of ways for private and shared blocks, similarly to [26]. Due to sharing properties of the workloads, the division chosen was 12 ways for private partition and 4 ways for shared. 3.12

1.56

Normalized runtime

1,2 1 0,8 0,6 0,4 0,2

Always steal

Shadow tags

Static 12

art-gcc

twolf-mcf

twolf

parser

mcf

gcc

art

MG

ammp

LU

Global LRU

parser-twolf

1

FT

CG

0

1,2

Normalized runtime

3.18

1,4

0,8

Figure 5. Global LRU normalized runtime of different

0,6

dynamic partitioning algorithms.

0,4

Figure 5 shows that the global LRU replacement algorithm gives the overall best performance as it is capable of adapting fast and correctly to the different benchmark behavior. The “shadow tag” implementation performs correctly in the multiprogrammed benchmarks but is not flexible enough for multithreaded benchmarks like FT. This makes it unsuitable, as the global LRU approach is simpler and more efficient. The “always steal” replacement algorithm, being the most dynamic, often “overshoots” and makes the shared/private partition movement faster than needed, which increases the miss ratio and harms the overall performance. In the static partitioning approach, reduction in cache utilization for multiprogrammed workloads clearly impairs performance.

0,2

2

8

16

art-gcc

twolf

twolf-mcf

parser

mcf 16 nolru

parser-twolf

1

gcc

art

MG

ammp

LU

FT

CG

0

Figure 4. Normalized to 1 shadow tag implementation runtime using different numbers of shadow tags. As can be seen in Figure 4, in most cases, the performance improves when more shadow tags are employed. In order to achieve the best performance, at

ACM SIGARCH Computer Architecture News

68

Vol. 36, No. 2, May 2008

Anyway, it is known that LRU algorithm performs well with workloads characterized by a good temporal locality [21]. With applications with poor temporal locality other replacement policies could be a better choice [12] [13]. Besides, although with this simple mechanism we are able to reduce a large portion of inter-core interference at L2, other more complex cache partitioning policies, such as [19], could complement the replacement policy to minimize it even more.

We will compare SP-NUCA with a Static-NUCA and a Dynamic-NUCA [1] [10] using a token-based coherence protocol. The S-NUCA uses all banks as an L2 shared cache and maps each cache block to a specific cache bank, depending on interleaving and block address. When an L1 misses, it sends a request to the rest of the L1s and the corresponding bank in the L2. If the block is not found in the L2, the request is forwarded to memory. The D-NUCA implementation [1] tries to bring private data to banks near the processor. To do so, it divides L2 banks into two groups: center and local banks each comprised of half of the L2 banks. In the simulated system, each processor has two banks as local ones chosen from the nearest, and there are 16 center banks. The D-NUCA simulated uses a perfect search algorithm to know in which bank of L2 the data being looked for is. The main parameters for the simulated systems, shown in Table 2, are the same in the three cases. As we concluded in section 4, the most cost-effective SP-NUCA configuration is using a global LRU algorithm without using the shared cache as a victim cache. As seen in Figure 7, the S-NUCA gives an average hit time of 26-29 cycles, as cache blocks are spread uniformly among all the L2 cache banks. However, the SP-NUCA and D-NUCA average latency could be remarkably different depending on the workload. D-NUCA latencies show the behavior of this system, bringing data close to processors and reducing global latency. Similarly, in SPNUCA when most of the blocks are found in the private bank of the requestor, the average latency is as low as 10 cycles, thus the multiprogrammed workloads yield very low hit latencies. On the other hand, when hitting a shared block, SP-NUCA is slower than the S-NUCA counterpart because the SP-NUCA needs to send a request first to the private bank and forwards the request when the private bank fails, increasing shared latency up to 34 cycles. This makes the latency get worse in the CG benchmark and increments the average latency in the other NPB benchmarks as more accesses to shared data are required.

Victim Cache

In this subsection, we will analyze the performance advantage reported when shared partition of the L2 cache is used as a victim cache of the private evictions to improve performance. This idea is proposed by Zhao et al. in [26]. When a block is evicted from the private part of the cache it is sent to the corresponding shared bank, replacing one of its blocks. An additional victim owner field is needed in order to return the block to a private bank when the core requests it again, or to change the block status to share if another different core requests it. Figure 6 shows that when using global LRU replacement, the victim cache does not provide a significant advantage. Although in ammp and art-gcc using victim cache improves the performance, in mcf, parser and twolf-mcf benchmarks, sending victims to other caches lowers performance as cores have to compete with each other for their own private L2 banks. The average of all execution times is 1% greater with victim cache. These results do not justify the cost of using the shared partitions as victim of private evictions. With static partitioning or unbalanced heterogeneous workloads probably victim cache could be worth it. 1,4 1,2 1 0,8 0,6 0,4 0,2

35

art-gcc

30 Average L2 hit Latecy (Cycles)

Figure 6. Normalized runtime of global LRU policy with and without using the shared partition as a victim cache. In conclusion, the simplest choice seems to be the most effective in performance-cost ratio, so the scheme chosen will be the global LRU replacement without using the shared partition as victim cache.

25 20 15 10 5

tw ol f ol f pa -mc f rs er -tw ol f ar t-g cc

m cf pa rs er

D-NUCA

SP-NUCA

tw

S-NUCA

gc c

p

ar t

am m

C

G

0

LU

twolf-mcf

twolf

parser

mcf

without victim cache

parser-twolf

with victim cache

gcc

art

ammp

LU

MG

FT

CG

0

M G

Normalized runtime

Comparative Study

FT

4.2

5

Figure 7. Average L2 hit latencies.

ACM SIGARCH Computer Architecture News

69

Vol. 36, No. 2, May 2008

Figure 8 shows the average performance for the different proposals. We see a significant improvement in multiprogrammed benchmarks, as most of their requests are private. art, ammp and gcc achieve better performance as they use the L2 more (around 80 hits per thousand instructions). parser makes only 6 hits per thousand instructions in the L2 so its run-time remains almost unmodified. In art-gcc there is noticeable performance degradation due to unbalanced cache utilization. Victim cache will improve this result but, as seen in Figure 6, the simple algorithm does not yield overall good performance. The NAS Parallel Benchmarks confirm that when using multithreaded applications that share a significant amount of data, in most cases, performance does not suffer or improves slightly.

different methodologies to optimize in NUCA block placement, block replication, tag usage, etc. Other significant numbers of proposals have been done in interference isolation. Stone et al in [21] evaluate different partitioning schemes between data and instruction streams in single core chips. Other works have approached this issue in the CMP case. For example, in [23] Suh et al. selectively through online analysis determine the marginal gain in selecting different partitions, or in [19] Qureshi et al. select a monitoring circuit that determines the application’s utility information to allocate cache space for each one. Others, such as cooperative caching partitioning [3] try to address both latency reduction and interference minimization. In most of the cases, cache partitioning operates in repetitive multi-million cycles of measurepartitioning-enforcement. In our proposal the partitioning is straightforward and much simpler. For the considered workloads a large proportion of unnecessary interference between cores is removed. Centering our attention in NUCA architectures, in [6] Huh et al. analyze different sharing degrees with coarse static or dynamic partitions for NUCA. They conclude that static mappings with a sharing degree of 2 or 4 can provide the best latency, and dynamic mapping can improve performance at the cost of complexity and power consumption. Dybdahl and Stenström in [5] propose a cooperative caching strategy based on a NUCA, to both avoid interthread interference and exploit the locality benefits of private caches using “shadow tags” to determine the benefits of having an extra cache way. The work requires a shared engine to decide the best partition and control the replacement in all shared caches, which reduces the scalability to large size caches or high numbers of cores per CMP. Also, this partitioning engine takes the decision after a large number of misses, while our scheme reduces it to every write-back, only taking into consideration the already managed LRU block, without monitoring any external element beyond the L2 bank. To our knowledge, the most similar work to our proposal is [26]. Zhao et al. present a work that resembles ours in motivation and objectives, but in an incipient state with simpler implementation, shallow evaluation and limited conclusions. That work assumes private L2 caches and a NUCA for the L3. Only statically assigned sharedprivate partitions are considered in the paper. Although briefly cited in the paper, no dynamic scheme is proposed or evaluated. The authors employ a distributed directory to determine which private partition owns the block required when a miss occurs in the private bank of the requesting processor and in the shared bank. This design decision introduces supplementary evictions due to capacity misses in the directory. Additionally, the directory could require a significant portion of on-chip area to be implemented. To reduce the cost of directory capacity misses, the authors

1,4

Normalized runtime

1,2 1 0,8 0,6 0,4 0,2

art-gcc

twolf

SP-NUCA

twolf-mcf

parser

mcf

D-NUCA

parser-twolf

S-NUCA

gcc

art

MG

ammp

LU

FT

CG

0

Figure 8. Normalized runtime S-NUCA, D-NUCA, and SP-NUCA. The D-NUCA had a significantly better latency in FT than the S-NUCA and the SP-NUCA, so its runtime is better. However, we can see how in some multiprogrammed applications, the D-NUCA is unable to capture the workload behavior as efficiently as the SPNUCA, moving the blocks excessively which leads to having worse performance than the SP-NUCA. In some workloads, D-NUCA suffers a large hit-rate loss. This behavior could be related to the presence of unnecessary inter-core interference. If we compare hit latencies and runtime between D-NUCA and SP-NUCA the differences are clearly dissimilar. Capacity misses on several benchmarks affect other core performance. In contrast, SPNUCA is able to isolate all of the private L2 accesses between cores. Although not for the same reasons as DNUCA, S-NUCA also suffers from this problem.

6

Related Work

A notable quantity of work has been done in CMP caching strategies. In particular, the NUCA approach represents one of the most relevant alternatives with many papers addressing issues such as sharing, migration, or replication of data in order to improve on-chip access latency [1][2][3][4][25]. Most of these works provide

ACM SIGARCH Computer Architecture News

70

Vol. 36, No. 2, May 2008

propose using shared partitions as victim caches. Although, our study suggests that a victim cache in shared partitions could degrade performance, with static partitions it could be interesting. No details about the coherence protocol are provided. The evaluation methodology is based on tracedriven simulation, just providing results in terms of miss or hit rates, with no real performance impact of the proposal.

7

[8] [9]

[10]

Conclusions and future work [11]

This paper presents an easy-to-implement, dynamic cache partitioning. Dividing each L2 bank at set level into private and shared content and adjusting the replacement algorithm, we can dynamically place private data closer to its owner processor. The proposed SP-NUCA architecture reduces the hit latency when accessing private data while almost preserving the S-NUCA latency when accessing shared data between processors. The SP-NUCA basically maintains the simplicity of an S-NUCA, while allowing a better on-chip locality. We explored different methods to determine the best partition of the cache in runtime. For the workloads evaluated, the best algorithm appears to be the simplest, requiring almost zero modification over a conventional global-LRU policy. The isolation in private accesses brings an additional advantage of SP-NUCA over other alternatives. Because of this attribute, we believe that SP-NUCA could be a suitable foundation to support some degree of CQoS.

[12]

[13]

[14]

[15]

[16] [17] [18]

8

Acknowledgements

This work has been supported by the Ministry of Education and Science of Spain, under TIN2007-68023-C02-01 and by the HiPEAC European Network of Excellence.

9 [1] [2] [3] [4]

[5] [6]

[7]

[19]

References

[20] [21]

B. M. Beckmann and D. A. Wood, “Managing wire delay in large chip-multiprocessor caches”, MICRO 37, 2004. B.M. Beckmann, M.R. Marty, D.A. Wood, “ASR: Adaptive Selective Replication for CMP Caches”, MICRO 2006. J. Chang and G. S. Sohi, “Cooperative caching for chip multiprocessors”, ISCA, 2006. Z. Chishti, M. D. Powell, and T. N. Vijaykumar, “Optimizing replication, communication, and capacity allocation in CMPs”, ISCA, 2005. H. Dybdahl and P. Stenström, “An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors”, HPCA 2007. J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, S. W. Keckler, “A NUCA Substrate for Flexible CMP Cache Sharing”, IEEE Trans. Parallel Distrib. Syst, vol.18, no.8, pp: 1028-1040, September 2007. R. Iyer, “CQoS: a Framework for Enabling QoS in Shared Caches of CMP Platforms”, ICS 2004.

ACM SIGARCH Computer Architecture News

[22]

[23]

[24]

[25]

[26]

71

I. T. R. for Semiconductors. ITRS 2005 Update. Semiconductor Industry Association, 2005. H. Jin, M. Frumkin, J. Yan; “The OpenMP Implementation of NAS Parallel Benchmarks and its Performance”, NAS Technical Report NAS-99-011, NASA Ames Research Center, Moffett Field, CA, 1999. C. Kim, D. Burger and, S. W. Keckler, “An Adaptive, non-uniform cache structure for wire-delay dominated on-chip caches”. ASPLOS X, pp. 211-222, October 2002. S. Kim, D. Chandra, and Y. Solihin, “Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture”. PACT 2004. D. Lee, J. Choi, J.H. Kim, S.H. Noh, S.L. Min, Y. Cho, and C.S. Kim, “LRFU: A Spectrum of Policies that Subsumes the Least Recently Used and Least Frequently Used Policies”. IEEE Trans. Computers, vol. 50, no. 12, pp 1352-1361, December 2001 N. Megiddo and D.S. Modha, “ARC: A Self-Tuning, Low Overhead Replacement Cache,” Proc. Usenix Conf. File and Storage Technologies (FAST 2003), Usenix, 2003, pp. 115-130 P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, F. Larsson, A. Moestedt, B. Werner, “Simics: A Full System Simulation Platform”. Computer, Vol. 35, No.2, pp. 50-58, February 2002. M. Martin, D. Sorin, B. Beckmann, M. Marty, M. Xu, A. Alameldeen, K. Moore, M. Hill, D. Wood, “Multifacet’s General Execution-driven Multiprocessor Simulator (GEMS) Toolset”, SIGARCH Comput. Archit. News, Vol.33, No.4, pp.92–99, November 2005. M. K. Martin, M. D. Hill, and D. A. Wood, “Token Coherence: Decoupling Performance and Correctness”, ISCA 2003. C.J. Mauer, M.D. Hill, D.A. Wood, “Full-system timing-first simulation”, SIGMETRICS 2002: 108-116. Michael R. Marty, Jesse D. Bingham, Mark D. Hill, Alan J. Hu, Milo M. K. Martin, David A. Wood, "Improving Multiple-CMP Systems Using Token Coherence," hpca, pp. 328-339, 11th International Symposium on High-Performance Computer Architecture (HPCA'05), 2005 M. K. Qureshi, D. N. Lynch, O. Mutlu, and Y. N. Patt. “A case for MLP-aware cache replacement”. ISCA, 2006. SPEC2000, http://www.spec.org/cpu2000/ H. S. Stone, J. Turek, J.L. Wolf, “Optimal Partitioning of Cache Memory”, IEEE Trans. Computers vol. 41, no 9, pp-1054-1068, September 1992. G. Suh, S. Devadas, and L. Rudolph. “Dynamic cache partitioning for simultaneous multithreading systems”. IASTED Int. Conf. on Parallel and Distributed Computing Systems, 2001 G. E. Suh, S. Devadas, and L. Rudolph, “A new memory monitoring scheme for memory-aware scheduling and partitioning”, HPCA, 2002. S. Thoziyoor, N. Muralimanohar, and N. P. Jouppi. CACTI 5.0: An Integrated Cache Timing, Power, and AreaModel. Technical report, HP Laboratories Palo Alto, 2007. M. Zhang and K. Asanovic, “Victim replication: Maximizing capacity while hiding wire delay in tiled chipmultiprocessors”, ISCA, 2005. L. Zhao, R. Iyer, M. Upton, D. Newell, “Towards Hybrid Last Level Caches for Chip-Multiprocessors”, dasCMP 2007.

Vol. 36, No. 2, May 2008

Suggest Documents