Predicting Cache Space Contention in Utility ... - Semantic Scholar

2 downloads 0 Views 259KB Size Report
tuning memory behavior. ACM Trans. on Program- ming Languages and Systems, 21(4):703-746, 1999. [8] IBM. IBM Power4 System Architecture White Pa-.
Predicting Cache Space Contention in Utility Computing Servers



Yan Solihin, Fei Guo, and Seongbeom Kim Deptartment of Electrical and Computer Engineering North Carolina State University {solihin,fguo,skim16}@ece.ncsu.edu http://www.cesr.ncsu.edu/solihin

In a typical Chip Multi-Processor (CMP) architecture, the L2 cache and its lower level memory hierarchy are shared by multiple cores [8]. Sharing ∗ This work is primarily supported by the National Science Foundation through grant CNS-0406306. It is also supported in part by Faculty Early Career Development Award CCF0347425, and by North Carolina State University.

1 0.8 0.6 0.4 0.2

mcf+gzip

mcf+mst

mcf+swim

mcf+art

0

Alone

mcf+gzip

Introduction

mcf+mst

1

mcf+swim

Alone

mcf's Normalized IPC

400% 350% 300% 250% 200% 150% 100% 50% 0%

mcf+art

The need to provide performance guarantee in high performance servers has long been neglected. Providing performance guarantee in current and future servers is difficult because fine-grain resources, such as on-chip caches, are shared by multiple processors or thread contexts. Although inter-thread cache sharing generally improves the overall throughput of the system, the impact of cache contention on the threads that share it is highly non-uniform: some threads may be slowed down significantly, while others are not. This may cause severe performance problems such as sub-optimal throughput, cache thrashing, and thread starvation for threads that fail to occupy sufficient cache space to make good progress. Clearly, this situation is not desirable when performance guarantee needs to be provided, such as in utility computing servers. Unfortunately, there is no existing model that allows extensive investigation of the impact of cache sharing. To allow such a study, we propose an inductive probability model to predict the impact of cache sharing on co-scheduled threads. The input to the model is the isolated L2 circular sequence profile of each thread, which can be easily obtained on-line or off-line. The output of the model is the number of extra L2 cache misses for each thread due to cache sharing. We validate the model against a cycle-accurate simulation that implements a dual-core Chip Multi-Processor (CMP architecture), on fourteen pairs of mostly SPEC benchmarks. The model achieves an average error of only 3.9%.

the L2 cache allows high cache utilization and avoids duplicating cache hardware resources. However, as will be demonstrated in this paper, cache sharing impacts threads non-uniformly, where some threads may be slowed down significantly, while others are not. This may cause severe performance problems such as sub-optimal throughput, cache thrashing, and thread starvation for threads that fail to occupy sufficient cache space to make good progress. To illustrate the performance problem, Figure 1 shows the number of L2 cache misses (1a) and IPC (1b) for mcf when it runs alone compared to when it is co-scheduled with another thread which runs on a different processor core but sharing the L2 cache. The bars are normalized to the case where mcf runs alone. The figure shows that when mcf runs together with mst or gzip, mcf does not suffer from many additional misses compared to when it runs alone. However, when it runs together with art or swim, its number of misses increases to roughly 390% and 160%, respectively, resulting in IPC reduction of 65% and 25%, respectively. mcf's Normalized L2 Cache Misses

Abstract

(a) (b) Figure 1: The number of L2 cache misses (a), and IPC (b), for mcf when it runs alone compared to when it is co-scheduled with another thread. The L2 cache is 512KB, 8-way associative, and has a 64-byte line size. Past studies have investigated profiling techniques

that detect but not prevent the problems, and apply inter-thread cache partitioning schemes to improve fairness [11] or throughput [17, 11]. However, the questions of what factors influence the impact of cache sharing suffered by a thread in a co-schedule, and whether the problems can be predicted (hence prevented) were not addressed. This paper addresses both questions by presenting an analytical model to predict the impact of cache sharing. Past performance prediction models only predict the number of cache misses in a uniprocessor system [3, 5, 6, 7, 12, 19, 20], or predict cache contention on a single processor time-shared system [1, 16, 18], where it was assumed that only one thread runs at any given time. Therefore, interference between threads that share a cache was not modeled. This paper goes beyond past studies and presents a tractable, analytical inductive probability model (Prob) that predicts the impact of cache space contention between threads that simultaneously share the L2 cache on a Chip Multi-Processor (CMP) architecture. The input to the model is the isolated L2 cache circular sequence profiling of each thread, which can be easily obtained on-line or off-line. The output of the model is the number of extra L2 cache misses of each thread that shares the cache. We validate the model by comparing the predicted number of cache misses under cache sharing for fourteen pairs of benchmarks against a detailed, cycle-accurate CMP architecture simulation. We found that Prob is very accurate, achieving an average absolute error of only 3.9%. The rest of the paper is organized as follows. Section 2 presents the model. Section 3 details the validation setup for our models. Section 4 presents and discusses the model validation results and the case study. Finally, Section 5 summarizes the findings.

2 2.1

Cache Miss Prediction Model Assumptions

We assume that each thread’s temporal behavior can be captured by a single stack distance or circular sequence profile. Although applications change their temporal behavior over time, in practice we find that the average behavior is good enough to produce an accurate prediction of the cache sharing impact. Representing an application with multiple profiles that represent different program phases may improve the prediction accuracy further, at the expense of extra

complexity due to phase detection and profiling. This is beyond the scope of this paper. It is also assumed that the profile of a thread is the same with or without sharing the cache with other threads. This assumption ignores the impact of the multi-level cache inclusion property [2]. In such a system, when a cache line is replaced from the L2 cache, the copy of the line in the L1 cache is invalidated. As a result, the L1 cache may suffer extra cache misses. This changes the cache miss stream of the L1 cache, potentially changing the profile at the L2 cache level. Co-scheduled threads are assumed not to share any address space. This is mostly true in the case where the co-scheduled threads are from different applications. Although parallel program threads may share a large amount of data, the threads are likely to have similar characteristics, making the cache sharing prediction easier because the cache is likely to be equally divided by the threads and each thread is likely to be impacted in the same way. Consequently, we ignore this case. Furthermore, for most of the analyses, the L2 cache only stores data and not instructions. If instructions are stored in the L2 cache with data, the accuracy of the model decreases slightly (by 0.8%). Finally, the L2 cache is assumed to use Least Recently Used (LRU) replacement policy. Although some implementations use different replacement policies, they are usually an approximation to LRU. Therefore, the observations of cache sharing impacts made in this paper are likely to be applicable to many other replacement policies as well.

2.2

Stack Distance Profiling

The input to the models is the isolated L2 cache stack distance or circular sequence profile of each thread without cache sharing. A stack distance profile captures the temporal reuse behavior of an application in a fully- or set-associative cache [13, 3, 12, 14], and is sometimes also referred to as marginal gain counters [16, 17]. For an A-way associative cache with LRU replacement algorithm, there are A + 1 counters: C1 , C2 , . . . , CA , C>A . On each cache access, one of the counters is incremented. If it is a cache access to a line in the ith position in the LRU stack of the set, Ci is incremented. Note that our first line in the stack is the most recently used line in the set, and the last line in the stack is the least recently used line in the set. If it is a cache miss, the line is not found in the LRU stack, resulting in incrementing the miss

seq(5,8) counter C>A . A stack distance profile can be easily obtained statically by the compiler [3], by simulation, cseq(5,7) or by running the thread alone in the system [17]. A B C D A E E B For our purpose, since we need to compare stack cseq(4,5) cseq(1,2) distance profiles from different applications, it is useful to take the counter’s frequency by dividing each Figure 2: Illustration of the relationship beof the counters by the number of processor cycles in tween a sequence and circular sequences. i which the profile is collected (i.e., Cfi = CP UCcycle ). Furthermore, we refer to Cf>A as the miss frequency, denoting the frequency of cache misses in CPU cyWe are interested in determining whether the last cles. We also call the sum of all other counters, i.e. access of a circular sequence cseqX (dX , nX ) is a PA Cf as reuse frequency. We refer to the sum of i cache hit or a cache miss 1 . To achieve that, it is i=1 miss and reuse frequency as access frequency (Af ). important to consider the following property.

2.3

Inductive Model

Probability

(Prob) Property 1 In an A-way associative LRU cache,

the last access in a circular sequence cseqX (dX , nX ) results in a cache miss if between the first and the The Prob model computes the probability of each last access, there are accesses to at least A distinct cache hit turning into a cache miss, for every possible addresses (from any threads). Otherwise, the last acaccess interleaving by an interfering thread. Before cess is a cache hit. we explain the model, it is useful to define two terms. Explanation: If there are accesses to a total of Definition 1 A sequence of accesses from thread at least A distinct addresses between the first access X, denoted as seqX (dX , nX ), is a series of nX cache up to the time right before the last access occurs, accesses to dX distinct line addresses by thread X, the address of the first and last access will have been shifted out of the LRU stack by the other A (or more) where all the accesses map to the same cache set. addresses, causing the last access to be a cache miss. Definition 2 A circular sequence of accesses If there is only a < A distinct addresses between the from thread X, denoted as cseqX (dX , nX ), is a spe- first and the last access, then right before the last cial case of seqX (dX , nX ) where the first and the last access, the address would be in the (a + 1)th position accesses are to the same line address, and there are in the LRU stack, resulting in a cache hit. no other accesses to that address. Corollary 1 When a thread runs alone, the last acFor a sequence seqX (dX , nX ), nX ≥ dX neces- cess in a circular sequence cseqX (dX , nX ) results in sarily holds. When nX = dX , each access is to a cache miss if dX > A, or a cache hit if dX ≤ A. a distinct address. We use seqX (dX , ∗) to denote Furthermore, in stack distance profiling, the last acall sequences where nX ≥ dX . For a circular se- cess of cseqX (dX , nX ) results in an increment to the quence cseqX (dX , nX ), nX ≥ dX + 1 necessarily counter C>A if dX > A (a cache miss), or the counter holds. When nX = dX + 1, each access is to a dis- CdX if dX ≤ A (a cache hit). tinct address, except the first and the last accesses. We use cseqX (dX , ∗) to denote all sequences where The corollary is intuitive since when a thread X nX ≥ dX + 1. runs alone, the number of distinct addresses in the In a sequence, there may be several, possibly over- circular sequence cseqX (dX , nX )is dX (because they lapping, circular sequences. The relationship of a se- only come from thread X). More importantly, howquence and circular sequences is illustrated in Fig- ever, the corollary shows the relationship between ure 2. In the figure, there are eight accesses to five stack distance profiling and circular sequences. Evdifferent line addresses that map to a cache set. In ery time a circular sequence with dX ≤ A disit, there are three circular sequences that are over- tinct addresses appears, CdX is incremented. If lapping: one that starts and ends with address A 1 that, there are some addresses that are accessed only (cseq(4, 5)), another one that starts and ends with once.Note They do not form circular sequences. Each access results address B (cseq(5, 7)), and another one that starts in a compulsory cache miss. Therefore, with or without cache and ends with address E (cseq(1, 2)). sharing, the access remains a cache miss.

N (cseqX (dX , ∗)) denotes number of occurrences of circular sequences cseqX (dX , ∗), we have CdX = N (cseqX (dX , ∗)). This leads to the following corollary.

Corollary 3 Suppose a thread X runs together with another thread Y . Also suppose that during the time interval between the first and the last access of X’s circular sequence, denoted by T (cseqX (dX , nX )), a sequence of addresses from Corollary 2 The probability of occurrences of circu- thread Y (i.e., seqY (dY , nY )) occurs. The last aclar sequences cseqX (dX , ∗) from thread X is equal to cess of X’s circular sequence results in a cache miss if dX + dY > A, or a cache hit if dX + dY ≤ A. 2 CdX , where totAccess deP (cseqX (dX , ∗)) = totAccess X X note the total number of accesses of thread X, and Every cache miss of thread X remains a cache miss dX ≤ A. under cache sharing. However, some of the cache hits of thread X may become cache misses under cache Let us now consider the impact of running a thread sharing, as implied by the corollary. The corollary X together with another thread Y that shares the L2 implies that the probability of the last access in a cache with it. Figure 3 illustrates the impact, as- circular sequence cseqX (dX , nX ), where dX < A, to suming a 4-way associative cache. It shows a circular become a cache miss is equal to the probability of sequence of thread X (“ A B A”). When thread Y the occurrence of sequences seqY (dY , ∗) where dY > runs together and shares the cache, many access in- A − dX . Note that we now deal with a probability computerleaving cases between accesses from thread X and Y are possible. The figure shows two of the access in- tation with four random variables (dX , nX , dY , and terleaving cases. In the first case, sequence “U V V” nY ). To simplify the computation, we represent nX from thread Y occurs during the circular sequence. and nY by their expected values: nX and E(nY ), reSince there are only three distinct addresses (U, B, spectively. Hence, Corollary 3 can be formally stated and V) between the first and the last access to A, the as: last access to A is a cache hit. However, in the secE(nY ) X ond case, sequence “U V V W” from thread Y occurs Pmiss (cseqX (dX , nX )) = P (seqY (dY , E(nY ))) during the circular sequence. Therefore there are four dY =A−dX +1 (1) distinct addresses (U, B, V, and W) between the acTherefore, computing the extra cache misses sufcesses to A, which is equal to the cache associativity. By the time the second access to A occurs, address fered by thread X under cache sharing can be acA is no longer in the LRU stack since it has been re- complished by using the following steps: placed from the cache, resulting in a cache miss for 1. For each possible value of dX , compute the the last access to A. More formally, we can state the weighted average of nX (i.e. nX ) by considering condition for a cache miss in the following corollary. the distribution of cseqX (dX , nX ). This requires a circular sequence profiling [4]. Then, we use cseqX (dX , nX ) instead of cseqX (dX , nX ). X’s circular sequence cseq (2,3) Y’s sequence A B A

UVVW

case 1: A U B V V A W Cache Hit

2. Compute the expected time interval duration of the circular sequence of X, i.e. T (cseqX (dX , nX )).

case 2: A U B V V W A Cache Miss

Figure 3: Illustration of how intervening accesses from another thread determines whether the last access of a circular sequence will be a cache hit or a miss. Capital letters in a sequence represent line addresses. The figure assumes a 4-way associative cache and all accesses are to a single cache set.

3. Compute the expected number of accesses of Y , i.e. E(nY ), during time interval T (cseqX (dX , nX )). Then, use seqY (dY , E(nY )) to represent seqY (dY , nY ) 4. For each possible value of dY , compute the probability of occurrence of the sequence seqY (dY , E(nY )), i.e. P (seqY (dY , E(nY ))). 2 For simplicity, we only discuss a case where two threads share a cache. The corollary can easily be extended to the case where there are more than two threads.

Then, compute the probability of the last access P (seq(d, n)) to represent P (seqY (dY , E(nY ))). The of X’s circular sequence becoming a cache miss following theorem uses inductive probability function by using Equation 1. to compute P (seq(d, n)). Theorem 1 For a sequence of n accesses from a given thread, the probability of the sequence to have d distinct addresses can be computed with a recursive relation, i.e. P (seq(d, n)) =  if n = d = 1  1  6. Repeat Step 1-5 for each co-scheduled thread   P ((d − 1)+ ) × P (seq(d − 1, d − 1)) if n = d > 1 (e.g., thread Y ). P (1− ) × P (seq(1, n − 1)) if n > d = 1  −  P (d ) × P (seq(d, n − 1))+  We will now describe how each step is performed.  P ((d − 1)+ ) × P (seq(d − 1, n − 1) if n > d > 1 Pd where P (d− ) = i=1 P (cseq(i, ∗)) and P (d+ ) = 1 − 2.3.1 Step 1: Computing nX P (d− ). 5. Compute the expected extra number of cache misses by multiplying the probability of cache misses of each circular sequence with its number of occurrences.

nX is computed by taking its average over all possible Proof: The proof will start from the more complex values of nX : term to the least complex term. P∞ Case 1 (n > d > 1): Let the sequence seq(d, n) N (cseq (d , n )) × n X X X X X =dX +1 nX = nP (2) represents an access sequence Y1 , Y2 , . . . , Yn−1 , Yn . ∞ nX =dX +1 N (cseqX (dX , nX )) The sequence just prior to this one is Y1 , Y2 , . . . , Yn−1 . The first To obtain N (cseqX (dX , nX )), an off-line profiling There are two possible subcases. subcase is when the address accessed by or simulation can be used. An on-line profiling is Y also appears in the prior sequence, i.e. n also possible, using simple hardware support where a counter is added to each cache line to track addr(Yn ) ∈ {addr(Y1 ), addr(Y2 ), . . . , addr(Yn−1 )}, n and a small table is added to keep track of hence the prior sequence is seq(d, n − 1). FurN (cseqX (dX , nX )). We found that each counter only thermore, adding Yn to the prior sequence needs to be 7 bits because there are very few nX val- creates a new circular sequence cseq(i, ∗) with iP ranging from 1 to d, with a probability of ues that are larger than 128. d − The seci=1 P (cseq(i, ∗)), denoted as P (d ). ond subcase is when the address accessed by 2.3.2 Step 2 and 3: Computing Yn has not appeared in the prior sequence, i.e. T (cseqX (dX , nX )) and E(nY ) addr(Yn ) ∈ / {addr(Y1 ), addr(Y2 ), . . . , addr(Yn−1 )} hence the prior sequence is seq(d − 1, n − 1). To compute the expected time interval duration for a Furthermore, adding Yn to the prior sequence circular sequence, we simply divide it with the access does not create a new circular sequence at all (i.e. frequency per set of thread X (AfX ): cseq(∞, ∗)), or creates a circular sequence that is not nX T (cseqX (dX , nX )) = (3) within the sequence (i.e. cseq(i, ∗) where i > d − 1). AfX Therefore, the probability of the second subcase is Pd−1 P∞ P (cseq(i, ∗)) = 1− i=d i=1 P (cseq(i, ∗)), denoted To estimate how many accesses by Y are expected to as P ((d − 1)+ ) 3 . Therefore, P (seq(d, n)) = P (d− ) × happen during the time interval T (cseqX (dX , nX )), P (seq(d, n − 1)) + P ((d − 1)+ ) × P (seq(d − 1, n − 1)). we simply multiply it with the access frequency per Case 2 (n > d = 1): since seq(1 − 1, n − 1) is imset of thread Y : possible to occur, P (seq(d−1, n−1)) = 0. Therefore, − E(nY ) = AfY × T (cseqX (dX , nX )) (4) P (seq(1, n)) = (P (1 )×P (seq(1, n−1)) follows from Case 1. Case 3 (n = d > 1): since seq(d, n − 1) is impossi2.3.3 Step 4: Computing P (seqY (dY , E(nY ))) ble (there are more distinct addresses than accesses), The problem can be stated as finding the probability P (seq(d, n − 1)) = 0. Therefore, P (seq(d, n)) = P∞ that given E(nY ) accesses from thread Y , there are 3 Computation-wise, P (cseq(i, ∗)) is calculated as i=d P dY distinct addresses, where dY is a random variable. i=A C>A + Ci i=d . For simplicity of Step 4’s discussion, we will just write totAccess

P ((d − 1)+ ) × P (seq(d − 1, d − 1)) follows from Case 1. Case 4 (n = d = 1): P (seq(1, 1)) = 1 is true because the first address is always considered distinct. 2 Probability Distribution Function

0.5

P(3-)

P(3+)

Probability

0.4

simulator [9]. The CMP cores are out-of-order superscalar processors with private L1 instruction and data caches, and shared L2 cache and all lower level memory hierarchy components. Table 1 shows the parameters used for each component of the architecture. The L2 cache uses prime modulo indexing to ensure that the cache sets’ utilization is uniform [10]. Unless noted otherwise, the L2 cache only stores data and does not store instructions.

0.3 0.2 0.1

P(cseq(>8,*))

P(cseq(8,*))

P(cseq(7,*))

P(cseq(6,*))

P(cseq(5,*))

P(cseq(4,*))

P(cseq(3,*))

P(cseq(2,*))

P(cseq(1,*))

0

Figure 4: Example probability distribution function that shows P (3− ) and P (3+ ). The function is computed by using the formula in Corollary 2.

Table 1: Parameters of the simulated architecture. Latencies correspond to contention-free conditions. RT stands for round-trip from the processor. CMP 2 cores, each 4-issue dynamic. 3.2 GHz. Int, fp, ld/st FUs: 3, 2, 2 Branch penalty: 13 cycles. Re-order buffer size: 152 MEMORY L1 Inst, Data (private): each WB, 32 KB, 4 way, 64-B line, RT: 2 cycles, LRU replacement L2 data (shared): WB, 512 KB, 8 way, 64-B line, RT: 12 cycles, LRU replacement, prime modulo indexed, inclusive. RT memory latency: 362 cycles Memory bus: split-transaction, 8 B, 800 MHz, 6.4 GB/sec peak

Corollary 2 and Figure 4 illustrates how P (d− ) and P (d+ ) can be computed from the stack distance profile. The figure shows that we already have three distinct addresses in a sequence. The probability that the next address will be one already seen is P (3− ), Applications. To evaluate the benefit of the otherwise it is P (3+ ). cache partitioning schemes, we choose a set of mostly memory-intensive benchmarks: apsi, art, ap2.3.4 Step 5: Computing the Number of Explu, equake, gzip, mcf, perlbmk and swim from the tra Misses Under Sharing SPEC2K benchmark suite [15]; and mst from Olden Step 4 has computed P (seqY (dY , E(nY ))) for all benchmark suite. Table 2 lists the benchmarks, their possible values of dY . We can then compute input sets, and their L2 cache miss rates over the Pmiss (cseqX (dX , nX )) using Equation 1. To find the benchmarks’ entire execution time. The miss rates total number of misses for thread X due to cache may differ from when they are co-scheduled, because contention with thread Y , we need to multiply the the duration of co-scheduling may be shorter than the probability of a cache miss from a particular circu- entire execution of the benchmarks. These benchlar sequence with the number of occurrences of such marks are paired and co-scheduled. Fourteen bencha circular sequence, then sum them over all possi- mark pairs that exhibit a wide spectrum of stack disble values of dX , and add the result to the original tance profile mixes are evaluated. number of cache misses (C>A ): Co-scheduling. Benchmark pairs run in a coschedule until a thread that is shorter completes. At that point, the simulation is stopped to make sure dX =1 that the statistics collected are only due to sharing the L2 cache. To obtain accurate stack distance or circular sequence profiles, for the shorter thread, the 3 Validation Methodology profile is collected for its entire execution without Simulation Environment. The evaluation is per- cache sharing. But for the longer thread, the proformed using a detailed CMP architecture simula- file is collected for the same number of instructions tor based on SESC, a cycle-accurate execution-driven as that in the co-schedule. missX = C>A +

A X

Pmiss (cseqX (dX , nX )) × CdX (5)

Table 2: The applications used in our evaluation.

4

Benchmark

Input Set

art applu apsi equake gzip mcf perlbmk swim mst

test test test test test test reduced ref test 1024 nodes

L2 Miss Rate (whole program) 99% 68% 5% 91% 3% 9% 59% 75% 63%

Validation

Table 3 shows the validation results for the fourteen co-schedules that we evaluate. The first numeric column shows the number of extra L2 cache misses under cache sharing, divided by the L2 cache misses when each benchmark runs alone (e.g., 100% means that the number of cache misses under cache sharing is two times compared to when the benchmark runs alone). The cache misses are collected using simulation. The last column presents the prediction error of the model. The errors are computed as the difference in the L2 cache misses predicted by the model and collected by the simulator under cache sharing, divided by the number of L2 cache misses collected by the simulator under cache sharing. Therefore, a positive number means that the model predicts too many cache misses, while a negative number means that the model predicts too few cache misses. The last four rows in the table summarize the errors. They present the minimum, maximum, arithmetic mean, and geometric mean of the errors, after each error value is converted to its absolute (positive) value. The impact of cache sharing is significant. In five co-schedules, one of the benchmarks suffers from 59% or more extra cache misses: gzip+applu (243% extra misses in gzip), gzip+apsi (180% extra misses in gzip), mcf+art (296% extra misses in mcf), mcf+gzip (102% extra misses in gzip), and mcf+swim (59% extra misses in mcf). Analyzing the errors for different co-schedules, Prob’s prediction errors are generally small. They are larger than 10% only in two cases where a benchmark suffers a very large increase in cache misses, such as gzip in gzip+applu (-25% error, 243% extra cache misses), and gzip in mcf+gzip (22% error, 102% extra cache misses). Since the model still correctly identifies a large increase in the number of cache misses, it

Co-schedule

Extra L2 Cache Misses Due to Sharing applu applu 29% +art art 0% applu applu 10% +equake equake 19% art art 0% +equake equake 43% gzip gzip 243% +applu applu 11% gzip gzip 180% +apsi apsi 0% mcf mcf 296% +art art 0% mcf mcf 11% +equake equake 6% mcf mcf 18% +gzip gzip 102% mcf mcf 8% +perlbmk perlbmk 28% mcf mcf 59% +swim swim 0% mst mst 10% +art art 0% mst mst 25% +equake equake 3% mst mst 0% +mcf mcf 2% swim swim 0% +art art 0% Min. Abs. Error: min(|Ej |) Max. Abs. Error: max(|E P j |) Mean Abs. Error:

|Ej |

n

L2 Cache Miss Pred Error Ej of Prob 2% 0% 1% 5% 0% 5% -25% 2% -9% 0% 7% 0% -3% 5% 7% 22% -3% 2% -7% 0% 0% 0% 3% 0% 0% 0% 0% 0% 0% 25% 3.9%

1

Geom-Mean Abs. Error: (Π(1 + |Ej |)) n − 1

3.7%

Table 3: Validation Results

is less critical to predict such cases very accurately. Elsewhere, Prob is able to achieve a very high accuracy, even in cases where there is a large number of extra cache misses. For example, in mcf+art, mcf has 296% extra cache misses, yet the error is only 7%. In mcf+swim, mcf has 59% extra cache misses, yet the error is only -7%. Finally, in gzip+apsi, gzip has 180% error, yet the error is only -9%. Prob’s remaining inaccuracy may be due to two assumptions. We assume that the number of accesses in a circular sequence of a thread X can be represented accurately by its expected value (nX in Equation 2). We also assumed that the number of accesses from an interfering thread Y can be represented accurately by its expected value (E(nY ) in Equation 4). In addition, the model rounds down E(nY ) to the nearest integer. Relaxing these assumptions requires treating nX and nY as random variables, which significantly

complicates the Prob model.

5

Conclusions

This work has studied the impact of inter-thread cache contention on a Chip Multi-Processor (CMP) architecture. We have proposed an analytical model to predict the impact of cache sharing on coscheduled threads. The input to our model is the isolated L2 cache circular sequence profiles of each thread, which can be easily obtained on-line or offline. The output of the model is the extra number of L2 cache misses of each thread that shares the cache. We validated the model against a cycle-accurate simulation that implements a dual-core CMP architecture and found that the model produces very accurate prediction. Future work includes using the Prob model as a foundation for job admission and scheduling in a utility computing servers.

References [1] A. Agarwal, M. Horowitz, and J. Hennessy. An Analytical Cache Model. ACM Trans. on Computer Systems, 1989. [2] J.-L. Baer and W.-H. Wang. On the Inclusion Properties for Multi-Level Cache Hierarchies. In Proc. of the Intl. Symp. on Computer Architecture, 1988. [3] C. Cascaval, L. DeRose, D. A. Padua, and D. Reed. Compile-Time Based Performance Prediction. In Proc. of the 12th Intl. Workshop on Languages and Compilers for Parallel Computing, 1999. [4] D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting the impact of inter-thread cache contention on a chip multiprocessor architecture. In Proc. of the 10th Intl. Symp. on High Performance Computer Architecture, 2005. [5] S. Chatterjee, E. Parker, P. Hanlon, and A. Lebeck. Exact analysis of the cache behavior of nested loops. In Proc.of ACM SIGPLAN Conf. on Programming Language Design and Implementation, 2001. [6] B. Fraguela, R. Doallo, and E. Zapata. Automatic analytical modeling for the estimation of cache misses. In Proc.of Intl. Conf. on Parallel Architectures and Compilation Techniques, 1999. [7] S. Ghosh, M. Martonosi, and S. Malik. Cache miss equations: a compiler framework for analyzing and tuning memory behavior. ACM Trans. on Programming Languages and Systems, 21(4):703-746, 1999. [8] IBM. IBM Power4 System Architecture White Paper, 2002. [9] J. Renau, et al. SESC. http://sesc.sourceforge.net, 2004.

[10] M. Kharbutli, K. Irwin, Y. Solihin, and J. Lee. Using prime numbers for cache indexing to eliminate conflict misses. In Proc. of the Intl. Symp. on High Performance Computer Architecture, 2004. [11] S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and partitioning on a chip multi-processor architecture. In Proc. of the Intl. Conf. on Parallel Architecture and Compilation Techniques, 2004. [12] J. Lee, Y. Solihin, and J. Torrellas. Automatically Mapping Code on an Intelligent Memory Architecture. In 7th Intl. Symp. on High Performance Computer Architecture, 2001. [13] R. L. Mattson, J. Gecsei, D. Slutz, and I. Traiger. Evaluation Techniques for Storage Hierarchies. IBM Systems Journal, 9(2), 1970. [14] Y. Solihin, J. Lee, and J. Torrellas. Automatic Code Mapping on an Intelligent Memory Architecture. IEEE Trans. on Computers: special issue on Advances in High Performance Memory Systems, 2001. [15] Standard Performance Evaluation Corporation. Spec benchmarks. http://www.spec.org, 2000. [16] G. Suh, S. Devadas, and L. Rudolph. Analytical Cache Models with Applications to Cache Partitioning. In Proc. of Intl. Conf. on Supercomputing, 2001. [17] G. E. Suh, S. Devadas, and L. Rudolph. A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning. In Proc. of Intl. Symp. on High Performance Computer Architecture, 2002. [18] D. Thiebaut, H. Stone, and J. Wolf. Footprints in the cache. ACM Trans. on Computer Systems, 5(4), Nov. 1987. [19] X. Vera and J. Xue. Let’s Study Whole-Program Cache Behaviour Analytically. In Proc. of Intl. Symp. on High Performance Computer Architecture, 2002. [20] H. J. Wassermann, O. M. Lubeck, Y. Luo, and F. Bassetti. Performance Evaluation of the SGI Origin2000: A Memory-Centric Characterization of LANL ASCI Applications . In Supercomputing, 1997.

Suggest Documents