Linear-time Modeling of Program Working Set in Shared ... - CiteSeerX

9 downloads 17235 Views 366KB Size Report
During a program execution, its working set can be defined as the footprint, which is ... To measure the accuracy of cache sharing prediction, we rank the slowdowns ..... The formula of Equation 7 passes the sanity check that the average footprint ..... by average-footprint analysis is relatively the best at 86. E. Three-Program ...
1

Linear-time Modeling of Program Working Set in Shared Cache Xiaoya Xiang† , Bin Bao† , Chen Ding† , Yaoqing Gao‡ Science Department, University of Rochester ‡ IBM Toronto Software Lab {xiang,bao,cding}@cs.rochester.edu, [email protected]

† Computer

Abstract—Many techniques characterize the program working set by the notion of the program footprint, which is the volume of data accessed in a time window. A complete characterization requires measuring data access in all O(n2 ) windows in an nelement trace. Two recent techniques have significantly reduced the measurement time, but the cost is still too high for real-size workloads. Instead of measuring all footprint sizes, this paper presents a technique for measuring the average footprint size. By confining the analysis to the average rather than the full range, the problem can be solved accurately by a linear-time algorithm. The paper presents the algorithm and evaluates it using the complete suites of 26 SPEC2000 and 29 SPEC2006 benchmarks. The new algorithm is compared against the previously fastest algorithm in both the speed of the measurement and the accuracy of shared-cache performance prediction.

Keywords: Footprint, Cache sharing I. I NTRODUCTION During a program execution, its working set can be defined as the footprint, which is the volume of data accessed in an execution window. Since the footprint shows the active data usage, it has been used to model resource sharing among concurrent tasks and to improve throughput and enforce fairness, either in memory sharing among multiprogrammed workloads or more recently in cache sharing amongmulticore workloads. A trace of n data accesses has n2 = n(n−1) distinct 2 windows and therefore n(n−1) footprints. Early studies mea2 sured program footprints in the shared cache of time-sharing systems. Since applications interact between time quanta, it is sufficient to consider just the windows of a single length— the length of a scheduling quantum [20], [22]. On today’s multicore systems, however, programs interact continuously. A number of techniques were developed to estimate the footprint in all-length windows, but they did not guarantee the precision of the estimation [2], [3], [6], [18], [19], [21]. A recently published technique called all-window footprint analysis can measure all footprints in O(CKlogM ) time, where CK is linear to the length of the trace and M is the volume of data accessed in the trace [23]. For each window length, the analysis shows the maximum size, the minimum size, and the size distribution of footprints in all windows of this length [23]. The analysis is not fully accurate but guarantees a relative precision, e.g. 99%. We call the analysis all-footprint analysis, because it measures the size of every footprint.

In this paper, we present average-footprint analysis. For each window length, the analysis shows the average size of footprints in all windows of this length. While the analysis gives the accurate average, it does not measure the range or the distribution. However, a weaker analysis can often be done faster. Indeed, we show that the average footprint can be measured accurately in linear time O(n) for a trace of length n, regardless of the data size. The average footprint is a function mapping from the length of a execution window to the volume of its data access. Intuitively, the working set increases in larger execution windows. We prove that the average footprint is monotonically nondecreasing. The new analysis precisely quantifies the growth of the average footprint over time. The previous, all-footprint analysis was the key metric used in the composable models of cache sharing [3], [6], [21], [23]. For P programs, there are 2P co-run combinations. A composable model makes 2P predictions using P singleprogram runs rather than 2P parallel runs. As an alternative to all-footprint analysis, the new average-footprint analysis can be used in the composable model to reduce the (footprint) measurement cost asymptotically. To evaluate the speed and usefulness of the averagefootprint analysis, we test it on the complete suites of SPEC 2000 and SPEC 2006 CPU benchmarks and compare the results with the fastest all-footprint analysis [23]. To measure the accuracy of cache sharing prediction, we rank the slowdowns in two- and three-program co-runs on a quad-core machine and compare the predicted ranking with exhaustive testing. Through experiments, we show that the average-footprint analysis can predict the effect of cache interference as accurately as the all-footprint analysis, yet at only a fraction of its cost. In fact, the cost of all-footprint analysis is too high for it to model SPEC 2006 benchmarks, which have up to 1.9 trillion accesses to up to 1 GB data. In comparison, the average-footprint analysis can model all SPEC 2006 benchmarks, finishing most of the programs within a few hours of time. This study has two limitations. First, we are concerned with parallel workloads consisting of only sequential programs that do not share data. We do not consider parallel programs, although similar footprint metrics have been studied to model multi-threaded workloads [6], [17]. Second, the footprint results are input specific, so they are useful mostly in workload characterization, for example, finding the most and the least interference among a set of benchmark programs.

thread A

abcdefa ft = 4

thread B

kmmmnon

b) Footprint windows and the cache sharing model: OffrelativeO(TlogN) O(CKlogN) line cache sharing models were pioneered by Chandra et al. [3] precision algorithm algorithmprograms and and Suh et al. [21] for a group of independent approx. extended for multi-threaded code by Ding and Chilimbi [6], Schuff et al. [17], andO(TN) Jiang et al. O(CKN) [12] Let A, B be two programs that share algorithm the same cachealgorithm but do not shared data, accurate the effect of B on the locality of A is

single-window statistics

rd = 5

rd’ = rd+ft = 9 thread A&B

accurate

akbcmdmemfnona

constant-

precision P (capacity miss by A when co-running with B) = P ((A’s reuse distance + B’sapproximation footprint) ≥ cache size)

(a) In shared cache, the reuse distance in program A is lengthened by the footprint of program B.

all-window Given an execution window instatistics a sequential trace, the footprint is the number of distinct elements accessed in the (b) four algorithms for measuring footprint in all A co-execution prog. B1 In shared cache, B1 theand reuse distance in thread execution windows in a trace. T is the length of window. The examples in Figure 1(a) illustrates the interaction ----113311--33--1111 A is lengthened by the footprint of thread B. --010-1-00 trace and locality N the largest footprint. In the first example, a reuse between and footprint. axbybyaxaxdwaxczczcz abbaadaccc window in program A concurs with a time window in program 4 cache misses on 3-element fully associative LRU cache. B. The reuse distance of A is lengthened by the footprint of prog. A B’s window. The second example uses two pairings of three xyyxxwxzzz traces to show that the shared cache miss rate depends also B2 and A co-execution on the footprint, not just the miss rate of co-run threads. -----113312-121--111 An implication of the cache sharing model is that cache axbycycxbxcwcxczdzdz prog. B2 interference is asymmetric for programs with different locality ---01100-0 2 cache misses on 3-element and footprints. A program with large footprints and short reuse abccbcccdd fully associative LRU cache. distances may disproportionally slow down other programs (b) Programs B1 and B2 have the same miss rate. However, A and while experiencing little or no slowdown itself. This was B1 incur 50% more misses in shared cache than A and B2. The observed in experiments [23], [24]. In one program pair, the difference is caused not by data reuse but by data footprint. first program shows a near 85% slowdown while the other Fig. 1. Example illustrations of cache sharing. program shows only a 15% slowdown. Programs B1 and B2 have the same miss rate. However, A-B1 incurs 50% more misses in shared cache than A-B2. The difference is caused not by data reuse but by footprint.

III. T HE M EASUREMENT OF AVERAGE F OOTPRINT A. Definitions  Let W be the set of n2 windows of a length-n trace. Each window w =< t, v > has a length t and a footprint v. Let I(p) be a boolean function returning 1 when p is true and 0 otherwise. The footprint function f p(t) averages over all windows of length t:

II. BACKGROUND ON O FF - LINE C ACHE M ODELS Off-line cache models do not improve performance directly but can be used to understand the causes of interference and to predict its effect before running the programs (so they may be grouped to reduce interference). Off-line analysis measures the effect of all data accesses, not just cache misses. It characterizes a single program unperturbed by other programs and the analysis itself. Such “clean-room” metrics avoid the chicken-egg problem when programs are analyzed together: the interference depends on the miss rate of corunning programs, but their miss rate in turn depends on the interference. Next we describe first the locality model of private cache and then the model for shared cache. a) Reuse windows and the locality model: For each memory access, the temporal locality is determined by its reuse window, which includes all data accesses between this and the previous access to the same datum. Specifically, whether the access is a cache (capacity) miss depends on the reuse distance, the number of distinct data elements accessed in the reuse window. The relation between reuse distance and the miss rate has been well established [9], [16]. The capacity miss rate can be defined by a probability function involving the reuse distance and the cache size. Let the test program be A.

P P vi I(ti = t) wi ∈W vi I(ti = t) f p(t) = P = wi ∈W n−t+1 wi ∈W I(ti = t) For example, the trace “abbb” has 3 windows of length 2: “ab”, “bb”, and “bb”. The corresponding footprints are 2, 1, and 1, so f p(2) = (2 + 1 + 1)/3 = 4/3. B. O(n) Algorithm There is a linear-time algorithm that calculates the precise average footprint for all execution windows of a trace. Let n, m be the length of the trace and the number of distinct data used in the trace. The algorithm first measures the follow three quantities: • the distribution of the time distances of all data reuses (n − m distances) • the first-access times of all distinct data (m access times) • the last-access times of all distinct data (exact definition later, m access times) The three quantities can be measured by a single pass over the trace using a hash table with one entry for each distinct data. The cost is linear, O(n) in time and O(m) in space.

P (capacity miss by A alone) = P (A’s reuse distance ≥ cache size) 2

bwd(di ): the backward reuse time distance of di , ∞ if di is the first access. • f wd(di ): the forward reuse time distance of di , ∞ if di is the last access. • I(p): a boolean function that returns 1 if p is true and 0 otherwise. For example, I(bwd(di ) > w) gives the contribution by di , which is 1 if bwd(d) > w and 0 otherwise. Similarly, I(f wd(di ) > w) gives the detraction of di , 1 if f wd(d) > w and 0 otherwise. The total size of the footprints in all windows of length w, when divided by the number of windows n − w + 1, is the average footprint, as shown next in Equation 1.

The three measures are the inputs to a formula f p(w). For any window size w(0 < w ≤ N ), f p(w) computes the average footprint for all windows of size w. In other words, the formula computes the average footprint for windows of all sizes without having to inspect the trace again. In the rest of the section, we derive the formula and discuss its complexity. The main idea of the formula is differential counting, which counts the difference in the footprint between consecutive windows. For any window size w, we start with the footprint in the first window and then compute its increase or decrease as the window moves forward in the trace. The first-access times are sufficient to compute the footprint of the first window. The change in later windows depends on two metrics on each trace element di : the forward time distance f wd(di ) and the backward time distance bwd(di ). Let datum x be accessed at di . Let the closest accesses of x be dj before di and dk after di . Then f wd(di ) = k − i and bwd(di ) = i − j. The forward and backward time distances determine the change of footprint between consecutive windows. The relation is shown in Figure 2.

bwd(di) di-w di-w+1

di-1



f p(w) =

di+1

fp(i-w)

di+w-1 di+w

f p(i + 1) = f p(i) + I(bwd(di+w ) > w) − I(f wd(di ) > w) (2) Expanding Equation 1 using Equation 2, we have three components in the average footprint:

dk f p(w) = f p(1) +

fp(i+1) fp(i)

n X

Fig. 2. An illustration how the forward and backward (reuse) time distance influences the change in footprint between consecutive windows

i=w+1 n−w X

fp(i-w+1)

(1)

Since

fwd(di)

di

n−w+1 X 1 f p(i) n − w + 1 i=1

1 ( n−w+1

(n − i + 1)I(bwd(di ) > w) −

(n − i + 1 − w)I(f wd(di ) > w)) (3)

i=1

Let the footprint of a w-size window starting at i be f p(i). Each element di in the trace affects the footprint of w windows: f p(i−w+1), f p(i−w+2), . . . , f p(i). In differential counting, we consider only the effect of di on two pairs of windows: the change from f p(i − w) to f p(i − w + 1) when di enters into its first window and the change from f p(i) to f p(i + 1) when di exits from its last window of influence. Figure 2 shows di and the two pairs of windows where di enters between the first pair and exits between the second pair. When di enters, it does not increase the footprint f p(i−w) if the same datum was previously accessed within f p(i−w +1), which means that its backward time distance is no greater than w (bwd(di ) ≤ w). This is the case illustrated in Figure 2. Otherwise, di adds 1 to the footprint f p(i−w). Similarly, when di exits from f p(i), the departure does not change f p(i + 1) if f wd(di ) ≤ w; otherwise, it subtracts 1 from f p(i + 1), as in the case illustrated in Figure 2. The footprint f p(i + 1) depends on three factors: the footprint f p(i), the contribution of the entering di+w , and the detraction of the exiting di . The footprint of all windows is then computed by adding these differences. Next we formulate this computation. We use the following notations. • n, m, w: the length of the trace, the size of data, and the window size of interest • di : the i-th trace access • f p(i): the footprint of the window from di to di+w−1 (including di and di+w−1 )

Next we compute each component separately. The footprint of the first window of length w is f p(1) =

w X

I(bwd(di ) = ∞)

(4)

i=1

In the next component, we split the forward time distances into two groups: finite and infinite distances. The summation order of the finite distances can be changed from 1 to n instead of from w + 1 to n. n X

=

(n − i + 1)I(bwd(di ) > w) (n − i + 1)I(w < bwd(di ) < ∞)

i=w+1 n X

+ =

(5)

i=w+1 n X

(n − i + 1)I(bwd(di ) = ∞)

i=w+1 n X

(n − i + 1)I(w < bwd(di ) < ∞)

i=1

+

n X

(n − i + 1)I(bwd(di ) = ∞)

i=w+1

Similarly, we decompose and simplify the forward distances: 3

n−w X

=

(n − i + 1 − w)I(f wd(di ) > w)

if we double the length of a trace by repeating each element twice, the length of the long time distances would double, and the average footprint would drop. For each window length w, the Equation 7 can be computed in time O(w). If we limit to consider only window sizes of a logarithmic scale, the formula can be represented and evaluated in O(log w) time.

(6)

i=1 n X

(n − i + 1 − w)I(w < f wd(di ) < ∞)

i=1 n−w X

+

(n − i + 1 − w)I(f wd(di ) = ∞)

C. Monotonicity

i=1

Theorem 3.1: The average footprint f p(w) is nondecreasing. Proof: Let wik denotes the i-th window whose size is k, k f (wi ) denotes the footprint of the i-th window whose size is k. We prove that, ∀k, 0 < k ≤ n, f p(k + 1) ≥ f p(k). First, ∀i, 0 < i ≤ n − k, the following holds because wik k and wi+1 are both contained in wik+1 : k+1 • f (wi ) ≥ f (wik ) k+1 k • f (wi ) ≥ f (wi+1 ) In addition, we have ∀k, 0 < k ≤ n, ∃j, 0 < j ≤ n − k + 1, such that, f (wjk ) ≤ f p(k). Now then

Combining the Equations 4, 5, and 6, we can now expand Equation 3. Instead of using individual accesses, we now use the three inputs, defined as follows: • fi : the first access time of the i-th datum • li : the reverse last access time of the i-th datum. If the last access is at position x, li = n + 1 − x, that is, the first access time in the reverse trace. • rt : the number of accesses with a reuse time distance t

f p(w)

=

n X

I(bwd(di ) = ∞) +

i=1 n X





f p(k + 1)

(w − i)I(bwd(di ) = ∞)

i=w+1 n−w X

+

1 ( n−w+1

(n − i + 1 − w)I(f wd(di ) = ∞)

i=1 n X

(n − i + 1)I(w < bwd(di ) < ∞)

i=1 n X

(n − i + 1 − w)I(w < f wd(di ) < ∞))

=

n−k 1 X f (wik+1 ) n − k i=1

=

j−1 n−k X 1 X f (wik+1 )] [ f (wik+1 ) + n − k i=1 i=j



j−1 n−k X 1 X k [ f (wik ) + f (wi+1 )] n − k i=1 i=j

=

j−1 n−k+1 X 1 X [ f (wik ) + f (wik )] n − k i=1 i=j+1

=

n−k+1 X 1 [ f (wik ) − f (wjk )] n − k i=1

i=1

=m+

+ +

M X

M X

1 ( (w − fi )I(fi > w) n − w + 1 i=1

=

(w − li )I(li > w)

=

i=1 n−1 X



(w − t)I(t > w)rt )

1 [(n − k + 1)f p(k) − f (wjk )] n−k 1 f p(k) + [f p(k) − f (wjk )] n−k f p(k)

t=1

=m− + +

m X

m X 1 ( (fi − w)I(fi > w) n − w + 1 i=1

IV. AVERAGE F OOTPRINT IN THE C OMPOSABLE M ODEL Our previous work used all-footprint analysis in the composable model to predict cache interference [23]. In the composable model, when multiple programs are run together, each reuse distance in a program is lengthened by the aggregate footprint of all peer programs over the same time window. Suppose there are n programs t1 , t2 , ..., tn running on a shared cache, the miss rate is computed by

(li − w)I(li > w)

i=1 n−1 X

(t − w)rt )

(7)

t=w+1

The formula of Equation 7 passes the sanity check that the average footprint f p(w) is at most the data size m, and the footprint of the whole trace (w = n) is m. Fixing the window length w and ignoring the effect of first and last accesses, we see that the footprint decreases if more reuse time distances (rt ) have larger values (t). This suggests that improving locality reduces the average footprint. For example,

P (capacity miss by ti running with tj , j = 1, . . . , n, j 6= i ) P = P ((ti ’s reuse distance + j6=i tj ’s footprint) ≥ cache size) Suppose the distribution of program ti ’s reuse distance is Drd (ti ), and the distribution of program ti ’s footprint of 4

window size w is Df p (ti , w). The first distribution is defined as P Drd (ti ) = {< xki , pki > | pk = 1} where < xki , pki > means the probability of the reuse distance equals xki is pki . Similarly, we P define Df p (ti , w) = {< ykwi , qkwi > | qkwi = 1}

To evaluate cache-sharing predictions, we run two experiments: • 2-program co-runs. We predict all 2-program co-runs and compare the predicted ranking with that of the previous work using the 15 SPEC 2000 benchmarks used in the previous work [23]. • 3-program co-runs. We started with the 10 representative benchmarks in SPEC2006 as selected by Zhuravlev et al. [29]. Reuse-distance analysis was too slow to measure 2 programs. We evaluate the prediction for all program triples of the remaining 8 programs. In both tests, we also compare with a simple prediction method based on miss rates (by ranking the total miss rate of the programs in the co-run group) [23].

Given a window size w, we use < ykwi , qkwi > to mean that the probability that the footprint equals ykwi is qkwi . Consider a 2-program co-run involving t1 and t2 . The capacity miss rate by t1 is calculated as follows by Equation 8. mr(t1 ) =

XX k1

w(xk1 )

pk1 ∗ qk2

w(xk1 )

I(xk1 + yk2

≥ C)

(8)

k2

where I is the identity function, and w(xk1 ) is the size of the reuse window that contains the reuse distance xk1 . This is the equation employed by all-footprint based modeling [23]. To use average-footprint analysis instead, we define the average footprint of a window size w for program ti as F (ti , w) = fiw . Equation 8 can be simplified to Equation 9. mr(t1 ) =

P

k1

w(xk1 )

pk1 I(xk1 + f2

≥ C)

B. Efficiency of Average-footprint Analysis Table I summarizes the analysis cost for the two benchmark suites, and for each suite, the average for integer and for floating-point programs. It divides the 55 tests into four groups: 12 SPEC 2000 integer programs, 14 SPEC 2000 floating-point programs, 12 SPEC 2006 integer programs, and 17 SPEC 2006 floating-point programs. The result of each group is summarized in three rows and three columns. The columns show the trace length, the data size, and the slowdown ratio of the profiling time to the unmodified run time. The rows show the minimum, maximum, and the average slowdown factors for all benchmarks of the group. The minimum slowdowns in four benchmark groups are all below 10. The maximum slowdowns are 40, 32, 41, and 74. The average slowdowns are between 21 in SPEC 2006 integer tests and 29 in SPEC 2000 integer tests. On average across all four groups, average-footprint analysis takes no more than 30 times of the original execution time. The individual results of the 55 programs are shown in Table II. Compared to the summary table, the individual-result table has two additional columns, which show the unmodified execution time and the time of average-footprint analysis. The unmodified time measures the execution of the original program without any instrumentation or analysis. On average, an unmodified SPEC 2000 program takes less than 3 minutes, and an unmodified SPEC 2006 program takes close to 10 minutes. Average-footprint analysis takes 3 to 73 minutes for SPEC 2000 programs and 10 minutes (gcc) to 10 hours (calculix) for SPEC 2006 programs.

(9)

The estimation of the execution time from the miss rate is the same as [23]. The only difference is that the previous model uses all-footprint analysis and Equation 8, and the new model uses average-footprint analysis and the simpler Equation 9. V. E VALUATION A. Experimental Setup We have implemented the average-footprint analysis algorithm in a profiling tool and tested 26 SPEC2K benchmarks, 12 integer and 14 floating-point, and 29 SPEC2006 benchmarks, 12 integer and 17 floating-point. All benchmarks are instrumented by Pin [15] and profiled on a machine with an Intel Core i5-660 processor and 4GB physical memory. The machine is set up with Fedora 13 and GCC 4.4.5. The twoprogram co-run results for SPEC 2000 are collected on an Intel Core 2 Duo machine with two 2.0GHz cores sharing 2MB L2 cache and 2GB memory. In order to measure 3-program coruns, we use an Intel quad-core machine, with four 2.27GHz cores sharing 8MB L3 cache and 8GB memory. Except in Section V-G when we examine the effect of input, we use the reference input in the test. Some programs, especially SPEC 2006, have multiple reference inputs. We use the first one tested by the auto-runner. In performance comparisons, the base program run time is one without Pin instrumentation or any other analysis. The length of SPEC 2000 traces ranges from 14 billion in gcc to 425 billion in mgrid. The amount of data ranges from 13 thousand 64-byte cache blocks (1MB) in eon to 3.2 million cache blocks (256MB) in gcc. The SPEC 2006 traces on average are 10 times as long as SPEC 2000 traces and have 5 times as many cache blocks. The trace bwaves is the longest with 1.9 trillion data accesses and has the most data, 928MB. The individual statistics of the 55 programs is listed in Table II.

C. Comparison with All-footprint Analysis All-footprint analysis can analyze SPEC 2000 programs but not SPEC 2006 programs. We compare average- and allfootprint analysis on SPEC 2000 programs in Table III. SPEC 2000 has 26 programs in total. The paper on all-footprint analysis reported results for 15 of the programs [23]. The table summarizes the cost of the two analyses in these 15 tests in the last two columns. The slowdowns by averagefootprint analysis are between 8.8 and 40. The slowdowns by all-footprint analysis are between 248 and 3160. The average slowdown is 40 for average-footprint analysis and about 1500 5

benchmarks SPEC2000 INT 12 programs SPEC2000 FP 14 programs SPEC2006 INT 12 programs SPEC2006 FP 17 programs

stats min max mean min max mean min max mean min max mean

trace length 1.41 E+10 16.05 E+10 7.52 E+10 3.03 E+10 42.55 E+10 17.44 E+10 4.88 E+10 151.47 E+10 47.20 E+10 20.73 E+10 385.15 E+10 138.85 E+10

data size(64B lines) 0.13 E+5 32.55 E+5 11.67 E+5 0.56 E+5 31.28 E+5 14.16 E+5 3.01 E+5 137.36 E+5 34.05 E+5 0.40 E+5 145.03 E+5 58.69 E+5

avg-FP slowdown(X) 8.8 39.7 28.8 9.7 32.4 21.9 8.1 40.9 20.7 9.4 73.9 26.1

TABLE I T HE MIN , MAX , AND AVERAGE COSTS ( SLOWDOWNS ) OF AVERAGE - FOOTPRINT ANALYSIS FOR 55 SPEC 2000 AND SPEC 2006 BENCHMARK PROGRAMS

for all-footprint analysis. In other words, on average for these 15 programs, average-footprint analysis is 38 times faster than all-footprint analysis. All-footprint analysis takes too long for SPEC 2006 programs. For example, it takes average-footprint analysis 10 hours to profile calculix. Being 38 times slower, it would take more than two weeks to measure the all-footprint distribution.

times. It should be ranked 97 but ranked 44 by all-footprint analysis, which is worse than the miss-rate rank 70. The rank by average-footprint analysis is relatively the best at 86. E. Three-Program Co-run Ranking Evaluating larger group co-runs is difficult because the number of tests increases exponentially with the size of corun group. To test all 3-program co-runs in SPEC 2006  benchmarks, we would have to run 27 3 = 2925 tests. Even if we ran all the tests, it would have been impossible to show the results clearly. Fortunately, Zhuravlev et al. have analyzed the benchmark set based on the cache miss rates and access rates and identified 10 representatives [29]. We had to narrow down further because the reuse-distance analysis could finish only for 8 out of the 10 representatives: 403.gcc, 416.gamess, 429.mcf , 444.namd, 445.gobmk, 450.soplex, 453.povray, and 470.lbm . There are 56 different 3-program groups from these 8 benchmarks. We show the prediction results in Figure 4. The results for 3-program co-runs of SPEC 2006 programs are similar to those of 2-program co-runs of SPEC 2000 programs. As before, the miss-rate based prediction does not show a detectable correlation while the average-footprint analysis shows a clear correlation with the actual interference. The maximal slowdown increases from 2.0 in 2-program coruns to 3.3 in 3-program co-runs, confirming the expectation that the interference becomes worse as the cache is shared by more programs. Exhaustive testing is also increasingly infeasible. For both reasons, the composable model is more valuable, so is the higher efficiency from the average-footprint analysis.

D. Two-Program Co-run Ranking The prior work showed co-run ranking results for 15 SPEC 2000 programs based on all-footprint analysis and compared with miss-rate based ranking and exhaustive testing [23]. We now show the ranking results using average-footprint analysis and compare it with the three previous ranking methods. We show the prediction results in a 2-D plot. The x-axis is the rank of program co-run groups. In this test, the rank ranges from 1 (the least interfering pair) to 105 (the most interfering pair). The y-axis shows the interference, measured by the quadratic mean of the slowdowns of programs in the co-run group. The slowdown of a co-run program is the ratio of its co-run time and the time running alone on the same machine (cache). q For two programs with slowdowns s1 , s2 , s21 +s22 we have y = 2 . The three graphs in Figure 3 show the plots for the predictions based on miss rate, all-footprint analysis, and averagefootprint analysis. In each plot, the accurate result from exhaustive testing is shown by a monotonically increasing red line as a reference. The simple miss-rate based prediction does not show an increasing trend, suggesting no correlation between the prediction and the actual interference. The two footprint-based predictions show significant correlation. Programs predicted to have a high interference tend to actually have a high interference. Average-footprint analysis ranks several program pairs better than all-footprint analysis. Consider the pair with the highest interference, art,mcf with a slowdown of 2. The pair is ranked 23 by miss rate, 99 by all-footprint analysis, and 105 by average-footprint analysis. The average-footprint rank is precisely correct. All-footprint ranking has a significant misprediction for the program pair gcc,art. The pair slows down each other by 1.6

F. Rank and Performance Closeness To quantify the difference between the predicted ranking and the accurate ranking, we define two metrics: the rank closeness and the performance closeness. The rank closeness shows on average how the predicted rank of a co-run group differs from the actual rank. We number n co-run groups by their accurate rank i. Let pred(i) be the predicted rank for group i. The rank closeness is defined as Pn |pred(i) − i| rank closeness = i=1 n 6

2.5

3.0

3.5

miss rate exhaustive testing

1.5

2.0

2.5

slowdown

2.0 1.5 1.0

slowdown

miss rate exhaustive testing

0

20

40

60

80

100

0

20

30

40

50

ranked program triples (from least interference to most interference)

2.5

ranked program pairs (from least interference to most interference)

10

3.0

3.5

average footprint exhaustive testing

1.5

2.0

2.5

slowdown

2.0 1.5 1.0

slowdown

all−footprint exhaustive testing

0

20

40

60

80

100

0

ranked program pairs (from least interference to most interference)

10

20

30

40

50

ranked program triples (from least interference to most interference)

2.5

Fig. 4. Comparison of 3-program co-run predictions for 8 SPEC2006 benchmarks. All-footprint analysis cannot model these programs because of its high cost.

2.0

of the co-run group with the predicted rank i. The difference |f (pred(i)) − f (i)| gives the mis-prediction. The performance closeness is the average mis-prediction for all groups:

1.5

slowdown

average footprint exhaustive testing

Pn

|f (pred(i)) − f (i)| n The two metrics are shown in Table IV. On average for 2program co-runs, the miss-rate rank errs by 35 positions, while the footprint-based ranks err by 14 and 15 positions. For 3program co-runs, the miss-rate rank errs by 19 positions, while the average-footprint rank errs by 6. In terms of performance, the miss-rate based ranking mis-predicts twice as bad as the footprint-based ranking. In search of a closeness metric, we also measured the Levenshtein distance. For two permutations of a set of numbers, the Levenshtein distance measures the number of “edits” needed to convert one to the other. For the 2-program co-run test, the distance is 103 for miss rate, 97 for all-footprint, and 96 for average-footprint. For the 3-program co-run test, the distance is 54 for miss rate and 48 for average-footprint. Levenshtein is not a good metric since it does not distinguish a ranking that does not show a correlation from rankings that do.

1.0

performance closeness =

0

20

40

60

80

100

ranked program pairs (from least interference to most interference)

Fig. 3. Evaluation of 2-program co-run predictions for 15 SPEC2000 benchmark programs. The prediction quality of average-footprint analysis is similar to that of all-footprint analysis.

The formula is the Manhattan distance between two vectors < p(1), p(2), . . . , p(n) > and < 1, 2, . . . , n >, divided by n. The worst possible ranking has a rank-closeness score of n/2 if n is even or (n − 1)/2 otherwise. Next we quantify the error in terms of the mis-predicted slowdown. Let f (i) be the slowdown of the co-run group with the accurate rank i, and f (pred(i)) be the slowdown 7

i=1

2-program co-run over 15 SPEC2000 benchmarks ranking strategy perf closeness rank closeness miss rate 0.385 35 all-footprint 0.194 15 avg-footprint 0.170 14

machine characterization [10]. Zhou studied random cache replacement policy and gave a one-pass deterministic traceanalysis algorithm to compute the average miss rate (instead of simulating many times and taking the average) [28]. Finally, Schuff et al. defined multicore reuse distance analysis and improved its efficiency through sampling and parallelization [17]. The sampling was based on a method developed by Zhong and Chang earlier [25]. These techniques are concerned with only reuse windows and cannot measure the footprint in all execution windows, which is the problem addressed in this paper. Off-line cache sharing models: The average working-set size in single-length execution windows such as a scheduling quantum can be computed in linear time. It has been used in studying multi-programmed systems [7], [22]. In a parallel environment such as today’s multicore processors, programs interact constantly. The interference in all-length windows has been considered for memory [21] and for cache [3]. Both used the following recursive equation involving the working set and the miss rate. As a window of size w is extended to w + 1, the change in the working set depends on whether the new access is a miss. Suh et al. assumed linear function growth when window sizes were close [21]. Chandra et al. computed the recursive relation bottom up [3]. The same problem has been solved using statistical inference. Two techniques by Berg and Hagersten (known as StatCache) [2] and by Shen et al. [19] were used to infer cache miss rate from the distribution of reuse times. Berg and Hagersten assumed constant miss rate over time and random cache replacement [2], and Shen et al. assumed a Bernulli process and LRU cache replacement [18], [19]. The latter method was adapted to predict cache interference [12]. A precise prediction was shown useful in an approximately solving the optimal coscheduling problem [11]. However, none of these method can bound the approximation error. Our earlier work gave the first precise methods for measuring the footprint [5], [23] and an iterative model for the circular effect of cache interference [23]. The linear-time algorithm in this paper computes the average rather than the full distribution and improves the measurement speed by near 40 times yet maintains a similar accuracy in shared-cache locality prediction. On-line models: The miss rate curve has been used for memory partitioning to ensure fairness or maximize throughput in a parallel workload [27]. Similarly, reuse distance has been used for cache partitioning among data objects [14]. Recently, Zhuravlev et al. reviewed four models based on the miss rate and the reuse distance [29]. As on-line models, these techniques did not consider the working set metrics because of the cost. For example, Zhuravlev et al. considered a less accurate model from Chandra et al. because for efficiency it did not require all-window footprints [3]. Zhuravlev et al. showed that cache sharing is one of the factors but not necessarily the major factor [29]. Still, an accurate and fast solution may help to quantify the contribution from cache sharing in the overall interference. Analytical models and streaming analysis: Counting the number of distinct data items has been considered as a

3-program co-run over 10 SPEC2006 benchmarks ranking strategy perf closeness rank closeness miss rate 0.632 19 avg-footprint 0.225 6 TABLE IV C OMPARE DIFFERENT RANKING STRATEGIES

G. The Effect of Input on Average Footprint The footprint of a program execution is affected by the program input just as the length of the execution is affected by the input. An important question for profiling-based techniques is how much the footprints in training runs may differ from those in test runs. In this section, we give a preliminary measure of this difference. Given a set of k executions of the same program, we quantify the variation between the k footprint functions (fi (w)) as follows. First, we compute the average of the average Pk footprints: f (w) = k1 i=1 fi (w). Then we compute the Manhattan distance between the i-th execution and the average as: P|W | di =

j=1

f (w )

| fi(w j) − 1)| j

|W |

where |W | is the number of different window lengths. A Manhattan distance of x% means that on average, the input i’s footprint function differs from the average by x% in each window size. Table V shows SPEC2000 and SPEC2006 programs, the number of inputs (provided by the benchmark suite and tested in our experiments), the range of trace lengths and data sizes in these inputs, the smallest and largest Manhattan distances as we just defined. The majority of programs, 20 out of 37, see no more than 30% difference between footprints of different inputs. The minimal difference is less than 20% in all but 5 programs. Note that the effect of the input may be predicted using model fitting based on the input characteristics [26]. This is outside the scope of this paper. VI. R ELATED W ORK Locality models: Locality in private cache can be modeled by reuse distance, which can be measured with a guaranteed precision in time O(n log2 m), where n is the length of the trace and m is the size of data [26]. Reuse distance has found many uses in workload characterization and program optimization [26]. There are a number of recent developments. Chauhan and Shei gave a method for static analysis of locality in MATLAB code [4]. Unlike profiling whose results are usually input specific, static analysis can identify and model the effect of program parameters. Most previous models targeted program analysis. Ibrahim and Strohmaier used synthetic probes to emulate the locality of an application for efficient 8

streaming analysis problem. Space-efficient (less than O(m)) solutions exist to measure frequency moments F0 (footprint), F1 (total frequency), F2 , F∞ (most frequent item), and entropy [1], [8], [13]. Instead of counting the F0 moment over the whole trace, we solve the problem of collecting the average Fo for all execution windows and focus on reducing the time complexity from O(n2 ) to linear. Streaming solutions may be combined to further reduce the space requirement of our algorithms.

[8] P. Flajolet and G. Martin. Probabilistic counting. In Proceedings of the Symposium on Foundations of Computer Science, 1983. [9] M. D. Hill and A. J. Smith. Evaluating associativity in CPU caches. IEEE Transactions on Computers, 38(12):1612–1630, 1989. [10] K. Z. Ibrahim and E. Strohmaier. Characterizing the relation between Apex-Map synthetic probes and reuse distance distributions. Proceedings of the International Conference on Parallel Processing, 0:353–362, 2010. [11] Y. Jiang, X. Shen, J. Chen, and R. Tripathi. Analysis and approximation of optimal co-scheduling on chip multiprocessors. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, pages 220–229, 2008. [12] Y. Jiang, E. Z. Zhang, K. Tian, and X. Shen. Is reuse distance applicable to data locality analysis on chip multiprocessors? In Proceedings of the International Conference on Compiler Construction, pages 264–282, 2010. [13] A. Lall, V. Sekar, M. Ogihara, J. Xu, and H. Zhang. Data streaming algorithms for estimating entropy of network traffic. In Proceedings of the International Conference on Measurement and Modeling of Computer Systems, pages 145–156, 2006. [14] Q. Lu, J. Lin, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. Soft-OLP: Improving hardware cache performance through softwarecontrolled object-level partitioning. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, pages 246–257, 2009. [15] C.-K. Luk et al. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, Chicago, Illinois, June 2005. [16] R. L. Mattson, J. Gecsei, D. Slutz, and I. L. Traiger. Evaluation techniques for storage hierarchies. IBM System Journal, 9(2):78–117, 1970. [17] D. L. Schuff, M. Kulkarni, and V. S. Pai. Accelerating multicore reuse distance analysis with sampling and parallelization. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, pages 53–64, 2010. [18] X. Shen and J. Shaw. Scalable implementation of efficient locality approximation. In Proceedings of the Workshop on Languages and Compilers for Parallel Computing, pages 202–216, 2008. [19] X. Shen, J. Shaw, B. Meeker, and C. Ding. Locality approximation using time. In Proceedings of the ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pages 55–61, 2007. [20] H. S. Stone, J. Turek, and J. L. Wolf. Optimal partitioning of cache memory. IEEE Transactions on Computers, 41(9):1054–1068, 1992. [21] G. E. Suh, S. Devadas, and L. Rudolph. Analytical cache models with applications to cache partitioning. In International Conference on Supercomputing, pages 1–12, 2001. [22] D. Thi´ebaut and H. S. Stone. Footprints in the cache. ACM Transactions on Computer Systems, 5(4):305–329, 1987. [23] X. Xiang, B. Bao, T. Bai, C. Ding, and T. M. Chilimbi. All-window profiling and composable models of cache sharing. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 91–102, 2011. [24] X. Zhang, S. Dwarkadas, and K. Shen. Towards practical page coloringbased multi-core cache management. In Proceedings of the EuroSys Conference, 2009. [25] Y. Zhong and W. Chang. Sampling-based program locality approximation. In Proceedings of the International Symposium on Memory Management, pages 91–100, 2008. [26] Y. Zhong, X. Shen, and C. Ding. Program locality analysis using reuse distance. ACM Transactions on Programming Languages and Systems, 31(6):1–39, Aug. 2009. [27] P. Zhou, V. Pandey, J. Sundaresan, A. Raghuraman, Y. Zhou, and S. Kumar. Dynamic tracking of page miss ratio curve for memory management. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, pages 177–188, 2004. [28] S. Zhou. An efficient simulation algorithm for cache of random replacement policy. In Proceedings of the IFIP International Conference on Network and Parallel Computing, pages 144–154, 2010. Springer Lecture Notes in Computer Science No. 6289. [29] S. Zhuravlev, S. Blagodurov, and A. Fedorova. Addressing shared resource contention in multicore processors via scheduling. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, pages 129–142, 2010.

VII. S UMMARY Complete characterization of footprint requires measuring data access in all execution windows. In this paper, we have presented the average footprint as a metric of all-window footprint. The footprint function maps from time to average footprint, We have shown that the average footprint function is monotone and can be used in the composable model to rank cache interference in shared cache without having to test any parallel executions. We have presented a linear-time algorithm for accurately measuring the average footprint. The linear-time algorithm uses differential counting based on the forward and backward reuse time distance. When tested on SPEC CPU 2000 benchmarks, the average-footprint analysis is on average 38 times faster than the previous, all-footprint analysis, yet it shows comparable accuracy in shared-cache locality prediction. The average-footprint analysis was efficient enough to measure the newer, SPEC 2006 benchmarks, but the all-footprint analysis could not. ACKNOWLEDGMENT We would like to thank Tongxin Bai for providing histogram mapping libraries. The presentation has been improved by the suggestions from Xipeng Shen and the systems group at University of Rochester. Xiaoya Xiang and Bin Bao are supported by two IBM Center for Advanced Studies Fellowships. The research is also supported by the National Science Foundation (Contract No. CCF-1116104, CCF-0963759, CNS-0834566). R EFERENCES [1] N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In Proceedings of the ACM Symposium on Theory of Computing, pages 20–29, 1996. [2] E. Berg and E. Hagersten. Fast data-locality profiling of native execution. In Proceedings of the International Conference on Measurement and Modeling of Computer Systems, pages 169–180, 2005. [3] D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting inter-thread cache contention on a chip multi-processor architecture. In Proceedings of the International Symposium on High-Performance Computer Architecture, pages 340–351, 2005. [4] A. Chauhan and C.-Y. Shei. Static reuse distances for locality-based optimizations in MATLAB. In International Conference on Supercomputing, pages 295–304, 2010. [5] C. Ding and T. Chilimbi. All-window profiling of concurrent executions. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008. poster paper. [6] C. Ding and T. Chilimbi. A composable model for analyzing locality of multi-threaded programs. Technical Report MSR-TR-2009-107, Microsoft Research, August 2009. [7] B. Falsafi and D. A. Wood. Modeling cost/performance of a parallel computer simulator. ACM Transactions on Modeling and Computer Simulation, 7(1):104–130, 1997.

9

benchmark 164.gzip 175.vpr 176.gcc 181.mcf 186.crafty 197.parser 252.eon 253.perlbmk 254.gap 255.vortex 256.bzip2 300.twolf 168.wupwise 171.swim 172.mgrid 173.applu 177.mesa 178.galgel 179.art 183.equake 187.facerec 188.ammp 189.lucas 191.fma3d 200.sixtrack 301.apsi 400.perlbench 401.bzip2 403.gcc 429.mcf 445.gobmk 456.hmmer 458.sjeng 462.libquantum 464.h264ref 471.omnetpp 473.astar 483.xalancbmk 410.bwaves 416.gamess 433.milc 434.zeusmp 435.gromacs 436.cactusADM 437.leslie3d 444.namd 447.dealII 450.soplex 453.povray 454.calculix 459.GemsFDTD 465.tonto 470.lbm 481.wrf 482.sphinx3

trace length 3.93 E+10 6.26 E+10 1.41 E+10 2.29 E+10 11.36 E+10 16.05 E+10 7.18 E+10 4.69 E+10 11.16 E+10 6.45 E+10 4.72 E+10 14.74 E+10 15.68 E+10 9.02 E+10 42.55 E+10 18.47 E+10 15.85 E+10 39.75 E+10 3.03 E+10 6.87 E+10 14.62 E+10 14.26 E+10 15.07 E+10 16.04 E+10 14.20 E+10 18.69 E+10 21.99 E+10 9.73 E+10 4.88 E+10 21.61 E+10 12.48 E+10 68.40 E+10 110.99 E+10 151.47 E+10 41.10 E+10 40.77 E+10 29.57 E+10 53.38 E+10 190.51 E+10 144.61 E+10 51.48 E+10 85.62 E+10 130.60 E+10 230.01 E+10 112.12 E+10 117.12 E+10 109.73 E+10 20.73 E+10 67.91 E+10 385.15 E+10 155.60 E+10 146.23 E+10 79.73 E+10 200.09 E+10 133.24 E+10

data size(64B lines) 14.07 E+5 0.38 E+5 32.55 E+5 12.62 E+5 0.32 E+5 4.11 E+5 0.13 E+5 18.14 E+5 31.50 E+5 10.92 E+5 14.71 E+5 0.60 E+5 28.82 E+5 31.21 E+5 9.10 E+5 28.57 E+5 1.37 E+5 8.47 E+5 0.56 E+5 6.80 E+5 2.98 E+5 2.14 E+5 25.95 E+5 17.06 E+5 3.94 E+5 31.28 E+5 52.31 E+5 11.98 E+5 41.74 E+5 137.36 E+5 3.06 E+5 6.61 E+5 28.56 E+5 20.99 E+5 3.01 E+5 16.44 E+5 35.99 E+5 50.50 E+5 145.03 E+5 0.52 E+5 113.33 E+5 81.21 E+5 2.22 E+5 102.33 E+5 20.18 E+5 7.36 E+5 88.48 E+5 76.33 E+5 0.40 E+5 27.92 E+5 136.10 E+5 6.45 E+5 67.03 E+5 116.40 E+5 6.43 E+5

unmodified time(sec) 22.7 33.7 4.7 46.6 37.2 80.2 20.1 14.9 39.4 19.7 23.4 93.8 112.8 99.7 175.2 68.0 56.1 135.8 48.2 33.5 63.7 84.4 63.9 82.6 133.0 102.2 222.3 129.4 31.5 338.3 67.0 178.8 518.8 633.3 99.5 422.2 229.1 297.9 555.4 212.2 565.8 593.1 853.8 1297.5 578.2 501.7 417.3 236.3 227.9 755.2 714.4 640.1 430.8 866.4 679.6

avg-FP analysis time 460.0 809.0 187.0 408.0 1427.0 1926.0 786.0 542.0 1261.0 770.0 548.0 1917.0 1695.0 1141.0 4395.0 1999.0 1799.0 4397.0 470.0 774.0 1623.0 1738.0 1685.0 1854.0 1608.0 2218.0 2478.0 1051.0 594.0 3627.0 1458.0 6441.0 12906.0 14930.0 4075.0 5797.0 3396.0 6943.0 18551.0 15682.0 5852.0 9584.0 12883.0 24751.0 11546.0 11799.0 12542.0 2219.0 7134.0 36728.0 15719.0 15280.0 8645.0 21419.0 14750.0

TABLE II I NDIVIDUAL STATISTICS OF THE 55 SPEC2000 AND SPEC2006 TEST PROGRAMS

10

FP alg cost(X) 20.3 24.0 39.7 8.8 38.4 24.0 39.2 36.3 32.0 39.0 23.4 20.4 15.0 11.4 25.1 29.4 32.1 32.4 9.7 23.1 25.5 20.6 26.4 22.4 12.1 21.7 11.1 8.1 18.8 10.7 21.8 36.0 24.9 23.6 40.9 13.7 14.8 23.3 33.4 73.9 10.3 16.2 15.1 19.1 20.0 23.5 30.1 9.4 31.3 48.6 22.0 23.9 20.1 24.7 21.7

15 SPEC2000 Benchmarks min max mean

trace length 1.41 E+10 16.05 E+10 8.20 E+10

data size(64B lines) 0.31 E+5 32.55 E+5 10.05 E+5

avg-FP slowdown(X) 8.8 39.7 26.1

all-FP slowdown(X) 248.2 3160.5 1495.2

TABLE III C OMPARISON OF THE MIN , MAX , AND AVERAGE SLOWDOWNS BY AVERAGE - FOOTPRINT ANALYSIS AND BY ALL - FOOTPRINT ANALYSIS . O N AVERAGE , AVERAGE - FOOTPRINT ANALYSIS IS 38 TIMES FASTER .

benchmark 186.crafty 188.ammp 254.gap 164.gzip 177.mesa 183.equake 176.gcc 179.art 197.parser 256.bzip2 175.vpr 181.mcf 300.twolf 444.namd 435.gromacs 470.lbm 459.GemsFDTD 453.povray 454.calculix 482.sphinx3 464.h264ref 436.cactusADM 434.zeusmp 410.bwaves 429.mcf 458.sjeng 401.bzip2 400.perlbench 437.leslie3d 403.gcc 433.milc 450.soplex 456.hmmer 462.libquantum 465.tonto 473.astar 416.gamess 471.omnetpp 445.gobmk

inputs 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 5 3 3 3 3 3 11 88 3 11 3 5 4 3 3 5 5 3 20

min n(109 ) 0.94 0.82 0.28 0.80 0.17 0.34 0.26 1.02 0.93 3.18 0.30 0.04 0.08 31.66 2.85 4.28 5.49 1.75 0.10 3.27 71.70 7.83 22.70 24.07 2.43 7.43 6.47 0.00 30.05 1.80 14.93 0.04 10.51 0.20 2.10 18.15 1.28 1.37 0.06

max n(109 ) 41.40 32.27 67.64 15.91 31.44 108.31 6.24 26.99 78.04 33.21 33.74 13.56 105.90 1171.18 1306.00 797.31 1556.05 679.12 3851.47 1332.42 2876.88 2300.05 856.22 1905.13 216.12 1109.88 439.71 549.55 1121.18 105.74 514.81 209.81 1433.38 1514.66 1462.34 579.31 2305.93 407.70 328.61

min m(64 ∗ 103 ) 29.85 176.55 587.06 12.33 99.63 158.32 57.68 22.20 97.69 241.72 21.82 21.16 1.14 734.98 221.82 6702.79 10253.37 39.45 43.13 455.39 213.27 385.21 1525.06 961.20 1371.95 2854.29 243.89 3.90 227.96 147.77 144.04 19.44 8.16 3.28 36.35 104.20 48.12 105.54 254.23

max m(64 ∗ 103 ) 29.94 176.56 3147.45 247.83 110.07 666.40 2347.59 44.86 406.01 1429.20 132.74 1259.02 46.66 735.52 222.12 6702.93 13609.65 39.64 2792.43 643.02 1061.36 10232.62 8120.55 14502.83 13736.28 2855.60 6855.16 9062.31 2018.42 15040.30 11332.96 7633.04 661.50 2098.58 644.53 3599.33 100.05 1644.20 310.52

min di 0.01 0.06 0.09 0.07 0.11 0.11 0.08 0.10 0.10 0.19 0.33 0.09 0.09 0.01 0.02 0.06 0.04 0.03 0.09 0.08 0.11 0.07 0.15 0.04 0.17 0.10 0.06 0.01 0.15 0.04 0.21 0.12 0.10 0.32 0.20 0.25 0.08 0.39 0.07

max di 0.05 0.16 0.18 0.20 0.21 0.23 0.29 0.32 0.33 0.36 0.44 0.49 0.51 0.02 0.05 0.10 0.13 0.14 0.16 0.21 0.22 0.23 0.26 0.27 0.28 0.29 0.38 0.41 0.42 0.46 0.51 0.53 0.57 0.59 0.59 0.66 0.81 0.84 1.89

TABLE V S IMILARITY OF THE FOOTPRINT IN DIFFERENT EXECUTIONS OF THE 37 SPEC2K/2006 BENCHMARKS AS MEASURED BY THE MAX AND MIN M ANHATTAN DISTANCE (max di , min di )

11

Suggest Documents