Cache Awareness in Blocking Techniques

0 downloads 0 Views 467KB Size Report
ratio with the simulated miss ratio and the execution time of various loop ... two data can compete for the same cache line, inducing con ict misses. ... Reference: In this article, a reference corresponds to a source-code reference to an array like A(j,i) .... rather than the absolute execution time (which depends on many other ...
Cache Awareness in Blocking Techniques  O. Temam C. Fricker W. Jalby Versailles University INRIA Versailles University France France France

Abstract To date, data locality optimizing algorithms mostly aim at providing strategies for blocking and reordering loops. But little research has been devoted to the nal step: nding the optimal block size, i.e., a block size that provides the best possible performance. Optimal block sizes are currently computed as if a cache is a local memory, i.e., cache interferences are ignored. Case-studies have already shown that cache interferences can greatly a ect the optimal block size value. The purpose of this article is to show that analytical modeling of cache interferences can be used to compute near-optimal block sizes for blocked loop nests. First, the method for evaluating cache interferences is presented. Second, the model is validated by correlating the estimated miss ratio with the simulated miss ratio and the execution time of various loop nests. Then, current techniques for computing the optimal block size are analytically and experimentally shown to yield below-optimal performance. Finally, current block size computation techniques are augmented with analytical modeling of cache interferences and TLB misses, and this new technique is shown to yield near-optimal performance and make blocking techniques safe. Reciprocally, it is also shown that even when no capacity miss occurs, nely tuned blocking techniques can be used to signi cantly reduce the number of cache interferences.

Keywords: blocking, optimal block size, cache interferences, data locality.

1 Introduction

1.1 Motivations and Goals

The purpose of this article is to address the nal step of blocking algorithms, i.e., computing an optimal or near-optimal block size value. The optimal block size is the block size value for which the blocked loop nest achieves the lowest execution time. A block size value is near-optimal if the execution time achieved is close to the minimum execution time. Most data locality algorithms [1, 2, 3, 4] do not address this step and focus on loop transformations. However, Lam et al. [5], and Ferrante et al. [6] have shown that block size value can have a signi cant impact on blocked algorithms performance. Eisenbeis et al. [1] have proposed to nd the optimal block size by computing the number of misses as a function of the block size and nding the block size that minimizes this function. However, this study is restricted to capacity misses only. Hill [7] has distinguished three types of misses: compulsory misses which correspond to the rst reference to a data, capacity misses which occur when cache space is insucient to store all data to be reused, and con ict misses which are due to the mapping function used in caches. Because in most caches a given address can be mapped to a restricted set of lines, two data can compete for the same cache line, inducing con ict misses. While capacity misses can be relatively easily computed, evaluating the number of con ict misses is a more dicult task. In the  This work was supported by the Esprit Agency DG XIII under Grant No. APPARC 6634 BRA III.

1

present article, using a modeling framework presented in [8], we show that both capacity and con ict misses can be computed as a function of the block size. Blocked algorithms are shown to perform signi cantly better when the optimal block size is computed using a function of both capacity and con ict misses rather than of capacity misses only. Coleman et al. [9] have also proposed a model for computing certain types of cache con icts and deducing near-optimal block size value. The study is focused on self-interferences though it also deals with cross-interferences on a lesser extent. However, a recent study [10] shows that most con ict misses are cross-interference misses and self-interference misses do not occur so frequently.

1.2 Terminology Array element: An array element denotes an array entry such as A(5) for example. Temporal and Spatial Locality:[11] A data exhibits temporal locality if it is reused within a short

time interval. A data exhibits spatial locality if data located at a near address are also referenced. Self-Dependence:[3] There is a self-dependence on an array reference when it reuses the elements it has previously referenced. This reuse can be either temporal or spatial. More precisely, self-dependence spatial reuse occurs when a reference uses more than one data within a cache line. Group-Dependence:[3] There is a group-dependence between two array references when one reference reuses the elements initially used by the other reference. Dependence/Reuse distance: The existence of a data dependence indicates the corresponding data is reused. Thus, we indi erently use the terms dependence distance or reuse distance. These distances are expressed in the number of innermost loop iterations that separates, i.e., that are executed between, the rst reference and the second reference to the data. Reference: In this article, a reference corresponds to a source-code reference to an array like A(j,i) in a DO i, DO j nest. Footprint:[12] The cache mapping function is a function that maps a virtual memory address to a cache line number. The footprint of a set of array elements is the set of cache lines to which the corresponding virtual addresses are mapped. Interference:[5] An interference miss, or con ict miss, occurs when a reference cannot reuse a data previously loaded in cache because it was ushed from cache by another data. If the ushing data was loaded by this same reference it is a self-interference, otherwise it is a cross-interference. Later on, we will further distinguish between internal cross-interferences and external cross-interferences. Throughout this article, cache mapping con icts are termed cache interferences, because a con ict corresponds to a disruption, i.e., an interference, in the reuse of data.

1.3 Position of the Problem 1.3.1 Computing the Optimal Block Size

For the general problem of computing the optimal block size of any loop nest, the most elaborate treatment can be found in Eisenbeis and al. article on Windows [1], so we will start from that point. In this article, for each reference, the set of data to be reused is called the reference window, and the optimal block size is the solution of the following optimization problem Problem 1.1 Let B the block size of any blocked loop in the loop nest.1 Let Ncapacity(B) the number of capacity misses of an n-deep loop nest. Let W (B ) the sum of the window sizes over all references. Find B that minimizes Ncapacity(B),2 under the constraint that all windows t in cache, i.e., W (B)  CS (where CS is the cache size). 1 For

the sake of simplicity, the same block size is used for all loops. Further performance improvements might be obtained with a distinct block size for each loop. 2 Note that the number of compulsory misses is not considered, because it is independent of B . More generally, any term independent of B is ignored.

2

Because Ncapacity(B ) only corresponds to capacity misses, the optimal block sizes obtained are inaccurate. In order to obtain a more accurate solution, the following problem must be solved

Problem 1.2 Let Ninterference (B) the number of interference misses. Find B that minimize Ncapacity(B)+ Ninterference (B), under the constraint W (B)  CS . So the problem sums up to computing the total number of cache misses in a loop nest. Note that the optimization proposed in this article is not a data locality optimizing algorithm in itself. On the contrary, it is a complement to most existing algorithms. However, strategies that rely on the evaluation of cache misses for directing loop reordering can be strongly a ected by this enhancement.

1.3.2 Blocking to Reduce Cache Interferences

Cache interferences operate by disrupting the reuse of data. Interferences are diverse and frequent because any type of reuse can be disrupted: temporal/spatial, self-dependence/group-dependence. In general, the larger the reuse distance for a given data, the higher the probability it is ushed from cache, because the more data are loaded before reuse occurs. Now, the initial goal of blocking is to reduce the amount of data to be reused (i.e., to be kept in cache), but a side-e ect of this strategy is to decrease the reuse distance. Consequently, even when no capacity miss occurs, blocking can prove to be ecient for reducing cache interferences.

1.4 Model Hypotheses and Experiments 1.4.1 Model Hypotheses

For the sake of simplicity, a number of restrictive hypotheses have been adopted on the loop nests considered in this study. First, only uniformly generated dependences with constant dependence distances are considered. This excludes subscripts such as A(j1 + j2 ) for instance. Note that previous studies [13] show that such hypotheses still encompass a large majority of the subscripts found in numerical codes. Second, the loop boundaries are assumed to be constant. The consequence of both hypotheses is to have constant dependence distances. Extending the model to linear dependence distances (the most usual case after constant dependences) and linear loop boundaries does not raise major diculties, but the simple expressions of section 2.3 would be replaced by sums over the iteration domain. Still, since the body of these sums would be linear expressions, they could be computed symbolically but would result in more complex formulas. The constant dependence cases are suciently numerous that we did not feel considering the non-constant case was necessary to validate this model. Though only perfectly-nested loop nests are discussed in this article, the notion of reuse set and interference set developed in the next section can be extended to non-perfectly nested loop nests. In this article, it is assumed that all necessary informations on the loop nest and the data structures are known, i.e., loop boundaries, array subscripts, array dimensions and array base addresses.

1.4.2 Experiments

All experiments have been performed on an HP/PA-RISC workstation (except for gure 2.3.1f). In the article, all dimensions are expressed in double-precision oating data or 64 bits, i.e., the size of an array element. The cache parameters and the notations used in the article are the following: cache size CS =256 kbytes = 32768 array elements; cache line size LS =64 bytes = 8 array elements; the cache is direct-mapped; TLB size TS = 64 entries; page size PS =8 kbytes = 1024 array elements. It was experimentally found that a TLB miss is  ' 14 times more expensive than a cache miss.3 All loops were compiled using the optimization level +O2. 3 Parameter  is introduced in section 3.1.

Note that  is particularly large on the HP; on a SUN Sparcstation 2,  = 2.

3

In this article, several experiments required that the array relative base addresses be controlled. This was achieved by having all arrays point to a single large array. By varying the position pointed by the rst element of each array, the relative base addresses of the arrays can be changed as desired. Within the article, three types of statistics can be found: the simulated miss ratio (for a whole loop or for an array), the estimated miss ratio (obtained with the model) and the global execution time of the loop (either measured, simulated or estimated) timed on an HP/PA-RISC workstation. The purpose of the simulated and estimated miss ratios is to check the model precision. The execution time curves corresponding to simulations and modeling have been obtained with the following technique. Since the purpose of the model is to predict variations of the execution time due to cache interference phenomena rather than the absolute execution time (which depends on many other features than just the cache), a starting point (x0 ; t0) is picked from the measured execution time graph (where x is the parameter varied in the graph, and t is the execution time). Let M (x) the number of simulated or estimated cache misses for parameter x, then simulated or estimated time function is equal to t = t0 + (M (x) ? M (x0))  tlat , where tlat is the miss penalty. Thus, the measured time graph ts with the simulated or estimated time graph only if the number of simulated/estimated number of cache misses corresponds to the real number of cache misses.

1.5 Organization of the article

In section 2, the model used to estimate cache interferences is presented and then validated using experiments. In section 3, the model of cache interferences is used to enhance the optimal block size computation, and conversely, to show that blocking can be used to signi cantly reduce cache interferences.

2 Modeling Cache Behavior

This section recalls the notions previously developed in [14, 15, 8]. In this section, the techniques used to estimate the number of interference misses are presented.

Basic Concepts

DO 1 j1=0,N-1 DO 1 j2=0,N-1 reg = X(j2,j1) DO 1 j3=0,N-1 Z(j3,j1) += reg * Y(j3,j2) 1 CONTINUE

Evaluating the miss ratio of a loop nest amounts to counting, for each reference, the number of cache misses due to disruption of (spatial or temporal) reuse. So, for each reference, it is necessary to determine when reuse occurs (i.e., the loop level where reuse occurs for the rst time). Consider the loop above. The reuse of reference Y (j3; j2) occurs on loop j1 and the reuse of reference Z (j3; j1) occurs on loop j2 . This reuse loop level de nes the set of data to be reused. For reference Y (j3; j2), it is equal to fY(0; 0); : : : ; Y(N ? 1; N ? 1)g while it is equal to fZ(0; j1); : : : ; Z(N ? 1; j1)g for reference Z (j3 ; j1). Similarly, this loop level also de nes the sets of data of all other references that can interfere with the "reuse" set. For reference Y (j3 ; j2), the "interference" sets are fZ (0; j1); : : :; Z (N ? 1; j1)g and fX (0; j1); : : :; X (N ? 1; j1)g (Z and X can ush elements of Y from cache before they are reused), see gure 2.3.3a. For reference Z (j3; j1), the "interference" sets are fY (0; j2); : : :; Y (N ? 1; j2)g and fX (j2; j1)g. On each iteration of the reuse loop, the number of cache misses due to disruption of reuse is equal to the size of the intersection (in cache lines) between the "reuse" set and the "interference" sets. 4

Considering the problem this way implicitly makes abstraction of time considerations, i.e., when interferences occur. The problem is then equivalent to computing the intersection size between several sets. Summing these intersections over all iterations of the reuse loop provides the total number of cache interference misses for a given reference. The next sections provide a formal treatment of this method.

Overview of the Model Elaboration

Using the intuitions developed in the previous paragraph, the problem of estimating the number of cache interference misses within a loop nest can be decomposed more formally as follows: 1. For each reference R, determine the set of data to be reused (section 2.1). 2. For each other reference R0, determine the set of data that can interfere with the reuse set of R (section 2.2).  Determine the actual sets, i.e., the mapping of the reuse and interference sets in cache (section A). 3. Deduce the number of interference misses (section 2.3): 1. Cross-interference misses are deduced from the computation of the intersection between the actual reuse set and the actual interference sets (sections 2.3.1 and 2.3.2). Two cases are distinguished: internal cross-interferences where the relative distance between the reuse set and the interference set is constant, and external cross-interferences where the relative distance between the reuse set and the interference set is not constant. 2. Self-interference misses are deduced from the computation of the actual reuse set (section 2.3.3).

2.1 Reuse Set

Within a set of array elements to be reused, it is possible that two elements map to the same cache location. These elements then alternately ush each other from cache and they can never be reused. Therefore, they should not be counted within the set of elements that can be e ectively reused. It is necessary to distinguish between theoretical reuse set and actual reuse set. Intuitively, the theoretical reuse set is the set of elements to be reused, while the actual reuse set is the set of elements that can be e ectively reused after they have been mapped in a nite-sized cache (this set is de ned in cache lines). De nition 2.1 Assuming loops j1; : : :; jn are nested (with jn the innermost loop), the reuse loop level l of a reference R = r0 + r1j1 + : : : + rnjn is de ned as l = max fk j rk = 0g. Intuitively, the reuse loop level corresponds to the innermost loop which carries reuse. Other loops either don't carry reuse or will breed less reuse anyway. The ri are the loop indices coecients of the array subscript obtained after linearization. De nition 2.2 For any array reference R corresponding to the address r0 +r1j1 +: : :+rnjn, which reuse loop is l, the theoretical reuse set is equal to TRS(R) = fr0 + r1j1 + : : : + rn jn , (0  ji  Ni ? 1)i>l g. Parameters Ni correspond to the loop boundaries. The coecients ri are modi ed so that the loop boundaries can be expressed as 0  ji  Ni . As already mentioned, this condition excludes loops with non-constant boundaries. In the above de nition, the theoretical reuse set is expressed in array elements. The actual reuse set size can be also expressed in cache lines by mapping the set of array elements into an in nite-sized cache. In this case, the cache lines are also called data lines, see de nition below. 5

De nition 2.3 The data line corresponding to an array element is the cache line to which this data

is mapped in an in nite-sized cache. De nition 2.4 The actual reuse set ARS(R) is the cache footprint of the theoretical reuse set TRS(R), excluding all cache lines for which mapping con icts occur, i.e., all cache lines where two or more data lines are mapped. De nition 2.5 kTRS(R)k is the size of the theoretical reuse set of R expressed in array elements. De nition 2.6 kARS(R)k is the size of the actual reuse set of R expressed in cache lines.

Example Consider loop 2.3.3a. The reuse loop level of Z (j3; j1) is 2 (or loop j2). The reuse set of Z (j3; j1) is equal to fz0 + j3 + Nj1; (0  j3  N ? 1)g, where z0 is the base address of array Z and N is the leading dimension of arrays X; Y; Z . In this case kTRS(R)k = N array elements. The actual reuse set is equal to the cache lines corresponding to cache locations f(z0 + j3 + Nj1) mod CS ; (0  j3  N ? 1)g. Assuming a cache size of 4 array elements (8 words) and an 1-element (2 words) cache line, the actual reuse set of Z is equal to N cache lines if N  4, it is equal to 8 ? N cache lines if 4 < N < 8, and it is empty if 8  N .

2.2 Interference Set

The de nition of the set of array elements that can interfere with a reuse set is very similar to the de nition of a reuse set. As for the reuse set, two interference sets are de ned: the theoretical interference set and the actual interference set; the theoretical interference set is the set of elements that can interfere with the reuse set, while the actual interference set is the cache footprint of the interfering elements (this set is de ned in cache lines). De nition 2.7 For any array reference R = r0 + r1j1 + : : : + rnjn that can interfere with a reuse set de ned on loop level l, the theoretical interference set is equal to TIS(R) = fr0 + r1j1 + : : : + rn jn , (0  ji  Ni ? 1)i>lg. As for the reuse set, in the above de nition, the theoretical interference set can be expressed in array elements or data lines. By default, it is expressed in array elements. In opposition to the actual reuse set, when two elements of a theoretical interference set map to the same cache line, this cache line can still induce interferences so it is counted in the actual interference set. De nition 2.8 The actual interference set AIS(R) is exactly the cache footprint of the theoretical interference set TIS(R). De nition 2.9 As for the reuse set, kTIS(R)k and kAIS(R)k denote respectively the size of the theoretical and actual interference set of R.

Example Consider loop 2.3.3a. The reuse of Y (j3; j2) occurs on loop j1. So the corresponding interference set of Z (j3; j1) is de ned on loop j1. The interference set of Z (j3; j1) is equal to fz0 + j3 + Mj1; (0  j3  N ? 1)g. In this case kTIS(R)k = N array elements. Assuming again a cache size of 4 array elements (8 words) and a 1-element (2 words) cache line, the actual interference set of Z is equal to N cache lines if N < 4, and it is equal to 4 cache lines if 4  N .

6

2.3 Cache Interference Misses

In this section, the expressions of the number of interference misses are provided for each type of interferences, based on the reuse/interference sets framework. Consider the reuse set of a reference R (reuse occurs on loop l) and the corresponding interference set of a reference R0 . De nition 2.10 The cache position of a set corresponds to the rst element of this set, i.e., if a set is de ned on loop l, its cache position is de ned by

(set(R)) = (r0 + r1j1 + : : : + rljl) mod CS :

Lemma 2.1 The relative cache position of the interference set of R0 with respect to the reuse set of R is equal to

(AIS(R0)) ? (ARS(R)):

Lemma 2.2 The relative cache distance between the interference set of R0 and the reuse set of R is equal to

 = j(AIS(R0)) ? (ARS(R))j :

Example Consider loop 2.3.3a. The reuse of Y (j3; j2) occurs on loop j1. So the corresponding interference set of Z (j3; j1) is de ned on loop j1. The position of the actual reuse set of Y (j3; j2) is equal to (ARS(Y (j3 ; j2))) = (y0 ) mod CS , and the position of the actual interference set of Z (j3; j1) is equal to (AIS(Z (j3; j1))) = (z0 + Mj1 ) mod CS (where M is the leading dimension of array Z ). The relative position of the two sets is then (z0 + Mj1 ) mod CS ? (y0 ) mod CS .

2.3.1 Internal Cross-Interferences De nition 2.11 If  is independent of j ; : : :; jl, the cross-interferences between R and R0 are called internal cross-interferences. Proposition 2.1 If the cross-interferences between R and R0 are internal cross-interferences, the size 1

of the intersection (in cache lines) denoted CL(ARS(R); AIS(R0);  ) between ARS(R) and AIS(R0) is constant over the iterations of loops j1; : : :; jl. The number of cache misses due to such cross-interferences is then equal to N1  : : :  Nl  CL(ARS(R); AIS(R0); ):

Proposition 2.2 The intersection between two intervals of size S and S , separated by a distance of  = j(S ) ? (S )j cache locations, is equal to 1

2

1

4

2

+ + CL(S1 ; S2 ; ) = min ((S2 ? ) ; S1 ) + Lmin ((S1 +  ? CS ) ; S2 ) S

For instance, S1 = kARS(R)k and S2 = kAIS(R0)k. If the sets correspond to a collection of intervals (instead of a single interval), each interval of the reuse set is compared with each interval of the interference set. The size of the intersection is obtained by summing over all these sub-cases. Example In loop 2.3.1a, internal cross-interferences between arrays A and B occur.5 The relative cache distance between the two arrays is equal to  = ja0 mod CS ? b0 mod CS j. Figure 2.3.1b, 2.3.1c illustrate the model precision and the impact of internal cross-interferences on global performance when 4 Floor and ceiling functions have generally been omitted in this article when computing a number of cache lines, because this approximation proved to have very little in uence on precision. 5 Note that loops j1 and j2 cannot be interchanged because of the stride-one access of reference C (j2 ; j1 ).

7

 is varied from  < N2 (the reuse sets of A and B overlap) to  > N2 (no overlapping, i.e., no interference). Figure 2.3.1d shows the layout of arrays A and B in cache for one iteration of loop j1. The line intervals represent the reuse sets of A and B . The cache mapping of the sets is obtained by projecting the array intervals onto the line representing the cache.

2.3.2 External Cross-Interferences

For external cross-interferences, only an approximate evaluation is used. This approximate evaluation proved to be suciently precise. The approximate evaluation is a simple probabilistic reasoning, assuming  (the distance between the reuse set of R and the interference set of R0 ) is a random variable with a uniform distribution. Interferences are averaged over all values of  . Hence the proposition

Proposition 2.3 The approximate number of cache misses due to external cross-interferences between R and R0 over execution of the loop nest is equal to N1  : : :  Nl  C1

S

=C S ?1 X =0

CL(kARS(R)k ; kAIS(R0 )k ; )

where CL(kARS(R)k ; kAIS(R0)k ;  ) is the function de ned in section 2.3.1.

Example Consider loop 2.3.2a, which corresponds to the multiplication of an N  N matrix A with a vector X . External cross-interferences occur between array X (reuse set of size N ) and array A (interference set of size N ). As can be seen on gure 2.3.2c, the time per reference increases with N , because the temporal reuse of array X is disrupted by array A, until a threshold value is reached, where the temporal reuse cannot be exploited at all. Figure 2.3.2d shows the layout of arrays A and X in cache after iteration j1 = 2. There are three intervals of A (corresponding each to an interference set) and one interval of X corresponding to the reuse set. 2.3.3 Self-Interferences

Assume the reuse loop level is l for reference R. Theorem 2.1 The number of cache misses due to self-interferences of R is equal to

N1  : : :  Nl  (kTRS(R)k ? kARS(R)k) : Intuitively, each cache line excluded from the actual reuse set because of con icts generates one cache miss, each time the reuse set is referenced, hence the proposition. Example Consider loop 2.3.3a, corresponding to the multiplication of two N  N sub-matrices of M  M matrices. The self-interferences of array Y (j3; j2) is the largest source of cache misses in this loop. The reuse set of Y (j3; j2) is a set of N intervals of size N spread with a relative distance of N mod CS . Figure 2.3.3b illustrates the model precision and the impact of self-interferences on global performance. As can be seen, for M = 800, self-interferences of Y are nearly total, temporal reuse cannot be exploited. Spatial reuse can still be exploited, so that the miss ratio is close to the theoretical minimum of L1S = 0:125. Figure 2.3.3d illustrates the way array Y can distribute in cache after six iterations of loop j2 (for the sake of clarity, the scale has been modi ed).

2.3.4 Putting it all together

In this section, the number of cache misses of loop 2.3.4a is computed, based on the techniques presented in the previous sections. This loop is a modi ed version of a loop in ARC2D, a Perfect Club benchmark [16]. In this loop, several interference phenomena occur simultaneously (internal crossinterferences, external cross-interferences of self-dependences/group-dependences). The leading dimension of arrays F; U is KU . One parameter,  , is used to characterize the distance between the di erent 8

A() B ()

DO j1=0,N1-1 DO j2=0,N2-1 C(j2,j1) = A(j2)+B(j2) ENDDO ENDDO



Miss ratio

0.12 0.1 0.06 0.04 0.02 0

Execution time for 100 runs (seconds)

+ 2 +2 +

0.08

50

14

Loop (sim) + A,B (sim) A,B (est) 2

2 +2 + ++ 2 2 +++ 2 2 +++ + 22 2 2 + + + ++ +++++ 22 22 22 2 22222 200

400

600 

800

1000

12 11 10 9 8

1200

Simulated miss ratio of entire loop. Simulated miss ratio of array A or B. Estimated miss ratio of array A or B. Relative distance between arrays A and B (in number of array elements).

Loop (real) 3 Loop (sim) + A,B (sim) A,B (est) 2

+ 3 3 2+ + 23 3 2+

13

Figure 2.3.1b:Estimated and simulated miss ratio.

Loop(sim)= A,B(sim) = A,B(est) = Delta =

Cache Reuse Sets: Reusable Non-reusable Interference Sets:

Figure 2.3.1d:Layout in cache.

Loop 2.3.1a (N1=250,N2=1000).

0.14

N2

C (; j1)

50

3 2+ 3 2+ 3 2+ +3 23 2+ 3 2+ 3 2+ ++ 23 23 3 2+ 3 2+ 3 2+ 33 2+ 2+ 3 2+ ++++ 33 2+ 333 22222

200

400

600 

800

1000

1200

Figure 2.3.1c: Execution time.

Loop(real)= Measured loop execution time. Loop(sim) = Correlation based on simulation of all arrays. A(sim) = Correlation based on simulation of A,B and estimates for all other arrays. A(est) = Correlation based on estimates for all arrays.

Figure 2.3.1: Internal Cross-Interferences.

9

A(; 0)

N

N

Miss ratio

Average Execution time per reference (in

0.14

0.08 0.06

+

2

2 22

0.02 0

2

2 22

10000

2

20000

1.9

X (sim) X (est) 2

N

30000

40000

seconds)

3 3+ 3 33+ 3+ 3 +33 222222222 + + 23 ++++ 2 1.8 3 + 3 2 3 2+ Loop (real) 3 + 2+ 3 1.7 3 + 2 3 + 3+ 2 Loop (sim) + 2 3 1.6 2 3 3 + 2+ X (sim) + 2 3 3 2 1.5 3 X (est) 2 + 2+ + 3 2 3 + 2 + 1.4 2

+++ 22

+++ 2 2 + + ++ Loop (sim) + 22 + ++

10?7

2

222222222 2+++++++++ ++

0.04

Cache Reuse Sets: Reusable Non-reusable Interference Sets:

Figure 2.3.2d:Layout in cache.

Loop 2.3.2a.

0.1

A(; 2)

X ()

DO j1=0,N-1 reg = Y(j1) DO j2=0,N-1 reg += A(j2,j1) * X(j2) ENDDO Y(j1) = reg ENDDO

0.12

A(; 1)

50000

1.3

10000

20000

N

30000

40000

50000

Figure 2.3.2b:Estimated and simulated miss ratio.

Figure 2.3.2c:Execution time.

Loop(sim) = Simulated miss ratio of entire loop. X(sim) = Simulated miss ratio of array X. X(est) = Estimated miss ratio of array X.

Loop(real)= Measured loop execution time. Loop(sim) = Correlation based on simulation of all arrays. X(sim) = Correlation based on simulation of X and estimates for all other arrays. X(est) = Correlation based on estimates for all arrays.

Figure 2.3.2: External Cross-Interferences.

10

Y (; ) DO j1=0,N-1 DO j2=0,N-1 reg = X(j2,j1) DO j3=0,N-1 Z(j3,j1) += reg * Y(j3,j2) ENDDO ENDDO ENDDO

32

0.1 0.08

Execution time for 100 runs (seconds)

+

0.02

34

+Loop (sim) + 2 +2 YY (sim) (est) 2

+

2

2

+ 2 ++++ 2 +++

222

222

32 30

+

+

0.06 0.04

X (; j1 )

Figure 2.3.1c:Layout in cache.

Miss ratio

0.12

Cache Reuse Sets: Reusable Non-reusable Interference Sets:

N = 100 800

Loop 2.3.3a (N=100).

0.14

Z (; j1 )

28

+ 2

2

++++ +++2

2222 222

+ 3 2

26

+ +++2

24

222

0 800 805 810 815 820 825 830 835 840 845 850 Matrix leading dimension M

+ 2 Real 3 3 Loop (sim) + + 3 2 Y (sim) + 3 2 Y (est) 2

+3 2 + 3 2 3 3 +++ +++ 222 3 222 3 33

2 + 3

+ 3 2

+ 2 3 3 + 2 3 + 33 2222 222 222 ++3 +++ 3 3 +++ + 3 33 3

22 800 805 810 815 820 825 830 835 840 845 850 Matrix leading dimension M

Figure 2.3.3b:Estimated and simulated miss ratio.

Figure 2.3.3c:Execution time.

Loop(sim) Y(sim) Y(est) M

Real = Measured loop execution time. Loop(sim) = Correlation based on simulation of all arrays. Y(sim) = Correlation based on simulation of Y and estimates for all other arrays. Y(est) = Correlation based on estimates for all arrays.

= = = =

Simulated miss ratio Simulated miss ratio Estimated miss ratio Leading dimension of

of entire loop. of array Y. of array Y. X,Y,Z.

Figure 2.3.3: Self-Interferences.

11

arrays so that all interference phenomena can be visited with a single parameter 6 . As can be seen on gure 2.3.4b, the occurrence of di erent interference phenomena determines four intervals separated by rare but extreme cases corresponding to ping-pong (disruption of spatial reuse). Figure 2.3.4d shows the cache layout of the di erent arrays after one iteration of loop J .

3 Estimating the Optimal Block Size

Section 3.1 describes a simple heuristic for evaluating TLB misses which is a necessary add-on to our techniques for computing the optimal block size. In section 3.2, the process of computing the optimal block size in the presence of cross-interferences is illustrated with an example. Similarly, in section 3.3, the optimal block size is computed in the presence of self-interferences. In section 3.4, the extension to spatial interferences is discussed. In section 3.5, further issues related to the evaluation of interference misses are discussed. A formal algorithm for estimating the optimal block size is provided in section 3.6. In the experiments of this section, all graphs correspond to measured execution time on the HP/PARISC (except for gure 3.2f).

3.1 TLB Misses

In next sections, the computations of cache misses are used to derive the optimal block size. However, most processors also include a second type of caches, a Translation Lookaside Bu er (TLB) which bu ers the latest physical to virtual page address translations. Such bu ers are necessary because processors use virtual addresses while most caches use physical addresses. Usually a TLB is a highly associative or even fully-associative caches so that there are little or no con ict misses. TLB misses are often more costly (though less frequent) than cache misses and should not be neglected. Loop transformations can increase the number of TLB misses which can eventually decrease or cancel the expected performance improvements. TLB misses TLB misses for a reference can be estimated the following way. Let R = r0 + r1j1 + : : : + rnjn a reference, with 0  ji  Ni. TLB capacity misses occur if there exists a loop ji0 where the number of pages crossed is larger than TS (the TLB size in number of entries), i.e., such that nT LB = min rPiS0 ; 1 Ni0 > PS where PS is the page size (in array elements). Then, the number of TLB  for that reference is approximately equal to NT LB = nT LB  Ni0 ?1  : : :  N1 =  misses ri0 min P ; 1  Ni0  : : :  N1. The above technique is not accurate but suciently precise for our purposes. In uence on the optimal block size computation The general optimization problem to be solved for obtaining the optimal block size is the following W (B )  CS ; nd B that minimizes Ncache misses(B ) Assuming one TLB miss is  times more costly than a cache miss, the optimization problem obtained by considering TLB misses is the following W (B )  CS ; nd B that minimizes Ncache misses(B ) +   NTLB (B )

3.2 Computation of the Optimal Block Size When Cross-Interferences Occur

Consider the original loop in gure 3.2a. Internal cross-interferences between A(j2) and B (j2) can occur, as well as external cross-interferences between C (j2; j1) and A(j2 ) and B (j2 ). Let us de ne  = j(b0 mod CS ) ? (a0 mod CS )j. If N2 > CS , then capacity misses occur for both A(j2) and B(j2). Therefore, blocking is applied on loop j2, as shown in gure 3.2a.

Theoretical optimal block size 6 This

assumption does not in uence the complexity of the computations.

12

U (k; j)

20 10

I = JU-1 DO 10 J = 0,JU-1 DO 20 K = 0,KU-1 LDD = -LDA(K)*U(K,J+2) - LDB(K)*U(K,J+1) LDA(K) = LDB(K) LDB(K) = LDD LDS(K) = LDS(K) + LDD*U(K,J) U(K,I) = U(K,I) + LDD*U(K,J) F(K,I) = F(K,I) - LDD*F(K,J) CONTINUE CONTINUE

Loop 2.3.4a (KU=700;

F(k,j)



LDB(k) LDS(k) KU

Cache Reusable Non-reusable

 

Execution time for 100 runs (seconds)

0.04

36

Simulated + + 0.03 2 Estimated 2+2+ + ++ 22 2+2+ 0.02 + 2 2+ +2 2 ++++++ 2 + 222222 +++++++ 0.01 +2

Real 3 3 + + Simulated 23 + Estimated 2 23 + 23 + 23 + 23 + 23 +3 ++ + 3 3 +3 +3 33 222222 + +3 +++ ++ 33 3 22222 22

35 34

3 + + 3 3 + 22 32 2 + 3 +3 2 31 2 + +3 2 2222222 30 2 29 3

2

-500

LDA(k)

Figure 2.3.4d:Layout in cache.

JU=500).

Miss ratio

0

U (k; j + 1) U (k; j + 2)

0

500



1000

1500

33

28

2000

-500

0

500



1000

1500

2000

Figure 2.3.4b:Estimated and simulated miss ratio.

Figure 2.3.4c:Execution time.

Simulated = Simulated miss ratio of entire loop. Estimated = Estimated miss ratio of entire loop.

Real = Measured loop execution time. Simulated = Correlation based on simulation of all arrays. Estimated = Correlation based on estimates of all arrays.

Figure 2.3.4: In uence of Cache Interferences.

13

The window size for A(j2 ) and B (j2 ), i.e., the number of elements to be reused (see section 1.3.1), is equal to B . The theoretical number of cache misses for A(j2); B (j2) in the blocked loop is equal to LNS2 . The number of cache misses for C (j2; j1) is equal to N1LSN2 . Consequently, the optimization problem is the following LS  2B  CS ; nd B that minimizes 0 (here, no capacity miss after blocking): h

i

And the solution is any value within LS ; C2S . The Window strategy would pick Bth = C2S , because whenever two block sizes perform identically, the largest possible block size is chosen to decrease the loop overhead.

Estimated optimal block size Internal cross-interferences No self-interference occurs, and let us ignore, for the moment, the impact of external cross-interferences. For the sake of simplicity, we now assume  < C+S ?  . Because + + kARS(A(j2 ))k = kARS(B (j2 ))k = B, CL(ARS(A(j2)); ARS(B (j2 )); ) =

min((B?) ;B)+min( (B+?CS ) ;B) LS

= (B?LS) ,

since B < CS and  < CS ?  . So, the new optimization problem is the following + 1 N2 LS  B  Bth ; nd B that minimizes N BLS  (B ?  ) : int At this point, the solution would be Best = . int , i.e., B is chosen so that External cross-interferences It can now be considered that B  Best no internal cross-interference occurs. Now, external cross-interferences can also in uence the optimal block size. Consider reference A(j2): its actual reuse set is an interval of B cache locations, and the corresponding actual interference set of C (j2; j1) is also an interval of B cache locations. Using the approximate expression of section 2.3.2, the number of external cross-interferences between A(j2 ) and C (j2; j1) is equal to NC1SNL2SB (S1 = B, S2 = B, reuse loop = j1). The result is the same for B(j2). Consequently, the new optimization problem is the following int ; nd B that minimizes 2N1 N2 B LS  B  Best CS L S : cache At this point, the solution would be Best = LS . TLB misses Now, let us consider TLB misses. A(j2); B(j2) are unlikely to breed TLB misses. But reference C (j2; j1) corresponds to j2 + N2j1 + c0 . It can be normalized to j20 + N2 j1 + jj2 + c0 (where j20 = j2 ? jj2) so that all loop indices have constant boundaries. The number of pages crossed over   N2 0 loop j2 is equal to nT LB = min PS ; 1  N1. According to section 3.1, if nT LB > TS , TLB misses occur. Here, N2 = (20000 or 1000) and N1 = 200, so nT LB ' N1 > TS ,7 and the condition is ful lled. Therefore, the number of TLB misses is equal to NT LB = nT LB  N1  NB2 . The new optimization problem is the following 1 N2 B +   N1 N2 : LS  B  ; nd B that minimizes 2N CS LS B



r





r





 min ; 1 . Here, CS2LS  min NPS2 ; 1 ' 1352 And the solution is Best = min ; 2   when min PNS2 ; 1 ' 1. Figure 3.2c illustrates the necessity to precisely evaluate every type of cache misses and TLB misses. Figure 3.2d shows the performance variations with the block size, for di erent values of  . Using blocking to reduce interferences When N2 < C2S , no capacity miss occurs, blocking can still be used to remove internal cross-interferences between A(j2) and B (j2 ). The optimization problems 7 For N2 = 1000, N2 PS

CS LS

' 1, so that nTLB ' 1

14

N2 PS

are identical, except that Bth = N2. As shown by the term 2NC1SNL2SB , blocking also reduces external cross-interferences, by decreasing the reuse distance. Figure 3.2e illustrates the impact of blocking for N2 = 1000.

3.3 Computation of the Optimal Block Size When Self-Interferences Occur

Consider loop 3.3a. Z (j3; j1) and Y (j3; j2) breed most memory references. Main interference misses are either cross-interferences between Z (j3; j1) and Y (j3; j2) or self-interferences of Y (j3; j2). Z (j3 ; j1) can at most ush B of the B 2 elements of the theoretical reuse set of Y (j3; j2), while self-interferences can prevent the reuse of all B 2 elements. So, self-interferences are likely to be the main cause of interference misses.

Theoretical optimal block size

References X (j2; j1), Y (j3; j2) and Z (j3 ; j1) all exhibit temporal reuse. According to the work of Eisenbeis and al. [1], the window size of these references is respectively 1, B 2 and B . The number of N3 N2 N3 misses corresponding to each reference is respectively BL S , LS and BLS . The optimization problem is therefore 2N 3 (terms independent of B are ignored): B 2 + B + 1  CS ; nd B that minimizes BL S And the solution is Bth = ?1+

p

S ?1)

1+4(C 2

p ' CS .

Estimated optimal block size Self-interferences Let us rst ignore the impact of external cross-interferences, and only consider 8

self-interferences. The rst goal is to nd a block size B such that no self-interference occurs. The problem is the following B  Bth ; nd B such that kTRS(Y (j3 ; j2))k = kARS(Y (j3 ; j2))k : According to section A, if fi (ns ; S; s) > s then the intervals of size S in one area of size s?1 overlap with the intervals of the next area, and self-interferences occur. So the rst condition is fi (ns ; S; s)  s s +s?1 which implies S  1+ ss?1 . The second condition is that, within one area of size s?1 , two consecutive CS intervals should not overlap, i.e., S  s . Since here S = B , these conditions are equivalent to ! self B  min 1+s+sss??11 ; s = Best . CS

self External cross-interferences It is now considered that no self-interference occurs, i.e., B  Best .

X has a negligible impact on global performance, so for the sake of simplicity, it is not considered here. N 3B3 According to the approximate evaluation of section 2.3.2, Y induces B 2 CS LS misses on Z (S1 = B , S2 = B, reuse loop = j2), and Z induces NB32(BCS3+LBS ) ' CNS3LBS misses on Y (S1 = B2 , S2 = B, reuse loop = j1).9 The new optimization problem is the following B

self  Best ; nd B that minimizes

cache And at this point, the solution is Best = min cache = min Best

s s?1 ; s CS

s +s?1

1+

!

q

CS 2

N3 BLS

+ 2CNS L3 BS

!

; 1+ ss?1 ; s , which is here equivalent to s +s?1

CS

, i.e., external cross-interferences have a negligible impact in this example.

8 For this paragraph, it is necessary to read appendix A. Otherwise, only the result should be considered and the paragraph should be skipped. 9 The number of misses is not symmetrical because the reuse does not occur on the same loop level for Y and Z .

15

C Blocked Loop

C Original Loop

1

DO 1 j1=0,N1-1 DO 1 j2=0,N2-1 C(j2,j1) = A(j2)+B(j2) CONTINUE

1

Loop 3.2a: Original and Blocked Loop (N1=250,

DO 1 jj2=0,N2-1,B DO 1 j1=0,N1-1 DO 1 j2=jj2,min(jj2+B-1,N2-1) C(j2,j1) = A(j2)+B(j2) CONTINUE

N2=20000).

Execution Time / Number of References (in 10?7 seconds) Execution Time / Number of References (in 10?7 seconds)

1.9 3 1.8 3 3 1.7 2 2 2 23 23 23 32 2 2 2 2 2 2 2 2 2 2 2 2 2 2 42 4 1.6 B = Bth 3 3 3 3 3 3 3 3 4 3 334 3333 1.5 BB==LS 42 3 4 4 4 1.4 B = Best + 444 4 1.3 B = Bact 4 4 4 4 4 1.2 4+ 4+ 4+ 4 4+ + + + + + + + + + + + +   +  +  +  +       + 1.1   +  50 5000 10000 15000 20000  Figure 3.2c: In uence of optimal block size computation (N2=20000); Bact

is the actual optimal block size.

1.9 1.8 1.7 1.6 1.5 1.4 1.3 1.2 +32 1.1

 = 250 3  = 2500 +  = 5000 2

3 3

3 2

+

250

3 33

333 3 3

3

3 3 3 + + +

2

22 2 2 22222 22222222 2 2 2 2 2 22 2 + 2+ 2 + 2 22 + + + + + + ++ + + +++

B

1352 2500 5000

Figure 3.2d: In uence of block size (N2=20000); a logarithmic scale is used.

Execution Time / Number of References (in 10?7 seconds)

1.8 B = Bth 3 1.7 3 3 3 B = Best + 3 1.6 B = Bact  3 1.5 3 3 1.4 3 3 1.3 + + 1.2  + + + + + + + + 3 + 3   1.1         1 100 200 300 400 500 600 700 800 900 1000

Original base addresses Shifted base addresses (Average Memory Access Time) 5 4 3 2 1 0



Figure 3.2e: Using blocking to reduce interferences (N2 = 1000); Bact

is the actual optimal block size.

16

Figure 3.2f: Ping-Pong phenomena in subroutine HYD of code ADM.

TLB misses The linearized addresses for E; F; G are respectively, j 0 + Nj + Bjj + x , j 0 + Nj 0 + Bjj + BNjj + y , j 0 + Nj + Bjj + z (with 0  j 0 < B,0  j 0 < B).  On loop j 0 , no TLB capacity miss can occur. 0  The of pages crossed corresponding to E; F; G, is equal to nT LB =  number   over loop j , respectively   N B min PS ; 1  B + min PS ; 1  B + min PS ; 1  B . So, if B > TS , nT LB ' B > TS , and the number of TLB misses is approximately equal to NTLB = nT LB  N  NB  NB ' NB3 .    Assume now B  TS . On loop j , nT LB = min PNS ; 1 N + min PNS ; 1 N + min PNS ; 1 N . Since here N varies between 800 and 850, nT LB > TS . So, NT LB = nT LB  NB  NB ' BN23 . 2

3

0

2

1

3

0

3

2

1

0

2

3

2

3

3

2

1

1

3

The optimization problem is now the following B



cache ; Best

nd B that minimizes

N3 BLS

2N 3 B

+ CS LS +  

!

(

3N 3

B2 N3 B

if B  TS if B > TS

:

And the solution is Best = min 1+ ss?1 ; s ; TS CS For instance, for N = 800, Best = s = 32. The actual optimal block size Bact experimentally found is Bact = 40, and the performance obtained for Best and Bact are close (see gure 3.3c). s +s?1

3.4 Extension to Spatial Reuse

The techniques shown in previous sections can be extended to spatial reuse. Note that for spatial reuse, at most LS ? 1 reuses can be achieved per cache line. Examples of spatial interferences are presented below. Using blocking to reduce spatial external cross-interferences Consider loop 3.3e. Both B(j2; j1) and A(j1 ; j2) do not exhibit temporal reuse. Both exhibit spatial reuse, but the spatial reuse distance for B (j2; j1) is equal to 1 iteration of j2, while the spatial reuse distance for A(j1; j2) is equal to N2 iterations of j2. Consequently, blocking on j2 can reduce the spatial reuse distance for A(j1; j2) without increasing that of B (j2 ; j1). It can be assumed that no spatial cross-interference occurs for B (j2 ; j1). For A(j1 ; j2), the reuse set is a set of N2 cache lines. The interference set of B (j2; j1) corresponding to this reuse set is an interval of N2 cache locations. As suggested in 2.3, each cache line of the reuse set is considered separately. Each cache line can be reused LS ? 1 times. Using the approximate evaluation of section 2.3.2 (S1 = LS , S2 = N2, reuse loop = j1 and LS ? 1 reuses), the number of spatial external cross-interference misses N2 on A(j1 ; j2) due to B (j2 ; j1) is equal to B 2  (LS ? 1)  CBS L3 S . The optimization problem is then the following N 2 B (LS ?1) LS  B  CS ; nd B that minimizes CS LS cache And the solution is Best = LS . Letus now consider TLB misses. The number of pages crossed over loop jj 2 is equal to nT LB = BN N N N N min BN PS ; 1  B + min PS ; 1  B ' B (in the example 200  N  1000). So if B < TS , the number of TLB misses is equal to NT LB ' NB 22 . The new optimization problem is the following LS

 B  CS ; nd B that minimizes 

1

3

N 2 B (LS CS LS

?1) +   N 2 B2

' 40. The precision of the optimal block size evaluation is And the solution is Best = illustrated in gure 3.3f. Note that blocking can not only reduce cache interference misses, but also TLB capacity misses. S ? LS

2C 1 1

17

C Blocked Loop C Original Loop

1

DO 1 j1=0,N-1 DO 1 j2=0,N-1 reg = X(j2,j1) DO 1 j3=0,N-1 Z(j3,j1) += reg * Y(j3,j2) CONTINUE

Loop 3.3a: Original and Blocked Loop (leading

1

DO 1 jj2=0,N-1,B DO 1 jj3=0,N-1,B DO 1 j1=0,N-1 DO 1 j2=jj2,min(jj2+B-1,N-1) reg = X(j2,j1) DO 1 j3=jj3,min(jj3+B-1,N-1) Z(j3,j1) += reg * Y(j3,j2) CONTINUE

dimension = N, 800



N



850).

Execution Time / Number of References (in 10?7 seconds) Execution Time / Number of References (in 10?7 seconds)

1.6 3 3 3 B = Bth 3 3 333 B = Best 1.5 33 3 3 3 + 3 3 3 3 3 B = Bact  3 3 1.4 3 3 3 33 33 3 33 3 3 3 1.3 3 3333 3 3 3 3 33 3 1.2 3 3 3 33 3 3 + + + + + + + + + + + + + + ++ + + + + + 1.1 ++++++++++++++++++++++++++++++    1 800 810 820 830 840 850 860

1.6 1.5 1.4 1.3 1.2 1.1 1

N

3 +3 3 3 3 ++ 3 3 + 22 N = 800 3 3 33 33 333 3 3 ++ 333 N = 801 + 3 2 + 3 3 3 + 33 3 3+ 2 3 N = 802 2 + 22 + 3 ++ N = 803  2 2  ++ 333 2 +++++++++ 2  3 +++ 2  3 2 2 + 2 +  22 22 3 ++ 3 +  2 33 +   222  2 3    3 3  +  33  + 2+ 2 2 2 2   2+  + 22 2  233 2 2  2 2+ 2+ 22  2 + 23  + + 2 ++ 2+ 22 2+

0 20 40 60 80 100 120 140 160 180 200 B

Figure 3.3c: In uence of optimal block size computation; Figure 3.3d: In uence of block size (N = 800 : Best = 32, N = 801 : Best = 72, N = 802 : Best = 112, is the actual optimal block size. N = 803 : Best = 156).

Bact

C Original Loop

1

Execution Time / Number of References (in 10?7 seconds) 8 4 44 4 4 4 44 4 4 4 4 4 4 4 4

N=200 3 7 N=300 + 4 6 N=400 2  N=500   2 5 N=1000 4 2 2 222222 2 4 + 4  2 + + 4 3 4 + ++ 4 23  3 34 3 34 33+ 3+ 3+ 333+ 33 4 4 23 +2 + 3 ++ + + + +     2 2 2 2 2 2   2 2 + 2+ 2+++ 22 ++ 3 333 3333 3 1 0 20 40 60 80 100 120 140 160 180 200

DO 1 j1=0,N-1 DO 1 j2=0,N-1 B(j2,j1) = A(j1,j2) CONTINUE

C Blocked Loop

1

DO 1 jj1=0,N-1,B DO 1 jj2=0,N-1,B DO 1 j1=jj1,min(jj1+B-1,N-1) DO 1 j2=jj2,min(jj2+B-1,N-1) B(j2,j1) = A(j1,j2) CONTINUE

Loop 3.3e: Original and Blocked Loop.

B

Figure 3.3f: Using blocking to reduce spatial interferences and TLB misses in loop 3.3e.

18

Ping-Pong Ping-Pong corresponds to spatial internal cross-interferences. Normally, the probability

that two data map to the same cache line is equal to CLSS . So, such phenomena have a very low probability to occur, but the associated performance degradation might be very dicult to trace back in a program. Therefore they should be detected at compile-time or run-time. For instance, in subroutine HYD (and other subroutines) of the Perfect [16] benchmark ADM, ping-pong phenomena were detected.10 The average memory access time was divided by 2, simply by slightly varying the base addresses of two arrays P and PS in subroutine HYD (see gure 3.2f).11 Spatial self-interferences can also frequently occur because of the widespread use of 2p as a matrix dimension. This is especially critical for blocked algorithms where the matrix dimension corresponds to the block stride, i.e., the distance between two blocks.

3.5 Further Issues in the Evaluation of Cache Interferences 3.5.1 Selecting the proper dependence

Assume that in loop 3.2a, reference B (j2 ) is replaced with reference A(j2 + 1). There is a selfdependence for A(j2) and a group-dependence from A(j2) to A(j2 + 1). The reuse distance of the group-dependence (1 iteration of j2 ) is much shorter than that of the self-dependence (N2 iterations of j2). So, for most elements (excluding A(0)), reference A(j2) exploits the group-dependence reuse rather than the self-dependence. Therefore, interferences on the group-dependence instead of the selfdependence must be considered. Interferences are evaluated for the dependence corresponding to the smallest dependence distance. In general, array subscripts (i.e., dependences) are simple enough to make the selection of the proper dependence a tractable task.

3.5.2 Cumulating interference sets

It is not possible to simply add the number of cache misses corresponding to each reference and each type of interferences, because some interferences can be redundant. Assume that in loop 3.2a, reference B (j2) is replaced with reference D(j2; j1). C (j2; j1) and D(j2; j1) can both induce external cross-interferences on A(j2). The relative cache distance between the actual interference sets of these two references is equal to CD = j(c0 mod CS ) ? (d0 mod CS )j. If CD < N2, i.e., the two actual interference sets overlap, then when they intersect with the actual reuse set of A, C and D can have a redundant impact on A, and cache misses should not be counted twice. In order to avoid such redundancy, interference sets should not considered individually: De nition 3.1 Two references R and R0 belong to the same translation group if the relative cache distance between R and R0 is constant. The actual interference set of a translation group is the union of the actual interference sets of all references within this translation group. Note that, in general, there are few di erent translation groups within a loop body.

3.5.3 Redundancies

Globally, most redundancies are avoided for the following reasons. First, determining cross-interferences on the actual reuse set instead of the theoretical reuse set avoids redundant evaluation between selfinterferences and cross-interferences. Second, determining internal cross-interferences before evaluating 10 Ping-Pong recurs on each run because the arrays are generally assigned consecutively in memory, so that their relative base address remains constant across runs. 11 These observations were obtained with a cache simulator for an 8-kbyte direct-mapped cache with a 32-byte cache line size, the same parameters as the DEC Alpha [17], MIPS R4000 [18].

19

external cross-interferences and then updating the actual reuse set avoids redundant evaluation between internal cross-interferences and external cross-interferences. Third, redundancies within external cross-interferences are ignored, which proved to be a reasonable approximation.12

3.6 Global Algorithm

The global algorithm used to compute the optimal block size in a loop is shown on gure 3.6. An optimal algorithm would be to simply compute all expressions of the di erent types of cache misses and TLB misses, and then solve the optimization problem. However, the optimization problem is then too complex to solve. So, as shown in sections 3.2, 3.3 and 3.4, we propose the heuristic which rst aims at eliminating self-interferences and internal cross-interferences, and then solves the optimization problem between external cross-interferences and TLB misses. As a consequence, since the optimization problem is solved with fewer constraints, it can be argued that the estimate of the optimal block size can sometimes be an underestimate. However it appeared that small block sizes seldom induce poor performance. 12 It is possible, though, to come up with examples where redundancies between external cross-interferences are signi cant.

20

Compute the number of cache capacity misses Ncapacity Find B the solution of the theoreticalthoptimization problem.

Compute the number of cache misses due to self-interferences.

For each reference R in the loop body Determine the shortest dependence distance Determine the reuse set and actual reuse set B  Bth self (RL)S=maximal Best block size without self-interferences. For theof translation reference Rgroup

Determine interferencesetset and actualtheinterference Update the actual reuse Compute the number cachecross-interferences. misses due to of set of R by removing elements internal victim of cross-interferences self (R)) Lcross (Bth ; Best S  B(R) min Bint = maximal block size without internal cross-interferences. Compute the number of cache misses due to external cross-interferences

For each other translation group Determine interferencesetset and actualtheinterference Compute the total number of external cross-interference ext (misses for reference R: Ninterf. R). Compute the number of Ptotal ext external ext =cross-interferences: Ninterf. R Ninterf. (R). Compute the number of TLB capacity misses NTLB . ext ; Bint ) LS  B  min(Bth ; Best Find Bestext that minimizesest Ncapacity + Ninterf. +   NTLB. Best is the estimated optimal block size

Global Algorithm.

Implementation As of now, the global algorithm has not been implemented. However, several pieces of code for computing the number of self-interferences, internal cross-interferences and external crossinterferences have been written. Extensive testing is currently limited by the fact these techniques are not implemented in a compiler. Possible target compilers are Sage++ [19], ParaScope [20] or the Tiny/OmegaCalculator [21]. Ideally, the implementation should be combined with a stable data locality optimizer. Few such tools are currently available.

4 Conclusions and Future Directions

Optimal block size computation, as performed up to now, often yields poor performance, and therefore restricts the use of blocking techniques in production compilers. In order to obtain near-optimal performance, it was shown that an accurate evaluation of the optimal block size is necessary. This computation requires an accurate evaluation of cache interference misses. Conversely, it was shown that interferences can severely a ect the performance of a loop even when all data to be reused t in cache, and that blocking can then be used to reduce the number of interferences and signi cantly improve performance, by decreasing the reuse distance. 21

Several future directions of research must be explored. First, the integration in a compiler of the techniques for evaluating interference misses and computing the optimal block size. Second, the extension of these techniques to non-constant dependence distances. Third, combining optimal blocking and selective copying in a single strategy, in order to minimize cache interferences both in tilable and non-tilable loop nests.

References

[1] Christine Eisenbeis, William Jalby, Daniel Windheiser, and Francois Bodin. A Strategy for Array Management in Local Memory. In Proceedings of the Third Workshop on Programming Languages and Compilers for Parallel Computing, 1990. [2] S. Carr, K. S. McKinley, and C. Tseng. Compiler optimizations for improving data locality. In Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, San Jose, October 1994. [3] Michael Wolf and Monica Lam. A Data Locality Optimizing Algorithm. In Proceedings of the ACM SIGPLAN '91 Conference on Programming Language Design and Implementation, volume 26(6), pages 30{44, June 1991. [4] W. Li and K. Pingali. Access Normalization: Loop Restructuring for NUMA Compilers. In Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 285{295, September 1992. [5] Monica Lam, Edward E. Rothberg, and Michael E. Wolf. The Cache Performance of Blocked Algorithms. In Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, April 1991. [6] Jeanne Ferrante, Vivek Sarkar, and Wendy Thrash. On Estimating and Enhancing Cache E ectiveness (Extended Abstract). In Proceedings of the Fourth Workshop on Programming Languages and Compilers for Parallel Computing, 1991. [7] Mark Hill. Aspects of Cache Memory and Instruction Bu er Performance. PhD thesis, University of California Berkeley, 1987. [8] O. Temam, C. Fricker, and W. Jalby. Cache Interference Phenomena. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, 1994. [9] Stephanie Coleman and Kathryn S. McKinley. The Tile Size Selection Using Cache Organization and Data Layout. In ACM SIGPLAN '95 Conference on Programming Languages Design and Implementation, 1995. [10] K. S. McKinley and O. Temam. A quantitative analysis of loop nest locality. In Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, Cambridge, October 1996. [11] John L. Hennessy and David A. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers Inc., 1990. [12] Dennis Gannon, William Jalby, and Kyle Gallivan. Strategies for Cache and Local Memory Management. Journal of Parallel and Distributed Computing, 5(5):587{616, October 1988. 22

[13] She Zhiyu, Zhiyuaun Li, and Pen-Chung Yew. An Empirical Study on Array Subscripts and Data Dependendencies. Technical Report 840, Center for Supercomputing Research and Development, University of Illinois at Urbana-Champaign, August 1989. [14] O. Temam, C. Fricker, and W. Jalby. Impact of cache interferences on usual numerical dense loop nests. In Proceedings of the IEEE, special issue on Computer Performance Evaluation, 1993. [15] O. Temam, C. Fricker, and W. Jalby. Evaluating the impact of cache interferences on numerical codes. In Proceedings of the International Conference on Parallel Processing, 1993. [16] George Cybenko, Lyle Kipp, Lynn Pointer, and David Kuck. Supercomputing Performance Evaluation and the Perfect Benchmarks. In Proceedings of IEEE Supercomputing'90 Conference, pages 254{266, 1990. [17] Digital Equipment Corporation, Maynard, Massachussets. Alpha 21164 Microprocessor, Hardware Reference Manual, 1994. [18] G. Kane and J. Heinrich. MIPS RISC Architecture. Prentice-Hall, 1992. [19] D. Gannon et al. SIGMA II: A Tool Kit for Building Parallelizing Compiler and Performance Analysis Systems. Technical report, University of Indiana, 1992. [20] K. Cooper, M. W. Hall, R. T. Hood, K. kennedy, K. S. McKinley, J. M. Mellor-Crummey, L. Torczon, and S. K. Warren:. The Parascope Parallel Programming Environment. Proceedings of the IEEE, 81(2):244{263, February 1993. [21] W. Kelly, V. Maslov, W. Pugh, E. Rosser, T. Shpeisman, and D. Wonnacott. The Omega Calculator, version 0.8. Technical report, University of Oregon, August 1994. [22] O. Patashnik R. L. Graham, D. E. Knuth. Concrete mathematics. Addison-Wesley, 1989.

23

A Evaluating the Actual Sets

Consider a reference R = r0 + r1j1 + : : : + rn jn , of which we want to compute the cache footprint on loop level l (i.e., the footprint is de ned by loops n; n ? 1; : : :; l + 1). Observation A.1 After iterating on any loop i; l < i  n, the data layout in cache is assumed to be a set of intervals of cache locations, characterized by the average size S of an interval, and the average distance  between two consecutive intervals. Consequently, a recursive process can be applied for each loop level i; l < i  n.

Problem A.1 Assuming an initial regular data layout of intervals of size S separated by a distance of  cache locations (a layout obtained after iterating on loops n; n ? 1; : : :; i + 1), the problem is to 13

determine the nal layout, i.e., average size S 0, average distance  0 after N iterations of loop i.

A.1 Actual interference set

For problem 3.1, a recursive process can also be applied. Let 0 = CS and 1 =  . After a number of iterations, the intervals wrap around the cache area of size 0 . Within each area of 1 cache locations, the layout of intervals is identical. So, only one such area of size 1 can be studied. Within one such area, the spacing between two consecutive intervals is equal to 2 = 1 ? 0 mod 1 . Let us consider a recursive application of this process. Observation A.2 On recursion level k, the size of the area considered is equal to k?1, and the spacing between two consecutive intervals in one such area is equal to k . The recursion process stops on a level s when all iterations of loop i have been considered, or when overlapping occurs, i.e., if s < S . Starting at this point, it is possible to determine the footprint within an area of size s?1 , and assuming the layout is regular, the footprint within the whole cache.

Proposition A.1 Let N = ns b s?0 1 c + r (with r = N mod b s?0 1 c). In b s?0 1 c? r areas of size s? , ns 1

intervals of size S have been brought, and in the remaining r areas ns + 1 intervals have been brought.

Theorem fi(x; S; s) = S +(x ? 1) s : Thenthe cache footprint size of all intervals is equal  A.1 Let  +

to max

14

b s?0 1 c ? r  fi (ns ; S; s) + r  fi (ns + 1; S; s); 0 .

For this layout of intervals, the average size of an interval is equal to S0 =

max





b s?0 1 c ? r  fi (ns ; S; s ) + r  fi (ns + 1; S; s ); 0 0 s?1



and the spacing between intervals is equal to  0 = s?1 . Remark: k = k?1 ? k?2 mod k?1, so that all k can be computed a priori with the cost of one application of Euclide algorithm [22]. In practice, this fact also makes the process non-recursive, since the level at which recursion stops can be determined beforehand. After the process has been applied for all loops i; l < i  n, the resulting layout in cache is the actual interference set of the reference, de ned on loop l. 13 I.e., the distance 14 + = max( ; 0).

between any two consecutive intervals is constant.

24

A.2 Actual reuse set

The reasoning is nearly identical for the actual reuse set, except that cache lines where overlapping occurs must be excluded.

Theorem A.2 Let fr (x; S; s ; s?1 ) =

8 < :

S if x = 1: S + 2? s + (x ? 2)+ (2s ? S )+ if fi (x) < s?1 :  max 0; S + 2  s + (x ? 2)+ (2s ? S )+ ? (fi(x) ? s?1 ) if fi (x)  s?1 :

The number of elements that can be reused is equal to max







0 s?1 ? r  fr (ns ; S; s ; s?1 ) + r  fr(ns + 1; S; s ; s?1 ); 0 :

An example of application of the whole process can be found in section 3.3. Note that the process is inaccurate because the layout of intervals is rarely perfectly regular. So the error can become signi cant after multiple applications (i.e., for multiple loop levels). Fortunately, a reuse set de ned over two loop levels corresponds to a 3-deep loop nest where reuse occurs on the third loop. For such a reuse set, the algorithm needs only to be applied once (except if the access stride is not equal to one). Reuse sets de ned over three loops (i.e., 4-deep loop nests) are less common in primitives and real codes [13], or are less likely to be exploited (for instance, loop blocking is rarely performed over more than the two innermost loops).

25

Suggest Documents