An Accurate Cost Model for Guiding Data Locality Transformations XAVIER VERA Malardalens ¨ Hogskola ¨ and ´ JAUME ABELLA, JOSEP LLOSA, and ANTONIO GONZALEZ Universitat Politecnica ` de Catalunya-Barcelona
Caches have become increasingly important with the widening gap between main memory and processor speeds. Small and fast cache memories are designed to bridge this discrepancy. However, they are only effective when programs exhibit sufficient data locality. The performance of the memory hierarchy can be improved by means of data and loop transformations. Tiling is a loop transformation that aims at reducing capacity misses by shortening the reuse distance. Padding is a data layout transformation targeted to reduce conflict misses. This article presents an accurate cost model that describes misses across different hierarchy levels and considers the effects of other hardware components such as branch predictors. The cost model drives the application of tiling and padding transformations. We combine the cost model with a genetic algorithm to compute the tile and pad factors that enhance the program performance. To validate our strategy, we ran experiments for a set of benchmarks on a large set of modern architectures. Our results show that this scheme is useful to optimize programs’ performance. When compared to previous approaches, we observe that with a reasonable compile-time overhead, our approach gives significant performance improvements for all studied kernels on all architectures. Categories and Subject Descriptors: C.4 [Performance of Systems]: modeling techniques; D.3.4 [Programming Languages]: Processors—Compilers; optimizations General Terms: Languages, Performance Additional Key Words and Phrases: Cache memories, tiling, padding, genetic algorithms
1. INTRODUCTION With ever-increasing clock rates and the use of new architectural features, the speed of processors increase dramatically every year. Unfortunately, memory This work was supported by the ESPRIT project MHAOTEU (EP 24942). X. Vera was supported in part by VR grant no. 2001–2575 and CICYT project 511/98. ´ J. Abella, J. Llosa, and A. Gonzalez were supported by CICYT project 995/2001. ¨ Authors’ addresses: X. Vera, Institutionen f¨or Datateknik, Malardalens H¨ogskola, P.O. BOX 883, ¨ ˚ SE-721 23, Sweden; email:
[email protected]; J. Abella, J. Llosa, and A. Gonzalez, ´ Vaster as Computer Architecture Department, Universitat Polit`ecnica de Catalunya-Barcelona, Jordi Girona 1–3, Barcelona 08034, Spain; email: {jabella, josepll, antonio}@ac.upc.es. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or
[email protected]. C 2005 ACM 0164-0925/05/0900-0946 $5.00 ACM Transactions on Programming Languages and Systems, Vol. 27, No. 5, September 2005, Pages 946–987.
An Accurate Cost Model for Guiding Data Locality Transformations
•
947
latency does not decrease at the same pace, being a key obstacle to achieve high IPC. The basic solution that almost all systems rely on is the cache hierarchy. While caches are useful, they are effective only when programs exhibit sufficient data locality in their memory accesses. Numerical applications tend to operate on large data sets, and usually present a large amount of reuse. However, this reuse may not translate to locality since caches can only hold a small fraction of the data accessed. 1.1 Cache Compiler Optimizations Memory is organized hierarchically in such a way that the lower levels are smaller and faster. In order to fully exploit the memory hierarchy, one has to ensure that most of the memory references are handled by lowest levels of cache. Programmers spend an important amount of time improving locality, which is tedious and error prone. Various hardware and software approaches have been proposed lately for increasing the effectiveness of memory hierarchy. Software-controlled prefetching [Mowry et al. 1992] hides the memory latency by overlapping a memory access with computation and other accesses. Compilers apply useful loop transformations such as tiling [Carr and Kennedy 1992; Coleman and McKinley 1995; Lam et al. 1991; Wolf and Lam 1991] and data layout transformations [Chatterjee et al. 1999; Kandemir et al. 1999; Rivera and Tseng 1998a, 1999a; Temam et al. 1993]. In all cases, a fast and accurate assessment of a program’s cache behavior at compile time is needed to make an appropriate choice of transformation parameters. Unfortunately, cache memory behavior is very hard to predict. Simulators are used for describing it accurately. However, they are very slow and do not provide too much insight into the causes of the misses. Thus, current approaches are based on simple models (heuristics) for estimating locality [Carr et al. 1994; Coleman and McKinley 1995; Lam et al. 1991; Rivera and Tseng 1998a, 1999a]. However, modern architectures have very complex internal organization, with different levels of cache, branch predictors, etc. Such models provide very rough performance estimates, and in practice, are too simplistic to statically select the best optimizations. Tiling has been shown to be useful for many algorithms in linear algebra. By restructuring the loop and changing the order in which memory references are executed, it reuses data in the faster levels of the hierarchy; thus reducing the average latency. Nevertheless, finding the optimal tile sizes is a very complex task. The solution space is huge, and exploring all possible solutions is infeasible. Padding has a significant potential to remove conflict misses. In fact, it can remove most conflict misses by changing the addresses of conflicting data, and some compulsory misses by aligning data with cache lines. However, choosing the optimal data layout is an NP problem [Petrank and Rawitz 2002]. A number of algorithms have been proposed which are based on simple cost models that only consider the first level cache [Rivera and Tseng 1998a]. We introduce an approach to drive program transformations oriented to enhance data locality. The centerpiece of the proposed method is an accurate cost model combined with a genetic algorithm. In particular, we improve the order ACM Transactions on Programming Languages and Systems, Vol. 27, No. 5, September 2005.
948
•
X. Vera et al.
Fig. 1. Optimizing framework.
of memory accesses via tiling, whereas conflict misses that tiling cannot eliminate are removed via padding. Moreover, it chooses the best tile and pad factors at the same time. Our approach makes use of a very precise cost model that allows us to consider all different levels of the memory hierarchy. Furthermore, we consider the performance cost of the miss-predicted branches. We present results for a collection of kernels that exhibit a large number of misses. For the sake of concreteness, we report results for a set of modern processors that represent current architectural paradigms. We have chosen the Pentium-4 (CISC), Alpha-21264 and UltraSparc-III (RISC), and Itanium (EPIC).1 1.2 An Overview This article proposes a new method to perform loop tiling combined with padding for numeric codes. We use a static data cache analysis that considers different levels of cache. Moreover, it considers the cost of the branch instructions accordingly to the outcome from the branch predictor. We first start describing data reuse using the well-known concept of reuse vectors [Wolf and Lam 1991]. We implemented Ghosh et al’s [1999] Cache Miss Equations (CMEs) to compute the locality of a program, extending its applicability to deal with multi-level caches. This allows us to have a precise model that describes cache memory behavior across different levels. Once information about read and write misses for all different levels is obtained, we set up the cost model function. By considering the relative costs for each memory level, as well as the cost of miss-predicted branches, the cost function is tuned for improving execution time. Finally, we use a genetic algorithm (GA) that traverses the solution space in order to determine all tile and pad factors at the same time, thus giving equal importance to both transformations. Figure 1 depicts an overview of the approach described in this article. We have implemented our system in the SUIF2 compiler. We use SUIF2 to identify high-level information (such as array accesses and loop constructs), which is used to model the cache behavior. The GA generates different possible combinations of tile and pad factors, that are analyzed by the cost model function. Finally, the best parameters are fed back and used to generate the optimized code. The rest of the article is organized as follows: Section 2 reviews our method to describe data locality, and introduces tiling and padding techniques. Section 3 describes our cost model for estimating performance. Section 4 1 We
do not consider the 16KB L1 cache since it only holds integers.
ACM Transactions on Programming Languages and Systems, Vol. 27, No. 5, September 2005.
An Accurate Cost Model for Guiding Data Locality Transformations
•
949
explains in detail our decisions to implement the genetic algorithm. Section 5 presents the experimental framework, and Section 6 compares our results against the state-of-the-art techniques. Section 7 contains some related work that aims at optimizing the cache behavior. Finally, we conclude and outline a road map to future extensions in Section 8. 2. IMPROVING LOCALITY In this section, we review the concepts of data reuse and data locality. We first discuss some important concepts related to data cache analysis and how we model different cache levels. Then, we introduce loop tiling and padding, and we explain how they can be used to improve data locality. Understanding data reuse is essential to predict cache behavior. Reuse happens whenever the same data item is referenced multiple times. This reuse results in locality if it is actually realized; reuse will result in a cache hit if no intervening reference flushes out the datum. Given that, a static data cache analysis can be split up into the following steps: (1) Reuse Analysis describes the intrinsic data reuse among all different memory references.2 (2) Data Locality Analysis describes the subset of reuses that actually results in locality. We describe each step in detail in the following sections. 2.1 Reuse Analysis In order to describe data reuse, we use the well-known concept of reuse vectors [Wolf and Lam 1991]. They provide a mechanism for summarizing repeated memory accesses within a perfect loop nest. Trying to determine all iterations that use the same data is extremely expensive. Thus, we use a concrete mathematical representation that describes the direction as well as the distance of the reuse in a methodical way. The shape of the set of iterations that uses the same data is represented by a reuse vector space [Wolf and Lam 1991], using the reuse vectors as a basis that describes the space. Whereas self reuse (both spatial and temporal) and group temporal reuse is computed in an exact way, group spatial reuse is only considered among uniformly generated references (UGRs), this is, references whose array subscript expressions differ at most in the constant term [Gannon et al. 1988]. Figure 2(b) presents the reuse vectors for the references in our running example showed in Figure 2(a). The reference a(i, j ) (W ) may reuse from the same datum (hence, temporal reuse) that a(i, j ) (R) (hence, group reuse) accessed at the same iteration. Reference c(k, j ) is associated with the self-spatial reuse vector (0,0,1), since it may reuse the same cache line (thus, spatial reuse) that it accessed one iteration before of the innermost loop nest. The other reuse vectors can be understood in a similar way. 2 We
use memory reference to note a static read or write in the program. A particular execution of that read or write at run time is a memory access. ACM Transactions on Programming Languages and Systems, Vol. 27, No. 5, September 2005.
950
•
X. Vera et al.
Fig. 2. The IJK matrix multiplication and the reuse vectors that describe its reuse.
In the case that a reuse vector is not present, we assume there is not reuse, thus, there is not locality. 2.2 Data Locality Analysis Data locality is the subset of reuse that is realized; that is, reuse where the subsequent use of data results in a hit in the considered cache level. To discover whether a reuse translates to locality we need to know all data brought to the cache between the two accesses (this implies knowledge about loop bounds and memory access addresses) and the particular cache architecture we are analyzing. CMEs [Ghosh et al. 1999] are mathematical formulas that provide a precise characterization of the cache behavior for set-associative caches with LRU replacement.3 They consider perfect nested loops consisting of straight-line assignments. Based on the description of reuse given by reuse vectors, some equations that describe those iteration points where the reuse is not realized are set up. Solving them gives information about the number of misses and where they occur. Even though generating the equations is linear in the number of references, solving them can be very time consuming. We use our previous work, which presents a probabilistic method based on sampling [Vera et al. 2000] to solve the equations in a fast and accurate way. Moreover, we use our own polyhedra representation [Bermudo et al. 2000] to further optimize the process of obtaining the number of cache misses from the equations. For some results showing the accuracy of our method to predict caches misses for different cache architectures and real processors, we refer the interested reader to our previous work [Vera et al. 2004]. 2.2.1 CMEs for Multi-level Caches. The increasing performance mismatch between memory and processor speeds has required an increased number of cache levels. For instance, Itanium has three levels of cache. Thus, an accurate model for predicting cache behavior must give a quantitative measurement of 3 We
plan to incorporate other replacement policies into our model in the future.
ACM Transactions on Programming Languages and Systems, Vol. 27, No. 5, September 2005.
An Accurate Cost Model for Guiding Data Locality Transformations
•
951
Fig. 3. Our approach for analyzing multilevel caches.
cache misses for all levels of cache; a reduced number of misses on the first-level cache may not translate to a reduced execution time due to the presence of other cache levels which could have high miss ratios. Therefore, we have extended the CMEs to describe cache behavior for modern architectures. Given a memory reference, the equations are to investigate whether the reuse described by its reuse vectors is realized or not. We now discuss how we extend the analysis to memory hierarchies with more than one level. For these architectures, we have to analyze differently memory references depending on the cache level they are accessing. For that purpose, a set of equations that describe precisely the relationship among the iteration space, array sizes and cache parameters is set up for each of the cache levels. Figure 3 shows our approach for a 3-level cache memory hierarchy. When analyzing potential cache set contentions, only memory accesses that miss in lower cache levels are considered. Thus, we can see the equations for each level as filters, where only those memory accesses that miss are analyzed in further levels. 2.3 Tiling and Padding Overview In addition to the hardware organization, it is common knowledge that performance of memory hierarchy is very sensitive to the particular memory reference patterns of each program. In order to enhance locality, transformations that do not alter the semantics of the program attempt to modify the order in which computations are performed, or simply change the data layout. In this section, we review the transformations implemented in our experimental compiler. 2.3.1 Tiling. Loop tiling combines strip-mining with loop interchange for increasing the effectiveness of memory hierarchy. Figure 2(a) shows the code for the matrix multiplication of NxN arrays kernel, which we use as our running example. Loop tiling basically consists of two steps [Wolf and Lam 1991]. The first one consists of restructuring the code to enable tiling those loops that carry reuse. The second one is to select the tile factors that maximize locality. It is the latter step that is sensitive to the characteristics of the cache memory considered. Due to hardware constraints, caches have limited associativity, which may cause cache lines to be flushed out of the cache before they are reused despite sufficient capacity in the overall cache. We present the tiled version, with tile sizes T1 and T2, in Figure 4(a). ACM Transactions on Programming Languages and Systems, Vol. 27, No. 5, September 2005.
952
•
X. Vera et al.
Fig. 4. Matrix multiply algorithm after applying tiling and padding.
Fig. 5. Example of tiled iteration space.
2.3.1.1 Implementing Tiling. CMEs are defined over convex iteration spaces [Ghosh et al. 1999]. However, the tiled iteration space is not convex. More formally stated, the iteration space obtained after tiling n dimensions can be expressed as the union of 2n convex regions. We illustrate this situation in Figure 5. Figure 5(a) shows how a 1-dimensional iteration space becomes a two-convex region iteration space (see ACM Transactions on Programming Languages and Systems, Vol. 27, No. 5, September 2005.
An Accurate Cost Model for Guiding Data Locality Transformations
•
953
Figure 5(b)) after tiling (T=3). The shaded regions correspond to the different convex regions before and after tiling. A naive way to overcome this problem is to use only one convex region that approximates the actual nonconvex region. This convex region can be the smallest parallelepiped that includes all other convex regions (see Figure 5(c)) or alternatively, the region which does not include the last iteration of every tiled loop when the tile size does not divide the upper bound (see Figure 5(d)). Nevertheless, neither option is accurate. The first option includes points outside the iteration space, whereas the second option excludes points belonging to the iteration space. In our aim of having an accurate model, we decided to implement the exact solution. We have modified the CMEs to deal with multiple convex regions by defining a set of equations for every convex region. When analyzing an iteration point, we use the equations corresponding to the convex region the iteration point is contained in. Let m be the number of convex regions of a loop after tiling. Compulsory equations are defined for each convex region, so the number of compulsory equations is increased by m. For each reuse vector, we generate a set of replacement equations for each convex region [Ghosh et al. 1999]. In addition, we generate a set of equations for every pair of convex regions that reflect the potential reuse between different regions. Thus, the number of replacement equations is increased by a factor of m2 . 2.3.2 Padding. Unlike loop tiling, padding modifies the data layout to eliminate conflict misses. It changes the data layout in two different ways. Interpadding modifies the base address of the arrays, whereas intrapadding changes the size of array dimensions. We refer to the L1 (primary) cache size as Cs . memi is the original base address of variable number i (Vari ) and P Basei stands for the intervariable padding between V ari and Vari−1 . dimij stands for the size of the dimension j of Vari (Di is the number of dimensions) and Si is its size. P Dimij is the intravariable padding applied to dimij , and P Si is the size of Vari after padding (see Figure 6). We define i as P Si − Si . 2.3.2.1 Intervariable Padding.. When intervariable padding is applied only the base addresses of the variables are changed. Thus, padding is performed in a simple way. Memory variable base addresses are initially defined using the values given by the compiler. Then, we define for each memory variable Vari , a variable P Basei , i = 0 . . . k: 0 ≤ P Basei ≤ Cs − 1. Note that padding a variable results in modifying the initial addresses of the other variables (see Figure 6). Thus, after padding, the memory variable base addresses are computed as follows: BaseAddr(Vari ) = memi +
k≤i
P Basek .
k=0 ACM Transactions on Programming Languages and Systems, Vol. 27, No. 5, September 2005.
954
•
X. Vera et al.
Fig. 6. Data layout: (a) before intervariable padding, (b) after intervariable padding (c) before padding, (d) after padding, (e) 2-D array, (f) 2-D array after intravariable padding.
2.3.2.2 Adding Intravariable Padding. The result of applying both interand intravariable padding is that all base addresses and sizes of every dimension of each memory variable may change. They are initially set according to the values given by the compiler. For each memory variable Vari , i = 0 . . . k we define a set of variables {P Basei , P Dimij }, j = 0 . . . Di 0 ≤ P Basei , P Dimij ≤ Cs − 1 After padding, memory variable base addresses are computed in the following way (see Figure 6): +
BaseAddr(Vari ) = memi + k