Dynamic Cache Splitting - CiteSeerX

0 downloads 0 Views 204KB Size Report
help from the compiler, a program can dynamically split the cache in several caches of any ... of the program data structures. .... left in the instruction codi cation.
Dynamic Cache Splitting Toni Juan, Dolors Royo and Juan J. Navarro Computer Architecture Department, Universitat Politecnica de Catalunya Gran Capita s/n, Modul D6, E-08034 Barcelona, (Spain) e-mail: [email protected] Abstract

A novel hardware for cache management that we call split cache is presented. With some help from the compiler, a program can dynamically split the cache in several caches of any size that is power of two, and choose in which cache each data element is mapped. When properly used, this scheme can reduce the con ict misses near to zero. Moreover, as the total capacity of the original cache is used in a more optimal way, also the capacity misses can be reduced.

1 Introduction The increasing gap between processor speed and main memory speed has made the cache misses to become the performance bottleneck. Thus, it is very important to reduce the cache misses as much as possible. Three types of cache misses can be distinguished [2], compulsory, capacity and con ict. The rst two depend on the size of the working set of the code and on the size of the cache memory. The third type depends on the cache organization, and given a certain cache capacity, varies depending on the degree of associativity [1]. When di erent data structures compete for the same cache lines during program execution, con ict misses arise. Fortunately, in many cases the programmer/compiler knows what data structures are candidates to con ict with others and which of this data structures will be needed in a near future. Therefore, if the compiler could manage the cache as a local memory, the number of con ict misses could be reduced almost to zero. In this paper we propose a hardware mechanism and a simple extension of the instruction set that allows the compiler to dynamically divide the physical cache in a variable number of logically separated caches of di erent sizes and choose in which partition is mapped each memory reference. In this way, con icts can easily be avoided assigning di erent cache partitions to di erent data structures. Each data structure could be assigned to a cache partition with the size that better ts its needs. When our new cache organization is properly used, the con ict misses become nearly zero. Another advantage is that some capacity misses disappear due to the better use of the total cache capacity. The paper is organized as follows. In section 2 related work is presented and the main di erences with our approach are highlighted. The split cache organization is described in section 3 and some design parameters are studied. In section 4 some general ideas about management of the split cache are used to evaluate the performance gain on some examples. We nish with some conclusions and future work in section 5.  This work was supported by the Ministry of Education and Science of Spain (CICYT TIC-880/92) and by the EU (ESPRIT Project APPARC 6634)

1

2 Related work For a long time, there has been many studies about improving the performance of the cache memory. Usually, the contribution of the data cache misses to the execution time can be modeled as Tmisses = Misses  CPM  Tc (1) where CPM (cycles per miss) are the cycles needed to service a miss, Misses is the total number of misses and Tc is the processor cycle time. Reducing any of the terms in equation (1) improves performance. For example, [3], [8], [7] and [9] aimed at the reduction of CPM. In [6] and [9] the goal was to reduce the number of misses and, in [5] the objective is to reduce the memory cycle time. It is interesting to note that most of these solutions reduce one term of equation (1) at the expense of increasing some of the other two terms. Consequently, these solutions only are useful when the reduction of one term surpasses the increase in the other two. As an example of this tradeo , if the original system has a direct mapped cache, the number of misses can be reduced replacing the direct cache with a set associative cache of the same capacity. However, set associative caches have a higher cycle time and the processor cycle time, Tc could be increased. Our proposal aims at the reduction of the Misses term in equation (1), at the expense of some possible (not always will happen, though) increase in Tc . The main di erence of our proposal resides in allowing the compiler to allocate data in di erent cache partitions based on its knowledge of the program data structures. With this approach, we maximize the use of the full cache capacity and obtain a substantial reduction in the number of con ict misses. Our mechanism almost o ers the power of a local memory without all the problems associated with its management.

3 Split cache The cache memory is split in several partitions under software control, the size of each partition is independent of the size of the others as long as it is a power of two. To simplify the explanation, we are going to assume that 1. all cache lines that con gure a partition are consecutive lines from the original cache, and 2. each cache line resides only in one partition, therefore there are no intersections between partitions1. From the point of view of the program, this is equivalent to having several cache memories. Now each memory reference has to indicate through which partition whishes to be cached. From the point of view of cache management, each partition behaves like an independent cache, that is, the address mapping in a given partition (direct, set associative, : : :) is performed in the same way as in the original cache but assuming its size. To avoid cache con icts, all the di erent data sets that the compiler considers that could con ict are going to use di erent cache partitions. The size of each partition is chosen trying to satisfy the memory needs of the data that is going to be mapped in it. For example, given the ji form of dense matrix by vector multiplication, DO j= 1, N xr= x(j) DO i= 1, N y(i)= y(i) + A(i,j)*xr ENDDO ENDDO 1

In section 5 this restriction is eliminated

one way to avoid all the possible con icts is to assign a di erent partition to each variable in this program. In gure 1a the data structures, their relationships and the order in which data is accessed is shown. In gure 1b we can see how each variable is mapped in a di erent partition (each variable has its own cache).

x *

00

A

A

01

x

1x

y1, y2

Tag Original @

Index + y =Liney xxxx

ID x var

00 01

Translation Transformed @

a)

01xx

10 11 Cache

b)

A direct access table, translation table, is needed to store the address transformations associated with each identi er. The number of table entries determine the maximum number of simultaneous partitions, this number is limited to 2s (in this case, each set of the original cache can be seen as a partition). Each entry of the translation table has two elds that determine the address translation to be applied when a memory reference is issued 1. the mask, that has b bits associated with the b most signi cant bits of the index. Each bit that is set to 1 in the mask indicates that the associated bit has to be xed in the transformed index independently of its original value. 2. the value, that is a vector of b bits and indicates to which value has to be xed each bit of the transformed index with a 1 in the corresponding bit of the mask. That is, for 1  i  b if (Mask(i)== 1) then transformed_index(s-b+i)= Value(i) else transformed_index(s-b+i)= Original_index(s-b+i) endif

and obviously, b  s. To obtain the transformed address, the correct entry of the translation table is selected and the mask is used to select between the bits of the original address and the associated bits in the value. This means that we need b multiplexers with two entries of one bit. The cache memory with the new hardware to con gure a split cache can be seen in gure 3. IR

b

ID

t+b t @

t

b

Value Mask b Index

b

Line

1

1

1

1

1

s-b

s Data Tags

cmp Hit?

Data

e

An example of two di erent partitions that can be obtained depending on the number of index bits controlled from the mask for a xed number of entries in the translation table is now presented. In gure 4a a cache with eight sets is divided into four independent partitions of the same size, since b = 2, each of the partitions has the minimum size allowed. In gure 4b the same cache is now also divided in four partitions, but mask and value can control one bit more than in the previous case (b = 3). In both cases the full cache capacity is used but the second example has more resolution to choose the partition size. From this example, two conclusions can be drawn: rst of all, the bigger b is, the smaller the partitions can be. Second, all the possible partitions achievable with a given value for b can be obtained with any con guration where the number of bits of mask and value is bigger. A more detailed justi cation is done in the next point.

Cache 000 001 010 011 100 101 110 111

Cache

Translation Table Mask Value

1 2 3 4

a)

1

11

00

2

11

01

3

11

10

4

11

11

000 001 010 011 100 101 110 111

Translation Table Mask Value

1

2 3 4

b)

1

100

0XX

2

110

10X

3

111

110

4

111

111

Area requirements

The hardware required to implement the address translation mechanism causes an increase in the area devoted to memory management. The amount of extra memory depends on b and e. It is interesting to have b as big as possible because the partitions can be smaller (better use of the available capacity, see gure 4). However, for each bit added to mask and value elds, a new bit has to be added to the tag in order to determine whether a given line is in the cache or not. Moreover, the area devoted to the translation table increases. On the other hand, as the number of entries of the translation table e increases, more di erent data streams can be maintained in memory without any interference between them, but also the area needed for the translation table increases. Usually e, the number of entries in the translation table, is going to be small because all the memory instructions have to codify the entry to be used and we assume that there is little space left in the instruction codi cation. Furthermore, the number of bits for mask and value (b) will be small because the bigger b is, the smaller the partitions are, and with small partitions it is necessary to have more partitions to cover all the cache capacity.

E ect on the processor cycle time

The original address of any data reference has to be transformed into a new address. This address translation has to be done in a way that does not a ect the critical path of the processor. There are two steps to follow when determining the new address. First of all, for each memory reference it is necessary to determine the address transformations (mask and value) associated with a partition. This is accomplished by an access to the translation table. Second, the transformation has to be applied to the memory address. This is accomplished with the pass through the multiplexers. The rst step can be done as soon as the instruction is available, for example in the decode stage. This means that the mask and its associated value can be obtained in parallel with the original address calculation. Usually, in a pipelined processor, the cycle time is determined either by the ALU (address calculation) or by the Memory stage (cache access). Then, the application of the address transformation should be placed in the stage that does not determine the cycle time. Since the table access is done in the decode stage, the multiplexers are selected before the cache address has been calculated. Thus, only a two gate delay has to be placed in the less restrictive stage. If both stages are restrictive, the cycle time of the processor will be incremented with the delay introduced by this two levels of gates. However, in many cases, the address calculation stage can be redesigned to anticipate those address bits needed for the translation process and no modi cation of the cycle time is produced. In the remainder of this paper we assume that the processor cycle time is not increased due to our mechanism.

3.2 Extension of the instruction set

The process of de ning a partition is as simple as determining which bits of the index have to be xed and what concrete value they get. The hardware mechanism presented requires an extension of the instruction set to properly manage all the new structures introduced. In particular, one instruction is needed to write the mask and the value into the translation table. Moreover, every memory access has to indicate a translation table entry to be used to translate the memory address. The assembler instructions could be something like Partition Entry, Fixed_mask, Value_mask Load Entry, @, Rd Store Entry, Rf, @

The usual way to use the new instruction is as follows: rst of all, the compiler determines the number of partitions needed and their sizes, then, generates the mask and value that de ne each partition and loads the translation table. Every memory reference is going to use the suitable entry of the translation table. It is important to note that this scheme is dynamically changeable. Thus, the compiler can use di erent cache partitionings in di erent parts of the program as the memory needs change (number and size of data structures being accessed). This way, the program dynamically adjusts the data distribution over the cache to minimize the number of misses.

4 Evaluation and examples of use

4.1 Partitioning criteria

Our objective is to reduce con ict misses. These misses can be subdivided in cross-interferences, when data from di erent data structures compete for the same lines; and auto-interferences, when the same data structure competes for the same cache lines. We de ne a stream as the sequence of memory references due to a concrete memory access instruction. The main criteria to determine what data is mapped in a given partition could be to separate the memory accesses to each di erent data structure. This way, the compiler can guarantee that no cross interferences will happen (see gure 1). However, the same data structure can be accessed by more than one stream. If di erent streams are assigned to di erent partitions, all the possible con icts (auto and cross-interferences) can be eliminated. The size of the partition assigned to one stream depends on the locality type of that stream. A stream with spatial locality only needs one cache line to exploit that locality. Thus, the smallest possible partition will be assigned to streams having only spatial locality. When a stream has temporal locality it will be mapped in a partition where all its working set ts (when possible). Blocking techniques [4] and all the well known compilation techniques used to generate code for vector processors can be used to reduce the working set of the program and improve the locality in all the partitions. When there are more active streams than possible partitions, some streams have to coexist in the same partition. The compiler will try to mix those streams that have the smallest temporal locality. Moreover, when the compiler doesn't knows how to distribute the data in the cache, there is always the possibility to only de ne one partition that covers all the cache memory and map all the streams on it (as in the traditional cache).

4.2 Coherence issues

When assigning streams to partitions, the compiler has to be able to determine if more than one stream is referencing the same memory positions. Clearly, if two di erent streams that access to some common positions, are assigned to di erent partitions, then a coherence problem appears. Another coherence problem arises when two di erent processes share data structures. In both situations, the compiler has to guarantee that the shared data is mapped in an equivalent partition, that is, a partition that begins in the same cache position and has the same size. In any case, when the compiler can not guarantee that all the references to a given data element go through the same cache position, then it can de ne a speci c partition to hold this kind of data.

4.3 Examples

To evaluate the e ectiveness of the separation of streams we have performed experiments using matrix by matrix and matrix by vector multiplication. The experiments consist of simulations of the behavior of the cache system using address traces. To obtain these traces, we have instrumented by hand the fortran code for each of the algorithms. The performance metric used is the number

of misses per op (MPF), because this can be translated easily to execution time3, and compare the use of the original cache with the use of cache splitting. The main characteristics of the cache system are: 1K words (or 8K bytes), direct mapped, 4 words per line (l = 4), and the translation table entries have two bits for mask and value (b = 2).

4.3.1 Matrix by Matrix multiplication

We use the jki form of matrix multiplication to evaluate the performance that can be obtained when using the split cache. DO j= 1, N DO k= 1, N b_aux= B(k,j) DO i= 1, N C(i,j)= C(i,j) + A(i,k)*b_aux ENDDO ENDDO ENDDO

When the problem does not t in cache a signi cant value of MPF is due to capacity misses. In addition, there are interference misses which increase with the size of the problem. Furthermore, these interferences produce very large spikes, since for speci c values of the matrix size (leading dimension of the matrix), columns of matrix A are mapped in the same cache lines than those of matrix C giving a pathological interference pattern. We can eliminate all these interferences and obtain better performance and a at behavior for any matrix dimension using cache splitting. Depending on the problem size there are two possible cache partitions that give the best performance. When the whole matrix A ts in one partition of half cache capacity, then all its temporal locality is reused; B and C go through to di erent partitions of any size to avoid con icts. The second possibility is when A does not t in half cache, then this partition is assigned to C and A and B are mapped in di erent partitions of any size. If the problem size is unknown in compilation time, the program will choose the best partitioning in execution time. The improvement of the split cache over the traditional cache can be seen in gure 5, the best partition has been used in each point for the split cache.

4.3.2 Matrix by Vector multiplication

The code for the ji form of matrix by vector multiplication has been presented in section 3. Since matrix A has no temporal locality, the compulsory misses are very important. Therefore, it is essential to avoid interferences on x and y vectors and exploit their spatial and temporal locality. The performance of this algorithm depends on how the columns of matrix A are mapped in the cache, in relation to vector y on a conventional cache. This results in a variable performance, as shown in Figure 6. This can be avoided using cache splitting in the following way. The y vector can use half of the cache capacity. All references to matrix A can go through one quarter, although one line would have been enough to exploit the spatial locality. The last quarter of the cache is used for vector x. This partition of the cache and assignation of the reference streams obtains a very good performance for a large range of problems. If C is the cache capacity in words then, for N = 1 to C=2 there are only compulsory misses, whereas for N = C=2 to C the part of y that is reused diminishes continuously, and nally for N  C the vector y cannot be reused. However, for any problem size we avoid all pathological behavior. The MPF obtained can be seen in gure 6, the split version always obtains better performance, and in some points the performance improvement is near to 500%. 3

remember that we assume that our translation hardware does not modify the processor cycle time.

0.6 Normal Cache Splitted Cache

MPF

0.4

0.2

0.0 0

200

400

600

800

Matrix Order

Figure 5: MPF for jki form of matrix by matrix multiplication

5 Conclusions and future work

A combined hardware/software mechanism called split cache has been presented. This scheme allows the compiler to dynamically split the cache in several separated caches (partitions) without increasing the cycle time in many cases. Every partition can have a di erent size. With this new cache organization, the compiler can easily separate each stream from the others eliminating the con ict misses. Moreover, the full cache memory is used better, meaning that some capacity misses can also be removed. Our mechanism is specially useful for direct mapped caches but can be used in set associative caches also. However, when used in a set associative caches, as the number of con ict misses is smaller, a lower improvement in the performance is expected. Some interesting points that are being studied are  Determining compiler algorithms to get the best performance of the new cache guaranteeing that there are no coherence problems.  Application and evaluation of cache splitting applied to Instruction caches. The main idea is to avoid interferences between subroutines.  Improve the performance of mixed caches. Depending on the data and instruction working sets of the program the cache capacity can be dynamically distributed. This means that the best features of split and mixed caches can be obtained.  From the point of view of the operating system, a split cache could be used t o avoid interferences between the system and the user processes. Even, a cache demanding process could get a separated part of the cache so that any other process could not ush the cache of the desired process.  Use the same mechanism, the split cache with the same hardware presented in this paper to de ne partitions that can be included into other partitions. Some interferences will still remain but the capacity will be used better again.  A selective prefetching mechanism can be implemented. Depending on the locality characteristics of a stream, a partition could automatically prefetch data lines only when it is really useful.

1.0

MPF

Normal Cache Splitted cache

0.5

0.0 0

500

1000

1500

2000

Matrix Order

Figure 6: MPF for ji form of matrix by vector multiplication

 A full hardware management of the cache partitioning and referencing process. With this approach it is not necessary to modify the instruction set of the processor.

References [1] J.L. Hennessy and D.A. Patterson. Computer Architecture, A Quantitative Approach. Morgan Kaufmann, 1990. [2] Mark D. Hill and Alan Jay Smith. Evaluating assciativity in cpu caches. IEEE trans. on Computers, Vol. 38(12):1612{1630, Dec 1989. [3] Norman P. Jouppi. Improving direct-mapped cache performance by the addition of a small fulyassociative cache and prefetch bu ers. In Proc. of the Int. Symp. on Computer Architecture, pages 364{373, 1990. [4] M. S. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked algorithms. In ASPLOS, pages 67{74, 1991. [5] Lishing Liu. Partial address directory for cache access. IEEE trans. on Very Large Scale Integration Systems, Vol. 2(2):226{239, Jun 1994. [6] A. Seznec and F. Bodin. Skewed associative caches. Technical Report 1655, INRIA, Mar 1992. [7] Andre Seznec. DASC cache. In Proc. of the 1st Int. Symp on High-Performance Computer Architecture, pages 134{143, Jan 1995. [8] Kimming So and Rudolph N. Rechtscha en. Cache operations by MRU change. IEEE trans. on Computers, Vol. 37(6):700{709, Jun 1988. [9] O. Temam and Y. Jegou. Using virtual lines to enhance locality exploitation. In Proc. of the Int. Conf. on Supercomputing, pages 1{12, 1994.