Detailed Cache Simulation for Detecting Bottleneck, Miss ... - CiteSeerX

1 downloads 0 Views 160KB Size Report
5.00. Keywords cache simulation, code optimization, performance visualiza- tion. 1. INTRODUCTION. The increase of memory speed is far behind that of CPU.
Detailed Cache Simulation for Detecting Bottleneck, Miss Reason and Optimization Potentialities Jie Tao and Wolfgang Karl

Institut fur ¨ Technische Informatik Universitat ¨ Karlsruhe (TH) 76128 Karlsruhe, Germany {tao,karl}@ira.uka.de

ABSTRACT

Keywords

Cache locality optimization is an efficient way for reducing the idle time of modern processors in waiting for needed data. This kind of optimization can be achieved either on the side of programmers or compilers with code level optimization or at system level through appropriate schemes, like reconfigurable cache organization and adequate prefetching or replacement strategies. For the former users need to know the problem, the reason, and the solution, while for the latter a platform is required for evaluating proposed and novel approaches. As existing simulation systems do not provide such information and platforms, we implemented a cache simulator that models the complete cache hierarchy and associated techniques. More specifically, it analyzes the feature of cache miss and provides information about the runtime accesses to data structures and the cache access behavior. Together with a visualization tool, this information enables the user to detect access hotspots and optimization strategies for tackling them. For supporting the study of different techniques with respect to cache configuration and management, this simulator models a variety of cache line replacement and prefetching policies, and allows the user to specify any cache organization, including cache size, cache set size, block size, and associativity. The simulator hence forms a research platform for investigating the influence of these techniques on the execution behavior of applications.

cache simulation, code optimization, performance visualization

Categories and Subject Descriptors B.3 [Memory Structures]: Performance Analysis and Design Aids; C.4 [Performance of Systems]: Modeling techniques

General Terms Performance

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. VALUETOOLS ’06 Pisa, Italy Copyright 2006 ACM 1-59593-504-5/06/10 ...$5.00.

1. INTRODUCTION The increase of memory speed is far behind that of CPU speed, leading to a widening gap between processor cycle and memory access latency. Memory performance becomes hence a critical issue for fully exploiting the computational capacity of modern processors and for enhancing the overall performance of a computer system. An approach to tackle this problem is to improve the cache locality. As a fast buffer for storing reused data, cache provides comparable access speed with processors. Due to the complexity of applications, however, the cached data can often not be reused by the running programs, so that still large number of cache misses occur, especially with memory intensive applications. Programmers therefore usually have to optimize their application codes with respect to cache locality. This task needs a knowledge about the access pattern of applications and the runtime cache access behavior, which is difficult to acquire through static analysis of the source code. Users hence rely on cache performance data provided by various tools, like hardware counters and simulation systems. Nevertheless, existing tools can only supply information about what happened, such as statistical number of cache misses. Even though this kind of statistics can be related to specific code region or data structures, it still does not suffice for an understanding of the access pattern of applications and a detection of cache inefficiency problems. For locality optimization, users need to know not only where the problem lies, to which existing tools are restricted, but also the reason causing this problem and possible solutions. Simulation tools, on the other hand, could provide this functionality, however, existing cache simulators do not show this feature. Additionally, cache locality can also be improved transparently with cache adaptation in terms of cache organization and management, if the system supports such reconfigurability. This kind of adaptation is combined with the access pattern of applications; and hence a platform is needed to study the behavior of such kind of adaptation with changing access pattern. In this case, we implemented a comprehensive cache sim-

ulator capable of both providing cache performance data needed for code optimization and establishing a platform for investigating system adaptivity with respect to cache behavior. For the former, the cache simulator delivers both statistical information exhibiting the problem and its location and information showing the miss reason, the runtime cache operation, and the accesses to data structures. For the latter, it models a set of different caching techniques, such as prefetching, cache line replacement, and a variety of cache coherence protocols. Summarily, this cache simulator distinguishes in its additional functionality: • allowing various cache models and organization • mechanisms for analyzing cache miss type • facility for detecting false sharing • simulating a variety of cache coherence protocols • supporting various prefetching schemes and cache line replacement strategies In addition, we also implemented a cache visualizer that presents the acquired cache performance data at high-level in an easy-to-understand way. More specifically, this visualization tool deploys a top-down design to direct the user step-by-step to detect the problem and the solution: users first acquire an overview about the cache access behavior, then the access hot spots, and finally the reason for cache misses. This allows the user to discover appropriate optimization strategies for performing code or data transformation with a result of less cache misses. The remainder of this paper is organized as follows. Section 2 gives a short background of cache miss type and common used optimization techniques. This is followed by a brief description of some related work in this research area in Section 3. In Section 4 the implemented cache simulator is described in detail. Initial experimental results are illustrated in Section 5. The paper concludes in Section 6 with a short summary and some future directions.

2.

OVERVIEW OF CACHE MISS TYPE AND OPTIMIZATION TECHNIQUES

Cache miss is usually divided into three categories: first reference (or cold) miss, conflict miss, and capacity miss. First reference misses occur when data is first accessed and still in the main memory. Conflict misses occur when a data block has to be removed from the cache due to mapping overlaps. Capacity misses occur when the cache size is smaller than the working set size. To distinguish between conflict and capacity miss, the following definition is often applied: if the reuse distance of an access is smaller than the number of cache lines, the resulted miss is regarded as a conflict miss; otherwise, it is a capacity miss. Within this definition, reuse distance is defined as the number of unique addresses referenced since the last access to the requested data. Here, addresses mapping to the same cache line are not regarded as unique. However, if both the working set and the cache were appropriately organized, most cache misses could be avoided. For example, if required data could be preloaded to the cache, cold misses would be reduced; if intensively reused

data blocks could be mapped to different cache sets, conflict misses would be reduced; if the working set could be smaller or the cache larger, capacity misses would not occur. Hence, researchers have been exploring ways to fulfill these assumptions. As a consequence, a set of optimization techniques have been proposed and applied to improve the cache utility. Well-known approaches include array padding, loop permutation, loop blocking/tiling, and adaptive schemes like cache reconfiguration, remapping, and guided prefetching. In the following, we discuss how these techniques reduce different cache misses. First Reference Miss: Prefetching is an efficient approach for reducing cache misses caused by first references. Common used prefetching mechanisms usually fetch the succeeding data block of the missing block. This is efficient when arrays are accessed in layout order. However, when the access lies in a large stride, the prefetched data potentially can not be fully used leading to cache misses. The following code illustrates a simple example: f or(i = 0; i < n; i ++) f or(j = 0; j < n; j ++) f or(k = 0; k < n; k ++) y(i, j) += x(k, j); Examining array x it can be seen that this array is not accessed in layout order but with a stride of n. If a cache line can not hold n/2 elements of array x, the next needed array element x(k + 1, i) will be neither within the data line holding x(k, i) nor in its succeeding data block, and has to be fetched from the main memory. An adequate prefetching scheme capable of deciding which block to prefetch, however, can eliminate this miss using a fetching stride of n. In this case, it is needed to know the stride of access. According to the experimental results of a research work in this area [8], such technique can achieve a performance gain of up to 26%. In addition to prefetching, loop permutation is also an efficient approach to reduce cold misses. For the concrete example, if loops j and k are interchanged, array x will be accessed row by row and hence the fetched and prefetched array elements can be used for the next iteration. Conflict Miss: Conflict misses occur when several data items map to the same cache set, causing reused data oft evicted from the cache and reloaded to the cache again. Padding is a common used approach to tackle conflict misses. By inserting pads in the working set, this approach changes the data layout with a result that conflicting data items do not map in the same cache set. The pads can be inserted between data structures or within a single array forming inter-array and intra-array padding individually. The following is an example requiring inter-array padding: int a(32), b(32), c(32); f or(i = 0; i < 31; i ++) f or(j = 0; j < 31; j ++) f or(k = 0; k < 31; k ++) a(i) += b(i) + c(i); Assuming this code running on a 32-bit system with a 2way cache of 8 lines and a cache line size of 16 bytes. As the arrays are integer (i.e. 4bytes), each cache line can hold 4 elements, indicating that each array maps to the complete

cache. Hence, these arrays overlap in the cache with the first 8 elements in the first cache set, and so on. Examining the computation in the code, it can be observed that this computation results in an access sequence in the order of a(i) (read), b(i) (read), c(i) (read), a(i) (write). Since these elements map to the same cache set and a single set contains only two lines, a(i) will be replaced by c(i). However, a(i) is needed for storing the result and has to be loaded into cache again. In this case, a conflict miss occurs. Nevertheless, if a buffer of 8 bytes (2 cache lines) can be inserted between a, b, and c, this conflict can be avoided and thereby no cache misses occur. For the research work described in [15], this kind of padding has reduced cache miss rate by 16% in average. Capacity Miss: A capacity miss occurs when between two accesses to the same data more references targeting different addresses than the cache can hold are performed. In this case, the second access to this requested data is a miss. This problem can be raised using loop blocking, where a single loop is divided into several smaller loops. This kind of blocking can restructure the access pattern so that accesses to reused data are closer together in the iteration, resulting in more cache hits. Overall, for tackling different cache misses a variety of techniques have been proposed. Currently, as reconfigurable hardware becomes possible, cache misses can also be reduced by using appropriate cache organization and replacement polices, with respect to the access pattern of individual program. For example, application A probably benefits from 2way cache, while application B shows a better performance with 4-way cache. These hardware-based techniques reduce the overall cache misses.

3.

RELATED WORK

Simulation is a well-established technique for studying computer hardware and predicting system performance. Over the last years, many simulation systems with the goal of providing a general tool for such studies have been developed. Prominent systems include the full system-wide simulation tools, like SimOS [5], SIMICS [9], Wisconsin Wind Tunnel [11], and SimpleScalar [1], and memory or I/O specific tools like RSIM [13] and MICA [6]. These larger simulators model the computer hardware in high detail, allowing to evaluate the complete design of the target architectures. This simulator, however, aims at supporting the research work in the area of cache performance. In this specific area, several simulators have been developed. Well-known examples include SIP [2], MemSpy [10], and Cachegrind [18]. SIP (Source Interdependence Profiler) is a profiling tool that provides information about cache usage, such as spatial and temporal use of floating point and integer loads and stores, cache miss ratios with respect to data structures, and a summary of the complete statement. It uses SimICS [9], a fullsystem simulator, to run the applications for collecting cache behavior data. MemSpy is a performance monitoring tool designed for helping programmers to discern memory bottlenecks. It uses cache simulation to gather detailed memory statistics and then shows frequency and reason of cache misses. Cachegrind is a cache-miss profiler that performs cache simulations and records cache performance metrics including L1 instruction cache reads and misses, L1 data cache accesses and misses, and L2 unified cache accesses and misses. The input for Cachegrind is extracted from the

debugging information. In summary, cache performance data can be achieved with existing simulators. However, for efficient optimization users need information that shows not only what happened but also why this happened and probably the hints about how to optimize. This information must also be presented at high-level e.g. in relation to data structures. However, existing cache simulators do not address this issue. As a result, we implemented this simulation system, in order to give a deeper insight into the runtime cache activities. As an additional functionality, we also simulate different techniques which aim at improving the runtime cache behavior through hardware or software approaches. This helps the researchers to evaluate their strategies and study the influence of theses techniques on the cache performance.

4. COMPREHENSIVE CACHE SIMULATION First of all, the cache simulator models the basic functionality of a cache hierarchy that can contain arbitrary level of caches. It takes memory references as input and then simulates the complete process of data searching in the whole memory system. Currently, we use a code instrumentor Doctor to acquire memory accesses. Doctor is originally developed as a part of Augmint [12], a multiprocessor simulation toolkit for Intel x86 architectures. It is designed to augment assembly codes with instrumentation instructions that generate memory access events. For every memory reference, Doctor inserts code to pass the accessed address and its size to the simulation subsystem of Augmint. For this work, we slightly modified Doctor allowing to generate a trace file which stores all memory accesses performed by the application at runtime. In order to support the study on the impact of cache configuration and cache speed on performance, we simulated caches with different cache size, block size, associativity, access latency, and coherence protocols. Besides this, we also implemented specific algorithms for analyzing cache miss characteristics. In addition, several prefetching and replacement schemes are also simulated . Cache Miss Analysis. A specific feature of the cache simulator is to report the type of cache misses. Most tedious work in this process is to calculate the reuse distance for each memory access. As defined in Section 2, reuse distance is the number of unique addresses referenced since the last access to the requested data. For the following address sequence, for example: ABCBDCCE Assuming B, C, D, and E are not in the same data line, the reuse distance of A at the point of E being accessed is 4. This is achieved by incrementing the reuse distance of A by accessing each other unique addresses. Here, a Reuse flag is needed to specify if an address is unique. In addition, a first reference flag is required for detecting cold misses. Figure 1 shows the process to identify a cache miss. First, the first reference flag is examined to determine if the miss is a cold miss. In this case, the number of cold misses is incremented and the first reference flag is set. Otherwise, capacity miss is checked. For this, the reuse distance of the missing data block is compared with the number of cache lines. If the former is larger, the miss is recorded as a capacity miss and otherwise a conflict miss.

a cache miss Y

cold miss?

change reuse distance

N number of cold miss ++ set first reference flag

capacity miss? cache hit

N

Y

number of capacity miss ++

clean own reuse distance and reuse flags for other data blocks N reuse flag of this block=0? Y reuse distance ++ set reuse flag to this block

number of conflict miss ++ change reuse distance

Figure 1: Process to identify the miss type

In the next step, the reuse distance of all data blocks is updated. This work must also be done by each cache hit. During this process, the reuse distance of the current access is set to 0 indicating the start of a new access period. For other data blocks the reuse distance is updated if the Reuse flag shows that this observed address is unique for them. False Sharing. On multiprocessor systems, cache misses can additionally be caused by cache line invalidation which is needed for maintaining the caches in different processors consistent. This kind of invalidation is usually performed by write operations: i.e. when a processor writes a cache line, copies of the same data block in other caches are invalidated. Actually, in specific cases such invalidation is not necessary. For example, one processor writes a single word in a data block, while another processor accesses another word. This scenario is called false sharing. However, if the data words could be organized in another way, for this example to store the word required by the second processor in a different data block, false sharing could be avoided. In order to help the users to reduce cache misses caused by invalidation, our cache simulator has specific mechanism for detecting false sharing. For implementing this mechanism, each processor maintains a write chain to store the information about each shared write performed by other processors. This means each shared write performed on a processor is registered in the write chains of other processors. For checking false sharing, the accesses following this write are examined to see whether they have the same address but different offset with the address of the write operation. At the end of the simulation both the total number of false sharing and a histogram recording the details of all false sharing cases are provided. Prefetching Schemes. By default, the cache simulator models a conventional prefetching policy, where the succeeding data block is loaded to the cache together with the missing block. In addition, another scheme is also modeled which allows the users to specify the number of data blocks to prefetch. This means a set of subsequent data blocks are loaded to the cache. Actually this scheme is only efficient for applications with large spatial locality. However, cache performance can be worsened if no spatial locality exists because reused data can be evicted from the cache due to the prefetched data in case of insufficient cache capacity.

The experimental results in the following Section will show this. Actually, the prefetching would be more effective if the prefetching scheme can previously know which data will be requested and how large the data is. Currently we are working on such strategies. Replacement Policies. Besides the traditional LRU policy, our cache simulator additionally models two novel replacement policies, i.e. Set-LRU and Global-LRU, both holding frequently used data and in case of conflict remapping the requested data to other place of the cache. The Set-LRU algorithm aims at detecting the rarely used cache set for storing the requested data. It first searches the least recently used cache set (called LRU-set) according to the total number of accesses to all sets in the cache. If a threshold is exceeded, for example, the accesses to the mapping set are twofold more than that to the LRU-set, the LRU cache line within the LRU-set is chosen for holding the missing data block. Here, the threshold is used to limit the number of remappings because this introduces overhead for additionally managing the remapped data blocks. For implementation of this algorithm, we define a set-description table to store needed information. This table would be implemented in hardware on real systems. For each set in the table, two fields are used: one records the number of references to it and the other gives the information whether a block mapped to this set is remapped. In case of remapping, where a cache block not found in the mapping set but in another one, the access time is doubled. In addition, the overhead for maintaining the set-description table is also taken into account. The second algorithm Global-LRU is an extension of the traditional LRU policy. Using this algorithm, the replacement candidate is searched in the whole cache rather than within a single set as the conventional LRU does. For this, again a set-description table is maintained. Unlike the one used for the Set-LRU algorithm, only remapping information is stored. Other needed information is available from the LRU implementation. Main Performance Data. In summary, we implemented a flexible cache simulator with various properties, which allows to generate the following performance data: • Statistics on single events, like memory reference, cache miss, cache miss in different categories, and false shar-

ing. This information can be given at various granularities with respect to e.g. the whole program, a specific memory region, or an individual iteration, loop, or function. Hence, this information allows the user to understand the overall performance and to globally locate the access hotspots. • Distribution of cache misses across individual data blocks of the complete working set. This allows the user to understand the access pattern within specific data structures and further to more concretely locate the bottleneck to data blocks within the data structures. In addition, this information can be acquired separately for each program phase e.g. a single function, enabling the combination of the hot data structure and the hot code regions.

over all data blocks of a single data structure and of each individual function. For demonstrating the reason of cache misses, YACO provides a Variable Trace and a Cache Set view which depicting the whole accesses and their features of a data array and the updates within a cache set. This information allows the user to detect appropriate optimization scheme to eliminate the detected cache problem. The impact of the optimization can be observed with YACO after running the optimized code. This process can be repeated until an acceptable cache performance is achieved. The program. We choose a simple matrix multiplication code to demonstrate how the performance information provided by the cache simulator helps to develop cache efficient applications. The computation is mainly done with the following loop:

• Profile of cache operations and the related data addresses. This information can be used to depict the runtime update of cache contents and the mapping conflict of data structures in the cache.

f or(i = 0; i < N ; i ++) f or(j = 0; j < N ; j ++) f or(k = 0; k < N ; k ++) c(i ∗ N + j) = c[i ∗ N + j] + a[i ∗ N + k] ∗ b[k ∗ N + j];

The statistics on individual events are directly presented at the end of the program run, while the other information is recorded in trace files for visualization. The implemented visualization tool provides a variety of graphical views helping the users to understand the cache activities and the access pattern of applications, to detect bottlenecks and the reason, and potentially to discover the solution for tackling the bottlenecks.

First, we use YACO’s Performance Overview to examine the total cache performance of this code. We find that less than the half of the total memory references is hit in the first level cache. The same result is also detected with L2 cache, where a miss ratio of 54% has been measured. Based on this observation we decide to optimize the code for better cache performance. For optimization we need to know the target, i.e. access hotspots. To acquire this knowledge we examine the Variable Miss Overview which presents the miss behavior of all data structures in the tested program. We see that all three matrices introduce cache misses and most misses with a and b are capacity miss, while with c mapping conflict is the main miss reason. We first optimize the code with respect to matrix a and b. Examining the loop for computation, it can be seen that each loop with k calculates a single element of matrix c and for this it needs a whole row of a and a whole column of b. More importantly, the row of a is reused for computing the next element. The capacity miss with a means that these elements are evicted from the cache before being reused. The fact is: each row of matrix a contains 64 elements which builds 16 data blocks, exceeding the L1 cache capacity (8 cache lines). We use loop blocking to reduce the size of the needed data for a loop in order to achieve the reuse of a. The following is the optimized code:

5.

EXPERIMENTAL RESULTS

As mentioned in Introduction, the cache simulator is implemented with two goals: providing performance data for cache optimization and establishing a platform for evaluating the impact of different cache organization and strategies, like prefetching and replacement, on the cache performance of applications. In this section, we use several examples to show how the information can be applied for performing code optimization and how different cache organization, prefetching schemes, and replacement policies influence the cache behavior. The most applications for theses experiments are chosen from standard benchmark suites, including SPLASH [17], NAS serial [3], and SPEC 2000 [4].

5.1 Code Optimization with a Cache Visualizer For a better understanding of the acquired cache performance data we implemented a visualization tool YACO [14]. YACO is specially designed with the goal of efficiently helping the users in their task of cache optimization. For this, it provides a variety of graphical views to direct the user step-by-step to detect the problem and the solution. Using YACO, users first acquire an overview about the cache access behavior shown by the chosen program. Based on this overview, the user can determine whether an optimization is essential. In the next step, the access hot spots, which are responsible for poor cache performance, can be located with help of YACO’s Variable Miss Overview, Data Structure Information, and 3D Phase Information. The first view presents the total misses with all data structures allowing a global location of the access hotspots, while the other two views enable a further location of this bottleneck to concrete data blocks and functions by showing the miss behavior

f or(block = 0; block < N ; block+= N/2) f or(i = 0; i < N ; i ++) f or(j = 0; j < N ; j ++) f or(k = block; k < block+N/2; k ++) c(i ∗ N + j) = c[i ∗ N + j] + a[i ∗ N + k] ∗ b[k ∗ N + j]; The difference with the original code lies in that the innermost loop (the loop with k) does not perform the whole work for generating an element of matrix c; rather it does only a half of the whole work. The additional loop with block guarantees the whole work to be covered. In this case only 32 elements of a (8 data blocks) are accessed within the k loop. This will help to keep elements of a in cache for reuse, even though the cache also stores elements of c and

b. This optimization results in the raise of the cache hit ratio from 46% to 62%. For matrix b the capacity miss can not be solved in this example because the same column is reused first after all other columns are accessed for calculating a single element. For matrix c, however, a further study on mapping conflict is needed in order to decide how to optimize the code. For this, we examine the Cache Set view of YACO, which shows the content update in the cache. We detect that the same element of matrix c is reused but replaced by elements of matrix a every time after the access. Actually, this corresponds to the computing process where first a multiplication is performed and then the result is accumulated to the c element. The multiplication is performed on different elements of a and b, but the accumulation targets on the same element of c. Hence, better performance can be potentially achieved if the elements of c can be kept in cache. An approach is to add a pad of one cache line before matrix c so that the elements of c do not map to the same cache set with the needed elements of both a and b. The data allocation is as following: a = (double∗)malloc(sizeof (double) ∗ N ∗ N ) b = (double∗)malloc(sizeof (double) ∗ N ∗ N ) d = (double∗)malloc(sizeof (double) ∗ 4) c = (double∗)malloc(sizeof (double) ∗ N ∗ N ) This optimization results in a 40% reduction of the L1 misses in comparison with the original version and thereby a 23% raise in cache hit ratio.

Figure 2 depicts a sample result with FFT simulated on a 4-node system with the MESI protocol. We choose three access addresses as examples to demonstrate false sharing cases. Take address 25167727 at the top as an example. It can be seen that in all three depicted Cases this address is first written by processor 2 and then read by processor 0. However, those writes and reads target on different data words. In Case 1, for example, after processor 2 writing the first word processor 0 reads the second. This indicates false sharing. Similar behavior can also be seen with other addresses. The knowledge about false sharing can be used to reallocate data structures or reorganize array elements so that false shared data words can be stored in different cache lines. In this way, required data can be held in the cache, even though invalidations are performed. We will study the impact of such optimization in the near future.

5.3 Evaluation of Different Cache Organization and Schemes Besides providing performance data for cache optimization, the implemented cache simulator can also be used as a platform to study various cache organization, and prefetching and replacement schemes. In this subsection, we demonstrate several examples. Cache Organization. First we examine the impact of different cache associativity and block size. For the experiment, we tested a variety of sequential applications from NAS and SPEC benchmarks. Figure 3 and 4 depict the experimental results.

5.2 Additional Optimization Potential on Multiprocessor Systems On multiprocessor systems there is another reason for cache miss, i.e. cache line invalidation. Such invalidation is issued by coherence protocols for achieving multiple caches consistent. However, in case of false sharing, the invalidation is not necessary hence causing performance lost. In order to support the users to detect false sharing, the cache simulator provides information about the interaction on shared data among processors. Access address

Processor 2

25167727

Processor 0 W4

Case 3

W3

W3

W4

write

Case 2

W: word

read Case 1

W2

W1

Figure 3: L1 misses with various associativities Processor 0 25168258

Processor 3

W3

W2 W4

W2

Access order

W1

Access order

Figure 2: FFT

W3

Processor 3 W1

W2

Case 2

W2 W3

Processor 2 W1

Processor 1

W2 W3

Processor 1 25165952

Processor 2

Processor 0 Case 2

W1 W2

Case 1

W2

Case 1

Cases of false sharing

False sharing between processors with

Figure 3 demonstrates the variation of the number of cache misses with set size that has been chosen as 1, 2, 4, and 8 representing directly-mapped and traditional set associative caches. In principle, number of cache misses shall decrease as the set size increases, because more mapping possibilities exist. As shown in the diagram of Figure 3, most applications hold this feature. However, WATER presents different behavior, where a 8-way cache shows more misses than a 4-way cache. Similar pattern can be seen with the changes of block size. As illustrated in Figure 4 which shows L1 misses with different block sizes, for all applications, except FFT, more

12000 Number of total L1 misses

cache misses are caused with a block size of 128 bytes. For FFT and LU, the best performance is achieved with a block size of 64 bytes, while for others a block size of 32 bytes outperforms. This different behavior is caused by the different access pattern of applications. Actually, for caches of the same size and set size, larger block indicates less blocks. In this case, only applications with higher spatial locality benefit from larger blocks. For applications with less spatial locality, usually data within the same block can not be used. On the other hand, while less blocks are available, more conflict exists, resulting in more cache misses.

N-Prefetching T-Prefetching

10000

B-Prefetching

8000 6000 4000 2000 0 FFT

LU

WATER MATRIX

SOR

Applications and prefetching policies

Figure 5: schemes

Figure 4: L1 misses with different block sizes Overall, both diagrams indicate that it depends on the access pattern of applications, whether a chosen set or block size behaves well. This also means fine tuning with cache configuration is needed in order to achieve significant performance gain. For the FFT code in Figure 4, for example, only 62% cache misses are caused when using a cache block of 64 bytes rather than a block of 16 bytes. Cache Prefetching. Now we examine the influence of different prefetching schemes. For this we measured cache misses in three cases: no prefetching (N-Prefetching), traditional prefetching (T-Prefetching) which preloads one following block, and block prefetching (B-Prefetching) which preloads a set of successive blocks of user specified number (for this experiment 2 blocks). We tested several SPLASH applications and two self-coded small kernels, and measured the total L1 misses. The experimental results are depicted in Figure 5. Examine Figure 5, it can be seen that for FFT and LU the number of total misses reduces with prefetching and further with block prefetching. Similarly, prefetching reduces cache misses of WATER but block prefetching is poorer than the traditional one. For MATRIX, however, both prefetching schemes introduce more cache misses and for SOR only block prefetching shows a slight reduction in cache misses. Overall, it depends on the access pattern whether applications benefit from prefetching. For applications with spatial locality, cache misses can be decreased through prefetching and block prefetching further reduces cache misses if large spatial locality exists. For other applications with less spatial locality, however, prefetching potentially introduces more cache misses, while frequently reused data is evicted

L1 misses with different prefetching

from the cache due to cache capacity or mapping conflict. This result indicates that an adaptive policy, which can dynamically switch between prefetching and no-prefetching and change the fetching size, is essential for achieving efficient cache behavior. Replacement Strategies. As described in the previous Section, we have implemented two algorithms for cache line remapping in case of interferences: the so-called Set-LRU that may replace a cache line in a rarely used cache set and the Global-LRU that applies the traditional LRU in the whole cache. We compare these replacement policies with the traditional LRU scheme.

T-LRU

80000

S-LRU

70000

G-LRU

60000 50000 40000 30000 20000 10000 0

FFT

Figure 6: schemes

LU

WATER mgrid

swim MATRIX SOR

L1 misses with different replacement

Figure 6 shows the total number of cache misses with different replacement policies. It can be observed that both novel schemes generally perform better than the traditional LRU (T-LRU), with Set-LRU (S-LRU) slightly outperforming Global-LRU (G-LRU). MATRIX and SOR, however, are exclusions. For MATRIX, G-LRU introduces the best performance with 16% less misses than the traditional LRU

approach. Nevertheless, for SOR both novel schemes show worse performance, especially the G-LRU scheme. This is probably caused by SOR’s working characteristics. SOR is used to iteratively solve partial differential equations. Its main working set is a large dense matrix. It begins by initializing the boundary values and then computes the values inside the matrix with given boundary conditions. The calculation terminates after a certain number of iterations. This means all elements in the matrix are needed, earlier or later, during each iteration. Hence, the G-LRU scheme, which evicts the data earlier used, but to be reused in the next iteration, causes more frequent replacement and thereby more misses and overhead.

6.

CONCLUSIONS

In this paper we present a novel cache simulator specifically designed for supporting cache optimization at both code level and system level. For this, the cache simulator models not only the basic cache activities, but also different cache organization and management strategies like replacement and prefetching policies. More specifically, it delivers information about the feature of cache misses and the runtime accesses in the cache and with the data structures. With support of a visualization tool, this information directs the user step-by-step to detect the access bottleneck, the reason about that, and the optimization strategies to tackle them. In addition, the cache simulator can also be used as a research platform to study the impact of different prefetching, replacement, and configuration schemes on the cache performance of applications. The optimization with a sample code and the experimental results with these schemes have proven the feasibility of this simulation tool. In the next step of this research work, we will conduct the cache optimization on real machines and realistic applications. These applications will be first simulated and then analyzed with the visualization tool. The optimization will be verified through actual execution. For this, performance counters [7, 16] will be used to provide performance metrics like number of cache misses and the CPU cycles.

7.

REFERENCES

[1] T. Austin, E. Larson, and D. Ernst. SimpleScalar: An Infrastructure for Computer System Modeling. Computer, 35(2):59–67, 2002. [2] E. Berg and E. Hagersten. SIP: Performance Tuning through Source Code Interdependence. In Proceedings of the 8th International Euro-Par Conference, pages 177–186, August 2002. [3] D. B. et. al. The NAS Parallel Benchmarks. Technical Report RNR-94-007, Department of Mathematics and Computer Science, Emory University, March 1994. [4] J. L. Henning. SPEC CPU2000: Measuring CPU Performance in the new Millennium. IEEE Computer, 33(7):28–35, 2000. [5] S. A. Herrod. Using Complete Machine Simulation to Understand Computer System Behavior. PhD thesis, Stanford University, February 1998. [6] H. C. Hsiao and C. T. King. MICA: A Memory and Interconnect Simulation Environment for Cache-based Architectures. In Proceedings of the 33rd IEEE Annual Simulation Symposium (SS 2000), pages 317–325, April 2000.

[7] Intel Corporation. Intel Itanium Architecture Software Developer’s Manual, volume 1–3. 2002. available at http://developer.intel.com/design/itanium/ manuals/iiasdmanual.htm. [8] T. L. Johnson, M. C. Merten, and W. W. Hwu. Run-time Spatial Locality Detection and Optimization. In Proceedings of International Symposium on Microarchitecture, pages 57–64, December 1997. [9] P. S. Magnusson and B. Werner. Efficient Memory Simulation in SimICS. In Proceedings of the 8th Annual Simulation Symposium, Phoenix, Arizona, USA, April 1995. [10] M. Martonosi, A. Gupta, and T. Anderson. Tuning Memory Performance of Sequential and Parallel Programs. Computer, 28(4):32–40, April 1995. [11] S. S. Mukherjee, S. K. Reinhardt, B. Falsafi, M. Litzkow, S. Huss-Lederman, M. D. Hill, J. R. Larus, and D. A. Wood. Wisconsin Wind Tunnel II: A Fast and Portable Parallel Architecture Simulator. IEEE Concurrency, 8(4):12–20, October 2000. [12] A.-T. Nguyen, M. Michael, A. Sharma, and J. Torrellas. The augmint multiprocessor simulation toolkit for intel x86 architectures. In Proceedings of 1996 International Conference on Computer Design, October 1996. [13] V. S. Pai, P. Ranganathan, and S. V. Adve. RSIM: An Execution-Driven Simulator for ILP-Based Shared-Memory Multiprocessors and Uniprocessors. In Proceedings of the Third Workshop on Computer Architecture Education, February 1997. [14] B. Quaing, J. Tao, and W. Karl. YACO: A User Conducted Visualization Tool for Supporting Cache Optimization. In High Performance Computing and Communications: First International Conference, HPCC 2005. Proceedings, volume 3726 of Lecture Notes in Computer Science, pages 694–703, Sorrento, Italy, September 2005. [15] G. Rivera and C. W. Tseng. Data Transformations for Eliminating Conflict Misses. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 38–49, Montreal, Canada, June 1998. [16] E. Welbon and et al. The POWER2 Performance Monitor. IBM Journal of Research and Development, 38(5), 1994. [17] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 Programs: Characterization and Methodological Considerations. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages 24–36, June 1995. [18] WWW. Cachegrind: a cache-miss profiler. available at http://developer.kde.org/ sewardj/docs-2.2.0/ cg main.html#cg-top.

Suggest Documents