CPU caches reduce main memory access times and thereby speed up execution ... though IA32 processors share a common instruction set, their CPU cache ...
Worst Case Behavior of CPU Caches Tobias John And Robert Baumgartl Chemnitz University of Technology 09107 Chemnitz, Germany {tobias.john,robert.baumgartl}@cs.tu-chemnitz.de
Abstract CPU caches reduce main memory access times and thereby speed up execution timing and significantly improve performance. Various cache architectures with different strategies and mechanisms have been developed. However, caches have a negative impact on execution timing predictability which is crucial in real-time systems. A precise understanding of available cache architectures is therefore essential. Although IA32 processors share a common instruction set, their CPU cache architectures differ significantly. Therefore, we describe in this paper a methodology for the construction of realistic worst cases for CPU caching. We compare and evaluate several IA32 processors with respect to efficiency and predictability of execution timing. Our methodology incorporates RTAI and the usage of Performance Monitoring Registers and generates very precise results in comparison to indirectly measuring execution times by counting clock cycles. Nonetheless our micro benchmarks are easily extendable and adaptable.
1
Introduction
The remainder of this paper is structured as follows: Section 2 describes basic caching principles and defines our terminology. Some specific worst cases for different cache configurations are discussed in section 3. The following section 4 describes how these worst cases can be constructed and measured. The results we obtained during our experiments are presented in section 5. The paper closes with some conclusions and a short outlook onto future work.
For satisfying complex and compute-intensive tasks, real-time systems must more and more rely on features of modern processor architectures. By analyzing microprocessor evolution it is obvious that processors used in today’s state-of-the-art personal computers can soon be expected to make their way into future embedded systems. As an example, memory management units (MMU) and caches have been introduced into real-time systems. Unfortunately, these features have a tendency to complicate the estimation of precise worst-case execution times, which is essential for real-time systems. Therefore, it is necessary to analyze and understand real-time behavior of modern processor architectures. Although processor caches in general have been the focus of several real-time research projects, this is not the case for the caches of the Intel IA32 architecture, because details of their inner workings have not been published. Therefore, this paper describes how to construct realistic worst cases of cache access operations for this specific processor architecture. We present a measurement methodology for obtaining very accurate access time measurements and apply it to different IA32 processors.
2
Caching Principles
This section gives a short introduction into recent CPU cache architectures and thereby provides an overview of the terms and definitions used in this paper. Data is indexed in caches by its address in main memory. Because caches are much smaller than main memory, only a part of the address is necessary as index. Hence, different main memory addresses point to the same cache location. In a direct-mapped cache, every cacheable datum can be stored at exactly one location within the cache. Usually, this leads to many replacement operations which lowers caching efficiency. Therefore, n-way-set-associative caches have been introduced which provide n potential lo1
cations for a datum to be cached (cf. figure 1). Now, the least significant part of the address indexes a set of the cache (a row in figure 1), in which a datum can be stored in different locations (the columns in figure 1).
MEM CACHE
The basic cache data unit is the cache line which comprises a block of bytes for efficiency reasons. We denote cache lines by uppercase letters and an individual datum within a line with a lowercase letter.
Y
set S-1
The CPU requests datum x which is not yet in the cache and therefore has to be fetched from main memory. If, however, the possible location where X can be stored in the cache is already occupied by line Y , then Y has to be written back to memory to free up space before X can be loaded. If the cache is 2-way-associative, both possible locations must be occupied (and one of them is written back). A write-through policy is easier to understand. No data has to be written when a cache miss occurs, because cache contents and associated main memory is always consistent.
FIGURE Cache
| {z } way W -1
1 way
way
0
| {z } | {z }
1: Simplified Structure of a
Depending on whether a modified datum in the cache is written back immediately or when the cache line is replaced, two different cache write policies (write-through and write-back, respectively) are distinguished. If data in the cache is frequently modified, the latter usually performs better.
3.2
3.1
Inclusive Cache Hierarchy
The worst case for a cache hierarchy of two levels is more complex and worse than for a single cache. L¨ oser et al [1] describe a worst case for a two-level inclusive caching hierarchy where L1 uses a writeback strategy and is at least two-way-associative. Its structure is shown in figure 3 by means of a simplified one-way cache.
To further improve cache efficiency, cache hierarchies exist which consist of several caches between CPU and main memory. The nearer a cache is located to the CPU the smaller and the faster it is. We denote the cache level by L1, L2, . . . . If every line of Ln is also present in Ln + 1 the hierarchy is called inclusive. On the other hand, if Ln and Ln + 1 cache different data the hierarchy is exclusive. This paper solely focuses on two-level hierarchies but the principles can be applied to higher hierarchies as well.
3
X
X
FIGURE 2: Worst Case for Single Write Back Cache
Main Mem L
set 2 set 1 set 0
Y x
Cache
Y
x? CPU
Mem. L2 L1
zi
Worst Case Configurations
ym .zi
B
yn .zi
C
xa .ym .zi
A
xb .ym .zi
B
xc .yn .zi
C
xd .yn .zi
D
A
FIGURE 3: Worst Case for Inclusive Two Level Hierarchy
Single Cache
The CPU accesses d which is neither stored in L1 nor in L2 and has to be fetched from memory. Cache line D that holds d maps to entry zi of L1 which is already occupied by line A. Therefore, for caching D
Figure 2 illustrates the worst case for reading a direct-mapped cache with write-back policy. 2
4
in L1, line A has to be written to L2. Unfortunately, the modified line B occupies the corresponding entry in L2, hence A is written through to memory. To load D into L2, entry yn .zi is the only one that can be used. The modified line C has to be written back to memory to free the necessary space. Hence, a single read miss causes two write operations to main memory. This worst case is sometimes called double-purge configuration. The characteristic property of this configuration is that although an inclusive caching architecture is used, it is possible to fill the cache levels in a way that data of L1 is not in L2. Both cache contents have to be modified so that lines can not simply be overwritten and have to be written back to main memory before being overwritten. These conditions are met by the Intel Pentium PII and PIII processors but not by the P4. The latter features a write-through L1 cache which makes it impossible to store modified data in L1 only.
3.3
Constructing Worst Cases
4.1
Experimental Platform
The algorithm described in [1] to achieve a doublepurge configuration is quite complex, can not easily be modified and has some limitations. Because our aim is to compare worst cases for different IA32 architectures we needed a more general and less complicated method wherefore we are accessing memory by consecutive addresses to fill the cache line by line. In doing so, address calculation and thereby the complexity is reduced to a minimum. This improvement makes the algorithms easy to understand and thereby well adaptable. An ideal experimental platform for our purposes is the Linux Real-Time Application Interface (RTAI). By implementing our benchmarks as RTAI modules we are able to eliminate any timing interferences by user-space applications and the Linux kernel itself. Additionally, because RTAI modules reside in kernel space, no memory access restrictions exist which eases implementing arbitrary access patterns to manipulate cache contents. For drawing precise conclusions and for verifying the assumptions made on the underlying architecture we utilize the performance monitoring capabilities of modern IA32 microprocessors. The discussion of several worst case configurations demonstrated clearly that a detailed knowledge of the underlying caching architecture is necessary and that beside structural parameters (e. g. cache size or associativity) functional behavior such as the replacement strategy or the caching policy need to be known in advance. Unfortunately, many of these parameters are not well-documented and often must be inferred from further experiments. This analysis has been done in a similar manner as the worst case experiments described here. The methodology basically consists of making guesses on properties of the cache to be analyzed and testing these guesses one after another by crafting a well-defined cache contents, performing specific access patterns and analyzing the resulting cache state. A discussion of these tests is beyond the scope of this paper. We refer the interested reader to [2] instead.
Exclusive Cache Hierarchy
The AMD Athlon64 uses an exclusive caching architecture. That means data is cached either in L1 or in L2. Uncached data can be loaded directly into L1 without being transferred first into L2. Hence, the worst case is accessing data that is not cached when L1 and L2 are already filled with modified (“dirty”) lines. To load a new datum into L1 one cache line has to be moved into L2 to free up space. Because L2 is completely occupied with modified data, the line is written back to memory. A double-purge configuration does not exist. Interestingly, the Athlon64 provides another cache memory between L1 and L2, the so-called victim buffer, which stores a small number (e. g. eight) of lines that have been evicted from L1 to L2. If the victim buffer is not completely filled, it prevents data from being moved into L2 and lines from being written back from L2 into main memory. However, if the victim buffer is full, the next line evicted from L1 causes a full flush of the buffer. All victim buffer entries are transferred into L2. If no free or invalid lines are available in L2 an equal number of lines are written to main memory. That means, in the worst case, if the victim buffer can hold up to lV B lines, the cache miss of datum x might result in one read from and lV B write operations to main memory. Obviously, different caching architectures with varying strategies and number of cache levels have different worst cases. Even for processors of the same architecture family the timing of the worst case memory access may significantly differ.
4.2
Implementation Details
The software to measure the worst case cache access offers the possibility to differentiate between three different cache configurations: 1. untouched, 2. “filled”, 3. “flooded”. 3
The ”flooded” state depends on the underlying architecture but is described on the basis of a twolevel inclusive architecture as shown in figure 3. As pointed out before, the key is to have different modified data in L1 and L2. To this aim, L1 is filled once (cf. figure 5) and afterwards only one of its cache ways is used to load new data into L2. The other ways are continuously “touched” to keep their data in L1. (Therefore it is necessary to have an L1-associativity greater one.) This procedure is repeated until L2 is completely filled. The resulting configuration is shown in figure 6. The rightmost way of L1 has been used to load new data whilst the leftmost three ways still hold their original content. Until now the content of L1 is still available in L2, but if we continue to load new data into L2 by using only the one L1 cache way, the oldest lines in L2— those that we are continuously touching to keep them in L1—will be replaced. Finally, L1 and L2 contain different, modified data except the one L1-way used to fill L2 (cf. figure 7). But because this way is the most recent, it will be replaced last and until that point the other ways of L1 will have been replaced, leading to the described worst case. Figures 5 to 7 rest upon the Intel Pentium PII cache architecture.
The first configuration leaves the caches untouched and therefore reflects the best case. The second configuration simply fills both cache levels with modified data, leaving the cache in a state that is likely to occur in everyday scenarios. The third configuration constructs the true, processor-specific, worst case. To quantify the influence of the branch prediction, all configurations allow to define whether the data references are performed in a loop or in a sequence without a branch. Furthermore, it is possible to execute a chunk of instructions prior to the test to fill the instruction cache. Those instructions do not have any influence onto the measurement. Figure 4 presents the structure of the benchmark, using assembly instructions common to the x86 architecture. init Perf.Mon.
no
fill I-cache? yes
rdmsr(msr, ull); ull = ull|((1UL