Address translation using the Translation Lookaside. Buffer (TLB) consumes as much as 16% of the chip power on some processors because of its high associa-.
Compiler-Directed Physical Address Generation for Reducing dTLB Power I. Kadayif, P. Nath, M. Kandemir, A. Sivasubramaniam CSE Department, The Pennsylvania State University University Park, PA 16802, USA {kadayif,nath,kandemir,anand}@cse.psu.edu Abstract Address translation using the Translation Lookaside Buffer (TLB) consumes as much as 16% of the chip power on some processors because of its high associativity and access frequency. While prior work has looked into optimizing this structure at the circuit and architectural levels, this paper takes a different approach of optimizing its power by reducing the number of data TLB (dTLB) lookups for data references. The main idea is to keep translations in a set of translation registers, and intelligently use them in software to directly generate the physical addresses without going through the dTLB. The software has to work within the confines of the translation registers provided by the hardware, and has to maximize the reuse of such translations to be effective. We propose strategies and code transformations for achieving this in array-based and pointer-based codes, looking to optimize data accesses. Results with a suite of Spec95 array-based and pointer-based codes show dTLB energy savings of up to 73% and 88%, respectively, compared to directly using the dTLB for all references. Despite the small increase in instructions executed with our mechanisms, the approach can in fact provide performance benefits in certain cases.
1. Introduction Reduction of power consumption is an imperative design consideration in chip design across both embedded and high-end systems [2, 3, 5]. More than helping to extend battery lifetime, this has an important consequence on the layout of circuits, and on the cooling/packaging requirements to keep them operating within the thermal constraints [4]. It also has a consequence on the reliability of storage and combinatorial circuits since power optimization at the circuit level usually involves lower voltages, which in turn can lead to transient errors. In this paper, our focus is on the Translation Lookaside Buffer (TLB), which is a cache of recent page table entries that is consulted on every memory system reference to translate a virtual address to a physical address. Because of its criticality in memory system lookup, this structure is very carefully designed at the circuit level to provide low lookup latencies, and is usually a highly associative (fully-associative in many cases) structure to reduce miss rates. Such a structure that is referenced
on every memory system lookup (even if the data is in the cache) results in a significant amount of dynamic power consumption (i.e., the switching power which is consumed when a component is exercised). Further, since this is typically a small structure (usually at most 128 to 256 entries), its power density (which is more critical to cooling), which is inversely related to the area, is also quite high, making it even more important for power optimizations. Consequently, several studies have attempted dynamic power optimizations for TLBs [7, 9, 13]. In this paper, we investigate how we can generate physical data addresses directly without going to the data TLB (dTLB) as much as possible with software control. The idea is to keep a small set of virtual to physical “translation registers” (TRs) that maintain translations for select pages, which the software directly uses to generate physical addresses. In this approach, the software specifies exactly what translation registers to be used for which data references, and employs them only when it is certain that the translation is present there. Otherwise, the default mechanism that goes via the dTLB is employed conservatively. By avoiding going through the dTLB whenever possible, we save on the dynamic power consumption on those accesses. The dynamic power — due to bit switching — is typically a function of the size of the structure that is accessed. Instead of accessing a fully associative array, we selectively use one of our translation registers. Instructions for loading TRs and directing memory references to go through a specific TR can be exposed by the hardware. These instructions can be employed either by the application directly, or can be used by the compiler when the codes are statically analyzable and can even be exploited in the runtime environment. In this paper, we demonstrate techniques by which the compiler can effectively use such instructions in optimizing the dTLB power of array-based and pointer-based applications. Our detailed simulations of these software optimizations on array-based and pointer-based Spec95 codes show dTLB power savings as much as 73% and 88%, respectively. The next section gives details on the support that we are provisioning in the hardware for software exploitation. Section 3 explains the software techniques that are used for the array-based and pointer-based codes. Section 4 describes the experimental setup and Section 5 presents detailed results. Finally, Section 6 summarizes the contributions of this paper.
2. Proposed Hardware Enhancements 2.1. Enhancements Our extensions to the hardware include the following: • We propose to have (n − 1) translation registers (TRi ) in the hardware whose format is as follows: [ ]. • These registers will be loaded with the page table entry using the following instruction exported by the Instruction Set Architecture (ISA): loadTR , TRi When this instruction is issued, the hardware uses the virtual page number that is given and goes to the dTLB to get the corresponding entry (or to the page table if it is not in the dTLB). It, then, puts this entry (which consists of the physical frame number, protection and other book-keeping bits such as modified/referenced etc.) into the specified TRi . The program is not allowed to specify the physical address or the protection bits that get loaded into the TR, as that would compromise protection. • We also need a modification in the functioning of a normal load/store instruction. When a load/store is issued, it presents the virtual address (for the memory location) in any of the addressing modes. We intend to use some bits of this address to indicate whether to go through the dTLB or whether to find the translation within a specific TR. Most CPUs are gradually becoming 64-bit architectures, which provide a really large virtual address space, and it is going to take a while before we fully utilize this space. We suggest using the top log n bits (where n − 1 is the number of TRs) to indicate whether the address should go through the dTLB for a translation (which corresponds to all these log n bits being 0) or whether to take the physical frame number from a TR, and if so which specific one. In our approach, the compiler is responsible for setting these bits which the runtime will see. Whenever the compiler knows for sure that a TRi has the translation that is needed, it will force these bits to correspond to that TRi . Otherwise (i.e., if the compiler is not 100% sure), it will conservatively set them to 0 so that the translation goes to the dTLB. In this latter (default) case, the functionality or power/performance ramifications is not going to be any different from an architecture without these enhancements. We have also considered the option of associating these log n bits (also called the “translation bits”) with the instruction itself, but we decided against it due to two reasons. First, this increases instruction size (and decode logic complexity). Second, it is possible that the same load instruction may look up different registers (or go to dTLB sometimes) in different dynamic invocations. Implementing this would require code expansion by the compiler. Consequently, we decided to embed these translation bits in the virtual address specification of the instruction. The important part of the dTLB power optimization process is then to decide how these bits for
the address generation should be set for each load/store instruction (this will be described in Section 3). Let us now look at the dL1 addressing mechanisms in conjunction with our enhancements. Once the memory reference instruction with appropriate bits is encountered, the hardware draws the physical frame number from the corresponding TR for the cache lookup in a PI-PT L1 indexing scheme (without going to the dTLB). This removes the dTLB from the critical path, potentially improving performance in addition to saving power. In a VI-PT L1, the cache indexing is directly done with the virtual address (as without our enhancements), and the physical address that is needed for tag comparison comes from the TR and the page offset. This provides power savings without any performance differences. Finally, in a VI-VT L1, cache lookup is done as without our enhancements, but upon an L1 miss we proceed as in PI-PT without the dTLB getting in the critical path (thus, getting both performance and power savings).
2.2. OS Support It is to be noted that we are not compromising on protection in any way since the program/compiler cannot manipulate the physical address or protection bits of the TRs. In the worst case, when the program/compiler wrongly uses a TR (due to a bug), at most, it would access/corrupt its own memory (and not any other address space). As far as the operating system is concerned, when context switching from one address space to the next, the TRs can be treated as normal registers, and should be saved/restored. While this may present some overhead, since context switches are rare, we do not expect this to be much of a problem. It may happen that a page mapping changes (to another physical location or the page is evicted from memory), between the time the program loaded the translation to a TR and its subsequent usage. In this case, the operating system will effect the corresponding changes in the page table and dTLB (as is the norm), together with making the necessary changes (invalidation or update) in the TRs storing the corresponding virtual page number. When a subsequent load instruction goes through this TR, it would either have the new mapping or will indicate that it is invalid, triggering a page fault. When the OS brings in this page, apart from updating the page table/dTLB, it also updates the corresponding TR.
3. Software Support 3.1. Basic Mechanism Consider a program with three global variables x, y, and z, in memory with the first two being on the same virtual page (statically analyzable) and the third on a different page, and let us assume one TR available in the hardware for simplicity. When the compiler encounters a high-level statement such as: x = x + y + z;
it will generate the following pseudo machine code: loadTR &x, TR1
// Page table entry of virtual page // of x (specified as immediate) is // loaded into TR1 loadc (&x OR 0x8000000000000000), Rtx // Address of x OR-ed (this is an // assembler directive) to indicate // that translation is through TR1 loadc (&y OR 0x8000000000000000), Rty loadc (&z), Rtz // We are not using a TR for // accesses to z load (Rtx), Rx // Loads x into Rx // uses TR1 for translation, // no dTLB access load (Rty), Ry // Loads y into Ry // uses TR1 for translation, // no dTLB access load (Rtz), Rz // Loads z into Rz // goes through TLB add Rx, Ry, Rx add Rx, Rz, Rx store (Rtx), Rx // Storing into x // uses TR1 for translation, // no dTLB access
In this code, we use TR1 for translating accesses to x and y, and only z is directly accessed through the dTLB. The first instruction (loadTR) loads the translation (from either the dTLB or the page table) of the page containing x (which also contains y) into TR1. The next instruction (load constant, loadc), which would be there even in a normal compiled code, loads the address of x (specified in immediate format) into register Rtx. The only difference is that we need to indicate that TR1 needs to be used. We do this by setting the most-significant bit of this 64-bit address (since we have only 1 register, only 1 bit is needed to distinguish translation through the TR from translation through dTLB) which is also done by the assembler — “statically” in this case since addresses are known at compile time. In cases where such addresses are determined at run time (e.g., as in the case of stack-resident variables), these bit settings may also need to be effected as an instruction, though the overheads for doing this were not overbearing. The loading for address y works similarly, but z is loaded based on translations from the dTLB. Once the arithmetic operations are done, the store back to x does not need to go through the dTLB again. Based on our discussion above, we classify the data references of interest in our programs into the following four broad categories for appropriate compilation support: • Global scalar variables: The above example captures our transformations for handling references in this category, trying to optimize successive references of global scalar variables (x and y) that can use the same translation. • Global array variables: These data structures are typically addressed in loops, and Section 3.2 will cover high-level transformations (loop restructuring) that enable the use of the TRs. • Heap data references: These are references to dynamically created (using malloc() or similar routines) data structures that are typically accessed through pointers. Mechanisms for employing TRs for these structures are discussed in Section 3.3. • Stack data references: Stacks typically exhibit very good locality (especially in Fortran and C codes where arrays are passed by reference/pointer), i.e., references out of the current — top of stack — page are quite low. This suggests that employing very few translation registers (even one) would suffice for stack accesses. In fact,
for(i=0;ia + q->b; p = p->next; q = q->next; }
(b)
Iteration (i+1) ... ...p->a... ...p->b... ...
(uses TR1) (uses TR1)
...p->next...(uses TR1) (loads TR1) ...
(c) Figure 2. Using TR for optimizing a pointerbased code fragment.
explained above. 3.2.1. Additional Optimizations In general, techniques which improve spatial and temporal locality [8, 12] would also help improve the reuse of TR contents, i.e., minimize the number of loadTR instructions, and/or minimize the number of required TRs. Consequently, techniques such as loop interchange [15] and tiling [15] (if the tile size and shape are chosen appropriately) can thus increase the effectiveness of our approach. In addition to such techniques, there are other opportunities where we can provide better TR reuse. Figure 4(a) shows loop distribution [15], wherein an original loop is distributed across the statements in its body. Note that, in this example, while the original code requires two TRs to cover all the references, the loop distributed code requires only one TR. In our experiments, we also evaluate the impact of loop distribution on the effectiveness of our approach.
3.3. Heap Accesses The granularity at which we intend to exploit translation re-use for heap accesses is a chunk of memory that is allocated using one malloc() call which is then used to implement recursive data structures such as linked lists, trees, graphs, etc. Several codes allocate such structures dynamically from the heap and link them through pointers. We call the resulting unit of optimization as an “instance” (e.g., of an object/structure). We need to track the point when an instance is touched, and when the locality goes out of this instance, and reuse the translations within this epoch. The requirement we impose is that the instance is allocated completely within one page (does not straddle page boundaries) as long as it is smaller than a page. We explain our mechanism using an example that is shown in Figure 2(b) where two linked list structures (defined in Figure 2(a)) are traversed by pointers p and q in a loop. Each time we come to an instance we record the translation in a TR (if the hardware provides two TRs), and as long as we are within the instance, the
loads/stores for those fields can be directed through the TR. In this example, it can be seen that we go to two new instances, one pointed by p and the other by q, by observing where p and q are updated. If the data that is pointed to by p or q (i.e., an instance) is contained entirely within a page, then we load the translation for it once (at the beginning of the iteration by which time the address of the new instances are known) and reuse this translation for three subsequent references to this instance. This is automated by the compiler (or perhaps even the programmer can give such directives). Figure 2(c) shows how TR1 is used in two consecutive iterations of the while-loop in Figure 2(b) for references by p. Similar optimization is performed for q. It should be emphasized that we are being very conservative in estimating page references. For instance, successive instances traversed by p may be in the same page, or within an iteration the instance pointed to by p and that by q may also be within the same page. We conservatively lookup dTLB even in such cases, though it is conceivable that further runtime optimizations are possible by detecting such behaviors. Only when we are 100% certain that the contents of a pointer can use a TR, do we attempt these transformations. In this simple example above, even with our conservative approach, we can find 67% reduction in dTLB lookups for the instances pointed to by p and q. Note that our techniques are in a sense immune to aliasing related problems, which are usually the main obstacle in compiler optimizations. Whenever we are not sure of what instance a pointer is pointing to (e.g., as a result of poor aliasing analysis), we conservatively lookup the dTLB, and use TRs only when we are sure that the current pointer access is to the same instance as the previous pointer access.
4. Simulation Setup We used the SimpleScalar [6] simulator for conducting a detailed energy and performance evaluation whose configured parameters are given in Table 1. The configurations for the dTLB — both monolithic and multi-level, dL1, together with the per access energy costs are given in Table 1. We conduct experiments with different number of TRs, and consider VI-PT, VI-VT and PI-PT dL1 configurations as explained earlier. All high-level code transformations were automated using the SUIF framework [1]. Energy numbers for the cache and TLB were obtained using a modified version of CACTI [14]. Note that one could have optimized the TLB structures to get better per access energy [7, 9], but that does not really affect the results to be presented since we are more interested in the percentage dTLB energy savings. The application codes in the default case use -O2 compiler optimization.
5. Experimental Results 5.1. Array-Based Codes Global array accesses are dominant in a lot of programs in the scientific computing and image/media pro-
Simulation Parameter
Value Processor Core RUU Size 64 instructions LSQ Size 32 instructions Fetch Queue Size 8 instructions Fetch Width 4 instructions/cycle Decode Width 4 instructions/cycle Issue Width 4 instructions/cycle (out-of-order) Commit Width 4 instructions/cycle (in-order) Functional Units 4 Integer ALUs 1 Integer multiply/divide 4 FP ALUs 1 FP multiply/divide Memory Hierarchy L1 Data Cache (dL1) 16KB, 4-way, 32 byte blocks, 1 cycle latency, write back L1 Instruction Cache (iL1) 16KB, 2-way, 32 byte blocks, 1 cycle latency, write back L2 1MB unified, 4-way, 128 byte blocks, 10 cycle latency, write back Data TLB (dTLB) one-level, 128 entries, full-associative, 50 cycle miss penalty or two-levels; 1st level is 1,2,4,8-entry; 2nd level 128-entry Instruction TLB (iTLB) 32 entries, full-associative, 50 cycle miss penalty Data Page Size 4KB Instruction Page Size 4KB Off-Chip Memory (DRAM) 128MB (divided into 32MB banks), 100 cycle latency Branch Logic Predictor Bimodal with 4 states BTB 1024 entry, 2-way Misprediction Penalty 7 cycles Cache Energy dL1 Energy 0.4361 nJ iL1 Energy 0.2716 nJ L2 Energy 1.2912 nJ dTLB Energy 1-Entry 0.0800 nJ 2-Entry 0.0844 nJ 4-Entry 0.0888 nJ 8-Entry 0.0913 nJ 128-Entry 0.1492 nJ
5.1.1. Base Results for VI-PT dL1
registers are allocated). This reduction directly manifests in the savings in dTLB dynamic energy (i.e., whatever is the underlying dTLB structure, we can provide the corresponding savings if we have the translation registers). The savings are shown for four different code versions. v1 is the base optimized version that uses loop strip-mining to force dTLB lookups at specific points (see example in Figure 1). The v2 version also does loop strip-mining, except that it tries to use the same TR to optimize more than one reference to the same page of an array within a loop. The v3 version uses loop distribution on top of the v2 version to increase benefits as explained in Section 3.2.1. The v4 version is similar to v3 but is more aggressive in applying loop distribution to enable TR reuse. More specifically, v4 allows loop distributions which can compromise cache locality while v3 does not allow such distributions (see Figure 4 for examples illustrating the difference between v3 and v4). We first notice that providing just one TR dedicated for the stack to capture the locality at its current activation helps cut down dTLB lookups by 30% and 23%, respectively, in tomcatv and mgrid. On the other hand, we do not gain much for swim and hydro2d because the global data references dominate the memory accesses. This suggests that a straightforward TR allocation scheme that dedicates one register for the stack may not be a bad idea, but the global data references are much more important. In fact, in swim, it makes more sense to allocate this register for the global data instead (at least for some of the code versions). Providing even one register for global arrays produces significant savings in all the applications. As we increase the TRs for global arrays, we do find dTLB lookups getting lower. This is particularly evident in tomcatv, swim and hydro2d where we get over 95% reduction in dTLB lookups by using 8 TRs with some code versions. Even in mgrid, with 8 TRs we are able to get over 75% dTLB lookup reductions. It should be emphasized that such reductions in dTLB lookups directly correspond to the energy savings. Comparing the code versions, we see that v2 makes significant improvements over v1, suggesting that reuse of TRs amongst references even within a loop iteration is very important. Further, when we move to v3 and v4, while the savings are a little better (at least for smaller number of TRs), they are not significantly different from those for v2. In fact, v3 and v4 could not be applied to mgrid and hydro2d. These results suggest that: (i) complex code transformations or sophisticated TR allocation strategies may not be needed, since even the v2 scheme can get very good energy savings across the applications; (ii) from the hardware viewpoint, 4 TRs for the global arrays (and one for the stack) seem to suffice since they provide as good savings as 8 TRs in most cases or they cut down lookups by over 80%.
Reduction in dTLB Lookups. The first set of results to show the effectiveness of our approach in avoiding dTLB lookups are illustrated in Figure 3. These results are with a VI-PT dL1, that shows the reduction in dTLB lookups when different number of TRs are made available to the array references (1, 2, 4, and 8) and stack references (0 and 1). On the x-axis, the pair (a,b) indicates ”a” TRs for the stack and ”b” TRs for the array references (refer to the earlier discussions on how these
Energy Comparison with Multi-level dTLBs. One could ask how good are these savings compared to a multi-level dTLB structure where the first level has the same number of entries as the number of TRs that is used in our scheme. Figure 5 shows the energy savings of our approach and that for multi-level dTLB structures, both normalized with respect to the energy for a monolithic 128-entry dTLB. The bars on the left side of each graph
Table 1. Default configuration parameters used in our experiments. All dynamic energy values are for a 0.1 micron process technology. cessing domains. We randomly selected four applications from Specfp95 (Fortran codes), that are array intensive. These codes are shown in the upper part of Table 2, which gives the dL1 hits/misses (for both VI-PT and VI-VT cases), dTLB hits/misses (for a 128-entry monolithic structure), and the dTLB energy consumption (for both VI-PT and VI-VT cases). In these programs, array references dominate most of the memory instructions, compared to scalar variables which are typically kept in registers [10]. Consequently, we mainly focus on these references, using the code restructuring described earlier. We present results with different number of TRs used for array references, and at most allocate one TR for the stack. Since global scalar references are not significant, we do not allocate any registers for them, and they go through the dTLB.
Benchmark tomcatv swim mgrid hydro2d vortex m88ksim gcc perl
Hits 1749557115 1002514807 2191978592 1496388802 1367475742 1468504558 1341502795 1368740348
Data Cache Misses 338507867 (16.21%) 860855136 (46.20%) 344432806 (13.58%) 454362985 (23.29%) 89098623 (6.11%) 15143493 (1.02%) 185134690 (12.12%) 106405236 (7.21%)
dTLB (VI-PT) Hits Misses 2087233648 831334 (0.039%) 1861779068 1590875 (0.085%) 2535976514 434884 (0.017%) 1949411179 1340608 (0.069%) 1493975753 6022742 (0.401%) 1495141348 99430 (0.006%) 1642110258 213829 (0.013%) 1497269529 48 (< 0.001%)
dTLB (VI-VT) Hits Misses 338424649 831661 (0.249%) 859266347 1590043 (0.185%) 344000352 432454 (0.126%) 453022377 1340608 (0.296%) — — — — — — — —
dTLB Energy (mJ) VI-PT VI-VT 311.66 50.52 278.25 128.6 8 378.50 51.45 291.25 67.99 224.73 — 223.14 — 245.10 — 223.43 —
Table 2. Benchmarks and cache and dTLB characteristics. Note that the pointer-based benchmarks are evaluated only for the VI-PT cache addressing scheme.
Figure 3. Normalized dTLB lookups for array-based applications (VI-PT). On the x-axis, the pair (a,b) indicates ”a” TRs for the stack and ”b” TRs for the array references.
show the energy consumption with our mechanism using different number of TRs for stack and global arrays as explained earlier. This includes the energy cost for accessing the translation registers in addition to the dTLB lookups. The bars on the right side show the energy consumption (adding the level 1 and level 2 energy) for a 2level dTLB structure with 1, 2, 4 and 8 entries in the first level and 128 entries in the second level. All the bars are normalized with respect to the energy consumption of a monolithic 128 entry dTLB without using any optimization. In a multi-level structure, energy savings are provided (over a monolithic structure) by satisfying many requests from a lower energy consuming level-1 (assuming the level-2 is looked up serially after the level-1 miss or else the energy is expected to be much worse). As a result, with a larger number of entries (as long as it does not get as large as the monolithic dTLB), it provides better energy savings by servicing more requests from the level-1, despite the fact that the energy cost goes up per lookup. On the other hand, even if the overall energy taken by our scheme may not appear as dramatic as the reduction in lookups shown in Figure 3 (since the energy values include the register access costs as well), our scheme is still providing much higher energy savings overall compared to a multi-level structure.In fact, a multi-level structure can also incur higher performance penalties in address translations than our TR based translation mechanism as will be shown next. Performance Characteristics. We have also analyzed the effects of code restructuring on the performance characteristics. Most importantly, we have observed that the increase in execution cycles is marginal, and in the case of monolithic dTLBs, we are even able to cut down on the number of misses. This reduction in dTLB misses is due to the fact that the software is able to better capture the current locality by pinning translations in the TRs available to it, than letting the hardware manage all
of its dTLB entries.In interest of space, a detailed analysis is presented in the tech report version of this paper [10]. 5.1.2. VI-VT and PI-PT dL1 Results Having looked at VI-PT dL1 where the dTLB does not have a performance consequence, we next focus on VIVT where the dTLB causes a delay (typically 1 cycle) when it is looked up before an L2 access. However, with our approach, it is conceivable that some of these lookups may be satisfied by one of the TRs directly, thereby not needing a dTLB lookup. In such cases, we not only get energy savings, but also benefits in performance. Figures 6 and 7 show, respectively, the reduction in dTLB lookups and the normalized energy consumption with our schemes for VI-VT, as in the case of VI-PT described above. We find that most of the observations made with VI-PT hold here as well, showing that our software based TR management for address translation provides very good energy savings, even better than a multi-level dTLB structure. One advantage of a VI-VT dL1 (instead of a VI-PT dL1) is that the dTLB lookup is needed only upon a dL1 miss, and not on every memory reference. This can lower TLB energy considerably, especially with high L1 hit rates. While this is true for instruction streams or programs with high data locality, we find that in the applications considered here, the dL1 miss rates are high enough that the dTLB energy continues to be a problem despite moving it beyond the dL1 access. Consequently, the results of this paper are uniformly important across both VI-PT and VI-VT designs. Another interesting observation to make is to compare a VI-VT based dTLB configuration (either monolithic or multilevel with different sizes for level-1), which has lower energy consumption than the original VI-PT dTLB, to a VI-PT dTLB with our TR enhancement (details in the
for(i=0;i