Workload and Implementation Considerations For ... - CiteSeerX

10 downloads 0 Views 28KB Size Report
from each of nine different programs running on Ultrix and VMS operating systems. These traces are of interest primarily because they include operating system ...
d

d

Workload and Implementation Considerations For Dynamic Base Register Caching Matthew Farrens

Arvin Park

Division of Computer Science University of California Davis, CA 95616 tel: (916) 752-9678 fax: (916) 752-4767 email: [email protected]

Division of Computer Science University of California Davis, CA 95616 tel: (916) 752-5183 fax: (916) 752-4767 email: [email protected]

Abstract Dynamic Base Register Caching (DBRC) [. Farrens Park Compression 1990 .] [. Farrens Park SIGARCH18 1991 .] has been shown to be a useful technique for significantly reducing processor to memory address bandwidth. By caching the higher order portions of memory addresses in a set of dynamically allocated base registers, only small register indices need to be transmitted between the processor and memory instead of the high order address bits themselves. In this paper we present the results of trace driven simulations which indicate that DRBC can facilitate the provision of separate paths for instructions and data by reducing the number of address lines required for parallel address channels. In fact, tailoring DBRC for separate instruction and data streams results in superior address compression. We also show that the effectiveness of DBRC is not significantly degraded by multiprogramming workload, for large Spec benchmark traces. Additionally, we suggest two methods to optimize DBRC implementation. (1) A processor’s translation lookaside buffer hardware can be modified to implement DBRC in addition to its normal address translation functions. (2) DBRC latency can be hidden by properly synchronizing it with memory chip address pin multiplexing. 1. Introduction A study by Hammerstrom and Davidson [. Hammerstrom 1977 .] suggests that the information content of address reference streams tends to be very low. They hhhhhhhhhhhhhhhhhhhhhhhhhhhhh This work was supported by the National Science Foundation under Grants CCR-90-11535 and CCR-90-122800, by University of California MICRO Grants 91-118 and 91-033, and by the Hewlett-Packard Corporation through gift number 1701080.

derive estimates from traces of numerical IBM 360 programs that indicate the average information content of an address word is often less than two bits. Therefore, as address spaces and their corresponding address words have grown in size from 16 to 24 to 32 (to 64) bits, so has the percentage of a given address word containing redundant information. Dynamic Base Register Caching (DBRC) [. Farrens Park Compression 1990 .] [. Farrens Park SIGARCH18 1991 .] is a useful technique for reducing this redundancy. By exploiting both spatial and temporal locality of reference to more compactly encode address words, redundant I/O pins and bus lines involved in transferring address information from processor to memory can be eliminated. DBRC operates by caching high order portions of memory addresses in a set of dynamically allocated base registers located at both the processor and the memory. This makes it possible to transmit small register indices between processor and memory instead of the high order address portions themselves. Low order address bits are transmitted directly from the processor to memory without interference from the DBRC system. These low order bits tend to change randomly between successive addresses; consequently the information content of low order bits is quite high, and there is little to be gained by encoding them. Dynamic Base Register Caching is very close in functionality to a classical Translation Lookaside Buffer. A TLB caches common translations from virtual to real page numbers, which are merely the high order components of virtual and real address words; the DBRC also caches commonly occurring high order address components. In addition, the page offsets, or lower order address components, are the same for corresponding real and virtual addresses, so they are sent directly to the memory without passing through the TLB. This similarity will be investigated in more detail later in this paper. DBRC can be viewed as a form of address segmentation. However, DBRC differs from classic segmentation in several ways: (1) Classical segmentation is specified in machine instructions, while Base Register Caching occurs dynamically and is transparent to the

d

d

programmer. (2) Classical segmentation provides protection and allows an architecture with a limited data word width to access a larger address space, whereas DBRC only reduces the number of address lines between processor and memory. (3) Since there is no bit overlap between base registers and low order components, no addition hardware is necessary. Base registers are simply concatenated with low order components to form a complete addresses. Previous simulations [. Farrens Park Compression 1990 .] [. Farrens Park SIGARCH18 1991 .] have shown that DBRC can reduce address bandwidth by more than a factor of two without significant decreases in memory latency or processor performance. These results apply to both set-associative and direct-mapped DBRC cache organizations. DBRC has even been shown to operate well in the presence of an on-chip cache that filters out much of the locality that DBRC presumably exploits to compactly encode addresses. In fact, the on-chip cache can be used to hide the DBRC lookup time, thereby eliminating it from the critical path of a memory reference. In this paper we extend the previous DBRC work by performing trace driven simulations to evaluate the effectiveness of the technique on multiprogramming workloads, and by significantly increasing the scope of traces studied. We also propose two methods to more efficiently integrate DBRC into a computing system. (1) Existing translation lookaside buffer hardware can be modified to implement DBRC. (2) DBRC latency can be hidden by synchronizing the operation of DBRC with address multiplexing at the memory chip. 2. The Simulations Our simulations were performed on a wide range of traces taken from a set of C and Fortran codes for both MIPS and SPUR architectures. These simulations investigated two main issues: (1) The effect of a multiprogramming workload on the effectiveness of DRBC, and (2) The feasibility of using Base Register Caching to provide two separate address paths, one for instruction references and one for data references, from processor to memory. Such a configuration is commonly called a Harvard Architecture. By using DBRC, the two different address busses can be compressed to fit into the 32 bits that a normal uncompressed address bus requires. All simulations in this study were performed for same set of DBRC configurations and parameters. These simulations measured DBRC hit rate as a function of the number of base registers and the reduced address size (which is a combination of both the index and low order component). The number of base registers in a given configuration is determined by the number of index bits; for example, k index bits can address up to 2k-1 base registers. The subtraction by one accounts for the fact that one index bit pattern is reserved to indicate a DBRC miss. Because of the difficulty in performing large n-way associative searches, two different DBRC configurations were simulated. One assumes a fully associative DBRC

using an LRU replacement scheme, and the other assumes a direct mapped DBRC. For the fully associative scheme, systems with 2 to 6 index bits and reduced address sizes that varied from 4 to 24 bits were simulated. (No more than 6 index bits were simulated because of the difficulty of implementing a fully associative search over more than 26-1 registers.) For the direct mapped DBRC configuration, the number of index bits ranged from 3 to 8, and the reduced address size varied from 5 to 25 bits. 2.1. The Traces The address stream traces used in this study were taken from several different sources. Five of the traces were taken from 3 LISP programs compiled for the SPUR architecture by Hill [. Hill Aspects 1987 .]. Hill selected these samples to exhibit different but somewhat pessimistic locality profiles. These programs were Slc (the SPUR LISP compiler), Rsim (a circuit simulator simulating a counter), and Weaver (a production system for VLSI chip routing). Two 500,000 address trace data samples were taken from each of the runs of the Slc and Weaver programs, and one 500,000 address trace was used of the Rsim program. The five traces were concatenated together, and treated as one large composite trace file. A total of 3,100,330 memory references (both data and instruction) are generated by this composite trace. In addition to the SPUR traces, the ATUM address traces (gathered using microcode patches on a VAX 8200 [. Agarwal ATUM 1986 .]) were also used. The ATUM traces consist of snapshots of address referencing activity from each of nine different programs running on Ultrix and VMS operating systems. These traces are of interest primarily because they include operating system references. To examine the effects of system references on Dynamic Base Register Caching performance, we simulated Dynamic Base Register Caching using ATUM traces with and without system calls. The ATUM traces contain a total of 8,397,122 references (6,230,280 references when the operating system references are removed). We also took traces from the Spec benchmark suite compiled for the MIPS architecture. We used traces from the tomcatv, xlisp, dnasa7, matrix100, spice2g6 and doduc benchmarks which were 1,700,009, 2,000,038, 3,000,091, 3,000,036, 3,000,071, and 72,535,842 references in length respectively. Some of these traces do not contain a complete address trace for each program because the complete traces were much too large to store and took too long to simulate. Instead, we randomly selected blocks of 100,000 references until a trace of two to three million references was created. Because of the sheer volume of results generated, not all results could be reported; the following sections present a representative subset of the total runs performed.

d

d

2.2. Effect of Multi-Programming Workload on DBRC The impact of a multi-programming workload on DBRC was evaluated by periodically invalidating all of the DBRC entries to emulate the effect of time-slicing. We examined the effect of invalidating the DBRC every 20,000, 40,000, 80,000, 160,000 and 320,000 address references. These invalidation intervals were chosen to reflect a reasonable range of processor speeds and time slicing quantum sizes. The invalidations had almost no discernible impact on DBRC performance across the entire range of invalidation intervals. This is illustrated in Figure 1, which compares DBRC hit rates for the 40,000 reference invalidation interval against hit rates for DBRC without invalidations for the doduc trace. (This trace was chosen for display because it is the longest single trace we simulated.) The reason multi-programming does not significantly affect the performance of the DBRC is the relatively small size of the base register cache. Even the largest direct-mapped base register cache has only 255 entries. The number of extra DBRC misses incurred to completely fill an empty base register cache is at most equal to the number of entries in the cache. For example, a typical DBRC of 63 entries will generate at most 0.15% additional cache-filling misses for a 40,000 reference invalidation interval. Note that since a full base register cache may contain a number of entries which will not be referenced before they are replaced, the actual percentage of additional references will probably be even lower. This is especially true for a direct-mapped base register cache that stores only one entry for each set of high order address components. 2.3. Using DBRC to support a Harvard Architecture DBRC would appear to provide a significant opportunity to implement a Harvard Architecture on a single chip processor. The reduction in address pins achieved by DBRC makes it practical to fit two separate address paths, one for instructions and one for data, into the limited number of pins of a processor chip package. In fact, using separate base register caches for instructions and data makes it possible to tailor base register caching for the different locality profiles present in the data and address streams. These locality profiles can be quite different. Instruction references tend to access consecutive memory locations, and even the jumps that occur are typically restricted to a limited locality. DBRC hit rates for instruction references are quite good for small reduced address sizes. Data references, on the other hand, exhibit less locality, so a relatively large reduced address size is required to achieve a high hit rate. By segregating address and data references into separate instruction and data address busses it is possible to provide a small reduced address bus for instruction references and a larger reduced address bus for data references. By doing this, it is actually possible to achieve a higher hit rate for a given number of address pins than if

two equal medium-sized reduced addresses are used for instructions and data. A medium-sized reduced address for instruction references provides more bits than are necessary to provide a high hit rate, while a medium-sized reduced address for data addresses provides too few bits to provide a high hit rate. This effect can be seen in the simulation results presented in Figure 2. By providing a 12 bit reduced address bus for instructions and a 20 bit reduced address bus for data, for example, one can see that an overall hit rate of over 99% can be achieved. This is much higher than is possible using identical 16 bit reduced busses for both instruction and data addresses. It is also interesting to note in these figures that for a direct-mapped DBRC, there is quite often a significant drop-off in performance as the reduced address size increases. This is because as the number of bits used for the offset increases in a reduced address, the bits used to select the line in the DBRC change. It should be noted that the effect disappears when the DBRC is fully associative. 3. Combining DBRC with a TLB It may be possible to combine the functions of the virtual memory system’s translation lookaside buffer (TLB) with the dynamic base register cache. In this way only a small amount of additional hardware, and hence chip real estate, must be incorporated into a computing system to implement DBRC. In fact, by combining the two functions, it may actually be possible to reduce the size of the on-chip portion of the translation lookaside buffer. Upon careful comparison one can see that the DBRC and TLB systems perform similar functions. The TLB caches common translations from virtual to real page numbers. These page numbers are merely the high order components of virtual and real address words. The page offsets, or lower order address components, are the same for corresponding real and virtual addresses, so they are sent directly to the memory without passing through the TLB. Similarly, the DBRC caches commonly occurring high order address components, and the DBRC system also sends the low order address components directly to the memory without passing through the DBRC system. In addition, the dividing line between high and low address components for DBRC is similar to the dividing line between the page number and page offset in the virtual memory system. A typical page offset in a TLB is 10 bits in size, which corresponds to a typical DBRC low order address size of 10. DBRC and TLB systems are similar because they attempt to exploit the same memory reference locality for similar purposes. In a TLB, the page offset (and hence page size) is chosen to be large enough so that the page table size is minimized, but not so large that a lot of untouched data is contained in a referenced page. In the DBRC the low order component is chosen to be large enough so that high order components are not continually changing with successive addresses. This is balanced

d

d

Direct Mapped BRC Reset Every 40K Refs

Direct Mapped BRC Not Reset

1.00 0.90 0.80

H 0.70 i t 0.60 0.50

R a 0.40 t e 0.30

255 BRegs 127 BRegs 63 BRegs 31 BRegs 15 BRegs 7 BRegs

0.20 0.10 0.00 5

10

15 Number of Bits

20

25

5

10

(a)

15 Number of Bits

20

25

(b)

Fully Associative BRC Reset Every 40K Refs

Fully Associative BRC Not Reset

1.00 0.90 0.80

H 0.70 i t 0.60 0.50

R a 0.40 t e 0.30

63 BRegs 31 BRegs 15 BRegs 7 BRegs 3 BRegs

0.20 0.10 0.00 4

8

12 16 Number of Bits

20

24

4

(c)

8

12 16 Number of Bits

20

24

(d) Figure 1.

Trace of doduc Program - Total of 72,535,842 References against a motivation to minimize low order address comTLB will be identical for a given number of DBRC or ponent size, and thereby minimize the total reduced TLB entries. In fact the hit rates measured in our trace address size. driven simulations apply equally well to both DBRCs and TLBs. The same desired hit rate for both the DBRC and The number of entries in a TLB and a DBRC is TLB will lead to an identical number of DBRC and TLB similar because the DBRC and TLB are both designed to entries. be a minimal size that guarantees a high hit rate. If the base register low order address component is the same size as the page offset, the hit rates for DBRC and the

The optimal DBRC may not be the same size as an optimal TLB, however. This is because a TLB miss may

d

d

cause a page fault, so the miss penalty is potentially greater for a TLB than a DBRC. The optimal hit rate, which determines the number of entries, may be larger for a TLB than a DBRC. Our preliminary estimations indicate that 128 is a good number of entries for a TLB, while 32 to 64 is a good number of entries for a DBRC. We anticipate that a combined DBRC/TLB would include 128 entries so that TLB performance will not suffer. This will increase the DBRC reduced address size by only one or two bits. The structure of a combined TLB and DBRC can be most easily thought of as a conventional TLB that is divided into two separate pieces. One piece, located at the processor, stores the virtual page numbers (VPNs). We will refer to this as the processor TLB. The other piece, located at the memory, stores the real page numbers (RPNs), and will be referred to as the memory TLB. Recall that a conventional TLB stores virtual page number and real page number pairs, so this division of a TLB creates two equal sized arrays with corresponding entries. Note that the processor TLB is organized as a cache so that associative lookups can be performed, while the memory TLB is a simple register file. Dividing the TLB in this way decreases the amount of TLB hardware on the processor chip. Of course, adding DBRC functionality will require some additional hardware, so the total amount of on-chip hardware may only be slightly less than for a conventional on-chip TLB. The integrated TLB/DBRC operates in the following way: Each time the processor issues an address, the address is divided into a VPN and an offset. The offset is sent directly to memory over dedicated address lines. The VPN is then compared with the VPNs stored in the processor TLB. If the VPN matches one of the VPNs contained in the processor TLB, the location index of the matching VPN is sent to the memory and is used to lookup the associated RPN in the memory TLB. The RPN is then concatenated with the offset to produce the real address. If the VPN does not matches one of the VPNs contained in the processor TLB, a TLB fault is initiated. This causes the virtual memory system to fetch an RPN from the process page table. One of the processor TLB entries is then replaced by the new VPN and the associated RPN is stored in the corresponding location in the memory TLB. Since the RPN may be larger that the reduced address, it may take multiple cycles to transmit the RPN over the reduced address lines to the memory TLB. Modifying a TLB to implement both address translation and address compression appears to be a very efficient way to implement DBRC. This idea holds great promise, and we are working on a more thorough study involving a detailed design in order to provide an accurate evaluation of this idea. 3.1. Synchronization of DBRC with Address Pin Multiplexing

The DBRC can also be synchronized with memory chip address pin multiplexing to hide the DBRC lookup latency. Previous work [. Farrens Park 1991 .] suggests that that the DBRC lookup latency can be hidden if a processor has an on-chip cache. The DBRC lookup is simply performed in parallel with the cache lookup. By the time an on-chip cache miss occurs, the processor DBRC lookup has already been accomplished, and a reduced address can be immediately sent off-chip. The DBRC lookup may also be synchronized with memory chip address multiplexing to hide the lookup time. Memory chip address pins are commonly multiplexed between high and low order address bits so that a large address can be transferred through a smaller number of address pins. For example, a 1 megabit memory chip requires 20 address bits. However, packaging constraints do not allow a memory chip to have 20 address pins. Therefore, the necessary 20 address bits are divided into 10 row address bits and 10 column address bits, and the package provides a total of 10 address pins. During a memory cycle, the 10 row address bits are first gated onto the address pins and latched inside the memory chip. Then, on the following subcycle, the 10 column address bits are gated onto the 10 address pins. Internally the two are combined to create the requisite 20 bit address. The DBRC lookup latency can be hidden by properly synchronizing the lookup with the memory address pin multiplexing. Under DBRC the low order address bits are sent directly to the memory, while the high order bits are used to perform a DBRC lookup. The low order bits can be gated into the row address pins immediately while the DBRC lookup is taking place. When the DBRC lookup has finally produced the high order bits, the address pins will be ready to accept the high order bits in the column address cycle. In this way the DBRC lookup latency can be effectively eliminated from the critical path in a memory access. 4. Conclusions This paper has presented a number of issues related to implementation and workload of Dynamic Base Register Caching (DBRC). Our simulations indicate that DRBC can facilitate the provision of separate channels for instructions and data, by reducing the number of pins required to transmit address information. In fact, tailoring DBRC for separate instruction and data address streams results in superior address compression. Additional results imply the effectiveness of DBRC is not affected by a multi-programming workload. We also conclude that DBRC implementation cost may be greatly reduced by incorporating the DBRC functions into a processor’s translation lookaside buffer, and that DBRC lookup time can be hidden through proper synchronization with memory chip address pin multiplexing. 5. Acknowledgements We would like to thank Andrew Pleszkun for his seminal ideas on this subject. We also thank Richard

d

d

Lipton for his helpful comments on synchronizing DBRC with memory chip address multiplexing. Finally, we thank Jeffrey Becker, Ron Maeder, and various reviewers for their suggested revisions of this paper. References

Suggest Documents