An advanced filtering TLB for low power consumption - Computer ...

4 downloads 214 Views 367KB Size Report
SOC center, Samsung ... This research is to design a new two-level TLB ... Proceedings of the 14th Symposium on Computer Architecture and High Performance ...
An Advanced Filtering TLB for Low Power Consumption Jin-Hyuck Choi

Jung-Hoon Lee

Gi-Ho Park

Shin-Dug Kim

CS, Yonsei University 134, Shinchon-dong, Seoul, 120-749, Korea 82-2-2123-2718 [email protected]

CS, Yonsei University 134, Shinchon-dong, Seoul, 120-749, Korea 82-2-2123-2718 [email protected]

SOC center, Samsung Electronics 416, Suwon-City, 442-742, Korea 82-31-279-7744 [email protected]

CS, Yonsei University 134, Shinchon-dong, Seoul, 120-749, Korea 82-2-2123-2718 [email protected]

Abstract This research is to design a new two-level TLB (translation look-aside buffer) architecture that integrates a 2-way banked filter TLB with a 2-way banked main TLB. One of main objectives is to reduce power consumption in embedded processors by distributing the accesses to the TLB entries across several banks in a balanced manner. Thus, an advanced filtering technique is devised to reduce power dissipation by adopting a sub-bank structure at the filter TLB. And also a bank-associative structure is applied to each level of the TLB hierarchy. Simulation result shows that the miss ratio and Energy*Delay product can be improved by 59.26% and 24.9%, respectively, compared with a micro TLB with 4-32 entries, and 40.81% and 12.18%, compared with a micro TLB with 16-32 entries.

Keywords Bank associative structure, filter mechanism, low power design, and translation look-aside buffer.

1. Introduction Power consumption becomes one of important issues in designing high performance embedded processors. In general, architectural approaches can offer significant improvements in reducing power consumption than the techniques at the gate or circuit level. Specifically, the power dissipated by on-chip memory is a significant portion of the total power dissipated by the entire processor. For example, in the case of StrongARM110 [1], which is one of the modern RISC type embedded processors, the amounts of power dissipated at the instruction cache, data cache, and translation look-aside buffer (TLB) correspond to 27%, 16%, and 17% of the overall power consumption respectively The TLB is a small on-chip cache for recently used virtual to physical address translations, which is generally designed as a fully associative structure. The reasons why

the on-chip memory systems consume a significant amount of power can be explained as follows. First, tag and data arrays of those on-chip memory systems are implemented using power-hungry static RAMs to support the fast access time to cope with the processor clock rate. Moreover, content addressable memory (CAM) used in a fully associative type implementation dissipates much higher energy than SRAM because of its internal comparison logic and additional match lines. Second, these on-chip memory systems are frequently accessed. Especially the instruction cache or TLB should be accessed on each clock cycle. Third, a miss occurring at these memory systems accompanies another access to a larger on-chip memory system, where extra power is dissipated to drive the I/O pads for off-chip memory accesses. Usually the capacitance of the I/O pads is much larger than on-chip capacitance, thus reducing misses of on-chip memory systems can be an approach to designing low power memory systems. Finally, a significant number of transistors are devoted to the on-chip memory systems and those densely packed transistors usually occupy a large portion of the die area. Noticeably, in spite of its much smaller area, the TLB consumes more power than the data cache in the above example. Thus, this research is only focused on designing a low power TLB structure for high performance embedded processors. Simulation result shows that the miss ratio and Energy*Delay product can be improved by 59.26% and 24.9%, respectively, compared with the micro TLB with 4-32 entries, and 40.81% and 12.18%, compared to a micro TLB with 16-32 entries.

2. Related work Current TLBs are typically implemented with static RAM cells for data and CAM cells for tags. For low power TLB design, memory cells are redesigned using techniques such as modified the CAM cell structure [2] and reducing voltage [3]. The work by Juan [2] proposes modifying the CAM cell by adding another transistor in the discharge path. With the modified cell, the control line

Proceedings of the 14th Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’02) 0-7695-1772-2/02 $17.00 © 2002 IEEE

can be used to precharge the match line without pulling the bit lines to zero. However, this method suffers from higher hardware cost and complexity, and performance degradation. As [3] shows, supply voltage is an important parameter controlling CMOS power consumption. For a fully associative TLB structure, power consumption tends to increase abruptly as the number of entries increases [2,4]. The total number of TLB entries accessed at once should be less than 64 or 128. However, higher performance can be achieved if more TLB entries are provided. Thus, one method is to use a micro or filter TLB with a small number of entries filtering the recently referenced entries above the conventional level-1 TLB [5], which has been used to reduce the number of entries accessed at once and to support faster access. However, as the micro or filter TLB is a hierarchical structure, lower power dissipation can be achieved without any performance gain though it uses more entries, and additional sense-amplifiers and output drivers. The additional cost gives lower power consumption rather than higher performance. If the entry size of the filter TLB grows to two or more, additional cycle time is needed to access the main level-1 TLB. Thus it will be a 2-cycle overhead degrading the total average memory access time of the systems. Therefore simply reducing power is inversely proportional to performance. Another approach for low power consumption is to divide the entire TLB space into two banked TLBs so that the number of entries accessed at once can be reduced to less than 32 or 64 [6,7]. This structure consumes less power than a fully associative TLB because only a half of the CAM entries are looked up on each access to the TLB. But a major drawback is performance degradation due to the tendency to encounter more capacity misses in a banked system. Thus, we have devised a two-level dual TLB structure that integrates a 2-way banked filter TLB with a 2-way banked main TLB to achieve low power consumption and high performance. Experimental results show that our TLB design can reduce the power consumption and the miss ratio significantly, compared with other approaches

3. Low power TLB design The proposed two-level dual TLB structure is presented in this section. For low power consumption, it combines two previously proposed techniques in addition to the new heuristic operational process and the optimization of the configurable parameter for effective structure. Its operational model is also described depending on several cases.

3.1 Filtering the fast access Using a fast lookup buffer at address translation time is a very common technique among embedded processor manufacturers and its hit ratio is approximately 50%. If a miss occurs at this buffer, another cycle is incurred to access the main TLB. One approach to reducing internal power dissipation is to avoid a large number of accesses to the main TLB, by using an extended filtering technique, which is borrowed from the filter cache design [5]. It also provides the high bandwidth and low latency. Fig.1 depicts the conceptual filter TLB organization, where a small TLB is placed over the main TLB for filtering main TLB accesses. The main TLB is to maintain high performance by improving the miss ratio. However, simply adapting the filtering scheme to the TLB can be problematic. In particular, the TLB space can be reduced by the inclusion property and this in turn tends to increase the overall number of TLB misses eventually increasing power consumption. Thus, a method to improve the filter TLB hit ratio is necessary. Reauired PPN

Low power comsumption

Filter TLB

Filtering the Main TLB access

MUX Main TLB

HIgh power comsumption & HIgh performance

Missed PPN

Figure 1. Organization of filter TLB One method is to enlarge the filter TLB capacity per page entry, i.e., the filter TLB is designed to hold more entry slots by maintaining more PPN (physical page number) slots per VPN (virtual page number) entry; called an extended sub-bank structure. Consider the k sub-bank case. The same tag can be used for k PPNs for a given VPN entry, and the number of tag bits for the k sub-banks is reduced by log2k bits. This enables the filter to hold more entries, thereby increasing the filter hit ratio. Also, power reduction is achieved by activating only one subbank entry at a time, using the log2k selection bits. To determine the optimal number of the sub-banks, we perform several experiments, and a two-way extended sub-bank structure is chosen as the best compromise from

Proceedings of the 14th Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’02) 0-7695-1772-2/02 $17.00 © 2002 IEEE

the experiment results. The details of the experiments are described in Section 4.

3.2 Bank associative structure with filter TLB Another approach to reducing power consumption is a bank associative structure. The x-way banked TLB structure partitions the entire TLB into x banks by modulo x, and so the least significant log2x tag bits of any given virtual address are used as the selection bits to activate a specific bank at a time. By using this scheme, we can reduce the power dissipation by approximately 1/x in accessing the TLB. TAG (VPN)

PO (4KB) Page Offset bit (12bit)

Sub Select (0 / 1)

Bank Select (0 / 1)

TAG (18bit) CAM

SRAM

SRAM

CAM

SRAM

SRAM

VPN

PPN

PPN

VPN

PPN

PPN

CAM

SRAM

CAM

SRAM

VPN

PPN

VPN

PPN

Filter TLB

Filter TLB Hit Main TLB

(1) Hit in a selected bank of the filter TLB Only one bank is activated by the least significant bit of the tag part given by the generated virtual address to find the required physical page number. And then the next part is the sub-bank select bit to find a correct page number by enabling a sub-bank. If a page is found in the activated bank of the filter TLB, the actions are the same as for a conventional TLB hit. The requested physical address is sent to the cache and compared with its tag bits. But if a miss occurs at the filter TLB, the main TLB is activated during the next cycle for a possible match. (2) Hit in a selected bank of the main TLB The main TLB is enabled for further page searching at the next level. If a requested page is found in the main TLB, the requested physical page is sent to the cache and compared with its tag bits. At the same time, its corresponding entry is moved into the filter TLB for reuse and it is invalidated in the main TLB to prevent entry duplication. And if there is an evicted entry from the filter TLB, it is stored back to an invalidated entry of main TLB for temporal reuse. Using this reusing scheme, we can enforce the break of inclusion property on our proposed TLB for spatial effectiveness. (3) Miss in both TLBs

Main TLB Hit

Bank 0

Physical Page Number

Bank 1

Figure 2. Organization for a combination of filter and main TLBs. We have adopted a 2-way banked structure to minimize the additional cost of the sense-amplifiers and output drivers necessitated by increasing the bank associativity. Thus our proposed TLB is constructed in two parts; a 2-way banked filter TLB with its extended sub-bank structure and a 2-way banked main TLB, as shown in Fig.2. There are two sub-banks at each bank of the filter TLB in this example. The operational flow is based on a modified two-level TLB model. When the CPU generates a particular virtual address, the filer TLB is searched first. To select one of the two banks, we use the least significant bit of the tag field as a bank select bit, i.e., 0 or 1. Three possible cases can be considered as follows.

When misses occur at both TLBs, while the memory management unit is handling the miss, a new page entry is loaded into the filter TLB directly. Only when a filter TLB entry is evicted according to the LRU (least recently used) policy, the evicted page entry is moved into the main TLB for temporal reuse. When using this scheme, as mentioned above, the total number of effective VPNs is essentially one large TLB space consisting of the filter TLB space plus the main TLB space. In contrast, the conventional micro TLB is a hierarchical structure based on the inclusion property. Thus, we effectively increase the TLB space by avoiding the inclusion property.

4. Performance evaluation There are four essential performance metrics, i.e., miss ratio, average memory access time (AMAT), power consumption, and Energy*Delay product which we use to evaluate and compare the proposed TLB system with other approaches. For our comparison, we focus on the two most commonly used TLB architectures, i.e., a fully associative TLB and a micro-TLB.

Proceedings of the 14th Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’02) 0-7695-1772-2/02 $17.00 © 2002 IEEE

4.1. Optimal sub-bank structure We performed several experiments to find the optimal number of sub-banks. This value is an important factor to increase the filter hit ratio. And also this structure does not require any additional sense-amplifiers and output drivers compared with other bank associative structures. (1sub)4-4-16-16

(1sub) 8-8-16-16

(2sub) 4-4-16-16

(1sub)4-4-16-16

(1sub) 8-8-16-16

(2sub) 4-4-16-16

(4sub) 2-2-16-16

1.7

Average Memory Access Time (cycle)

To estimate the energy dissipation at the TLB systems, this research used the CACTI [8,10] tool, which is a simulator to estimate transistor-level cache access time, area and power dissipation. Recently modified version of the CACTI introduced new features including cache banking model with fully independent bank structures. Especially, CACTI model can change the organization of internal memory array by breaking it into smaller subarrays (bit line and word line divisions) or by mapping more than two sets to a single word line, which allows the overall access time and power consumption to be optimized for a given cache configuration. The CACTI simulator is modified for TLB power evaluation. in some points. First, the block size of TLB is not variable but fixed by the TLB entry size. Throughout the simulation in this research, the TLB entry size was assumed to be 4 bytes. Second, in the cache, the length of the offset field within an address is determined by the size of a cache block, but in the TLB, a predefined page size determines the length of page offset field within a virtual address. Finally, the original CACTI cannot simulate small caches with less than eight sets because its decoder architecture is based on the 3-to-8 decoder block. Thus, we modified the decoder architecture for 2-to-4 decoder block to be used for 4-entry TLB.

1.6

1.5

1.4

1.3

1.2

1.1

1

applu

compress

go

gcc

perl

tomcatv

vortex

AVG

Figure 4. Average Memory Access time for various sub-bank configurations. For a particular number of sub-banks, e.g., four sub-banks, experiments are performed to check how effectively each sub-bank is utilized by checking the evicted filter TLB entries. By experiments, approximately 79% of the total number of evicted TLB entries turns out to be occupied by only one sub-bank, and 17% to be occupied by two subbanks. Only 2% belongs to the TLB entries occupied by three or four sub-banks. Thus, the optimal number of subbanks is less than three or four to improve the filter TLB hit ratio. To obtain more accurate conclusion, we performed the simulation to determine the exact value for various sub-bank specifications. Fig.3 and Fig.4 show the average miss ratio and the average TLB access times for the following TLB structures, assuming a 4KB page size. (1sub)4-4-16-16

(1sub) 8-8-16-16

(2sub) 4-4-16-16

(4sub) 2-2-16-16

3 2.5

(4sub) 2-2-16-16

Energy (nJ)

1.4

Miss Ratio (%)

1.2 1 0.8

2 1.5 1

0.6

0.5 0.4

0

0.2

applu

0 applu

compress

go

gcc

perl

tomcatv

vortex

AVG

Figure 3. Miss ratios for various sub-bank configurations. Firstly, we experimented the utilization of each subbank for the four sub-banked filter TLB, and it shows how many entries can be used for each sub-bank.

compress

go

gcc

perl

tomcatv

vortex

AVG

Figure 5. Average energy dissipation for the various sub-bank configurations The notation, “(1sub) 4-4-16-16,” denotes the proposed TLB configured as a 2-way banked filter TLB with 4+4 entries and a 2-way banked main TLB with 16+16 entries, where the sub-bank size is the same as the conventional one, and “(1sub) 8-8-16-16,”is constructed

Proceedings of the 14th Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’02) 0-7695-1772-2/02 $17.00 © 2002 IEEE

as a 2-way banked filter TLB with 8+8 entries as one subbank structure in the same way as above and a 2-way banked main TLB with 16+16 entries. The notation, “(2sub) 4-4-16-16,” denotes the proposed TLB configured as a 2-way banked filter TLB with 4+4 entries as 2extended sub-banks and a 2-way banked main TLB with 16+16 entries. Finally, the notation, “(4sub) 2-2-16-16,” denotes the proposed TLB configured as a 2-way banked filter TLB with 2+2 entries as 4-extended sub-banks and a 2-way banked main TLB with 16+16 entries. Except the “(1sub) 4-4-16-16,” configuration, other three structure are configured as having the same number of physical page.

FA32 and FA48, are used because they hold the same number of PPNs as our main TLB and the combined resources as our TLB structure, respectively. A micro (filter)-TLB of 4 entries with a main TLB of 32 entries, denoted by micro TLB (4-32), and a micro (filter)-TLB of 16 entries with 32 entries in its main TLB, denoted by micro TLB (16-32), are also used for comparison. The former is a typical configuration and the latter is more comparable one in terms of the number of PPNs provided. FA 48

FA 32

MIcro TLB (16-32)

MIcro TLB (4-32)

Proposed TLB (4-4-16-16)

1.2

(1sub)4-4-16-16

(1sub) 8-8-16-16

(2sub) 4-4-16-16

(4sub) 2-2-16-16

4.5 4

E*D product

3.5

MIss Ratio (%)

1

0.8

0.6

0.4

3 2.5

0.2 2 1.5

0

applu

1 0.5

compress

go

gcc

perl

tomcatv

vortex

AVG

Figure 6. Energy*Delay products for the various sub-bank configurations

4.2. Performance improvement Fig.7 and Fig.8 show the average miss ratios and the average TLB access times for the following TLB structures, assuming a 4KB page size. Fully-associative (FA) TLBs with 32 entries and 48 entries, denoted by

FA 32

Average Memory Access Time (cycle)

In “(4sub) 2-2-16-16,” configuration, although each entry can hold 4 PPNs, it shows the lowest performance, and it also shows that the greedy augmentation of the number of sub-banks may somewhat degrade the performance. Thus, the sub-bank structure is only applied to the filter TLB to improve 1-cycle filter TLB hit ratio and power consumption. Fig.5 and Fig.6 represent the average energy dissipation and Energy*Delay product for various sub-bank filter TLB configurations. As shown in our experiment, the number of indexing entries should be less than four, especially in terms of power consumption and Energy*Delay product. And also the 2-way banked structure should be used to minimize additional hardware cost, e.g., additional comparator, sense-amplifiers, output-driver, and so on. Power estimation methodology is described next section.

go

gcc

perl

tomcatv

vortex

AVG

Figure 7. Miss ratios for the proposed TLB and various TLB organizations.

0

applu

compress

FA48

MIcro TLB (4-32)

MIcro TLB (16-32)

Proposed TLB (4-4-16-16)

1.6 1.5 1.4 1.3 1.2 1.1 1 applu

compress

go

gcc

perl

tomcatv

vortex

AVG

Figure 8. Average Memory Access time for various sub-bank configurations The time to access the filter TLB is assumed to be one cycle. If a miss occurs, then the main TLB is searched during the next cycle. The notation, “4-4-16-16 entries,” denotes the proposed TLB configured as a 2-way banked filter TLB with 4+4 entries and a 2-way main TLB with 16+16 entries. Every filter (micro) TLB in this experiment is based on LRU (Least Recently Used) replacement. The time needed for a miss handling interrupt routine is

Proceedings of the 14th Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’02) 0-7695-1772-2/02 $17.00 © 2002 IEEE

Power dissipation (nJ)

10 9

Read Hit

Write

Read Miss

Where RHIT and RMISS are the ratios of hits and misses in the TLB respectively. Also PHIT and PMISS are the power consumption required to process a hit and a miss respectively. In equation (2), PAccCAM is the power dissipated at all the entries when the tag part of the TLB is accessed, and PDOdriver and PVSdriver are the power consumed at the data output driver and valid signal driver, when a TLB hit occurs. In equation (3), PWrite is the power dissipated at the data area and tag area in order to update an entry in the case of a miss. To add the external power consumption, the following metric is required: PAccCache is the power used to access a cache block, RCacheMiss is the cache miss ratio, and PCacheWrite is the power consumption when a cache write operation occurs for a cache miss. Ppad is the power dissipated at the on-chip pad slot. A 32KB 2way set associative data cache with 32-byte block size is assumed as the processor memory hierarchy and the supply voltage with factor is assumed as 4.5 Volts with 0.80um technology. The values of RCacheMiss, PAccCache, PCacheWrite, and Ppad are 5%, 21.291nJ, 10.145nJ, and 6.48nJ respectively [9].

8 7 6

FA 32

FA 48

applu

compress

MIcro TLB (4-32)

MIcro TLB (16-32)

Proposed TLB (4-4-16-16)

6

5

Energy (nJ)

assumed to be 15cycles, which is based on the value for common 32-bit embedded processors (e.g., Hitachi SH4 or ARM920T). To evaluate power consumption in our proposed TLB, extra decoding power to select a specific bank is accumulated separately. And then the overall power consumption is obtained by summing the power consumption for the proposed TLBs and additional power dissipated at the multiplexer. In the simulator, only one selected bank is activated to drive the tag on the bit lines of the CAM, and enable the SRAM word lines and senseamplifiers to read out the data. There is no additional delay due to the multiplexer because the bank selection bit can be applied to the multiplexer, while the tag is being compared at the CAM. Benchmark analysis shows that the percentage of one-cycle hits at the filter TLB is almost 95% in the proposed TLB. Fig.9 shows the amount of power consumed for each event in a single TLB access. Significant difference in power consumption between the TLB with 128 entries and that with 64 entries, as shown in Fig.8, comes from the fact that the power consumed at the match line and bit lines in the CAM tends to grow considerably when the number of entries increases beyond 64. Thus, we used 32 entries in our experiments to show the reduction in power consumption that we can realistically expect.

4

3

2

1

5 4 3 2 1 0

0

go

gcc

perl

tomcatv

vortex

AVG

Figure 10. Average energy dissipation of the proposed TLB and various TLB organizations. 4

8

16

32

48

64

128

Fully Associative TLBentry

Figure 9. Power dissipation per access for various TLB sizes. The power dissipation can be estimated for a fully associative TLB as follows [9]: PTLB

= RHIT * PHIT + RMISS * PMISS

(1)

PHIT

= PAccCAM + PDOdriver + PVSdriver

(2)

PMISS = PAccCAM + PWrite + ( PAccCache+ RCacheMiss* (PCacheWrite + Ppad) ),

(3)

Finally, the logic components required to construct an additional TLB entry are a comparator, a valid signal driver, and a data output driver, whose power consumption values are assumed to be 0.03nJ, 0.01nJ, and 0.2nJ respectively. These values were obtained using modified DineroIV and CACTI simulators as mentioned before. Fig.10 and Fig.11 present the average energy dissipation and Energy*Delay product for the different TLB structures. The delay value is obtained as the average TLB access time. Clearly, the proposed TLB system is the best structure in terms of power consumption and the performance improvement tends to be more significant than the other structures. It should be noted that our 4-416-16 configuration is smaller than the 16-32 micro-TLB and the 48 entry fully associative TLB.

Proceedings of the 14th Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’02) 0-7695-1772-2/02 $17.00 © 2002 IEEE

FA 32

MIcro TLB (4-32)

FA 48

MIcro TLB (16-32)

Proposed TLB (4-4-16-16)

References

6

[1] T. Sribalan Santhanm. “StrongARM SA110, a 160mhz 32b 0.5w CMOS ARM processor,”In Hot Chips 8, Aug. 1996.

Energy*Delay product

5

4

[2] T. Juan, T. Lang, J. Navarro, “Reducing TLB Power Requirements,” Int’l Symp. on Low Power Electronics and design, 1997.

3

2

1

0

applu

compress

go

gcc

perl

tomcatv

vortex

AVG

Figure 11. Energy*Delay products of the proposed TLB and various TLB organizations. Table I provides simulation results for performance evaluation. In order to compare the proposed TLB with other TLB organizations, we used a metric called IR (Improvement Ratio), IR=((a-b)/b*100%. As show in Table I, the miss ratio and Energy*Delay product can be improved by 59.26% and 24.9% respectively, compared with the micro-TLB with 4-32 entries. Table 1 Improvement ratio over various TLB structures. IR of miss ratio

IR of AMAT

IR of power

IR of E*D product

FA 32

59.34 (%)

- 8.41(%)

36.94 (%)

31.51 (%)

FA 48

22.07 (%)

- 13.41 (%)

48.03 (%)

40.90 (%)

MIcro TLB (4-32)

59.26 (%)

20.56 (%)

11.06 (%)

24.90 (%)

MIcro TLB (16-32)

40.81 (%)

- 8.91 (%)

19.52 (%)

12.18 (%)

5. Conclusion We devised a TLB system for low power consumption in embedded processors. The proposed TLB is configured as a 2-way banked filter TLB and a 2-way banked main TLB. This is to minimize the power consumption by reducing the overall number of entries to be accessed at a time, and enhance the effective TLB space by avoiding the inclusion property. As shown in our experiments, our TLB design can reduce the power consumption by about 48.03%, compared with a fully associative TLB that has the same number of PPNs, and also it can reduce the miss ratio and Energy*Delay product by 59.26% and 24.9%, respectively, compared with a micro TLB with 4-32 entries, and 40.81% and 12.18%, compared with a micro TLB with 16-32 entries.

[3] A D. Liu and C. Svensson, “Trading Speed for Low Power by Choice of Supply and Threshold Voltages,” IEEE Journal of Solid State Circuits, Vol. 28, No. 1, 1993. pp. 10-17. [4] A. Borg, J.B. Chen, N.P. Jouppi, “A Simulation Based Study of TLB Performance,” ,” In proceedings of Int’l Symp. on Computer Architecture 1992, pp. 114-123. [5] J. Kin, M. Gupta, and W. H. Mangione-Smith, “The Filter Cache: An Energy Efficient Memory Structure,’’ In Int’l Symp. Microarchitecture, 1997, pp. 184-193. [6] S. Manne, A. Klauser, D. Grunwald, and F. Somenzi, “Low power TLB Design for High Performance Microprocessors,” Univ. of Colorado Technical Report, 1997. [7] Kanad Ghose and Milind B. Kamble, “Reducing Power in Superscalar Processor Caches Using Subbanking, Multiple Line Buffers and Bit-Line Segmentation,” In Proc. International Symposium on Low Power Electronics and Design (ISLPED'99), August 1999, pp.70-75. [8] G. Reinman and N. Jouppi, “An Integrated Cache Timing and Power Model,” Compaq WRL Report, 1999. [9] J. H. Lee, J. S. Lee, and S. D. Kim, “A Selective Temporal and Aggressive Spatial Cache System Based on Time Interval,” In proceedings of Int’l Conference on Computer Design’00, Sep. 2000, pp. 287-293. [10] P. Shivakumar and N. P. Jouppi, “CACTI 3.0: An Integrated Cache Timing, Power, and Area Model,” Compaq WRL Research Report, Aug, 2001

Proceedings of the 14th Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’02) 0-7695-1772-2/02 $17.00 © 2002 IEEE

Suggest Documents