A Selectively Accessing TLB for High Performance and ... - CiteSeerX

23 downloads 14174 Views 33KB Size Report
†Parallel Processing Laboratory, Dept. of Computer Science, Yonsei ... swjung@samsung.co.kr ... TLB shows the best performance and power effect when.
A Selectively Accessing TLB for High Performance and Lower Power Consumption Jung-Hi Min†, Jung-Hoon Lee†, Seh-Woong Jeong‡, and Shin-Dug Kim† †Parallel Processing Laboratory, Dept. of Computer Science, Yonsei University Seoul 120-749, Korea {jhmn,ljh,sdkim}@cs.yonsei.ac.kr ‡Media SOC Team, System LSI Business, Samsung Electronics Co., Ltd., Kihueng, Korea [email protected]

Abstract

Keywords : TLB, block buffer, memory hierarchy, virtual memory, and simulation.

loss of performance. The proposed TLB is constructed as a block buffer and two -way banked TLBs. Power consumption can be minimized by reducing the number of TLB entries accessed at once. Thus one of three TLB entries belonging to a block buffer and two banked TLB entries is accessed selectively. Most of TLBs designed for modern processors show high performance, i.e., providing 1% miss ratio. Thus the work is focused on power consumption at the TLBs. To verify the proposed TLB, we use SPEC95 benchmarks and CACTI-II [2] which is modified for TLB. Consequently, the proposed TLB is superior to conventional TLBs in power consumption in spite of similar performance. It can reduce the power consumption by more than 40%, compared with the FA-TLB despite of the same miss ratio. The benchmarks also show the power effect of the proposed TLB structure. The Energy * Delay product is compared with that of conventional TLBs. The results show that 38% reduction of overhead and 27% reduction of overhead when the proposed TLB can be obtained, compared with FA-TLB and Micor-TLB respectively. Consequently, the proposed TLB shows the best performance and power effect when compared with other conventional TLBs.

1. Introduction

2. Related works

Recently low power design techniques are very important in designing embedded processors, e.g., PDA (personal digital assistance), Laptop, Cellular phone, and so on. To achieve this goal, a selectively accessing TLB structure is designed for both low power consumption and high performance. Most of modern processors have a sort of TLB structure and the TLB is a cache structure for fast address translation, from VPN (Virtual Page Number) to PPN (Physical Page Number) [1]. The role of TLB is to translate a particular virtual address into its corresponding physical address. Thus, the TLB should be designed such that its accessing latency can be minimized to guarantee high performance. In this research, a new TLB structure is proposed to support low power consumption without any

Block buffering technique [3] is already well known to system developers. This technique uses spatial locality to support low power system. If an application program has high spatial locality, the possibility of hits in a block buffer rises. Power consumption can be reduced for memory references. The banked TLB [4] is one of the best power saving TLB structures. This technique subdivides main TLB into several sub-TLBs. One sub-TLB is accessed for a particular access. If a main TLB is subdivided into two sub-TLBs, it just need to spend a half of power consumption by selectively accessing one of two sub-banks.

This paper presents a structure of TLB (translation lookaside buffer) for low power consumption but high performance. The propsed TLB is constructed as a combination of one block buffer and two -way banked TLBs. The processor can access the block buffer or one of two banked TLBs selectively. This feature is quite different from that used in the traditional block buffering technique. Simulation results show its effectiveness in terms of power consumption and Energy*Delay product. The proposed TLB can reduce power consumptions by about 40%, 10%, 23%, and 23%, compared with a FA (fully associative)-TLB, a micro-TLB, a victim-TLB, and a banked-TLB respectively. Also the proposed TLB can reduce Energy*Delay products by about 38%, 28%, 21%, and 21%, compared with a FA-TLB, a micro-TLB, a victim-TLB, and a banked-TLB respectively. Therefore the proposed TLB can achieve low power consumption and high performance with a simple architecture.

The victim-TLB is a structure of TLB to be adapted to the traditional victim buffer mechanism [5].

3. Selectively accessing TLB To support low power consumption, we propose a TLB which uses spatial locality maximally. We use a block buffer but it is not the same as conventional block buffering mechanism. However, we can access selectively either the block buffer or the banked TLB. This feature is different from the conventional block buffering mechanism.

3-1. Motivation This research is to design a TLB structure for both low power and high performance. The low power consumption can be achieved by reducing number of entries that are accessed at once and high performance can be obtained by supporting fast access time and high TLB hit ratio. Unfortunately, these characteristics contradict each other. We consider AMAT (Average Memory Access Time) and power consumption to find the best efficiency of TLB. Generally, the PTE (Page Table Entry) shows high spatial locality. We use a block buffer to utilize this locality effectively. And the main TLB is subdivided into the banked TLB for reducing power consumption by reducing the number of TLB entries at once.

3-2 The proposed TLB structure In general, the conventional TLBs with a block buffer access the block buffer at first, and then access main TLB. But our TLB accesses a block buffer or one of two banked TLBs selectively. The least significant six bits of any referenced tag are compared with those bits stored in the block buffer. If the comparison result is a match, the block buffer is accessed. Otherwise one of the banked TLBs is accessed. This mechanism accesses a block buffer or a banked TLB selectively for one access. If a hit occurs in the block buffer, it spends just one cycle and consumes very little power required to access one entry, i.e., the block buffer. If the comparison result is not a match each other, one of the banked TLBs is accessed depending on the bank selection bit. In this case, it spends just one cycle for accessing the banked TLB without accessing any unnecessary block buffer. Thus it can reduce power consumption without any degradation of performance. Two-cycle references take place just in the case of any block buffer miss. This mechanism can reduce the power consumption by more than 40% in spite of the similar low miss ratio for the FA-TLB.

3-3. Operational model

There are two cases for the result of comparing six bits of the buffer entry with those of referenced address. If the result of comparison is the same, the block buffer is accessed. Otherwise one of the banked TLBs is accessed. The block buffer can be accessed just within one cycle. The banked TLB can be accessed at either one cycle or the next cycle. (1) The block buffer is selected If the tag of a referenced address corresponds to that stored within the block buffer, the same step as a conventional hit takes place. Its corresponding PPN is sent to the cache with its page offset and the corresponding entry is written in the block buffer. If a miss takes place in block buffer, the banked TLB is accessed by using bank select bits at the next cycle. (2)The banked TLB is selected If the tag of a referenced address doesn’t correspond to that of block buffer, the banked TLB is accessed at one cycle. If the tag of referenced address can be found in the banked TLB, the banked TLB hit takes place. In this case, the requested physical page number is sent to the cache with its page offset and the corresponding entry is written in the block buffer. If the tag of a referenced address can’t be found in the banked TLB, the banked TLB miss takes place. In this case, the missed page is brought from the lower memory and is written in both the block buffer and the corresponding banked TLB. At this time the incoming page is overwritten on the block buffer and the banked TLB replaces an entry by FIFO replacement policy. Whether accessed at one cycle or at the next cycle, the action of the banked TLB is the same. 0 11

TAG

or

PageOffset

6

1

0

Tag Bits

Bank Select Bit

6 Block Buffer

0

CAMP

1 0

CAMP

Buffer Hit

1

CAM

SRAM

CAM

SRAM

VPN

PPN

VPN

PPN

TLB Hit

Physical Page Number

Fig. 1 Organization of the proposed TLBs.

4. Performance Evaluation There are four criteria to evaluate TLB performance, i.e., miss ratio, AMAT (average memory access time), power consumption, and Energy * Delay. We have

4.1 Miss ratio and average memory access time According to the simulation result, the miss ratio shows a similar result when compared with conventional TLBs, i.e., FA-TLB, Micro-TLB, Victim-TLB, and Bank-TLB. Most of these TLBs show the miss ratio below 1% [7].

access time (ns), cycle time (ns), power(nJ), tag_comparison speed (ns), tag_comparison power consumption (nJ), data output speed (ns), data output power (nJ), and so on. We use 0.8 micron technology process and 4.5V voltage for simulation. Power consumptions for various TLB structures, i.e., a direct-mapped structure, a set-associative structure, and a fully-associative structure, are evaluated and analized in this work. For the FA-TLB with 16 entries, a read hit consumes 2.9209 nJ of energy, a read miss consumes 1.1021 nJ, and a write access consumes 0.5526 nJ, and a write access consumes 0.5526 nJ [6]. In general, on entry block buffer consumes less power than any other structures. Also the proposed block buffer shows more than 53% of hit ratio. This result can provide a significant advantage in the point of low power issue. FA-TLB (32entry)

Micro-TLB (4-32entry)

Victim-TLB (16-16entry)

Bank-TLB (16-16 entry)

Victim-TLB (16-16entry)

Bank-TLB (16-16 entry)

Supposed TLB (1-16-16entry)

1.6

1.5

1.4

1.3

1.2

1.1

1

applu FA-TLB (32entry)

M icro-TLB (4-32entry)

1.7

Average Mem o ry Acess Tim e(cycle)

compared conventional TLBs with the proposed TLB based on these criteria. We selected four conventional TLBs to compare with the proposed TLB, i.e., FA-TLB (32-entries), Micro-TLB (4-32 entries), Banked-TLB (16-16 entries), and Victim-TLB (16-16 entries). We used trace-driven simulation for estimating power consumption and the performance. During experiment, all combinations of partial tag bits for fast tag comparison are performed to find the optimal combination of partial tag bits. We conclude that the optimal combination is to use the least significant six bits. The second optimal case is four bit combination. In case of using six bits comparison, the power consumption can be reduced by about 42% in spite of the same miss ratio for an FA-TLB. This effect also shows that the Energy*Delay product can be reduced about 39%. It means that the overhead of TLB structure turns out to be negligible for any conventional TLB. Fig. 2 and Fig. 3 show the miss ratio comparison and the AMAT comparison respectively. The percentage of one cycle hits is more than 98%. The miss handling time is assumed as 15 cycles in terms of common 32-bit embedded processors. These metrics are used in common 32-bit embedded processors, i.e., Hitachi SH7750 series, UltraSPARC-3, and ARM920T.

compress

gcc

go

ijpeg

li

m88ksim perl tomcatv

vortex

AVG

Supposed TLB (1-16-16entry)

Fig. 3 Average memory access time of conventional TLB and the proposed TLB.

1 0.9 0.8

The power consumption for the fully-associative TLB is calculated as follows [6].

Miss Ratio (%)

0.7 0.6 0.5

FA_Power = hit ratio * power dissipated by a hit + miss ratio * power dissipated by a miss

0.4 0.3 0.2 0.1 0

applu

compress

gcc

go

ijpeg

li

m88ksim

perl

tomcatv

vortex

AVG

Fig. 2 Miss ratios of conventional TLB and the proposed TLB.

4.2 Power and Energy * Delay product The power consumption for each TLB can be evaluated by CACTI-II simulator [1] which is modified for the TLB structure. Modified CACTI-II simulator can calculate

The power dissipated by a hit is the power consumption when a read hit takes place. The power dissipated by a miss is the whole summation of three components, i.e., the energy dissipated to drive the whole TLB entries when the TLB is accessed, the energy dissipated to update a tag memory and a data memory when a miss takes place, and the energy dissipated in the cache and pads when a miss occurs. In particular, the third component can be calculated as the summation of two sub-components, i.e., the energy spent to access a cache block and the energy spent at on-chip pad slots. Fig. 4 shows the comparison result for the conventional TLBs and the proposed TLB. Traditionally, the

Micro-TLB shows the best effect in power aspect among most of conventional TLBs. But the proposed TLB shows 46% reduction and 14% reduction in power consumption when it is compared with FA-TLB and Micro-TLB, respectively. This effect in power consumption leads the best result in Energy * Delay product which is a criteria of both performance and power. Fig. 5 shows the Energy * Delay product to compare conventional TLBs with that of the proposed TLB. In this comparison, 38% reduction of the overhead and 27% reduction of the overhead can be achieved when the proposed TLB is compared with FA-TLB and Micor-TLB respectively. Consequently, the proposed TLB shows the best performance and power effect when it is compared with other conventional TLBs. FA-TLB (32entry)

Micro-TLB (4-32entry)

Victim-TLB (16-16entry)

Bank-TLB (16-16 entry)

Supposed TLB (1-16-16entry)

4

3.5

3

Power(nj)

2.5

2

1.5

power consumption requirement for recent design trends, a technique to minimize power consumption is devised through this proposed TLB. The proposed TLB is simply constructed with one block buffer and two -way banked TLBs. For fast comparison, only six least significant bits of the referenced tag are used to compare with that of the block buffer. This mechanism selectively accesses block buffer or one of two banked TLBs for every access. If a hit takes place in the block buffer, it spends just one cycle and very little power consumption just for one entry. If the comparison result is not matched, the banked TLB is accessed. In this case, it spends just one cycle for accessing the banked TLB without unnecessary accessing the block buffer. It can reduce the AMAT and power consumption. Two-cycle references may occur just for the case of any block buffer miss. Simulation results show that the effectiveness in power consumptions and Energy*Delay products becomes significant. The proposed TLB can reduce Energy*Delay products by about 38%, 28%, 21%, and 21%, compared with a FA-TLB, a micro-TLB, a victim-TLB, and a banked-TLB respectively. Thus the proposed TLB supports low power consumption but high performance with a simple architecture and low hardware cost.

6. References

1

0.5

0

applu

compress

gcc

go

ijpeg

li

m88ksim

perl

tomcatv vortex

AVG

Fig. 4 Power comparison of conventional TLBs with the proposed TLB. FA-TLB (32entry)

Micro-TLB (4-32entry)

Victim-TLB (16-16entry)

Bank-TLB (16-16 entry)

Supposed TLB (1-16-16entry)

5 4.5 4

Energy * Delay

3.5 3 2.5 2 1.5 1 0.5 0

applu

compress

gcc

go

ijpeg

li

m88ksim

perl

tomcatv

vortex

AVG

Fig. 5 Energy * Delay product of conventional TLBs and the proposed TLB.

5. Conclusions We proposed a modified TLB structure for low power consumption but high performance. To guarantee low

[1] Todd M. Austin and Gurindar S. Sohi, “High-bandwidth address translation for multiple-issue processors,” In Proceedings of the 23rd ACM Int’l Symp. on Computer Architecture,pp.158-167,May 1996. [2] G. Reimnam and N. Jouppi, “An Integrated Cache Timing and Power Model,” Compaq WRL Report, 1999. [3] J.Kin, M.Gupta, and W.H.Mangione-Smith, “The Filer Cache : An Energy Efficient Memory Structure,”In International Symposium on Microarchitecture,1997. [4] S.Manne, A.Klauser, D.Grunwald, and .Somenzi,”Low power TLB Design for High Performance Microprocessors,”Univ.of Colorado Technical Report,1997. [5] Norman P. Jouppi,”Improving Direct-Mapped Cache Performance by the Addition of a Small Fully Associative Cache and Prefetch Buffers,” In 17th ISCA, May. 1990, pp.364-373. [6] J.H. Lee, J.S. Lee, S.W. Jeong, and S.D. Kim, "A Banked-Promotion TLB For High Performance and Low Power," In ICCD, 2001, pp.118-123. [7] Anita Borg , J. Bradley Chen, Norman P. Jouppi,”A Simulation Based Study of TLB Performance,” In ISCA, 1991.

Suggest Documents