Design tradeoffs for software-managed TLBs - CiteSeerX

Design Tradeoffs for Software-Managed TLBs RICHARD

UHLIG,

STUART University

An

increasing

and

number

TLBs.

performance

and

memorzes:

cache

Design

TLB

design

systems.

versions

Subject

TREVOR

running

MUDGE,

D.4.2

[Operating

General

Terms:

Design,

Experimentation,

Words

and

(TLB),

Systems]:

trap-driven

interaction

through

and

R2000-based

software-

penalties

that

memory.

with

monitoring

[Memory B.3.3.

(3.4 [Computer

D.4.8

Key

support

considerable

its use of virtual

and their

on a MIPS

B.3.2

memory;

Additional

and

hardware

memory;

technzqaes;

memory impose

a range

This

of monolithic

simulation,

workstation

are work

we explore

running

Ultrix,

3.0.

vzrtual

ulatzon;

can

tradeoffs

Descriptors:

Aids-szm

virtual

structure

Through

of Mach

memorzes;

tems—measurement

provide

system’s

for benchmarks

and three

buffer

STANLEY, BROWN

management

operating

operating

Categories and

architectures

software-managed

OSF\l,

TIM

RICHARD

software

on the

microkernel

TLB

of

However,

dependent

explores

NAGLE,

and

of Michigan

managed highly

DAVID

SECHREST,

Structures]: [Memory

Systems

[Operating

Design

structures]:

Styles—assoczatwe Performance

Organization]:

Systems]:

Analysis

Performance

Storage

of Sys-

Management—~irtual

Performance—measurements Performance

Phrases:

Hardware

monitoring,

simulation,

translation

lookaside

simulation

1. INTRODUCTION Many

computers

translation 1988

support

lookaside

[Smith

et al.

including the [Hewlett-Packard

virtual

buffers 1988],

version

Symposzum This

work

contract grant

CDA-9121887,

Michigan, Permission not made

Ann

number

and

a National

MI

to copy without

with

of computer

and

its

specific permission. @ 0734-2071/94/0800-0175 ACM

Proceedings May

Advanced

the

Science

in

the HP-PA 19921, have

operating system. and provide greater

of the 20th

Projects

Foundation

Foundation

ZS-1

architectures,

Annual

These flexi-

International

1993.

Research

Science

of Electrical

Agency

CISE

Graduate

Engineering

and

under

Research

DARPA/ARO Instrumentation

Fellowship.

Computer

Science,

University

of

48109-2122. fee all or part

for direct

for Computing

in the

San Diegoj

by National

Department

Arbor,

appeared

by Defense

or distributed

publication

Association

hardware-managed

beginning

responsibility into the simplify hardware design

article

supported

address:

an increasing

Architecture,

DAAL03-90-C-O028,

Authors’

of the

of this

on Computer was

by providing

However,

AMD 29050 [Advanced Micro Devices 19911, 1990], and the MIPS RISC [Kane and Heinrich

shifted TLB management software-managed TLBs

A shorter

memory

(TLBs’).

date

of this

commercial

appear,

Machinery.

and

material

is granted

advantage,

the ACM

notice

is given

To copy otherwise,

that

provided copyright

copying

that notice

the copies

is by permission

or to republish,

requires

of the

a fee and/or

$0350 TransactIons

on Computer

Systems,

Vol.

12, No

3, August

1994,

are

and the title

Pages

175-205

176

.

bility

Richard

in

page

Uhlig

table

et al.

structure,

but

typically

have

slower

refill

times

hardware-managed TLBs [DeMoney et al. 1986]. At the same time, operating systems such as Mach

3.0 [Accetta

are moving functionality into user processes and virtual memory for mapping kernel data structures

making greater held within the

These

and related

hence, This

decrease overall system article is a case study,

mance

of application

based

operating

system

and

workstation.

trends

can increase

performance. exploring the

operating

system

In particular,

impact

TLB

we examine

the

system

code.

Through

simulation,

configurations. The results illustrate software-managed TLB in conjunction plementations. As a case study, this

rates

on the

differences

we

use of kernel.

R2000-

in performance

operating construction

explore

and,

perfor-

on a MIPS

seen when running the same applications under different showing how these performance differences reflect the operating

et al. 1986]

miss

of TLBs

code running

than

systems, of the

alternative

the design tradeoffs for with particular operating work seeks to illuminate

TLB

a particular system imthe broader

problems of interaction between operating system software and architectural features. This work is based on measurement and simulation of running systems. To examine issues that cannot developed a system analysis running

machines.

Tapeworm,

We

which

be adequately modeled with simulation, we have tool called Monster, which enables us to monitor

have

also

is compiled

intercept all TLB misses caused references. The information that is used to obtain tions. The remainder

TLB

miss

of this

previous

TLB

and

analysis

tools,

Monster

OS research

under

software-based

5. Section

2. RELATED

and

Ultrix,

6 summarizes

a novel

TLB

the operating

to simulate

is organized related

simulator

system

OSF/

work.

The MIPS

1, and Mach

performance

different

as follows:

to this

and Tapeworm.

HardwareSection

into

TLB

Section

Section

R2000

it can

memory system configura-

2 examines

3 describes

our

structure

and

TLB

3.0 are explored

improvements

called

so that

by both user process and OS kernel Tapeworm extracts from the running

counts

article

its performance and

developed

directly

in Section

are presented

4. in

our conclusions.

WORK

By

caching page table entries, TLBs greatly speed up virtual-to-physical address translations. However, memory references that require mappings not in the TLB result in misses that must be serviced either by hardware or by software. Clark and Emer [ 1985] examined the cost of hardware TLB management by monitoring a VAX- I 1/780. program’s run-time was spent handling More recent papers have investigated

For their workloads, TLB misses. the TLB’s impact

5 to 8% of a user on user

program

performance. Using traces generated from the SPEC benchmarks, Chen, et al. [ 1992] showed that, for a reasonable range of page sizes, the amount of the address space that could be mapped was more important in determining the TLB miss rate than was the specific choice of page size. Talluri et al. ACM

Transactions

on

Computer

Systems,

Vol

12,

No

3, August

1994

So ftware-Managed [1992]

showed

mapped including from

that

large

while

regions

MIPS

RISC,

Clark

system

and

could

(such

[1985]

as that

memory,

TLBs

showed

significantly

references

Emer

TLBs

do not. They

4 to 32 Kbytes

CPI. Operating

older

of physical

also have

showed

that

in

in

also that

the

a strong

impact

the page size

TLB’s

only

11/780)

architectures,

increasing

the

177

.

VAX

newer

decrease

although

TLBs

contribution

on TLB 18%

to

miss

of

all

rates.

memory

references in the system they examined were made by the operating system, these references were responsible for 70% of the TLB misses. Several recent papers [Anderson et al. 1991; Ousterhout 1989; Welch 1991] have noted changes in the structure of operating systems are altering the utilization the

TLB.

For

example,

Anderson

implementation

of an operating

implementation

of that

machine, found

which

a 600%

misses.

increase

This

article

system mance

differs

TLBs

and

technology difficulties

here,

managed

there TLBs

focus

that

3100 running

The different

service

different dependent

TOOLS

To monitor

and analyze operating

system

called

of this

section

which 3.1

Monster

misses its

impact

was far

et

al.

of TLB

and

away

3.0 kernel.

focus

on

of changing

cost

Mach

when,

refill

penalty.

penalties,

of handling

3.0 can vary

reflect

as in the

in the refill

small

system AND

softwareoperating

the varying

with

are used.

System

Monitoring monitoring

EXPERIMENTAL

TLB

behavior we

TLB

and a TLB these

than

and

our on

a

400 cycles.

of the code paths

that

miss types is We, therefore,

ENVIRONMENT

developed

simulator tools

miss

and discussion.

for benchmark have

exam-

hardware-

low variance,

20 to more

lengths

systems

While

a single

from

in our analysis

systems,

describes

they

The Monster

category

through

of the

variance the

times

variety

of

expensive

types of misses. The particular mix of TLB on the construction of the operating system.

on the operating

3. ANALYSIS

Anderson

for the Mach

work

R3000-based

of TLBs.

expensive

can be complex,

relatively

show

DECstation handle highly

TLBs

is significant have

measurements

previous

examination

monolithic microkernel

3.0) on a MIPS

of the most primitive

old-style a newer

on TLB design. We trace the specific sources of perforand the options for eliminating them. The design tradeoffs

for software-managed ined

(Mach

of these

system

from its

an

2.5) and

management

in the incidence

invoked

compared

(Mach

system

software

the handling

frequently

managed

operating

requires

Moreover,

the most

et al. system

that of

called the

programs

running

on a

a hardware-monitoring Tapeworm.

experimental

The remainder environment

in

with Monster system

(Figure

1) enables

comprehensive

analyses

of

the interaction between operating systems and architectures. Monster is comprised of a monitored DE Citation 3100, an attached logic analyzer, and a controlling workstation. The logic analyzer component of Monster contains a programmable hardware state machine and a 128 K-entry trace buffer. The state machine includes pattern recognition hardware that can sense the ACM

Transactions

on Computer

Systems,

Vol

12, No

3, August

1994

178

.

Richard

Uhllg

et al

l——.——.—

I FIR

1

m

The Monster

a Tektromx Ultnx,

DAS

OSF/1,

to the CPU momtur systems

monitormg

9200

Logic

and Mach

pros,

which

vu-tually with

such

system

Monster and

marker

the lo&c

of a TLB

instructions

runmng

has been

and the cache

To enable

servlcmg NOP

3100

motherboard

the processor

act]vlty.

is a hardware-momtonng

a DECstatlon

3.0. The DECstatlon

he between as the

special

Logic Anal yzer

Analyzer

all system

events,

mented

DECstation 3100

miss,

that

This

analyzer each

md]cate

system three modified

allows

operating

exit

access

logic analyzer

on certain

system and

of

systems

to provide

the

to trigger

the entry

conslstmg

operat]ng

has

to

operating

been

points

instru-

of various

routines

processor’s

address,

machine

data,

and control

can be programmed

signals

to trigger

on every

clock

on predefine

cycle.

patterns

This

the CPU resolution)

bus and then store these signals and a timestamp (with into the trace buffer. Monster’s capabilities are described

completely In this

in Nagle et al. [1992]. article, we used Monster

instrumenting exit

points

each OS kernel of various

monitored

with

the marker

trace

marker

code segments.

the logic

instructions

analyzer’s

to measure

with

analyzer,

The

whose

and a nanosecond

buffer.

Once

filled,

the

TLB

miss-handling

instructions machine

resolution trace

buffer

kernels detected

timestamp was

at 1 ns more

costs

denoting

instrumented state

state

appearing

entry were and into

by and

then stored

the logic

postprocessed

to

obtain a histogram of time spent in the different invocations of the TLB miss handlers. This technique allowed us to time single executions of code paths with

far

greater

accuracy

than

can be obtained

using

the

coarser-resolution

system clock. It also avoids the problems inherent in the common method of improving system clock resolution by taking averages over repeated invocations [Clapp et al. 1986]. Note that this measurement technique is not limited to processors with off-chip caches. Externally observable markers can be implemented on a processor with on-chip caches through noncached memory accesses. 3.2 TLB Slmulatlon

with Tapeworm

Many

previous

TLB

design

tradeoffs However,

[Alexander there are

1992].

simulation. tools like ACM

studies

have

used

First, it is difficult to obtain pixie [Smith 199 1] or AE [Larus

Tran+act]ons

on

Pomputer

trace-driven

simulation

to explore

et al. 1985; Clark and Emer 1985; Talluri a number of difficulties with trace-driven

Systems,

Vol

12,

No

et al. TLB

accurate traces. Code annotation 1990] generate user-level address 3, August

1994

Software-Managed traces

for

monitors [Chen

a single

task.

[Agarwal 1993],

account

However,

et al. 1988;

are required

for

both

more

Clark

to obtain

multiprocess

complex

and

Emer

realistic

and

the

address

operating

kernels

traces

that

system

trace-driven

tion

technique

periods

at regular

traces

requires

of time intervals.

are invariant

policies

while Third,

misses

are

TLB.

serviced

software-managed TLBs the stream of instruction Because

the code that

interaction

operation

is processed,

While

this

for

introducing

assumes

be true

state

a TLB

a change

that

address

or management

for

cache

machines),

simulation

it is not

true

in

miss

TLB

can itself

structure

induce

and

by compiling

a TLB

the

3100

are handled

kernel

TLB

this

misses

information

possible TLB A simulated Tapeworm

by software.

our TLB

simulated actual physical the

the

its

own

that

the

For example,

slots

(Figure mappings

to determine

data

TLB

to simulate checks

each

if it would

than

with

the

128 slots

TLB. in the

using

reference the

only

64

that

misses

larger,

simu-

property

the actual

and replacement It can simulate

actual

of 128 virtual-to-

also missed

controls

uses other

available

an array

inclusion

TLB

to simulate

entries

memory

have

a strict

Tapeworm

policies by supplying placement operating system miss handlers.

a TLB maintains

maintains

TLBs.

holds

systems’

Tapeworm

and

or smaller only

Tapekernels. DECsta-

“hooks” after every about all user and

simulator.

structures

larger

actual

and

Tapeworm

and simulated

Tapeworm

2), Tapeworm

TLB

one. Thus,

actual

to

address

actual

lated

directly

the

system

simulator,

the operating

code via procedural relevant information

configurations. TLB can be either

TLB.

TLB

We modified

the Tapeworm passes the

to maintain

ensures

for

miss,

resulting

worm, directly into the OSF/ 1 and Mach 3.0 operating system Tapeworm relies on the fact that all TLB misses in an R2000-based miss handlers to call miss. This mechanism

ex-

distor-

where a miss (or absence thereof) directly changes and data addresses flowing through the processor.

address trace can be quite complex. We have overcome these problems

tion

and

be suspended

parameters

may

by hardware

processing

thus

simulation

in the structural

services

between

system

trace

trace-driven

to changes

of a simulated

(where

that the

considerable

itself.

resources. Some researchers have overcome the storage resource by consuming traces on-the-fly [Agarwal et al. 1988; Chen et al. This

consume

as hardware

or annotated

Second,

1992].

can

such

179

.

storage problem tended

simulation

tools, 1985]

system-wide

workloads

TLBs

TLB

functions TLBs with

between

the

management called fewer

by the entries

than the actual TLB by providing a placement function that never utilizes certain slots in the actual TLB. Tapeworm uses this same technique to restrict the associativity of the actual TLB.1 By combining these policy functions with adherence to the inclusion property, Tapeworm can simulate the performance of a wide range degrees of associativity and a variety

of different-sized TLBs with of placement and replacement

different policies.

— 1 The by

actual

using

which

R2000

certain the

mapping

TLB bits

of may

is fully

associative,

a mapping’s be

virtual

but page

varying number

degrees to

of

associativity

restrict

the

slot

can

be

set

of

(or

emulated slots)

mto

placed. ACM

TransactIons

on Computer

Systems,

Vol.

12, No.

3, August

1994

180

.

Richard

Software .—

Uhlig

et al.

Trap on TLB MISS /

-——

Kernel Code (Unmapped

Space)

F;:~=~F; (128 Slots)

Actual TLB

Page Tables (Mapped

Fig.

2.

Space)

Tapeworm.

revoked

The

whenever

there

Tapeworm

TLB

TLB

M a real

Its own TLB

configuration(s),

Because

captures

dynamic

of the

driven

the

by static

nature

simulator

is

miss

The

the

simulator

system

btnlt

.wmulator

and

mto

resides

avoids

design

cited

completely

operating

TLB

problems

system

misses

m the operating

the

avoids

above.

many

Because

associated

of the problems

Tapeworm

is

avoided.

Because

miss-handling

Tapeworm

code, it considers

impact

by

traces large

is invoked the

with

driven

within the OS kernel, it does not require address difficulties with extracting, storing, and processing TLB

the

uses the real

and

IS

to simulate

system,

Tapeworm

with

simulators

traces

The Tapeworm simulation

—

(84 slots)

trace-driven

TLB

procedure

calls

at all; the various address traces are

by the

machine’s

of all TLB

misses

actual whether

they are caused by user-level tasks or the kernel itself. The Tapeworm code and data structures are placed in unmapped memory and therefore do not distort simulation results by causing additional Tapeworm changes the structural parameters the actual avoiding

TLB,

the behavior

the distortion

3.3

Experimental

All

experiments

of the system

inherent

in fixed

itself

TLB and

misses. Finally, because management policies of

changes

automatically,

thus

traces.

Environment were

performed

on

a DECstation

31002

running

three

different base operating systems (Table I): Ultrix, OSF/ 1, Mach 3.0. Each of these systems includes a standard Unix file system (UFS) [McKusick et al. 1984]. Two additional versions of Mach 3.0 include the Andrew file system (AFS) cache manager [Satyanarayanan 1990]. One version places the AFS cache manager m the Mach Unix Server @FSin) while the other migrates the AFS cache manager into a separate server task (AFSout). To obtain measurements, all of the operating systems were instrumented with counters and markers. For TLB the OSF/ 1 and Mach 3.0 kernels.

2The

DECstation

3100

contains

an

R2000

simulation,

microprocessor

Tapeworm

was embedded

(16,67

and

memory. ACM

Transactions

on

Computer

Systems.

Vol

1’2, No

3, August

1994

MHz)

16 Megabytes

in

of

Software-Managed Table

I.

Benchmarks

and Operating

Benchmark

TLBs

.

181

Systems.

Description

compress

Uncompressed and compresses a 7.7 Megabyte video clip.

IOzone

Asequential file l/O benchmark that writes and then readsa Megabyte file. Written by Bill Norcott.

jpeg~lay

Thexloadimage

10

program written by Jim Frost. Displays

four

JPEG images. mab

John Ousterhout’s Modified Andrew Benchmark [1 8].

mpeg~lay

mpeg_play

ousterhout

John Ousterhout’s benchmark suite from [1 8].

video~lay

A modified version of mpeg~lay an uncompressed video file.

V2.O from the Berkeley Plateau Research Group. Displays 610 frames from a compressed video file [1 9].

Operating System

that displays 610 frames from

Description

Lfltrix

Version 3.1 from Dlgtal Equipment Corporation.

OSFI1

OSF/1 10 is the Open Software Foundation’s 2.5,

Mach 3.0

Carnagle Mellon University’s version mk77 of the kernel and uk38 of the UNIX server.

Mach3+AFSin

Same as Mach 3.0, but with the AFS cache manager (CM) running in the UNIX server.

Mach3+AFSout

Same as Mach 3.0, but with the AFS cache manager running as a separate task outside of the UNIX server. Not all of the CM functlonalky has been moved into this server task.

Benchmarks

were

were

so that

tuned

( 100–200

seconds

Throughout benchmark ment

with

each

under

Mach

this

article

3.0)

article

ON

C compiler

takes All

version

approximately

measurements

2.1 (level-2 the

cited

used

on all the

is the average

same

are the

we use the benchmarks

were

in this

IMPACT

the Ultrlx

benchmark

binaries

cited

4. OS

compded

of three

SOFTWARE-MANAGED

optimization).

amount

average

listed

operating

version of Mach

of three

in Table

systems.

Inputs

of time

to

run

runs.

I. The same

Each

measure-

trials.

TLBS

Operating system references have a strong influence on TLB performance. Yet, few studies have examined these effects, with most confined to a single operating operating

system systems

[Clark and Emer can be substantial.

1985]. However, differences between To illustrate this point, we ran our

benchmark suite on each of the operating systems listed in Table I. The results (Table II) show that although the same application binaries were run on each

system,

and total

TLB

the

functionality

there service

is significant time.

between

variance

Some of these operating

in the

increases

systems

(i.e.,

number

of TLB

UFS

vs. AFS).

creases are due to the structure of the operating systems. For monolithic Ultrix kernel spends only 11.82 seconds handling while the microkernel-based Mach 3.0 spends 80.01 seconds. ACM

TransactIons

on Computer

misses

are due to differences

Systems,

Vol

12, No

Other

in in-

example, the TLB misses,

3, August

1994

182

Richard

.

Uhlig Table

et al

11.

Total

Run Time (see)

Operating System

TLB

Misses

across

the Benchmarks.

Ratio to Ultrix TLB Service Time

Total TLB Service Time (sec~

Total Number of TLB Misses

Ultrix 3.1

583

9,177,401

11.82

1.0

OSFII

892

11,691,398

51.85

4.39 6.77

975

24,349,121

80.01

Mach3+AFSin

1,371

33,933,413

106.56

9.02

Mach3+AFSout

1,517

36,649,834

134.71

11.40

Mach 3.0

*Time The

based

total

on

measured

run-time

Although

the

substantial

and

same

that

of TLB binaries

m the number

while

the total

types

of software-managed reason,

ship to TLB misses.

4.1

miss

Page Tables

OSF/1

task

has

own

machine-independent

seven

of the

operating

and their

of TLB

misses,

programs

systems, g service

increases

four

rise

to a 3-level

(PTE) types. The R2000 used

to

address

cache

m a

fold (from

each with

its own associated

page table

structure,

cost. For

its relation-

and costs of different

implement

linear

(Ll)

page table

page

code [Rashid

table,

structures

which

et al. 1988].

is

types

of

Because

(Figure

3).

maintained

by

the user

page

tables can require several megabytes of space, they are themselves the virtual address space. This is supported through level-2 (L2 page tables, which relatively large and

there times

Hardware

level-l

pmap

benchmark

correspond]n

misses

and the frequencies

3.0 both its

nmmcs

by the

AFSout), the total time spent servicing TLB This is due to the fact that there are different

and Translation

and Mach

Each

on each

to understand

handling,

miss.

recurred

run

number

TLB

it is important

TLB

misses were

of TLB

9,177,401 to 36,639,834 for misses increases 11.4 times. this

to service

time

number

application

difference

Notice

median

stored in or kernel)

also map other kernel data. Because kernel data is sparse, the L2 page tables are also mapped. This gives

page

table

processor

contains

recently

to a physical

hierarchy

used address,

and

four

a 64-slot,

PTEs.

fully

When

the relevant

the PTE

different

page

associative R2000 must

table

TLB,

translates be held

entry

which

is

a virtual

by the TLB.

If

the PTE is absent, the hardware invokes a trap to a software TLB miss-handling routine that finds and inserts the missing PTE into the TLB. The R2000 supports two different types of TLB miss vectors. The fh-st, called the user (uTLB) vector, is used to trap on missing translations for LIU pages. TLB This vector is justified by the fact that TLB misses on LIU PTEs are typically the most frequent [DeMoney et al. 1986]. All other TLB miss types (such as those caused by references to kernel pages, invalid pages, or read-only pages) and

all

other

interrupts

ACM

Tranwct]o.s

and

exceptions

trap

to a second

vector.

generic-exception on

Computer

Systems,

Vol

12,

No

.3 August

1994

vector,

called

the

Software-Managed

n User Data Page

L1

183

.1 L1 K PTE Each PTE maps one,4K page of kernel text or data.

L2 PTE Each L2 PTE maps one, 1,024 entry user page table page.

L2

.

Kernel Data Page

L1 U PTE Each PTE maps one, 4K page of user text or data.

@

TLBs

Vw’tualAddress Space Physical Address Space L3

L3 PTE Each L3 PTE maps 1 page of either L2 PTEs or LIK PTEs.

m

u Fig,

3.

Page

structure

with

structure

holds

are stored

the

storage, virtual data

address

stored

For the purposes correspond

to the

3.0.3 In

kernel

TLB

misses

miss

vectors

3 Other

page

conclusions.

hashed

page

of LIU

to LIU

TLB

L2,

table

they

Hays

structures

are

[1993]

ACM

(LIU)

PTEs.

are stored the

page

tab] e

These

LIU

PTEs

L2 page

tables

tables,

in tbe

L3 page

page

a 3-level

of the

L2 pages

the

to the

form

top

set of L1 page

time,

table

are m appeal table

by

is fixed

hierarchy

in

because

TLB. page hold

size. 1,024

can directly

Each

PTE

LIU

map

requires

PTEs,

either

we define

TLB

structure

implemented

misses,

L3, modify,

4 bytes

of

or 4 Megabytes

4 Megabytes

to handle is primarily

of

of kernel

possible

and

have

examined

could

mms-handhng Transactions

types

LIU

lead

to

111) to 1 and

subcategories

Table

a

PTEs tuned

require

the impact

(Table

by OSF\ III

the different due to the

by a highly

miss

types

five

invalid).

the OS uses them. other

miss

we define

and

are serviced

on TLB

boot

tables

The

data.

table

However,

structures and

can

of the time required differential in costs

vector.

table

the

page

case study,

(LIK,

At

4 KByte

page

and the way that

Huck

table

user

In turn,

anchor

the L2 page tables

of this

16 cycles because

at the uTLB

table.

page

space.

They

PTEs.

as an

a fixed

page

4 GBytes

addition

our measurements misses. The wide within

has

L1

linear

serves

Mach

its own

PTEs.

( LIK)

do not go through

architecture

map

kernel

The

(mapped)

by level-1

(L2)

L3 page

This

3.0.

having

level-2

level-1

space. Likewise,

or indirectly

each task

in the

a single

Mach

in virtual

are mapped

with

table

and

residing

are the

and

L3 page

Therefore,

Mach

table

memory.

R2000

OSF/1

which

tables

L2 PTEs

to the

MIPS

in levels

pages

L1 page

physical

references The

user

(L3 ) PTEs

unmapped

two

L1 page

both

level-3

first

the

the

hold

structure

the

in the

Mapping which

table

of also

types of TLB two different

can be retrieved handler

anywhere

different

shows

set

of

of forward-mapped,

inserted

from

definitions

inverted,

about

and

and

times. on Computer

Systems.

Vol

12, No

3, August

1994.

184

.

Richard

Uhlig

et al,

Table

III

Costs

for Different

TLB

Ultrix

TLB Miss Type

table

types

shows

of TLB

histogram

the

LI U

20

20

LI K

333

355

294

L2

494

511

407

354

286

To

of timings

described

below.

Modify

375

436

499

Invalld

336

277

267

number

misses

of machme

determine

for each

Note

that

Mach 3.0

OSFI1

16

L3

This

Mms Types

type

Ultrix

cycles

these

(at

costs.

of miss,

60 ns/cycle)

Monster

We separate

does not have

reqmred

was

TLB

L3 misses

used

miss

because

to service

to

types

collect mto

different

a 128 K-entry

the

SIX categories

It Implements

a 2-level

page

table. LILT

TLB

mms on a level-l

user

LIK LZ

TLB

miss

on a level-l

kernel

TLB

miss

on level-2

L3

TLB

miss

on a level-3

Modify

A page protection

Invahd

An access to a page marked

kernel

PTE

PTE

PTE

This

PTE

can only

occur

Can occur

after

as invalid

(page

after

a miss

elthcr

on a level-l

a level-2

user

PTE.

mms or a level-l

miss. vlolatlon fault)

300 to over 500 cycles because they are serviced by the generic handler residing at the generic-exception vector. The R2000 TLB hardware supports partitioning of the TLB into two sets of slots. The lower the upper

partition

partition

be refetched

quickly

The TLB

hardware

partition

through

the

range

is intended

is intended (e.g.,

LIU)

random

OS Influence

high

frequently

register

effectively,

fixes

that the

O through

retrieval used

referenced

replacement

index

the lower partition consists of slots consists of slots 8 through 63. 4.2

with

more

or infrequently

also supports a hardware

8 to 63. This,

for PTEs

to hold

PTEs

of PTEs

returns TLB

costs, while

PTEs

7, while

the

can L3).

in the upper

random

partition

that (e.g.,

numbers at

in

8, so that

upper

partition

on TLB Performance

In the versions

of the operating

which

for

account

the

systems

variation

in

studied,

the

number

there

are three

of TLB

basic

misses

factors

and

their

associated costs (Table IV and Figure 4). They are: (1) the use of mapped memory by the kernel (both for page tables and other kernel data structures); (2) the placement of functionality within the kernel, with a user-level server process (service migration), or divided among several server processes (OS decomposition); (additional relationship 4.2.1 address therefore, ACM

and

(3) the

range

of functionality

provided

by the

system

OS services). The rest of this section uses our data to examine between these OS characteristics and TLB performance.

the

Kernel Data Structures. Unmapped portions of the kernel’s space do not rely on the TLB. Mapping kernel data structures, adds a new category of TLB misses: LIK misses. For the operating

Mapping

Transactmn~

on

Computer

Systems.

Vol

12,

No

3, August

1994

So ftware-Managed Table

0s

IV.

Characteristics

of the Operating

Mapped Kernel Data Structures

Service Migration

Few

None

Ultrix

Systems

TLBs

.

185

Studied.

Service Decomposition

Additional OS Services

None

X Server

OSF/1

Many

None

None

X Server

Mach 3.0

Some

Some

Some

X Server

Mach3+AFSin

Some

Some

Some

X Server&

AFS CM

Mach3+AFSout

Some

Some

Many

X Server&

AFS CM

systems

we examined,

substantial

impact

several

hundred

Ultrix

places

unmapped

to service.

most

of its that

flexibility,

in the

data

by

can draw

In contrast,

OSF/

1 and

Mach

3.0L

upper

slots.

handling

This

an

contention

produces

miss

result

LIK

can

a small,

the much

place

most

can have

miss

fixed time.

larger

Table

portion

of

However,

to

virtual

V shows

of their

a

requires

space if it

that

kernel

few data

LIK struc-

to rely heavily on the TLB. Both and LIU PTEs in the TLB’s 56

a large in

L lK

OS at boot

memory.

tures in mapped virtual place, forcing them OSF/1 and Mach 3.0 mix the LIK PTEs

misses

each

in

the

upon

exhausts this fixed-size unmapped misses occur under Ultrix.

of LIK

because

structures

is reserved

Ultrix

number

performance

cycles

memory

maintain

an increase on TLB

an

number

L3

of LIK

miss.5

In

misses.

our

Further,

measurements,

OSF/1 and Mach 3.0 both incur more than 1.5 million LIK misses. OSF/1 must spend 6270 of its TLB handing time servicing these misses while Mach 3.0 spends 4.2.2

3770 of its TLB

Service

handling

time

servicing

In a traditional

Migration.

LIK

operating

misses. system

kernel

such as

Ultrix or OSF/ 1 (Figure 4), all OS services reside within only the kernel’s data structures mapped into the virtual these services, however, can be moved into separate server

the kernel, with space. Many of tasks, increasing

the

[Anderson

modularity

1991].

and

For this

been developed 3.0 [Accetta

extensibility

reason, in recent

1986],

By migrating

of the

numerous years

(e.g., Chorus

and V [Cheriton

these

services

operating

system

microkernel-based [Dean

et al.

operating

systems

have

and Armand

1991],

Mach

1984]).

into

separate

user-level

tasks,

operating

sys-

tems like Mach 3.0, fundamentally, change the behavior of the system for two reasons. First, moving OS services into user space requires, on the MIPS RISC architecture, that both their program text and their data structures be mapped. On other architectures that provide a small number of entries to

4 Like

Ultrix,

structures. to allocate between 5L1K

Mach

3.0 reserves

a portion

of unmapped

space

for

However, it appears that Mach 3.0 uses this unmapped mapped memory Once Mach 30 has allocated mapped mapped

PTEs

and unmapped

are stored

space

in the mapped, ACM

despite L2 page

Transactions

their tables

differing (Figure

on Computer

dynamic

allocation

of data

space quickly and must begin space, it does not distinguish

costs. 3). Systems,

Vol.

12, No.

3, August

1994

186

ACM

.

Transactmns

Richard

Uhlig

on Computer

et al.

Systems,

VOI

12,

No

3, August

Ig94

So ftware-Managed

ACM

Transactions

on Computer

Systems,

Vol

TLBs

12, N0

.

3. August

187

1994.

188

Richard

.

map

large

Uhlig

blocks,

et al.

such

as the entire

kernel,

these

entries

may

be exhausted

[Digital 1992; Hewlett-Packard 1990; Kane 1990; 1993]. In either case, migrated services

and Heinrich 1992; Motorola will have to share more of the

TLB

the user tasks’

with

user

Comparing

tasks,

the

2.2-fold

increase

moving

OS services

moving

OS data

In

user

handled same

from

9.8

into

million

structures

by the

fast

uTLB

to

4.2.3

Operating

in kernel

server

and can provide Unfortunately, performance monolithic

This

is

footprints.

3.0, we see a directly

change

by

LIU

for Mach

are mapped

due

to

comes

from

user

space.

space to mapped

which

are

3.0). In contrast,

the

PTEs,

by LIK

PTEs,

which

are

handler. Moving

Decomposition.

achieve

the

full

OS functionality potential

into

a

of a microkernel-

Operating system functionality can be further deserver tasks. The resulting system is more flexible

a higher degree of fault tolerance. experience with fully decomposed

systems

has shown

severe

problems. Anderson et al. [1991] compared the performance of a Mach 2.5 and a microkernel Mach 3.0 operating system with a

substantial portion of the file system functionality ~unning cache manager task. Their results demonstrate a significant between

TLB

Mach

second

mapped

(20 cycles

space

does not

based operating system. composed into individual

and

million.

kernel

are

handler

System

Unix

21.5

OSF/1

space. The

mapped

structures

by the general-exception

monolithic

in

user

from

data

with

misses

mapped

the

structures

conflicting

of LIU

space,

data

serviced

possibly

number

the two

systems

with

Mach

2.5 running

as a separate performance

36’% faster

than

AFS gap

Mach

3.0,

despite the fact that only a single additional server task is used. Later versions of Mach 3.0 have overcome this performance gap by integrating the AFS cache manager into the Unix server. We compared our benchmarks running on the Mach3 + AFSin system against only

the

same

structural

cache number

benchmarks

difference

manager.

The

of both

running

between

results

(Table

L2 and L3 misses.

on the

the V)

L2 slots required is proportional service concurrently. As a result,

show

Many

mappings needed to service L2 misses. The L2 PTEs compete for the R2000’s

Mach3

systems

+ AFSout

is the

a substantial

of the L3 misses 8 lower

location

TLB

system.

The

of the

AFS

increase

in

the

are due to missing

slots. Yet, the number

to the number of tasks providing adding just a single, tightly coupled

of

an OS service

task overloads the TLB’s ability to map L2 page tables. Thrashing results. This increase in L2 misses will grow ever more costly as systems continue to decompose services into separate tasks. 4.2.4

Additional

OS

In addition

Functionality.

to OS decomposition

and

migration, many systems provide supplemental services (e.g., X, AFS, NFS, Quicklime). Each of these services, when interacting with an application, can change the operating system behavior and how it interacts with the TLB hardware. For example, adding a distributed file service (in the form of an AFS cache manager) to the LIU TLB miss-handling

Mach 3.0 Unix time (Table VI).

ACM

Systems,

Transactions

on

Computer

Vol

12,

No

server adds 10.39 seconds to the This is due solely to the increased 3, August

1994

Software-Managed

ACM

Transactions

on Computer

Systems,

Vol.

TLBs

12, No.

.

3, August

189

1994

190

.

Richard

functionality

Uhlig

residing

et al

in the Unix

server.

adding 14.3 seconds. These misses ment the Mach 3.0 kernel must creased

functionality

support

operating

crease

and decompose

5. IMPROVING This

will

section

improving helpful to

have

systems

TLB

However,

an important

and

LIK

to what

impact

degree

vector.

Our

64 TLB requires

data).

with

slots

three

L2 PTEs

(Table

a small

penalty.

However,

3. O-based systems.

for the

majority

substantially The rest

of time

frequency

5.1

to accommodate

L2 PTEs)

and

types

these

assumptions systems,

handling

of TLB

Reducing

misses Finally,

data

that

these

break

down

the costly TLB

misses.

increases the cost of software-TLB of this section proposes and explores

of most

Unix

two

much

slower

in the slots

partiand

a traditional kernel

56

Unix

(2 PTEs

for

segments. design

choices

make

system such as Ultrix. For Ultrix, The remaining miss types impose

In these

are added to the architecture. size and associativity.

the

sets of 8 lower

left for additional

managed TLBs can be improved. First, can be reduced by modifying the TLB L2 misses can be reduced by increasing

by providing

and

are also reflected

V) demonstrate

spent

vector

disjoint

are intended

at least

assumptions

uTLB

the two

sense for a traditional Unix operating misses constitute 98.3% of all misses. Mach

these

fast

assumptions

into

principal assumptions to be the most frequent

all OS text and most of the OS data page tables) are assumed to be un-

reflects

the

These

slots

three

measurements

Second, of user

design

vectors:

The 8 lower

(which

kernel

TLB

miss

of the slots.

task

in-

PERFORMANCE

The R2000

general-exception upper

can

examines both hardwareand software-based techniques for before we suggest changes, it is TLB performance. However, review the motivations behind the design of the R2000 TLB,

of TLB

tioning

architectures

systems

functionality.

( > 9592) of all TLB miss types. structures (with the exception mapped.

also increase,

on how

operating

described in Section 4.1. The MIPS R2000 TLB design is based on two [DeMoney et al. 1986]. First, LIU misses are assumed

types

misses

are due to the additional memory manageprovide for the AFS cache manager. In-

for the OSF/

non-LIU

misses

Handling

LIU only 1- and

account

these

misses

management (Table VI). four ways in which software-

the cost of certain types of TLB misses vector scheme. Second, the number of the number of lower slots.G Third, the can be reduced we examine

if more

the tradeoffs

total

TLB

slots

between

TLB

Miss Costs

The data in Table V show a significant increase in LIK misses for OSF/1 and Mach 3.0 when compared against Ultrix. This increase is due to both systems’ reliance on dynamic allocation of kernel mapped memory. The R2000’s TLB performance suffers because LIK misses must be handled by the costly generic-exception vector, which requires 294 cycles (Mach 3.0).

6The newer ACM

MIPS

Transactmns

R4000

processor

on Computer

[Kane

Systems,

and Heinrlch Vol

12. No

1992]

3. August

Implements 1994

both

of these

changes

Soitware-Managed Table

VII.

Benefits

of Reduced

TLB

Miss

New Miss costs

Table 6 (see)

20

20

30,123,212

36,15

36.15

0.00

294

20

330,803

8.08

0.79

7.29

L1K

407

40

2,493,283

43.98

2.99

40.99

L3

286

286

690,441

11.85

11.85

0.00

Modify

499

499

127,245

3.81

3.81

0.00

267

267

Invahd

Total table

shows

of adding LIK

These

time

miss

Based

support on our

vectoring

the LIK

misses

from

also estimate

that

decrease

their

hardware miss

could

thelrcost

misses 294

handler.

For

miss-handling

from

could

implement

the remaining

and

1993].

Such

(for

407

cycles

hardware

steps

of the

a scheme

is reduced.

misses

at the

Alternatively,

3.0)

and L2 misses. reduce

to under

implement and

algorithm

retains

the

and reduced hardware),

solution

40 cycles.

Better

step of a TLB handler

necessary

of software

hardware but the

We

would

hardware/software

when

benefits

generic-exception

the cost of

a software

only

that

20 cycles.

the first

invoke

or

example,

for L2 misses

of a hybrid

a software

of the

beginning

would

miss vector

(Mach

support

we estimate

3.0) to approximately TLB

contribution respectively.

For

for LIK

handler

to servme

Their

hardware

handlers,

approach

handler

handlers.

vectors

some systems in

better

TLB

the uTLB Mach

uTLB

respectively.

miss

also come in the form

(e.g., flexible page table structure to systems with full table-walking misses

through

of the

hardware-based

the

to 0.79 and 2.99 seconds,

software

a separate

example,

algorithm

20cycles, down

either

the

allowing

of additional

through

dedicating

and

seconds

analysis

cycles

costs through

to40and

of the

consist

and

cost

support

Hays

miss misses

coding

timing

0.00 48.28

L2

8.08 and 4398

careful

2.70 58.29

TLB

for

costs can be reduced

more

hardware

from

2.70 106,56

vector

reduces

drops

168,429 33,933,413

of reducing

interrupt

Thlschange

miss

through

LIK

the benefits

a separate

misses

to TLB

Time Saved (see)

L2

L1 U

This

New Total cost (see)

To:ymost

Counts

191

.

Costs

Previous Previous Miss costs

Type of PTE Miss

TLBs

to

[Huck

handling

complexity relative average cost of TLB

could

vector

test

before

for

L2 PTE

invocation

of

the code that saves register state and allocates a kernel stack. Table VII shows the benefits of adding additional vectors. It uses the same data

for Mach3

cost estimates these

two

+ AFSin

as shown

resulting

from

modifications

106.56

seconds

service

time

down

drops

is that to 58.29

90%.

More

in Table

V, but is recomputed

the refinements the

total

seconds. importantly,

above.

TLB

LIK

miss

service the LIK

The result service drops

with

time 93%,

the new

of combining drops and

and L2 misses

from

L2 miss no longer

contribute substantially to overall TLB service time. This minor design modification enables the TLB to support much more effectively a microkernel-style operating system with multiple servers in separate address spaces. ACM

TransactIons

on Computer

Systems,

Vol

12, No 3, August

1994

192

-

Richard Uhlig et al.

Multiple TLB handler, dozens

miss vectors provide additional benefits. In the generic trap of load and store instructions are used to save and restore a

task’s context. Many the processor to stall.

of these loads and stores cause cache misses requiring As processor speeds continue to outstrip memory access

times, the CPI in this save/restore region will grow, increasing the number of wasted cycles and making non-uTLB misses much more expensive. TLBspecific they

miss

handlers

contain

only

memory-resident 5.2 The

not suffer data

the same performance

reference

to load

problems

the missed

because

PTE

from

the

page tables.

Partitioning MIPS

will

the single

the TLB

R2000

TLB

design

mandates

a partition

of the

TLB

fixes its position between the 8 lower slots and the 56 upper architectures, it is possible to lock entries in the TLB, which

entries

and

slots. On other can have much

the same effect as a partition, but with more flexibility in choosing the number of slots to be locked [Milenkovic 1990; Motorola 1990]. The MIPS R2000 partitioning is appropriate for an operating system like Ultrix [DeMoney

et

functionality comes

1986].

reside

PTE

mappings

In the structure,

However,

separate

insufficient.

that

the

al. into

This

is because

in different from

as OS

user-space

user-level

designs

tasks,

migrate

having

in a decomposed

tasks

compete

and

decompose

only

8 lower

system

the

by displacing

slots

be-

OS services

each other’s

L2

the TLB.

following sections, we examine the influences of operating system workload, PTE placement policy, and PTE replacement policy on

optimal

mechanism

for

on the Optimal Partition Point. To better understand 5.2.1 Influences impact of the position of the partition point, we measured how L2 miss

the rates

adjusting

partition

vary

depending

used

to vary

number

point.

dynamically

We

on the number the

of TLB

then

the partition

number slots

of lower

of lower

was

kept

propose

fixed

TLB

TLB

slots

slots

adaptive

available.

from

at 64. OSF/

Mach 3.0 ran the mab benchmark over total number of L2 misses was recorded For region

an

point.

Tapeworm

4 to 16, while

1 and

all three

the

was total

versions

of

and

the

the range of configurations, (Figure 5).

each operating system, two distinct regions can be identified. The left exhibits a steep decline which levels off near zero seconds. This shows

a significant performance improvement for every extra lower TLB slot made available to the system, up to a certain point. For example, simply moving from 4 to 5 lower slots decreases OSF\ 1 L2 miss-handling time by almost 507.. fiter 6 lower slots, the improvement slows because the TLB can hold most of the L2 PTEs required by OSF/ 1.7 In contrast, the Mach 3.0 system continues to show significant improvement up to 8 lower performance in line

7Tw0

L2

PTEs

for

slots. The with OSF/

kernel

data

additional 3 slots needed 1 are due to the migration

structures

and

one

each

for

a task’s

segments. ACM

TransactIons

on Computer

Sy~tems,

Vol

12!, No

3, August

1994

to bring Mach of OS services

text,

data,

and

3.0’s from

stack

Software-Managed

~5— E E1 F ~4< ‘E

AFSout

+-

AFSin

-A

Mach3

+—

OSFI

Fig,

\

\

5 p3 L N 12

+

5.

lower

U

: ~h.y.

miss under

for

L2

number time

different

the

total

becomes

of

for the

operating

reserves

PTEs,

L2 misses

vs.

L2 miss

TLB

As the

193

.

cost

total

mab benchmark

servicing

L

PTE The

systems. slots

1

L2

slots,

TLBs

more

lower

time

spent

neghgible.

‘k

o 4567891011121314

1516

Number of Lower Slots

the

kernel

makes

to the

Unix

a system

share

call

the TLB’s

server

in user

to the Unix

lower

slots,

space.

server,

In other

the

words,

Mach3

+ AFSin’s

cache

However,

when

user-level Figure PTEs

behavior

manager the

server,

how

can reside

TLB

hold

+ AFSout

three

additional

continues

the

size of the lower

tion

Different

in Figure

have

Figure

operating

different

7 shows

the optimal various partition requires additional

L2

PTEs until

6. Coupling

the decreasing

(11 total). all

11 L2

systems

with

for the

varying

an optimal

partition

point

of 8 for Mach

+ AFSout, when running level of interaction with The lower

graph

partition

particular

degrees

For example,

of LIU,

with point,

operating

of service the upper

of

LIK,

L2 misses

an optimal

points.

point.

L2 PTEs.

at the expense

the number

partition

partition

additional a separate

to improve

optimal

AFSin, and 12 for Mach3 Applications and their

the into

partition

of increasing

the increasing LIU, LIK, and L3 misses yields shown in Figure 6. This partition point, however, is only optimal system.

slot require-

server’s

decomposed

must

L2 PTEs

in the TLB.

has the side-effect

as shown

is

the lower

Unix

a task

server

three

3.0 because

by the

manager

must

increasing

partition

and L3 misses

cache

Mach3

to Mach

the Unix server’s

increase

is mapped

simultaneously

Unfortunately, the upper

AFS

the

5 shows

is similar

functionality

3.0, whenever

and

the Unix

(text segment, data segment, stack segment) ment, for the system as a whole, to 8. AFS

In Mach task

migragraph

in

3.0, 10 for Mach3

+

the Ousterhout benchmark. OS services also influence

in Figure

7 shows

the results

for

applications running under Mach 3.0. compress has an optimal point at 8 slots. However, video_ play requires 14, and mpeg _ play 18 lower slots to achieve optimal TLB performance. The need for lower

slots

is due to video_

action with services like the BSD the importance of understanding how applications

interact

with ACM

play

and

mpeg _ play’s increased

server and the X server. both the decomposition

the various TransactIons

inter-

It also underscores of the system and

OS services.

on Computer

Systems,

Vol. 12, No. 3, August

1994.

194

.

Richard Uhlig et al

25

1

!%

Fig.

6

Total

number cost

cost

of lower

of TLB

against

mms

the

of

of lower

4 to 32, while entries

LIK,

LIK

L2

+

L3

1s plotted L2,

and

L3 The

TLB

slots

varies

from

number

constant

+

-~

vs.

time.

the total

Total

total

total

m video_

der Mach

The

this

remams

benchmark

misses

slots.

serwcmg

LIU,

components number

of TLB

TLB

e

at

of TLB 64

The

play runmng

un-

30.

24 –

22 ~

G20– ~ al E 18– F ~

14– 12 46810121416182022 Number

Fig.

7

slots

Optimal

optimal

partltlon

shows

the

systems. Mach

ACM

average

The

of Lower

partition

are allocated,

right

upper which

I

I

6

81012141618202224262830

I

for various

operating

are avadable

varies

with

of the

shows

the

the

operating of 3 runs

for

and LIK,

system

benchmark

ousterhout

average

systems

for the L lU,

on Computer

I

run

Systems,

Vol

12,

No

3.

August

I

I

of Lower

I

benchmark under

I

I

I

1

Slots

benchmarks.

3 different

1994

I

As more

and L3 PTEs

and

3.0.

TransactIons

[

Number

slots

of 3 runs graph

4

Slots

points

fewer point

24

This The

3 different benchmarks

lower

yields left

an

graph

operating run

under

Sollware-Managed The TLB PTEs

partition

with

of our operating for all other

types

LIU

PTEs.

if mixing

it is unclear

To determine miss handlers The

operating

with

slots

However,

are not placed

Further,

to allow PTEs

use the lower

of PTEs.

and L3 PTEs

acceptable,

costs from

systems

LIK

cies.

was implemented

low retrieval

higher

it is unclear

costly-to-retrieve

if PTE

is required

the effects of other PTE placement of the baseline systems to implement

results

are

shown

in

Table

the

but

VIII.

four slots

costly-to-retrieve

are mixed

mappings

partitioning

to separate

costs. All

and the upper

why

slots,

195

.

systems

retrieval

for L2 PTEs

in the lower

TLBs

with

with

LIU

the

PTEs

is

at all.

choices, other

Policy

A

we modified meaningful

is

identical

the poli-

to

that

implemented by the baseline systems. The importance of some sort of partitioning is shown by Policy D, where all PTEs are mixed together, and which demonstrates very poor performance. At first glance, the appears to be the most desirable. However, note that with the lower

partition

was not permitted

to grow

beyond

partition,

Figure

we varied

8 shows

policies

the results

A, B, and

has different

the

for this

C. Only

optimal

partition

the

but

curves

improve

from

experiment

total

points,

point

its

optimal

with

fixed

performed are shown.

at these

policy B and

A C,

8 slots to accommodate

the additional PTE types allocated to this partition. To see if the performance of policies B and C would lower

baseline policies

for PTE Note

points

a larger

location

that

the

at 8.

placement each policy

performance

is

roughly the same for each system. From this we can conclude that the most important PTE placement policy is the separation of LIU and L2 PTEs (to avoid

the

between vided

very the

that

performance

PTE

the partition

Careful separate

poor

other

software services

point

in separate

management

user-level this

more

policy

policies

is tuned

can coexist

unacceptable level. This kernel system structuring illustrate

of

placement

D).

However,

A, B, and

to the optimal

of the

TLB

in a system

tasks

that

communicate

we constructed

interaction between servers in a multiserver load, a collection of user-level tasks mimics

the

degree

performance

is important because a key is that logically independent

clearly,

can be varied.

emulator partitioning

on the point

Figure

Mach

9 shows

3.0 kernel.

moves

farther

the

With

through

message

a workload

that

to

results

the

A

to an

passing.

To

emulates

the

OS. In this workof communicating

of running additional

right.

to which

characteristic of microservices should reside

microkernel the behavior

each

pro-

degrades

OS servers by passing a token between each other. The number and the number of pages that each server touches before passing along

differences

region.

can extend

before

the

C are negligible,

this server,

system

that

of servers the token multiserver the

optimal

leaves

the

partition point fixed at 8 will quickly encounter a performance bottleneck due to the addition of servers. However, if the TLB partition point is adjusted to account for the number of interacting servers in the system, a much greater number of servers can be accommodated. Nevertheless, note that as more servers are added, the optimal point number of tightly coupled servers bottleneck is best dealt with through of larger

TLBs

or miss

vectors ACM

still tends to shift upward, limiting the that can coexist in the system. This additional hardware support in the form

dedicated TransactIons

to level-2 on Computer

PTE Systems,

misses. Vol

12, No 3, August

1994

196

Richard Uhlig

.

et al.

Table

VIII

Alternate

table

shows

Placement

Pohcies.

cost

PTE Placement

Policy

This

PTE

A

Level 2 PTEs r lower partition. All other PTEs m upper partition.

1.91

B

Level 2 and 3 PTEs in lower parhtlon. Level 1 user and kernel PTEs In upper parhtlon.

392

c

All PTEs m lower part[tlon, except for level 1 user PTEs which are placed m upper parht]on

2.46

D

No partitioning at all

the time

11.92

affect

of alternate

PTE

spent

(m seconds)

to handle

cost M the

total

ousterhout

The partition

point

m fixed

placement TLB

policles mmses

at 8 sloth for the lower

on TLB for the

partltlon

management

Mach

cost

3.0 system

and 56 slots

The

running

for the upper

partltlon

f!

Policy A

c

Policy B

a

Pollcy

c

o ! 4

12

8

16

20

24

28

32

36

40

ParWlOn Point

Fig

8

that

of Table

costs

TLB are

partlt]onlng VIII,

shown

for different

except The

PTE

wlthdlffcrent

total

cost

for

placement

PTEplacement

pobcles.

This

1s the

pollcles

Only

thetotal

TLB

scale

of the

POIICY D (no partltlon)

]s off the

same

experiment

as

management plot

at

11.92

seconds

The baseline

systems

use a FIFO

replacement

policy

for the lower

partition

and a random replacement policy for the upper partition when selecting a PTE to evict from the TLB after a miss. To explore the effects of the replacement policy in these two partitions, we modified the miss handlers to try other combinations of FIFO and random replacements in the upper and lower partitions. The results of one such experiment are shown in Figure 10. For these workloads, differences between the replacement policies are negligible over the full range of TLB partition points, indicating that the choice between these two replacement policies is not very important. Our trap-driven simulation method makes it difficult to simulate nonrecently used and other ACM

TransactIons

on Computer

Systems,

Vol

12, No

3, August

1994

Software-Managed

TLBs

.

197

25.

&lC43

02X4

11 5

9

13

17

21

ParWlOn

Fig.

9.

TLB

management passing

partitioning costs

a token

for

among

under Mach different

multiserver

operating

3.0 running

a workload

numbers

Replacement (random

33

This

emulates

graph

shows

a multlserver

total

TLB

system

by

tasks.

A

POIICYA Random

D

Policy A FIFO

20

2s

Parhtlon

10.

systems. that

ofuser-level

12

4

Fig.

29

!

15

policies

25

Point

policies.

or FIFO)

for

This the

graph

Mach

shows

3.0 system

36

Point

the

performance

implementing

of different PTE

placement

replacement policy

A on

August

1994

ousterhout.

ACM

Transactions

on Computer

Systems,

Vol

12, No.3,

198

.

Richard

Uhllg

et al,

Table

IX

Different

Fixed Partitioning (see)

Workload

table

mg, The

to the PTE

1.27(18)

1 11

161

147(18)

143

IOzone

1 30(8)

O 43 (32)

043

the

static

total

TLB

optimal

placement

pseudo-LRU

Dynamic Partitioning (see)

Static Partitioning (see)

392 (8)

terhout

compares

settmglt

Schemes.

vldeo~lay

ous

This

TLBPartltlonlng

policies,

management

point

pohcy

(8)

(shown

costs

when

m parentheses),

fixing

the

and when

partltlon

using

at

8, when

dynamic

partltlon-

1s B,

so currently

we cannot

extend

this

statement

to other

policies. 5.2.2 Dynamic TLB partition

point

workload,

PTE

and

knowledge optimal

varies

partition

factors

point

run.

depending

placement

of these

processing

We have

Partitioning.

and

However,

administrators

might

then

but

flx

that

the best place to set the

operating the

it

system

replacement

is possible

it

statically

for

an

operating

system

knowledge workloads

have

not

of time,

although

placement policy and have predict the nature of future typically run ally are often

policy, ahead

shown on the

knowledge

of the

sort that

the

duration can

control

of workloads must

that

optimal

algorithm

point.

The

function

is computed

At

invocation,

each

change

resulted

costs when adjusted

is invoked

after

from the

the two most algorithm

in a significant

compared

against

appropriately.

tests increase

the previous

This

algorithm

some

PTE

be tuned

problems, we have designed and implemented that self-tunes the TLB partition dynamically

at which time it decides to move the at all. It is based on a hill-climbing

the

of some

it can do little to Although system

To address these adaptive algorithm misses or not

Given

to determine

of its own structure, that it must service.

at a given installation, parameters left untouched or are set incorrectly.

structure,

policy.

fixed

are

manu-

a simple, to the

number

of TLB

partition point either up, down, approach, where the objective

recent

settings

to see if the or decrease setting. tends

of the partition most in TLB

recent

miss-handling

If so, the partition to

home

in

point. partition

on

the

point

is

optimal

partition point and tracks this point as it changes with time. We tested this algorithm on each of the benchmarks in our suite and compared the resultant miss-handling costs against runs that flx the partition at 8 and at the static optimal point for the given benchmark. The results are shown in Table IX. Note that dynamic partitioning yields results that at times are slightly performed another

better than experiment

the static optimal. To explain this effect, we that records the movement of the TLB parti-

tion point during the run of gee, results (see Figure 11) show that ACM

TransactIons

on Computer

Systems,

Vol

a component the optimal 12, No

3, August

of the partition 1994

mab benchmark. point changes

The with

Software-Managed

20

TLBs

199

.

T

-z 15 .— 2 ~ 10 .— + .— c 25 0 Time Fig.

11.

Dynamic

partition

with

partitioning

time

TLB

time

partitioning

while

during

running

gcc

gee,

on

This

a system

graph that

shows

as a benchmark

its

overhead

additional compute PTE

among

the optimal point with time, it has the potential than the static optimal which remains fixed at

In

is minimal.

L2 and L3 PTE miss

handler.8 could

Increasing this

upper

TLB

adaptive

moves

its

working

It should

cost for maintaining the objective function.

L1 misses 5.3

of the

dynamic,

tracks results

sets.

some “good average point” for the duration of a run. The invocation period for the dynamic partitioning

costly

movement the

algorithm.

partitioning algorithm to give slightly better

that

the

implements

it is more support

to reduce

this

substantial

we

examine

The tradeoffs

are used to hold

minimize Figure

three

the here

there

is an

are required to for the already

of a register

tuned that

L1

counts

cost.

benefits

of building

can be more

different

types

TLB configurations configurations was L2 misses. 12 shows TLB

For smaller

can be set so that

for the highly

in the form

complex

of mappings

TLBs,

with

slots,

additional

the upper

LIK,

we used Tapeworm

for all seven benchmarks

significant

components

slots

and L3 PTEs),

ranging from 32 to 512 upper slots. fully associative and had 16 lower

performance

the most

TLBs because

(LIU,

whereas the lower slots only hold L2 PTEs. To better understand the requirements for upper to simulate these TLB

dynamic

TLB Size

section slots.

misses,

the

algorithm

however,

the TLB miss counts that Although this cost is negligible

Hardware

help

be noted,

Because

are LIK

under misses;

Each slots

of to

OSF/1. LIU

and

8Maintaining a memory-resident store sequence On the R2000,

counter this can

in the level-l miss handler requires a load-incrementrequire anywhere from 4 to 10 cycles, depending

whether

hits

data

20-cycle

the

memory

average

reference

currently

required ACM

the by the

cache.

level-l

TransactIons

miss

This

is a 20@+ to 507(

increase

over

on the

handler.

on Computer

Systems,

Vol. 12, No. 3, August

1994.

200

.

Richard

Uhlig

et al,

100 90 LI U %0

FUZ. 12.

TLB

service

time

vs

number

70

of

upper TLB slots servlcmg for all

The total cost of TLB miss seven benchmarks run un-

~ &

der

number

2 *

OSF/1

varied

The

from

lower

8 to 512,

slots

tions,

was fixed

Other

modify,

of upper

and

is

whd e the

slots

was

number

of

sum

L2 miss

of

the

\

for less than misses

structures

in the OSF\

ing the hardware misses

recomputed cycles)

for the OSF/

With

misses.

service

time

time

In

After

128 slots,

figures).

Because

size,

any

system,

and

decreases

invalid

further

may

However,

this

a modest discussion

workload. of TLB

article

+ AFSout

misses

in TLB

systems

to support is limited

like

large to TLB

service

(Figure

Mach

3.0.

moving

by over 5070. as “other”

support

in the

with

respect

to

a negligible

effect

on

Of

TLB may be and OSF/ 1

course,

such

even

as CAD

for operating

The reader is referred to Chen support for large applications.

13).

is dominated in TLB

a 128- or 256-entry systems like Ultrix

applications

(20

For example,

(listed

have

we

time

improvement

are constant

size will

LIK

Therefore,

time

time

dominate

misses

5.1, modify-

systems

handling

This suggests that monolithic operating

operating

be needed

miss

32 to 128 slots.

and modify

increases

LIK

The data

to service

20 cycles.

miss-handling

3.0 TLB

and modify

handler

is a noticeable

from

Mach

the invalid

microkernel

there

time.

of mapped

in Section

the uTLB

TLB

Slots

miss-handling

as outlined

lower-cost

reduced,

of Upper

number

to an estimated

the

sizes increase

overall TLB performance. sufficient to support both TLBs

time

TLB

large

3.0, and Mach3

misses

each

as TLB

the

to allow

using

1, Mach

64 to 128 slots

TLB

service

the cost of LIK

by LIU from

total

to

However,

mechanism

the LIK

the

of the total

due

1 kernel.

trap

reduces

357.

is

512

256

128

64 Number

LIK

Other

\

32

of

W

30

20

account

L3

i

*40;

costs

prominence

,.,

50

invalid,

L3 misses

LIK

60

at 16 for all configura-

the

&

systems

et al. [ 1992]

larger

programs. running

for a detailed

5.4 TLB Associativlty Large, consume capacity

fully

associative

TLBs

nearly twice the chip [Mulder et al. 199 I].

(128+

entries)

are

difficult

architects could implement larger TLBs with The following section explores the effectiveness

lesser degrees of TLBs with

of associativity. ACM

Transac.tlons

on Computer

Systems,

to build

area of a direct mapped structure To achieve high TLB performance,

Vol

12, No

3, August

1!994

and

can

of the same computer

of associativity. varying degrees

So ftware-Managed 70--

OSFI1 35

30 iTL-

u

LIU

H

LIK

‘-T

TLBs

–-—-

.

201

Mach 3.0

60

U

LIU

H

LIK

B

L3

.—

50 III=T

— 20

10

0 32

128

64

256

512

32

126

64

Number oi Upper Slots

256

512

Number of Upper Slots

100

‘-Mach3+ AFSout

❑- Lli

90 80

E

LIK

n

L3

—

❑

Other

-

70 ~ g ~ ~ +

60 50 40 [

:’

‘--

‘:

30 20 10 0 64

32

126

5t2

256


Fig.

13.

servicing

Modified (for

in 20 cycles and

service

for Mach3

from

vs. number assuming

are handled + AFSout.

current-generation

sizes ranging

time

benchmarks)

and L2 misses

the bottom

Many

TLB

all seven

of upper L lK

in 40 cycles. Note

that

the

processors

32 to more

than

TLB

misses

slots.

The

total

can be handled

The top graphs scale varies

implement 100 entries

cost of TLB

by the uTLB

are for OSF/1

for each

fully

and Mach

3.0,

graph.

associative

(Table

miss

handler

X). However,

TLBs

with

technol-

ogy limitations may force designers to begin building larger TLBs which are not fully associative. To explore the performance impact of limiting TLB associativity, we used Tapeworm to simulate TLBs with varying degrees of associativit y. The top two graphs in Figure 14 show the total TLB mpeg _ play benchmark under Mach3 + AFSout the benchmark

under

Mach

3.0. Throughout

the

range

miss-handling time for and the video– play of TLB

sizes,

increasing

associativity reduces the total TLB handling time. These figures illustrate the general “rule of thumb” that doubling the size of a caching structure will yield about the same performance as doubling the degree of associativity [Patterson

and Hennessy

19901. ACM

TransactIons

on Computer

Systems,

Vol

12, No 3, August

1994

202

Richard Uhlig et al.

.

Table

X.

Number

of TLB

Slots

for Current

Processors

Number

Processor

Associativity

TI Viking

MIPS R2000

8+4

32

I

2-way

I

32

128

I

full

I

64 unrfred

I

—

I

full

I

64 unlfled

I

—

full

HP 9000 Series 700

full

96+4

4-way

32 urmfled

that

R4000

I

MIPS R4000

Intel 486 Note

page

sizes

actually

mappings

has

vary

from

48 double

are to consecutive

of

Data slots

full

DEC Alpha 21064 IBM RS/6000

Number

of

Instruction slots

4K

to 16 Meg

slots

pages

Two

PTEs

m the virtual

48 urvf!ed

and

are variable

can

reside

address

96+4

m many

m

space

one

processors

double

[Dl~tal

slot

The

If them

JIIPS virtual

1992]

25-

E+

20 ~ {

E3

~

\

; 1=

–a

,

.s15—

x

5 lo:

2-way

3

4-way

— —

4-way

8-way

_.

8-way

Full

x

Full

[

2.way

0 G ,5 x

5

Elti A

~

x

G 7 $?

x 0 I

I 64

1 128

I 256 Number of Upper Slots

32

x o I

I 512

I

32

64

I 256

128


under

mpeg~lay

Mach3+AFSout

vldeo~lay

under

Mach 3.0

140 1.

32

64

128 256 Number of Upper Slots

compress

Fig. ACM

14

Transactions

Total

TLB

on Computer

service

time

Systems,

under

for TLBs

Vol

12, No

512

OSFI1

of different 3, August

sizes and assoclatmtles 1994

I 512

So ftware-Managed Some degree

benchmarks,

the total For

TLB

a 2-way

Even

miss-handling

time TLB,

2-way

32-entry,

These

three

4-way

graphs

of larger

can

TLBs

perform

For example,

set-associative

a 512-entry,

smaller tion

however,

of set associativity.

compress

set-associative

with

a small

in Figure

benchmark

displays TLB

TLBs

203

.

14 shows

under

pathological

OSF/

1.

behavior.

is outperformed

by

a much

TLB.

reducing

is an effective

for graph

for the compress

set-associative

show that

badly

the bottom

TLBs

associativity

technique

to enable

for reducing

the construc-

TLB

misses.

6. SUMMARY This

article

seeks to demonstrate

to architects

ers the importance

of understanding

software

and

features,

simulation

of these

performance ported

to

attributable magnified

a new

platform.

data

structures,

dependent

when

A

utilization on overall

including

the

also on the division

the

page

of the TLB. performance

Software because

themselves.

machine,

together

with

simulations

was was

management of the large performance.

to map TLB

between

its own

behavior

is

the kernel

approaches difficulties

modest degree of service decomposition. In our study, we have presented measurements

numerous

problem

TLB

memory

functionality

and and

system

of the

separate user tasks. Currently popular microkernel ple server tasks, but here fell prey to performance

current

here,

operating

component

tables

of service

hardware

presented

use of virtual

design-

measurement

times that can exist. can significantly influence

on the kernel’s

system

between for

a microkernel

significant

miss service of the TLB

depends

of tools

In the case study

resulted

to an increased the TLB’s impact

behavior

the interactions

importance

interactions.

problems

variations in TLB The construction TLB

the

and operating

and

rely on multiwith even a

of actual

of architectural

systems

on a

problems,

and

have related the results to the differences between operating systems. We have outlined four architectural solutions to the problems experienced by microkernel-based systems: changes in the vectoring of TLB misses, flexible partitioning

of the

associativity

to

implemented

at little

applicable

enable

The

design,

larger

construction

more their

providing

TLBs,

and

of larger

TLBs.

cost, as is done in the R4000.

to the MIPS

performed. system

TLB,

RISC general

potential

architectural lessons

family

The

with

and the utility applicable.

two

in

architectural

of

can

be

are most

the studies

trends

of tools

degree

solutions

on which the

the first

These

regarding

for interaction

machines to which they are ported, impacts are, we feel, more broadly

limiting

were

operating features

for assessing

on these

REFERENCES ACCETTA, M., new

kernel

BARON, R., GOLUB, D., RASHID, R., TEVANIAN, foundation

for

UNIX

development.

USENIX Assoc., Berkeley, Calif. AGARWAL, A., HENNESSY, J., AND HOROWITZ, M. and multiprogramming

workloads. ACM

ACM

Trans.

Transactions

In 1988.

the

A., AND YOUNG, M. Summer

Cache

Comput. on Computer

1986

performance

Syst.

1986.

USENIX

Mach:

A

Conference.

of operating

system

6, 4, 393–431.

Systems,

Vol

12, No 3, August

1994

204

Richard

.

et al

C A , KESHLEAR,

ALEXANDER,

UNIX

envu-onment

AM~RICAN

MICRO

Devices,

Inc.,

ANDERSON, T Support

W

1991

E, LF,\T,

1993

Calif,

on

Neu,s

1985

Translation

buffer

performance

In a

1.3, 5, 2-14

Am29050

Mwruproce.wor

User’s

Manual.

American

Micro

Cahf

H

M,,

B~RSHAD, sy%em

for Programmmg

Workshop

Arch

DWTCES

and operating

B.

M , AND BRJGGS, F.

C’omput

Sunnyvale,

architecture CHEN,

Uhlig

B

Languages

Software Workstation

In the 4th Internatlonul

and

methods

for

Operatzng

1991

N , AND LAZOWSKA, E. D

des~gn

Operatzng system

Sy~tem~

address

Systems

IEEE

The

Conference AC’M,

tracing.

New

In

Computer

InteractIon

of

on Arrhztectura[

York

Proceedings

Society

Press,

of Los

the

4th

Alamltos,

178-185

CHEN, J

B., BORG, A., AND JOtJPPI, N, P

In The 19th

Annual

CHERITON, D, R.

Internatmnal

1984,

1992

A simulation

Sjnlposzurn

The V Kernel:

based

on Computer

A software

study

of TLB

Architecture

base for dmtrlbuted

performance

IEEE,

New

systems

York

IEEE

Softu,

1,

2, 19-42 CLAPP, R

M , DUCHESNEAU,

real-time CLAI?h,

performance

D.

W , AND EM~R,

.%mulatlon D~AN,

I J

Other

Kernel

DEMoNE~,

M.,

DIGITAL.

IEEE,

1992.

Performance

Data

Assoc J

29, 8 (Aug

of the

C’omput

Syst

movement

USENIX

Architecture

Toward

translation

buffer:

1,31-62

m kernehzed

1992.

1986

), 760–778

VAX-11/780

3,

, Berkeley,

Handbook

PA-RISC

system~

In

Mtcr[)-Kernels

Cahf

Operating

system

supports

on a RISC

In

11

Dlgltal

Eqmpment

Architecture

and

Carp

, Bedford,

Instruction

Mass

Set Reference

Manual.

Inc. 1993

space machines.

G

ACM

York

1990

HLTC~, J AND HAYES, J tzwe, IEEE,

1991

N , AND SCH[JLTZE, T.

Commun

Trans

J , AND MASHEY,

Alpha

Hewlett-Packard,

KANE,

F.

New

HEWLIZ rT-P~UC4RD

address

1985 ACM

Architectures

MOORE,

C’OMPCON

for Ada

S

and measurement

R. W. AND ARMAND,

and

J , VULZ, R. A , MLT~G~, T

benchmarks

New

Architectural

In The 20th

support

Annual

for translation

International

table

management

Sympuszum

in large

on Cornpater

Archztec-

York.

.mm HEINRIC’H,

1992.

J

MIPS

RISC

Architecture

Prentice-Hall,

Englewood

Cllffs,

NJ LARUS, J

1990.

R

Abstract

execution:

A techmque

for

efficiently

tracing

programs

Univ

of

Wlsconsm-Madison MC KUSIC,

M.

UNIX.

ACM

K, Joy,

W.

Trans.

N,,

LEFFL~R,

Cornpat

MIL17Nh0V1C,

M.

MOTOROLA.

1993

1990

PowerPC

1990

MC88200

Syst

S, J., AND FA~RAY,

R. S

1984

A fast

file

system

for

2, 3, 181-197

Microprocessor

memory

601 RISC

management

Microprocessor

units

Users’

IEEE

Manual

Mzcro

10, 2, 75-85

Motorola,

Inc , Phoemx,

Ariz, MOTORULA. tice-Hall,

Englewood

Chffs,

MULIYER, J. M., QUACH, N Its application NAGLE,

IEEE

D , UHLI~,

between

R.,

operating

of Mlchlgan,

P.A.TEL, K., decoder

SMITH, Univ

B.

gan Kaufmann,

processor Comput ACM

Why C.,

T

1992

computer

aren’t

1991.

An area

Monster.

model

A tool

architectures.

operating

AND ROWE,

2nd

ed

Pren-

for on-chip

memories

and

Tech

for

analyzing

the

Rep. CSE-TR-147-92

interaction The

Umv

L.

systems

A

1992

getting

faster

Performance

J.

1990,

Computer

Architecture.

A , YOUNG, M., GOLUB, D., BARON, R., BLACh,

architectures.

vu-tual ZEEE

1990

Trans.

Scalable,

memory Comput, secure,

management

of

as hardware?

a software

A Quantatzve

MPEG

WRL video

Approach

Mor-

Systems,

Vol

12,

D., BOLOShY, W., AND Grew,

for paged

umprocessor

and

multi-

file access,

IEEE

37, 8, 896–908 and highly

available

23, 5, 9-21, on Computer

as fast

Cahf

Machine-independent

Transactions

Manual

Berkeley

HENNESSY,

SATYANARAYAN.AN, M.

User’s

Mlch

San Mateo,

RASHID, R, T~VANWN, 1988.

and

of California,

PATTERSON, D. mu

UnLt

Cu-c , 1, 2, 98–106.

AND MUDGE,

Arbor,

OUSTWRHOUT, J. 1989. Tech Note, (TN-11)

Management

T., AND FLYNN, M, J

J. Solzd-State

systems

Ann

Cache/Memory N J

No

3, August

1994

distributed

J,

Software-Managed SMITH,

J.

E.,

memory. SMITH, M.D. tory,

M.,

page

sizes.

New ings. WILRES,

Tracing

Univ.,

Palo

The

19th

with Alto,

KONG S., HILL, In

Corporation Pixie

Tech.

1988.

Computer

of America

Patent

system

205

.

employing

virtual

Systems

Labora-

No. 4,774,659.

Rep. CSL-TR-91-497,

Computer

Calif.

M.

D., AND PATTERSON, D, A.

Annual

Internat~onal

1992.

Symposlurn

Tradeoffs

on Computer

in supporting Archztectzme.

two IEEE,

York. B.

1991.

USENIX

The file system Assoc.,

Berkeley,

J. AND SEARS, B.

protection

Received

Astronautics

1991.

Stanford

TALLURI,

WELCH,

G. E., AND GOLDSMITH, M.A.

DERMER,

Assignee:

TLBs

architecture.

October

1993;

1992.

belongs

in the kernel.

In USENIX

Mach

Symposzum

Proceed-

Calif. A comparison

of protection

lookaside

buffers

and the PA-RISC

HP Laboratories.

revised

May

ACM

1994;

accepted

Transactions

July

1994

on Computer

Systems,

Vol

12, No. 3, August

1994