Reference History, Page Size, and Migration ... - Semantic Scholar

34 downloads 490 Views 941KB Size Report
A local/remote architecture has the main memory partitioned into a local memory on each processor's node. Each processor can access the memories of.
Reference History, Page Size, and Migration Daemons in Local/Remote Architectures * Mark A. Holliday Department of Computer Science Duke University Durham. NC 27706 Abstract

Local/Remote

1.1

We address the problem of paged main memory management in the local/remote architecture subclass of shared memory multiprocessors. We consider the case where the operating system has primary responsibility and uses page migration as its main tool. We identify some of the key issues with respect to architectural support (reference history maintenance, and page size), and operating system mechanism (duration between daemon passes, and number of migration daemons). The experiments were conducted using software implemented page tables on a 32-node BBN Butterfly PlusTM. Several numerical programs with both synthetic and real data were used as the workload. The primary conclusion is that for the csses considered migration was at best marginally effective. On the other hand, practical migration mechanisms were robust and never significantly degraded performance. The specific results include: 1) Referenced bits with aging can closely approximate Usage fields, 2) larger page sizes are beneficial except when the page is large enough to include locality sets of two processes, and 3) multiple migration daemons can be useful.

One of the primary shared

memory,

formance

of shared

for the local/remote the issues evaluate

we refer

A local/remote

architecture.

to

architecture

can access the memories

Each processor

the shared memory paradigm

SO

From each processor’s viewpoint,

the other

of

can be sup-

the local memories of

are remote memories.

processors

memory could certainly

Though secondary

be attached to one or more nodes, we

focus on the case of only the main memory being used. The nodes are tightly-coupled is high bandwidth memory Figure

in that the interconnection

and low latency.

Consequently,

access time is a small multiple

access time.

The BBN Butterfly

1, is an example

network the remote

of the local memory

PlusTM

of a local/remote

[I], represented

in

architecture.

Assume the local memories that form the main memory are divided

in local

into page frames.

Address references cause the

to use the page number memory

or remote

to determine

memory

contains

which

frame

the needed

page.

If the frame is in a remote memory, the word at that address management memory

is a major

factor

multiprocessors.

architecture

in managing

a paged

subclass, memory

(not

in the per-

In this we identify

the entire

paper,

putation

some of

paged

and experimentally

is fetched.

application

(consisting virtual

Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and i@ date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/ or specific permission.

104

manager

to local memories

the memory access time

migrate

ware.

memory

which

for assigning

page to which

pattern

system

pages of the comthe effective

mem-

may dynamically

can often reduce the effective

manager

manager’s

in a shared

the operating

so as to minimize

by migrating

maintained

executing

com-

of the goal of minimiz-

time,

Since the reference

change,

information

104

is responsible

memory The

is a single parallel

As part

completion

putation

tion.

$1 SO

space.

memory

access time.

workload

of user processes)

address

ing the computation’s

ory

ACM O-8979 1-300-O/89/0004/0

page)

The assumed

some of the alternatives.

1989

is, what

processor’s node. ported.

‘This work was supported in part by the National Science Foundation (Grant CCR-8721781).

0

computations

large scale,

into a local memory on each

the other nodes

Introduction memory

parallel

for supporting

has the main memory partitioned

hardware

Effective

architectures

as, the local/remote

Only small regions of the space of architectural, system, and workload parameters were explored. Further investigation of other parameter combinations is clearly warranted.

1

Architectures

policy local

pages during

the computa-

for determining memory

in each page table

when

to

uses the reference entry

by the hard-

lated in that the objects (pages) are the same. In addition, time constraints information

on the operating

available

the

system and the amount of

to it (for example,

Referenced bits) are

similar. The differences, however, are substantial. ory/secondary

The main mem-

memory access time ratio is dramatically

than the local/remote

larger

access time ratio. Also, a reference to a

word only in secondary memory requires trapping

to the oper-

ating system and fetching the entire page. Another through

related

field

of literature

caching in multiple

is data

placement

processor caches [4] and file sys-

tems [9]. Those discussions are in the context

of a backup

complete memory (main memory or file server, respectively). Adapting

caching to the context

comparing Figure 1: A local/remote architecture (Reprinted with permission from Inside the Butterfly Plus, BBN Advanced Computers Inc., 1987.). We assume that the memory manager implements icy by a migration

daemon that periodically

applies the policy,

and migrates

the pol-

unblocks itself,

the chosen pages from one

Concurrency

control must be provided since address transupdate the page table entries.

are

used. the environment

is to evaluate

operating

memory management. of organizing

described above our general goal

system mechanisms A companion

and policies for

paper compares methods

the effect on performance

issues with respect to architectural maintenance

support

and page size) and operating

(the blocking duration

of certain key

(reference history system mechanisms

and number of migration

this paper we experimentally

daemons).

evaluate alternative

In

approaches

1.3

Remainder

Work

little attention

has been given to page migration with a distributed

of point-to-point

ular, mesh interconnections.

interconnections In loosely-coupled

for

shared mem-

ory. C. Scheurich and M. Dubois [lo] discuss page migration

(in particular,

Paper

tion three we discuss the migration

issues. In sec-

daemon issues. In section

four we describe the method used in our experiments.

In sec-

We conclude

in section six.

2

Architectural

With

Issues

respect to architectural

memory management tory maintenance 2.1

support

two important

for operating

system

issues are: reference his-

and page size.

Reference

History

in

and, in particarchitectures

a local area network of workstations)

the work

by Li and IIudak [8] is relevant but their focus is on replication and soIutions to the consistency problem. The vast literature

on uniprocessor

memory is re-

two main alternatives

are

a Referenced bit per page for each processor’s references and a Usage field per page for each processor’s Referenced

certainly

references.

bit would be set on each reference.

a hardware

The Usage

on each reference. Either would be

cleared by the migration feasible,

The

daemon.

implementation

the policy

of a Usage field is

daemon could approximate

the

Usage field in a software copy of the page table. The daemon would update this second copy as it cycles through the hardware copy.

One method of updating

used to simulate

(known as aging when

LRU [12]) would be to shift each approxi-

mate Usage field one bit in the low order direction, least significant significant

virtual

Maintenance

For reference history maintenance,

Although

coupled architecture

the context

of the

In the next section we discuss the architectural

periodically Related

a tightly

here

would be interesting.

field would be incremented

to these issues.

Relatively

investigated

page tables [6]. The specific goal addressed in this

paper is to determine

1.2

to the approach

and

We as-

sume that locks at the page table entry level of granularity Within

approach

tion five we present the results of the experiments.

node’s memory to another’s. lations and migrations

that

of no backup memory

discard the

bit, and insert the Referenced bit as the most

bit .

The tradeoff of complexity

versus reference history resolu-

tion on the spectrum between a single Referenced bit, the aging approximation

3.1

of a Usage field, and a Usage field is evaluated

Page

duration,

In the case of a main memory/secondary

memory

the disk sector size (SPUR has a four kilobyte However, an environment

is quite

different.

The disk sector size is not a lower bound.

If a pro-

cessor will only reference a small percentage

To maximize

mon’s locking

into

local memories with no secondary memory

and each copy of the page table

a daemon and the local processor, mutual provided.

page size [13]).

multiple

phase (in which

Since each page table entry is a shared resource between

larger than

of main memory partitioned

each

entry of each chosen page is updated).

transfer unit. For example, the VAX page size is

512 bytes [2]. Often the page size is substantially

for some

phase (in which it examines

each chosen page is migrated

hierarchy

the page size is influenced by the disk sector size since that is

in the near future,

an examination

copy of each page table entry), and a migration

Size

the minimum

Protocol

When a daemon runs, its passes consist of blocking

in our experiments. 2.2

Locking

concurrency

protocol

uses locks at the granularity

table entries, and that individual mum possible duration. entry while examining

then a smaller page size would be advan-

must be of page

locks are held for the mini-

In particular,

during the examination

phase the daemon only locks a particular

of a large page

exclusion

we assume that the dae-

copy of a page table

that copy. During the migration

phase

tageous because it would reduce the page transfer delay and

locks on all the copies of a page table entry must be held si-

network traffic.

mul t aneously.

At least as importantly,

if another process will

The locking protocol

reference a different section of that large page, then contention, the number of remote references, and the potential ing will be dramatically

address translation

for thrash-

simply locks the local copy of the appro-

priate page table entry.

reduced by having two smaller pages

If more than one daemon is used, then each daemon exam-

one in each local memory.

ines the copies of the page table entries corresponding

On the other hand, the page size can become too small because a smaller page increases the overhead fragmentation

partition

of page table entries. In addition,

3.2

locality

used by the local processor during an

if there is significant

over a larger region, then the implicit

spatial

of the pages of the address space.

Blocking

Duration

prefetching

of a

An issue is the appropriate

The best page size is likely to be workload dependent.

One

gain due to a particular

larger page size is lost. possible compromise

is blocked. If the duration

is to simulate

using a small page size in conjunction jacent pages. The migration prefetching

an adaptive

page size by

with prefetching

migration

daemon can adjust the degree of

duration

in our exper-

of the period the daemon

is too long much of the performance

page migration

may be lost. If the du-

may degrade due to excessive

daemon overhead.

An interesting

at runtime in response to the workload.

The effect of different page sizes is evaluated

duration

ration is too short, performance

of ad-

point is that in certain situations,

a blocking

of zero may still be too long; the daemon may not be

able to cycle around the page tables rapidly

iments.

a situation,

multiple

migration

enough.

Migration

Daemons

page table. This possibility

In such

daemons may be appropriate

where each daemon is responsible for a different partition

3

to its

of the

is considered in our experiments.

Our approach assumes that there is a copy of the page table in each local memory and one or more migration migration

daemon executing

3.3

daemons. Each

on its own processor periodically

Migration

Policies

It is not the intent of this paper to compare migration

policies.

unblocks itself and steps through all of the copies of the page

However, some policy must be used. We chose a simple, but

table applying

plausible,

the migration

The purpose of migration mance.

policy. daemons is to improve

However, the completion

perfor-

policy.

each daemon on each pass moves a page to the

may

local memory of the processor that referenced it the most since

instead increase due to daemon overhead. This overhead is pri-

the last pass. Most of the time no transfer occurs since either

marily

no processor referenced the page or the processor in whose local

due to lock contention

protocol, used.

blocking duration,

time of the application

pull-based

Specifically,

and is dependent on the locking and number of migration

Of course, the effect on performance

daemons is also dependent on the migration

daemons

memory

of the migration

the page resides has referenced the page as often as

any other processor.

policies used.

In a sense this is the greedy policy in that (under the as-

106

sumption

that the recent past accurately

updates the page table entry reference information.

predicts the near fu-

ture) it does what is best in the short run with the hope that

The reference information

it is also the best action in the long run. This policy can be By not taking

myopic in many ways.

transfer

not accounting for factors such as this page’s activity passes, the number of other pages being migrated and the destinations

pass, it can easily generate poor decisions. provide

per processor using a history

By

start macro sets the Referenced bit or updates the Usage field.

in this pass,

The migration

in this

daemon examines the Usage fields or Referenced

bits and clears them. The aging of the approximate

It does however

a reference point and it allows us to concentrate

by shifting occurs at the migration

on

the implementation

Method

of in hardware. software

We used three programs in the experiments:

multiplication

of

An assignment

a grid with a five-point

approximately

stencil [7], and the progression of a two-

surface wave [5]. Besides being realistic numerical

applications,

each was chosen to represent a class of behavior.

Vector matrix

multiplication

plifies situations

method (with optimal initial

placement)

exemplifies

to a scalar variable 8.75 microseconds,

translation

to a scalar vari1.75 microseconds.

in remote memory

Plus with software address

an assignment to a scalar variable in local memory

with a 1ocaI page table takes approximately

44 microseconds.

We have placed a busy wait in the start macro used by remote

Jacobi’s

accesses so that an assignment

situations

to a scalar variable in remote

(due

references to shared pages) might occur.

The

the local accesses (that is, approximately

In all three cases we evaluated

of the two features.

duration,

daemons.

these programs on our testbed on a

BBN Butterfly

software address translation

and number of migration

vant.

Plus. Our testbed is a version of the Uniform

System [ll] environment

of a

of this added delay is it intro-

in the relative

speeds of address translation

and daemon passes. In particular,

that we have modified.

magnitude

is still small enough to be rele-

The main significance

duces a disparity

five times

220 microseconds).

Even with the added delay the absolute

the effects of reference his-

tory, page size, blocking We evaluated

takes

so the ratio is five.

is not advantageous and where thrashing

surface wave problem exemplifies a mixture

the relative

speed of the

daemon is increased and so it becomes more responsive to ref-

In the Uniform System each processor node has one process executing

an assignment

Plus without

memory with a local page table takes approximately

where migration to interleaved

is advantageous.

On a 32-node BBN Butterfly

On a 32-node BBN Butterfly

(with repeated references) exem-

where migration

in software instead

able in local memory takes approximately

a matrix by a vector, a version of Jacobi’s method for updating

dimensional

of results due to

of address translation

address translation

Usage field

daemon examination.

A concern is the possible perturbation

the other issues.

4

Usage field

of Referenced bits, or a Usage

field counting the number of references by each processor. The

in earlier

of the other pages being migrated

in each page table entry is a

Referenced bit for each processor, an approximate

cost into ac-

count its action may not even be best in the short run.

The end

macro unlocks the page table entry.

erence pattern

on it. Data to be shared between processes (she&

changes. Thus, for workloads where migration

is advantageous,

means it is accessible by all the processes; it need not imply

optimistic.

that more that one process actually

the experimental

accesses it) are mapped

the experimental

results may be somewhat

For workloads where migration

is disadvantageous,

results may be somewhat conservative.

into each process’s address space. Each process’s address space Z&O has mapped

in that process’s code and private

data.

A

process’s code and private

data reside in its local memory.

Shared data are partitioned

among the local memories.

In

our

modification

to the Uniform

4.2

As illustrated

System accesses to the

workload,

data mapped into all the address spaces go through a software runtime address translation.

Consequently,

vector,

our page tables are

Software

Address

Multiplication

in Figure 2, in the vector matrix

a square matrix,

N x 1.

N x N, is multiplied

There are M processors.

multiplication by a column

The rows are di-

sequences of rows.

of rows has an even-numbered

Each sequence

and an odd-numbered

proces-

sor assigned to it. The two processors start at opposite ends of

Translation

The software address translation variable

Matrix

vided into q contiguous

only for shared data. 4.1

Vector

their row sequence and move toward each other as they process

replaces an access to a shared

rows. This method of processor allocation

on a given page by a start macro, an access, and an

end macro. The start macro converts the page number to an

a controlled

address. The start macro also locks the page table entry and

migration

107

experiment

for measuring

by a daemon is effective.

was selected to allow under what conditions

In this experiment

each

row starts a new page and is a multiple initial

page placement

We chose four medium size square matrices from this suite:

number of pages. The

matrices

divides the pages to be referenced by a

003 (147 rows and 1298 nonzero entries),

009 (207

pair of processors in half. The pages of the rows closest to a

rows and 572 nonzero entries), 014 (238 rows and 1128 nonzero

processor are placed in that processor’s local memory.

entries),

Thus, if

and 021 (900 rows and 4322 nonzero entries).

The

each processor processes its rows at the same speed they will

same program was used as before except that these matrices are

meet in the middle without

stored in and accessed through a sparse matrix

ever having to reference the pages

in the other processor’s local memory.

That representation

representa.tion.

has three vectors: a vector of row pointers,

a vector of column indices, and a vector of nonzero entries.

N-l

Thus, the amount of storage needed equals the number of rows

II0

no

plus two times the number of nonzero entries. 4.3

Jacobi’s

As illustrated

Method in Figure 3, in our implementation

of Jacobi’s

method, there are two copies of the square matrix. is partitioned

Each copy

by a grid with each processor responsible for up-

dating one partition.

There are ten iterations.

In each iteration

one of the copies is treated as the old copy by all the processors and the other copy is the new copy. In a single iteration each processor updates each of its entries in the new copy, row by row, and then waits for all the other processors to finish Figure 2: Multiplying

a vector and a matrix.

their updates.

A processor updates each entry in its partition

of the new copy by reading the four adjacent entries (the two We then introduced form sparsity

a need for migration

and repeat references.

through

old copy and assigning the average.

Every even entry in the

half of the rows closest to the even-numbered zeroed.

adjacent in its row and the two adjacent in its column) in the

nonuni-

The pages of a processor’s partition

processor was

memory.

For non-zero entries we varied the number of refer-

ences from one (the default needed in vector matrix

multipli-

cation) to ten and twenty. Thus, the even-numbered

processor

tition

While updating

entries on the boundary

of its par-

a processor reads some of the values on the boundary

of adjacent partitions. a partition

will finish its half of the rows first and then start processing

are placed in its local

However, the processor responsible

references each point in that partition

for

more than

the rows whose pages are on the other processor’s local memory.

Increasing

the repeat references slows down both pro-

cessors, but also increases the opportunity odd-numbered

for migration

(the

processor will have processed fewer rows by the

time the even-numbered

processor reaches the midpoint).

In this manner we can investigate

the different parameters

in a controlled manner. When the number of references to each non-zero entry is one, it is slightly

disadvantageous

the page from the remote memory.

As the number of repeat

references is increased, to migrate

it becomes increasingly

to move New Grid

Old Grid

advantageous

the page.

Unfortunately,

the above experiment

Figure 3: Jacobi’s method. the new grid.

(which uses a matrix

with 75% non-zero entries) has a disadvantage

in that matridoes any other processor.

ces in typical workloads tend to be essentially full (all non-zero entries) or quite sparse. Thus, as a complementary we used four real sparse matrices. ces to be used by the numerical

Thus,

experiment

Duff, Grimes, Lewis, and

Poole [3] developed a suite of representative

Reading the old grid and updating

workload,

real sparse matri-

unlike

the vector

there is significant

is not advantageous

analysis community.

We want to evaluate

108

matrix

multipiication synthetic / sharing of data,, and migration

(the initial

page placement

the penalty

is the best).

due to migration

daemon

overhead and the susceptibility ent parameter

to thrashing

under the differ-

timings for processors 0 and 1 were discarded. take into account only the multiplication

settings.

spent initializing 4.4

Surface

Wave

hypercube Butterfly

for calculating

with impressive implemented machine.

on a 1024-process

speedup results. A member of our a version of that algorithm

on our

We converted that implementation

to use

The version implemented and no barriers.

tions is equivalent

assumes periodic boundary Having

periodic

boundary

condi-

reappears on the oppo-

At time 0 nonzero values of magnitude

placed randomly

con-

to the surface being that of a torus since

a wave when it reaches one boundary site boundary.

blocking

duration

(0, 50, and 10,000 microseconds)

number of daemons.

across the surface.

kilobyte

in the local memories of the

processors have only the odd-numbered

of the

for a total T millisec-

where no migrations

just before starting

As expected,

to reference the row).

Foresight does better than None when there

is more than one reference to an entry.

ences two competing

Experiments

In the vector matrix a 1 megabyte

.

Multiplication: multiplication

lock contention

Synthetic

synthetic

address space containing

Type None

factors.

time because it decreases the

duration

tends to increase the comple-

tion time because it causes the daemon to respond more slowly

one copy of a matrix

to advantageous

migration

opportunities.

The competition

be-

tween these two factors can be seen in the results. For the one

Page Size(Rejerence Count) ) lK(10) ) lK(20) ) 512(10) 1 2K(lO) 1 12.6 1 22.2 1 14.4 1 14.4

lK(1) 6.2

duration

between the daemon and the user processes.

An increased blocking

workload we used

for a single daemon influ-

An increased blocking

tends to decrease the completion

Matrix

Why it does slightly

better in the one reference case is not clear. Increasing the blocking duration

Vector

are made

and where perfect foresight exists (all the pages in a row are migrated

onds.

5.1

entries

(that is, 2 pages per row). To provide reference points

we also conducted experiments

one are

The progression

wave is computed every 6t milliseconds

5

The number of references to each non-

zero entry (recall the rows initially even-numbered

and the

nonzero) considered were 1, 10, and 20 with a page size of 1

software address translation. ditions

itself and not time the daemons did not

In Table 1 we show the times (in seconds) as we vary the

the progress of a two-

surface wave was implemented

department

In addition,

run during the initialization.

Recently (5) an algorithm dimensional

the matrix.

The times listed

reference case the contention factor dominates.

For the ten and

twenty reference cases the responsiveness factor dominates. The presence of two daemons accentuates the two competing factors. Both lock contention

and responsiveness increase.

Again this is reflected in the results with the contention dominating

siveness factor dominating Table 1: Vector Matrix Multiplication (in seconds). Synthetic workload, 16 processors, Usage field, iD is i daemons, BO, 050, BlOO mean block for 0, 50, and 10,000 microseconds, respectively.

To determine

ing point number).

in the twenty reference case.

the effect of page size we considered

sizes of 512 bytes and 2 kilobytes case. Apparently,

migration

page

for the reference count of 10

the overhead of migrating

pages makes Foresight with 512 rows and 512 columns (each entry is a four-byte

factor

in the one and ten reference cases and the respon-

actually

multiple

smaller

worse than no mi-

gration with a page size of 512 bytes. In the 512 byte results

float-

the contention

Each processor had a copy in its local mem-

factor clearly is dominating

both with respect

ory of the read-only variables (such as the vector in the vector matrix

multiplication)

and various index variables

and flags.

We used 16 processors as user processes in the application one or two processors for migration

and

daemons. Thus, each pair

of processors handled 64 rows with 32 rows initially

in each of

Aging= 1 Aging=2 Aging=3

their local memories. To account for experimental

error we measured the com-

14.2 14.0 14.1

pletion time for each processor with a user process and display below the average.

The way the Uniform

System generates

tasks causes the timings for processor 0 to be abnormal,

Table 2: Vector Matrix Multiplication (in seconds). Synthetic workload, 16 processors, 1 daemon, blocking duration = 0.

so the

109

performance nificantly

in the other cases. That performance

improved

is not surprising

since

never sig-

the reference count

is one.

11

Twe

Matrix/Paae

1

I.

Size )

n

i

8.8 9.3

lD(B5d) 2D(BO)

47.7 48.9

81.0 85.9

Table 3: Vector Matrix Multiplication workload, 4 processors, Usage field.

to blocking

duration

MOO3 1.09 1.39

None Daemon

MO09 2.19 2.18

MO14 -M021’ 2.74 11.41 2.70 13.72

M021(256) 11.41 14.63

(in seconds). Synthetic

and number of daemons.

Table 4: Vector Matrix Multiplication (in seconds). Sparse Matrix suite, 16 processors, 64 byte pages except for M021(256), Usage field, 1 daemon, blocking duration = 0, reference count is 1.

In the 2 kilo-

byte results, however, the responsiveness factor appears more

5.3

important. Note that the 2 kilobyte daemon migration

results are the only case where

results in a lower completion

space containing

two copies of a matrix

columns (each entry is a four-byte

Table 2 shows the results of replacing either a Referenced bit (equivalent

the Usage field by

times in ail three cases are approximately

with 512 rows and 512

floating point number).

used 17 processors as user processes in the application

to aging with one bit) or

aging using the last two or last three Referenced bits.

We

(one of

these did only convergence checking) and one or two processors

The

as migration

7% more than that

daemons.

Each matrix

copy was partitioned

by

a 4x4 grid with each processor responsible for one partition

of a Usage field.

of

128 rows by 128 columns.

In Table 3 we repeat some of the runs but only with four processors. Here migration

We measured the completion

is much more successful, probably

because the daemon can react more rapidly

for processors 0 and 1 were discarded.). For page sizes we considered 512 bytes, 1 kilobyte,

In summary, either a single Referenced bit or aging appears to be a good approximation influence

However, migration

migration.

did not degrade performance

and in some cases it did help.

Multiple

A horizontal

placement

means that

the next one or more segments of the current row (which are in other grid partitions)

times.

placement

substantially

are mapped onto the page. A vertical

means that the next one or more segments in this

grid partition

daemons are attrac-

tive when reducing the pass time is more important

Thus, for the larger two page sizes an issue is

how the page is placed.

In most

daemon did not improve completion

and 2

The length of the row segment in a grid partition

is 512 bytes.

Larger page sizes are

as expected with advantageous

cases the migration

kilobytes.

of a Usage field. Page size signif-

daemon performance.

time for each processor with a

user process and display below the average (again the timings

since there are

fewer page tables to examine during its pass.

preferable

Method

In the Jacobi’s method workload we used a 2 megabyte address

time than no

migration.

icantly

Jacobi’s

(which is in the next one or more rows) are

mapped onto the page.

than the

increased lock contention.

Tvw .5.2

Vector

Matrix

For the sparse matrix

Multiplication:

Sparse

test suite we looked at a single parameter We used 16 user processors, a

Usage field, 1 daemon with blocking reference count of one. comparable

duration

pages: 64 bytes. Since matrix

workload

2K(Vj 146.4 179.8 162.7 147.1 178.0

2K(H) 345.2 347.7 350.3 347.5 349.5

I.

16 processors, Usage

we used much smaller

021 has somewhat larger storage

in Table 4, daemon migration

proved performance

lK(H) 346.2 329.9 320.6 317.2 328.0

Table 5: Jacobi’s Method(in seconds). field. V is vertical, H is horizontal.

For each page size case we considered the same experiments

demands we also consider a page size of 256 bytes for it. As indicated

11

Suggest Documents