A local/remote architecture has the main memory partitioned into a local memory on each processor's node. Each processor can access the memories of.
Reference History, Page Size, and Migration Daemons in Local/Remote Architectures * Mark A. Holliday Department of Computer Science Duke University Durham. NC 27706 Abstract
Local/Remote
1.1
We address the problem of paged main memory management in the local/remote architecture subclass of shared memory multiprocessors. We consider the case where the operating system has primary responsibility and uses page migration as its main tool. We identify some of the key issues with respect to architectural support (reference history maintenance, and page size), and operating system mechanism (duration between daemon passes, and number of migration daemons). The experiments were conducted using software implemented page tables on a 32-node BBN Butterfly PlusTM. Several numerical programs with both synthetic and real data were used as the workload. The primary conclusion is that for the csses considered migration was at best marginally effective. On the other hand, practical migration mechanisms were robust and never significantly degraded performance. The specific results include: 1) Referenced bits with aging can closely approximate Usage fields, 2) larger page sizes are beneficial except when the page is large enough to include locality sets of two processes, and 3) multiple migration daemons can be useful.
One of the primary shared
memory,
formance
of shared
for the local/remote the issues evaluate
we refer
A local/remote
architecture.
to
architecture
can access the memories
Each processor
the shared memory paradigm
SO
From each processor’s viewpoint,
the other
of
can be sup-
the local memories of
are remote memories.
processors
memory could certainly
Though secondary
be attached to one or more nodes, we
focus on the case of only the main memory being used. The nodes are tightly-coupled is high bandwidth memory Figure
in that the interconnection
and low latency.
Consequently,
access time is a small multiple
access time.
The BBN Butterfly
1, is an example
network the remote
of the local memory
PlusTM
of a local/remote
[I], represented
in
architecture.
Assume the local memories that form the main memory are divided
in local
into page frames.
Address references cause the
to use the page number memory
or remote
to determine
memory
contains
which
frame
the needed
page.
If the frame is in a remote memory, the word at that address management memory
is a major
factor
multiprocessors.
architecture
in managing
a paged
subclass, memory
(not
in the per-
In this we identify
the entire
paper,
putation
some of
paged
and experimentally
is fetched.
application
(consisting virtual
Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and i@ date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/ or specific permission.
104
manager
to local memories
the memory access time
migrate
ware.
memory
which
for assigning
page to which
pattern
system
pages of the comthe effective
mem-
may dynamically
can often reduce the effective
manager
manager’s
in a shared
the operating
so as to minimize
by migrating
maintained
executing
com-
of the goal of minimiz-
time,
Since the reference
change,
information
104
is responsible
memory The
is a single parallel
As part
completion
putation
tion.
$1 SO
space.
memory
access time.
workload
of user processes)
address
ing the computation’s
ory
ACM O-8979 1-300-O/89/0004/0
page)
The assumed
some of the alternatives.
1989
is, what
processor’s node. ported.
‘This work was supported in part by the National Science Foundation (Grant CCR-8721781).
0
computations
large scale,
into a local memory on each
the other nodes
Introduction memory
parallel
for supporting
has the main memory partitioned
hardware
Effective
architectures
as, the local/remote
Only small regions of the space of architectural, system, and workload parameters were explored. Further investigation of other parameter combinations is clearly warranted.
1
Architectures
policy local
pages during
the computa-
for determining memory
in each page table
when
to
uses the reference entry
by the hard-
lated in that the objects (pages) are the same. In addition, time constraints information
on the operating
available
the
system and the amount of
to it (for example,
Referenced bits) are
similar. The differences, however, are substantial. ory/secondary
The main mem-
memory access time ratio is dramatically
than the local/remote
larger
access time ratio. Also, a reference to a
word only in secondary memory requires trapping
to the oper-
ating system and fetching the entire page. Another through
related
field
of literature
caching in multiple
is data
placement
processor caches [4] and file sys-
tems [9]. Those discussions are in the context
of a backup
complete memory (main memory or file server, respectively). Adapting
caching to the context
comparing Figure 1: A local/remote architecture (Reprinted with permission from Inside the Butterfly Plus, BBN Advanced Computers Inc., 1987.). We assume that the memory manager implements icy by a migration
daemon that periodically
applies the policy,
and migrates
the pol-
unblocks itself,
the chosen pages from one
Concurrency
control must be provided since address transupdate the page table entries.
are
used. the environment
is to evaluate
operating
memory management. of organizing
described above our general goal
system mechanisms A companion
and policies for
paper compares methods
the effect on performance
issues with respect to architectural maintenance
support
and page size) and operating
(the blocking duration
of certain key
(reference history system mechanisms
and number of migration
this paper we experimentally
daemons).
evaluate alternative
In
approaches
1.3
Remainder
Work
little attention
has been given to page migration with a distributed
of point-to-point
ular, mesh interconnections.
interconnections In loosely-coupled
for
shared mem-
ory. C. Scheurich and M. Dubois [lo] discuss page migration
(in particular,
Paper
tion three we discuss the migration
issues. In sec-
daemon issues. In section
four we describe the method used in our experiments.
In sec-
We conclude
in section six.
2
Architectural
With
Issues
respect to architectural
memory management tory maintenance 2.1
support
two important
for operating
system
issues are: reference his-
and page size.
Reference
History
in
and, in particarchitectures
a local area network of workstations)
the work
by Li and IIudak [8] is relevant but their focus is on replication and soIutions to the consistency problem. The vast literature
on uniprocessor
memory is re-
two main alternatives
are
a Referenced bit per page for each processor’s references and a Usage field per page for each processor’s Referenced
certainly
references.
bit would be set on each reference.
a hardware
The Usage
on each reference. Either would be
cleared by the migration feasible,
The
daemon.
implementation
the policy
of a Usage field is
daemon could approximate
the
Usage field in a software copy of the page table. The daemon would update this second copy as it cycles through the hardware copy.
One method of updating
used to simulate
(known as aging when
LRU [12]) would be to shift each approxi-
mate Usage field one bit in the low order direction, least significant significant
virtual
Maintenance
For reference history maintenance,
Although
coupled architecture
the context
of the
In the next section we discuss the architectural
periodically Related
a tightly
here
would be interesting.
field would be incremented
to these issues.
Relatively
investigated
page tables [6]. The specific goal addressed in this
paper is to determine
1.2
to the approach
and
We as-
sume that locks at the page table entry level of granularity Within
approach
tion five we present the results of the experiments.
node’s memory to another’s. lations and migrations
that
of no backup memory
discard the
bit, and insert the Referenced bit as the most
bit .
The tradeoff of complexity
versus reference history resolu-
tion on the spectrum between a single Referenced bit, the aging approximation
3.1
of a Usage field, and a Usage field is evaluated
Page
duration,
In the case of a main memory/secondary
memory
the disk sector size (SPUR has a four kilobyte However, an environment
is quite
different.
The disk sector size is not a lower bound.
If a pro-
cessor will only reference a small percentage
To maximize
mon’s locking
into
local memories with no secondary memory
and each copy of the page table
a daemon and the local processor, mutual provided.
page size [13]).
multiple
phase (in which
Since each page table entry is a shared resource between
larger than
of main memory partitioned
each
entry of each chosen page is updated).
transfer unit. For example, the VAX page size is
512 bytes [2]. Often the page size is substantially
for some
phase (in which it examines
each chosen page is migrated
hierarchy
the page size is influenced by the disk sector size since that is
in the near future,
an examination
copy of each page table entry), and a migration
Size
the minimum
Protocol
When a daemon runs, its passes consist of blocking
in our experiments. 2.2
Locking
concurrency
protocol
uses locks at the granularity
table entries, and that individual mum possible duration. entry while examining
then a smaller page size would be advan-
must be of page
locks are held for the mini-
In particular,
during the examination
phase the daemon only locks a particular
of a large page
exclusion
we assume that the dae-
copy of a page table
that copy. During the migration
phase
tageous because it would reduce the page transfer delay and
locks on all the copies of a page table entry must be held si-
network traffic.
mul t aneously.
At least as importantly,
if another process will
The locking protocol
reference a different section of that large page, then contention, the number of remote references, and the potential ing will be dramatically
address translation
for thrash-
simply locks the local copy of the appro-
priate page table entry.
reduced by having two smaller pages
If more than one daemon is used, then each daemon exam-
one in each local memory.
ines the copies of the page table entries corresponding
On the other hand, the page size can become too small because a smaller page increases the overhead fragmentation
partition
of page table entries. In addition,
3.2
locality
used by the local processor during an
if there is significant
over a larger region, then the implicit
spatial
of the pages of the address space.
Blocking
Duration
prefetching
of a
An issue is the appropriate
The best page size is likely to be workload dependent.
One
gain due to a particular
larger page size is lost. possible compromise
is blocked. If the duration
is to simulate
using a small page size in conjunction jacent pages. The migration prefetching
an adaptive
page size by
with prefetching
migration
daemon can adjust the degree of
duration
in our exper-
of the period the daemon
is too long much of the performance
page migration
may be lost. If the du-
may degrade due to excessive
daemon overhead.
An interesting
at runtime in response to the workload.
The effect of different page sizes is evaluated
duration
ration is too short, performance
of ad-
point is that in certain situations,
a blocking
of zero may still be too long; the daemon may not be
able to cycle around the page tables rapidly
iments.
a situation,
multiple
migration
enough.
Migration
Daemons
page table. This possibility
In such
daemons may be appropriate
where each daemon is responsible for a different partition
3
to its
of the
is considered in our experiments.
Our approach assumes that there is a copy of the page table in each local memory and one or more migration migration
daemon executing
3.3
daemons. Each
on its own processor periodically
Migration
Policies
It is not the intent of this paper to compare migration
policies.
unblocks itself and steps through all of the copies of the page
However, some policy must be used. We chose a simple, but
table applying
plausible,
the migration
The purpose of migration mance.
policy. daemons is to improve
However, the completion
perfor-
policy.
each daemon on each pass moves a page to the
may
local memory of the processor that referenced it the most since
instead increase due to daemon overhead. This overhead is pri-
the last pass. Most of the time no transfer occurs since either
marily
no processor referenced the page or the processor in whose local
due to lock contention
protocol, used.
blocking duration,
time of the application
pull-based
Specifically,
and is dependent on the locking and number of migration
Of course, the effect on performance
daemons is also dependent on the migration
daemons
memory
of the migration
the page resides has referenced the page as often as
any other processor.
policies used.
In a sense this is the greedy policy in that (under the as-
106
sumption
that the recent past accurately
updates the page table entry reference information.
predicts the near fu-
ture) it does what is best in the short run with the hope that
The reference information
it is also the best action in the long run. This policy can be By not taking
myopic in many ways.
transfer
not accounting for factors such as this page’s activity passes, the number of other pages being migrated and the destinations
pass, it can easily generate poor decisions. provide
per processor using a history
By
start macro sets the Referenced bit or updates the Usage field.
in this pass,
The migration
in this
daemon examines the Usage fields or Referenced
bits and clears them. The aging of the approximate
It does however
a reference point and it allows us to concentrate
by shifting occurs at the migration
on
the implementation
Method
of in hardware. software
We used three programs in the experiments:
multiplication
of
An assignment
a grid with a five-point
approximately
stencil [7], and the progression of a two-
surface wave [5]. Besides being realistic numerical
applications,
each was chosen to represent a class of behavior.
Vector matrix
multiplication
plifies situations
method (with optimal initial
placement)
exemplifies
to a scalar variable 8.75 microseconds,
translation
to a scalar vari1.75 microseconds.
in remote memory
Plus with software address
an assignment to a scalar variable in local memory
with a 1ocaI page table takes approximately
44 microseconds.
We have placed a busy wait in the start macro used by remote
Jacobi’s
accesses so that an assignment
situations
to a scalar variable in remote
(due
references to shared pages) might occur.
The
the local accesses (that is, approximately
In all three cases we evaluated
of the two features.
duration,
daemons.
these programs on our testbed on a
BBN Butterfly
software address translation
and number of migration
vant.
Plus. Our testbed is a version of the Uniform
System [ll] environment
of a
of this added delay is it intro-
in the relative
speeds of address translation
and daemon passes. In particular,
that we have modified.
magnitude
is still small enough to be rele-
The main significance
duces a disparity
five times
220 microseconds).
Even with the added delay the absolute
the effects of reference his-
tory, page size, blocking We evaluated
takes
so the ratio is five.
is not advantageous and where thrashing
surface wave problem exemplifies a mixture
the relative
speed of the
daemon is increased and so it becomes more responsive to ref-
In the Uniform System each processor node has one process executing
an assignment
Plus without
memory with a local page table takes approximately
where migration to interleaved
is advantageous.
On a 32-node BBN Butterfly
On a 32-node BBN Butterfly
(with repeated references) exem-
where migration
in software instead
able in local memory takes approximately
a matrix by a vector, a version of Jacobi’s method for updating
dimensional
of results due to
of address translation
address translation
Usage field
daemon examination.
A concern is the possible perturbation
the other issues.
4
Usage field
of Referenced bits, or a Usage
field counting the number of references by each processor. The
in earlier
of the other pages being migrated
in each page table entry is a
Referenced bit for each processor, an approximate
cost into ac-
count its action may not even be best in the short run.
The end
macro unlocks the page table entry.
erence pattern
on it. Data to be shared between processes (she&
changes. Thus, for workloads where migration
is advantageous,
means it is accessible by all the processes; it need not imply
optimistic.
that more that one process actually
the experimental
accesses it) are mapped
the experimental
results may be somewhat
For workloads where migration
is disadvantageous,
results may be somewhat conservative.
into each process’s address space. Each process’s address space Z&O has mapped
in that process’s code and private
data.
A
process’s code and private
data reside in its local memory.
Shared data are partitioned
among the local memories.
In
our
modification
to the Uniform
4.2
As illustrated
System accesses to the
workload,
data mapped into all the address spaces go through a software runtime address translation.
Consequently,
vector,
our page tables are
Software
Address
Multiplication
in Figure 2, in the vector matrix
a square matrix,
N x 1.
N x N, is multiplied
There are M processors.
multiplication by a column
The rows are di-
sequences of rows.
of rows has an even-numbered
Each sequence
and an odd-numbered
proces-
sor assigned to it. The two processors start at opposite ends of
Translation
The software address translation variable
Matrix
vided into q contiguous
only for shared data. 4.1
Vector
their row sequence and move toward each other as they process
replaces an access to a shared
rows. This method of processor allocation
on a given page by a start macro, an access, and an
end macro. The start macro converts the page number to an
a controlled
address. The start macro also locks the page table entry and
migration
107
experiment
for measuring
by a daemon is effective.
was selected to allow under what conditions
In this experiment
each
row starts a new page and is a multiple initial
page placement
We chose four medium size square matrices from this suite:
number of pages. The
matrices
divides the pages to be referenced by a
003 (147 rows and 1298 nonzero entries),
009 (207
pair of processors in half. The pages of the rows closest to a
rows and 572 nonzero entries), 014 (238 rows and 1128 nonzero
processor are placed in that processor’s local memory.
entries),
Thus, if
and 021 (900 rows and 4322 nonzero entries).
The
each processor processes its rows at the same speed they will
same program was used as before except that these matrices are
meet in the middle without
stored in and accessed through a sparse matrix
ever having to reference the pages
in the other processor’s local memory.
That representation
representa.tion.
has three vectors: a vector of row pointers,
a vector of column indices, and a vector of nonzero entries.
N-l
Thus, the amount of storage needed equals the number of rows
II0
no
plus two times the number of nonzero entries. 4.3
Jacobi’s
As illustrated
Method in Figure 3, in our implementation
of Jacobi’s
method, there are two copies of the square matrix. is partitioned
Each copy
by a grid with each processor responsible for up-
dating one partition.
There are ten iterations.
In each iteration
one of the copies is treated as the old copy by all the processors and the other copy is the new copy. In a single iteration each processor updates each of its entries in the new copy, row by row, and then waits for all the other processors to finish Figure 2: Multiplying
a vector and a matrix.
their updates.
A processor updates each entry in its partition
of the new copy by reading the four adjacent entries (the two We then introduced form sparsity
a need for migration
and repeat references.
through
old copy and assigning the average.
Every even entry in the
half of the rows closest to the even-numbered zeroed.
adjacent in its row and the two adjacent in its column) in the
nonuni-
The pages of a processor’s partition
processor was
memory.
For non-zero entries we varied the number of refer-
ences from one (the default needed in vector matrix
multipli-
cation) to ten and twenty. Thus, the even-numbered
processor
tition
While updating
entries on the boundary
of its par-
a processor reads some of the values on the boundary
of adjacent partitions. a partition
will finish its half of the rows first and then start processing
are placed in its local
However, the processor responsible
references each point in that partition
for
more than
the rows whose pages are on the other processor’s local memory.
Increasing
the repeat references slows down both pro-
cessors, but also increases the opportunity odd-numbered
for migration
(the
processor will have processed fewer rows by the
time the even-numbered
processor reaches the midpoint).
In this manner we can investigate
the different parameters
in a controlled manner. When the number of references to each non-zero entry is one, it is slightly
disadvantageous
the page from the remote memory.
As the number of repeat
references is increased, to migrate
it becomes increasingly
to move New Grid
Old Grid
advantageous
the page.
Unfortunately,
the above experiment
Figure 3: Jacobi’s method. the new grid.
(which uses a matrix
with 75% non-zero entries) has a disadvantage
in that matridoes any other processor.
ces in typical workloads tend to be essentially full (all non-zero entries) or quite sparse. Thus, as a complementary we used four real sparse matrices. ces to be used by the numerical
Thus,
experiment
Duff, Grimes, Lewis, and
Poole [3] developed a suite of representative
Reading the old grid and updating
workload,
real sparse matri-
unlike
the vector
there is significant
is not advantageous
analysis community.
We want to evaluate
108
matrix
multipiication synthetic / sharing of data,, and migration
(the initial
page placement
the penalty
is the best).
due to migration
daemon
overhead and the susceptibility ent parameter
to thrashing
under the differ-
timings for processors 0 and 1 were discarded. take into account only the multiplication
settings.
spent initializing 4.4
Surface
Wave
hypercube Butterfly
for calculating
with impressive implemented machine.
on a 1024-process
speedup results. A member of our a version of that algorithm
on our
We converted that implementation
to use
The version implemented and no barriers.
tions is equivalent
assumes periodic boundary Having
periodic
boundary
condi-
reappears on the oppo-
At time 0 nonzero values of magnitude
placed randomly
con-
to the surface being that of a torus since
a wave when it reaches one boundary site boundary.
blocking
duration
(0, 50, and 10,000 microseconds)
number of daemons.
across the surface.
kilobyte
in the local memories of the
processors have only the odd-numbered
of the
for a total T millisec-
where no migrations
just before starting
As expected,
to reference the row).
Foresight does better than None when there
is more than one reference to an entry.
ences two competing
Experiments
In the vector matrix a 1 megabyte
.
Multiplication: multiplication
lock contention
Synthetic
synthetic
address space containing
Type None
factors.
time because it decreases the
duration
tends to increase the comple-
tion time because it causes the daemon to respond more slowly
one copy of a matrix
to advantageous
migration
opportunities.
The competition
be-
tween these two factors can be seen in the results. For the one
Page Size(Rejerence Count) ) lK(10) ) lK(20) ) 512(10) 1 2K(lO) 1 12.6 1 22.2 1 14.4 1 14.4
lK(1) 6.2
duration
between the daemon and the user processes.
An increased blocking
workload we used
for a single daemon influ-
An increased blocking
tends to decrease the completion
Matrix
Why it does slightly
better in the one reference case is not clear. Increasing the blocking duration
Vector
are made
and where perfect foresight exists (all the pages in a row are migrated
onds.
5.1
entries
(that is, 2 pages per row). To provide reference points
we also conducted experiments
one are
The progression
wave is computed every 6t milliseconds
5
The number of references to each non-
zero entry (recall the rows initially even-numbered
and the
nonzero) considered were 1, 10, and 20 with a page size of 1
software address translation. ditions
itself and not time the daemons did not
In Table 1 we show the times (in seconds) as we vary the
the progress of a two-
surface wave was implemented
department
In addition,
run during the initialization.
Recently (5) an algorithm dimensional
the matrix.
The times listed
reference case the contention factor dominates.
For the ten and
twenty reference cases the responsiveness factor dominates. The presence of two daemons accentuates the two competing factors. Both lock contention
and responsiveness increase.
Again this is reflected in the results with the contention dominating
siveness factor dominating Table 1: Vector Matrix Multiplication (in seconds). Synthetic workload, 16 processors, Usage field, iD is i daemons, BO, 050, BlOO mean block for 0, 50, and 10,000 microseconds, respectively.
To determine
ing point number).
in the twenty reference case.
the effect of page size we considered
sizes of 512 bytes and 2 kilobytes case. Apparently,
migration
page
for the reference count of 10
the overhead of migrating
pages makes Foresight with 512 rows and 512 columns (each entry is a four-byte
factor
in the one and ten reference cases and the respon-
actually
multiple
smaller
worse than no mi-
gration with a page size of 512 bytes. In the 512 byte results
float-
the contention
Each processor had a copy in its local mem-
factor clearly is dominating
both with respect
ory of the read-only variables (such as the vector in the vector matrix
multiplication)
and various index variables
and flags.
We used 16 processors as user processes in the application one or two processors for migration
and
daemons. Thus, each pair
of processors handled 64 rows with 32 rows initially
in each of
Aging= 1 Aging=2 Aging=3
their local memories. To account for experimental
error we measured the com-
14.2 14.0 14.1
pletion time for each processor with a user process and display below the average.
The way the Uniform
System generates
tasks causes the timings for processor 0 to be abnormal,
Table 2: Vector Matrix Multiplication (in seconds). Synthetic workload, 16 processors, 1 daemon, blocking duration = 0.
so the
109
performance nificantly
in the other cases. That performance
improved
is not surprising
since
never sig-
the reference count
is one.
11
Twe
Matrix/Paae
1
I.
Size )
n
i
8.8 9.3
lD(B5d) 2D(BO)
47.7 48.9
81.0 85.9
Table 3: Vector Matrix Multiplication workload, 4 processors, Usage field.
to blocking
duration
MOO3 1.09 1.39
None Daemon
MO09 2.19 2.18
MO14 -M021’ 2.74 11.41 2.70 13.72
M021(256) 11.41 14.63
(in seconds). Synthetic
and number of daemons.
Table 4: Vector Matrix Multiplication (in seconds). Sparse Matrix suite, 16 processors, 64 byte pages except for M021(256), Usage field, 1 daemon, blocking duration = 0, reference count is 1.
In the 2 kilo-
byte results, however, the responsiveness factor appears more
5.3
important. Note that the 2 kilobyte daemon migration
results are the only case where
results in a lower completion
space containing
two copies of a matrix
columns (each entry is a four-byte
Table 2 shows the results of replacing either a Referenced bit (equivalent
the Usage field by
times in ail three cases are approximately
with 512 rows and 512
floating point number).
used 17 processors as user processes in the application
to aging with one bit) or
aging using the last two or last three Referenced bits.
We
(one of
these did only convergence checking) and one or two processors
The
as migration
7% more than that
daemons.
Each matrix
copy was partitioned
by
a 4x4 grid with each processor responsible for one partition
of a Usage field.
of
128 rows by 128 columns.
In Table 3 we repeat some of the runs but only with four processors. Here migration
We measured the completion
is much more successful, probably
because the daemon can react more rapidly
for processors 0 and 1 were discarded.). For page sizes we considered 512 bytes, 1 kilobyte,
In summary, either a single Referenced bit or aging appears to be a good approximation influence
However, migration
migration.
did not degrade performance
and in some cases it did help.
Multiple
A horizontal
placement
means that
the next one or more segments of the current row (which are in other grid partitions)
times.
placement
substantially
are mapped onto the page. A vertical
means that the next one or more segments in this
grid partition
daemons are attrac-
tive when reducing the pass time is more important
Thus, for the larger two page sizes an issue is
how the page is placed.
In most
daemon did not improve completion
and 2
The length of the row segment in a grid partition
is 512 bytes.
Larger page sizes are
as expected with advantageous
cases the migration
kilobytes.
of a Usage field. Page size signif-
daemon performance.
time for each processor with a
user process and display below the average (again the timings
since there are
fewer page tables to examine during its pass.
preferable
Method
In the Jacobi’s method workload we used a 2 megabyte address
time than no
migration.
icantly
Jacobi’s
(which is in the next one or more rows) are
mapped onto the page.
than the
increased lock contention.
Tvw .5.2
Vector
Matrix
For the sparse matrix
Multiplication:
Sparse
test suite we looked at a single parameter We used 16 user processors, a
Usage field, 1 daemon with blocking reference count of one. comparable
duration
pages: 64 bytes. Since matrix
workload
2K(Vj 146.4 179.8 162.7 147.1 178.0
2K(H) 345.2 347.7 350.3 347.5 349.5
I.
16 processors, Usage
we used much smaller
021 has somewhat larger storage
in Table 4, daemon migration
proved performance
lK(H) 346.2 329.9 320.6 317.2 328.0
Table 5: Jacobi’s Method(in seconds). field. V is vertical, H is horizontal.
For each page size case we considered the same experiments
demands we also consider a page size of 256 bytes for it. As indicated
11