Applications in Science and Engineering. NASA. Langley. Research. Center .... calls such as "what is the wallclock time now", or, "is there a message of type T.
NASA ICASE
Contractor Report
Report
194936
f
No. 94-50 i
/
IC PARALLELIZED
DIRECT
SIMULATION PARALLEL
S EXECUTION
OF MESSAGE-PASSING PROGRAMS
I'..-
_
¢M
I
"-
,.0
C: _
0 0
0" Z
i Phillip
M. Dickens
t_
Philip David
Heidelberger M. Nicol
C3 t.t. nc uJC_ N 0
_ j..i cd. _..IA
pn t,,_ Z
Contract NAS June 1994
1 - 19480 ,,,_ U.J ,_
0
_u_
_
I
Institute NASA Hampton,
for Computer Langley VA
Applications
Research
in Science
and
Engineering
Center
23681-0001
Operated
by Universities
Space
Research
Association
I k-,- _
u') u.J t/_
Parallelized
Phillip
Direct
Execution
Sinmlation
Parallel
Programs*
Philip
M. Dickens ICASE
NASA
Langley
IBM
Research
Hampton,
T.J.
Heidelberger
Watson
VA 23681
Yorktown
Box
David
t
Research
P.O.
Center
of Message-Passing
Center
704
Heights,
NY
M. Nicol*
Department
of Computer
The
of William
College
Williamsburg,
10598
Science and
Mary
VA 23185
Abstract As which
massively performance
in diverse
parallel
In but
such
execution
simulation
a system,
|ntel
within
10%
On
relative
slowdowns
codes
paper
we
(Large codes
error.
(relative
be
efficiently
parallel
describe
one
communication
we are
Parallel
measured
to natively
to date,
on
the
executing
model
in its
own
of the and
directly
high
relative
presumed this
specifically
suitable
for parallelized
on tile
performance
we have
code,
arises parallel
Because
performance
application
by
executes
the
behavior.
Environment),
predicts
and
of
report
ways
problem
parallelization,
and
Simulation LAPSE
code)
one
details
methods
programs,
nature
This
network
describe
parallel
Application
Depending
We
finding
monitoring, where
to
interested
simulator.
in
predicted.
solution
system
and
interest
performance
simulator
of message-passing
all
can
is growing
a discrete-event
discrete-event
LAPSE
Paragon.
there
compilers,
expensive,
of the
the
this uses
is computationally
of such
parallel
as operating
parallelization
direct
proliferate,
as parallelizing
code,
machine,
approach
low
such
development.
application
the
computers
of massively
contexts
algorithm the
parallel
built
on
well,
typically
we have
observed
speedups
using
up
to
64
processors.
*Research for Computer
supported
by NASA
Applications
lThis
work
was
performed
tThis
work
was
performed
grant
CCR-9201195
.
contract
in Science while while
number
NAS1-19480,
& Engineering this this
author author
(1CASE),
was was
while NASA
on sabbatical on sabbatical
the authors Langley
were
Research
in residence Center,
at tile
Hampton,
Institute VA 23681.
at ICASE. at ICASE.
It is also
supported
in part
by
NSF
1
Introduction
Performance research, arena.
prediction
especially Writers
generating
as parallel
of parallel
efficient
are interested
highly
size and
designs
information
now", the response call's
offers
is a good
This solution simulator then
becomes
to consider This
paper parallel
LAPSE
(Large
for Paragon rithm
programs. While
The
poorly
initialization
prediction
where
tuning
in evaluating
the
problem The
pieces
direct
require
where
execution
the
we describe
have
processes machine
machine
been
implemented
implemented
a direct
in C, or Fortran,
execution
application,
the size and characteristics simulation
or a mixture
From
the
processes. to the
for message-
in a tool
LAPSE code.
algo-
LAPSE
of the virtual
We describe
named
Paragon,
synchronization
application
these
of the application
of the two.
resides
path
on the Intel systems.
and executes
of
It is important
simulator
message-passing
builds,
but
simulator.
to general
machine.
simulator
is a bottleneck.
is applicable
onto the physical
on large,
application
the
file describing
for parallel
is a clear source
to the Paragon,
the
is a
of computation,
are specific
how to build
code
simulator
of LAPSE
describing
at the simulated
the communication
virtual
Environment),
of type T
performance
executing
machine
a message
deal
virtual
itself
application
the application
program
by
From the point the
simulator
a great
either
is directly
all references
when
execution
host the directly
the virtual
Simulation
machine
the discrete-event
of parallelizing
methods
Parallel
Thus,
of the application
in situations
code
by the simulator.
of parallel
does
processors
of parallelizing
machine
written
increasing
offers a solution
but
time now", or, "is there
A detailed
Execution
codes
behavior,
of view of the simulator,
or the simulator
of the virtual
applications
are
using fewer, more
are interested
machine.
of the virtual
approach
though
the mapping loads,
point
a system
in this paper
only a "makefile"
and a LAPSE
the
a pool of other
Application
with
all application
are trapped
on the virtual
on the state
for parallelization.
the problem
codes.
machine
is the wallclock
for accurate
systems.
a bottleneck,
we describe
as input
virtual
depends
potential
considers
passing
execution
to be simulated.
while
tools
in performance
5, 9, 11, 16] of application
the application's
the activity
will perform
tuning
the fact that algorithms
up
available)
networks
simulation
it is running
one easily envisions
on one processor
and
scales
are only infrequently
of
computing
deal with
may be interested
area
as an aid towards
of new parallel
algorithm
execution
From
the
candidate
parallelism;
a direct
placement.
as-yet-unbuilt
of a new users
(which
but must
Developers
of new communication
calls such as "what
describing
possibly
General
about
code,
available
programs
performance
performance
instrumentation
performance,
measurements.
simulation[4,
Under
temporal
the high performance
of performance
parallel
an important
workloads.
code to the simulated
of the
Users
their
execution
executes
driver,
code.
Designers
of view of the application
is currently
would like to be able to predict
of processors
problems.
to obtain
the application
time
to dominate
architecture.
realistic
of direct
of these
executed
are coming
how well the
resources.
under
The method to each
computers
and observing
machine
available
their
programs
parallel
their codes for large numbers readily
of parallel
code may perturb
in predicting
problem
analysis
compilers
in predicting
instrumentation interested
and/or
source machine
code, and
automatically
LAPSE
LAPSE's
accepts
supports
performance
on four parallelapplications:two linearsystemsolvers(oneindirect,onedirect), a continuoustime Markovchainsimulator,anda graphicslibrary driver. Measuringspeedups (relativeto simulations usingonly oneprocessor)and slowdowns(relativeto natively executingcode)weobservea wide rangeof performancecharacteristics, dependingin largepart on the natureof the codebeingsimulate& LAPSEhasprovento accuratelypredictperformance,typically within 10%.While wehave simulatedas manyas 512virtual processors usingonly 64 actualprocessors, the mainlimitation we'veencounteredis limited physicalmemory.This problemis specificto the Paragon,andshould not bean issueon parallel architecturesthat better supportvirtual memory. Severalother projectsusedirect executionsimulationof multiprocessor systems.Amongthese wefind two pertinentcharacteristics, (i) the type of networkbeingsimulated,and(ii) whetherthe simulationis itself parallelized.Table 1 usestheseattributes to categorizerelevantexistingwork, and LAPSE. Tool HASE[121 LAPSE MaxPar[3] Proteus[2]
most
are an important virtual are
shared
a facet
serial
memory
serial
WWT[24]
cache-coherent
shared
memory
parallel
1: Direct
simulators LAPSE
thread
available
protocols
deal
with.
our own,
kernel
are subject
to its costs for context Tunnel
that uses a multiprocessor
work has
it is one that
observed
to accelerate
when
directly
operating
system's
(for small
us.
LAPSE
grain
sizes).
address
as uses The
spaces
in a government
processes
mechanisms
As much
simulator
virtual
machine
but
context-switching
affects
of independent
this ourselves;
considerably,
a direct-execution
of a shared
protocols
which does not support problem
identified
context-switching
the appearance
to support
to that
of cache-coherency
the simulation
nor are we able (in the context system
simulation
existing
and
has been
Tools.
on the Intel Paragon[13],
However,
consideration,
constructs
Simulation
complicate
to us do not support
the operating
Wind
than
is implemented
Coherency not
Execution
other
improvement
at the time
shared
memory
to modify
in parallel
cache-coherent
shared
to our approach,
ulator
serial
(no cacheing)
cache-coherent
necessary
The Wisconsin
memory
Tango[7]
of magnitude
subject
shared
serial
need
threads,
parallel
network
an order
Unix
network
message-passing
as a key performance
OSF-1
message-passing
Simon[9]
concern.
packages
parallel
serial
overhead
thread
network
network
memory.
its own light-weight
message-passing
message-passing
current
LAPSE
simulator
RPPT[4]
Table
Among
communication
lab)
are by necessity
for scheduling,
and
are
switching. (WWT) (the
is to our knowledge
CM-5) to execute
[12] was published.
The
intent
the only working
the simulation. in HASE
(HASE
multiprocessor
sim-
was not operational
is to use a commercially
available
simulatorbasedon the differences
optimistic
between
goal is to support
LAPSE
scalability
cache-coherency its host.
the
protocol
A second
its future
communication
network
another
processor.
requires application barrier
behavior.
WWT
ignores
memory
reference
of better idleness
whereas
in many
lower bound WWT
support
synchronize strate
among
|ong
on the operating
for different parallelized
occurs
numerical
available
required
that
runs purely
have been
proposed
network
simulators.
Our
excellent
affected
message
setting
and any
only an explicit
call
a less rigid between phase.
long periods
ap-
a long Because
of network
comes from the observation
simulating
can still be obtained exploits
The
setting
are independent
of actually
its
is fixed,
alternate
during
of time.
In such
the timing.
Where
provided
a message.
there Finally,
CM-5 idiosyncrasies
before,
e.g., parallelized
contributions
simulations
by actual
until the next
This fact allows
path
WWT
is a the
to recognize
as an application.
of LAPSE
direct-execution
processors
[19, 22] protocol.)
of every
in LAPSE
things
to cause
can have
to send or receive
cleverly
at any time by
is deferred
programs
than simula-
with the
between
is altered
and a communication-intensive
lookahead
overhead
parallel
In a cache-coherent
behavior.
well in advance
system
LAPSE
coupled.
of the execution
of time,
system
the latency
can avoid synchronization
portions
of our approach
sometimes
large
The lookahead
operating cache.
elements
the feasibility
to observe
cannot.
is not independent
misses in the simulated and
LAPSE
code
in a message-passing
network
many
code can be executed
uses a customized
Individual
can influence
no communication
the WWT
path
event;
of machine
by a write
case of the YAWNS
that
is a tool for
may interact
the communication
are less tightly
a network
possibilities,
applications,
the execution
by assuming
primary
deals with this by keeping
object
before
type
a processor
Any communication
is a special
In particular,
where
cases the application
occurs
WWT
any communication
Application
B cycles.
processors
subroutine
phase
every
that
The
a different
The WWT
an assumption
the barrier
might generate
lookahead
codes.
to note
LAPSE's
of a conservative
system,
poor.
of cycles.
contention
to synchronization.
computation
the ability
of synchronization
Paragon
to a message-passing proach
exploits
that
any network
By contrast,
to simulate
is apparently
to synchronize
method
known.
designed
of lookahead,
It is worthwhile
of purpose.
of Paragon
In a cache-coherent
100 number
with the assurance (This
The first is a matter analysis
being
is a matter
protocol.)
on any cache miss, or may have its cache affected
B _
processes
recipient.
that,
researchers,
The WWT
at least
synchronization
and the WWT.
The lookahead
in close synchrony.
Warp
and performance
difference
tion to predict
Time
are to show
of distributed
implementation
performance.
This
of the LAPSE
system.
memory
and testing
combination
direct
how to effectively
programs,
to demon-
on non-trivial
of features
execution,
makes
codes,
LAPSE
and
unique
its peers.
Section a massively nization slowdown,
2 gives an overview parallel
strategy. and
code into a direct Section
speedups,
5 describes while
Section
execution
Section
simulation
our experiments 6 presents
and
3 describes
and then their
our conclusions.
3
how LAPSE
Section
results
4 details
with respect
transforms our synchro-
to validation,
2
An
Overview
A parallel
program
for a distributed
distributed
among
n < N processors.
an equivalence
we presently
communicate
through
the esend defining id and
call in the
processor
area occupied returned
type,
of ereev.
If the
message
will transfer
the message
l(a)
iUustrates
measure these
of control
or predict
durations
Interrupt
at time
a.
operating control
l(b).
network
overhead
block.
the interrupt.
After
explicitly
process
receives
an additional
overhead
is incurred;
point
process
2 begins
executing
time
c) before
user
buffer.
the message Application
time d).
The
application
to assign
virtual
times
process processes to event
version incoming
receive
call,
time
system
2 arrives.
1 can
such durations
a facet we do deal with), synchronization
1 to process
message
state,
2 starting
1. Because over
begins
2, which
on processor
protocol.
by the network
on processor
process
is invisible
of the
the
network,
coming
out of
was in the control
to handle
by the operating
system,
control
process
2 finally
the code
process
execution of these
reaches
it from a system
overhead
The
2 gains
middle
received
receipt
By contrast,
that
for transmission b. The
is complete
it is not interrupted;
in part
system
then calls a
e.g., process
that
simulation
message
When
are unaware
times
is
Otherwise
messages
If we can assume
interrupting
g).
handling
is sent from process
completely
1 continues
the
transaction
durations,
are determined
1 until
this message again.
arrives,
a subsequent
the assumption
(in this case copying
from process
Control
by the user.
lack of interruptions,
the
the operating
the message
spends
in a parallel
thereby
2 (at time
arguments
length.
It runs for a period,
execution
to the operating
has been
message
The message
network.
a message
process
At this point
the message
a message;
is an asynchronous
specified
views time.
about
assuming
l(b),
f,
as soon as the memory
buffer;
a message.
knows
to prepare
2 at time
to application
process
overheads
is passed
required
arguments the process
buffer.
we can exploit
to application
on processor
of an execution
h) that
(again
In Figure
At time a, control system
is not returned
returned
activity
For example,
passes
message
ireev
anticipated
The time the system
and message-passing
in Figure
the
times in a queueing
give us information
as illustrated
is received,
n = N, processes
and length,
to receive
into the location
to send or receive process
application
and maximum
a - 0, c - b, and e - d under
to service
durations
is called
so that
routines.
base address
into a communications
to the application.
of network
these durations
before
into the user's
An application
are analogous
are independent
network
routine
crecy
once the message
library
calling
processes,
Application
to the application
the message,
how an application
message-passing return
the
the message
to the application.
the
from
The
the message
for reuse,
is called
is static.
calls to system
a message.
from the network
of N application
are constructed
mapping
is returned
to place
routine
directly
the
integer),
is available process
receive
copies
sends
Control
base address
transfers
Figure system
nx library
by the message
programs
using explicit
type (a user-defined
to the application
msgdone,
passing
is comprised
parallel
We assume
id of the recipient.
are the message
upon
Intel
machine
Most
assume.
message
the message
memory
1 reaches
arriving after
is completed
the
timing
receive
a, b, c, and so on, as a function
(at time
to user
space)
at time i at which
its receive
message
details.
buffer
is
moves
statement
directly
has been
(at
into the
completed
(at
It falls to the simulator of the execution
durations
execute send process
1
process
2
execute
[]
[]
0
ab
cd
I 0
execute
receive[] hi
(a) Application
execute
process
1
receive execute
I
a b n ra • _send _
I
execute
[] jk
execute
View
c Ooooooo.oooo..ooD--
execute
d • _
receive
0 I
',, I
I 's
I I
X I
process
2
I
% l
execute
_
Figure
f
g
reported
by the application
system
overheads
activity
affects
maintains records
1: Application
processes,
the execution
observed
application
conceptually
separated
application
Figure to LAPSE
message
an overview
LAPSE
code is linked
the interface
progresses,
with
is essentially
LAPSE
to the application, typically
For instance,
in another
becoming
trigger
application
network
resident
some interaction
an application be received process.
simulator,
is essentially
The
simulator
although
is
the set of
between
routine
submitted
message-passing
as the LAPSE
in the same address
routine
that
on future-events
An application
application is known
by an interface
in LAPSE
list.
application
call to send a message An interface
A time-line
activity.
structure.
redirect
operating
simulator
the time-stamps
on the simulation and
assumed
of how message-passing
1(b), called a time-line,
to them.
a single event
that
k
timings.
delays,
Figure
times
communication
macros
application
an application
where
The set of all such routines
which sends the message--to receive
simulator
of LAPSE's
routines.
routines
routines.
events
simulation simulator,
depending
into an application
is recompiled
calls to interface
routine
reflecting
and networks
2 gives
corresponding interface
structure
process
j
i
and a model
process
and assigns
h
of message-passing handling, Indeed,
a data
,send, t3 t5
of parallel
processes.
events
as the simulation
future
an evaluation
I
receive [] []
View
views
and interrupt
list for an application
are modified
Network
of application
for each application
a future-events
execute
and Network
for message-passing
I I
I
_
(b)
l
I
_. interrupt
0
I f
I
_' l
r
41 ,
calls
interface.
space.
Application
processes,
through
is trapped
by an interface
corresponding
to a matching
also communicates
to
The
with some
processor
processor
.........................................
Network Simulator Application Simulator
i
Network Simulator Application Simulator
I t' is chosen,
tion terminates. send events
Once
and the window
Given
whose
simulator
[t', w(t'))
lower window
time-stamps
every
has reached
is simulated.
time
This process
t_ = w(t),
fall in It, w(t))
time
continues
until the simula-
to ensure
that
edge t, the value w(t) is computed
will eventually
another
exist already
all message
on their
applications'
time-lines. Synchronization lower bounds
within
on future
tween simulators managed [21].
initially
is through
computed
simulator,
message's
transforms
minimum
whose
progresses, network
appointment
entry
times.
either
and each simulator's
the halting
time increases,
smaller
time-stamps.
all simulators, strictly
h _, and
must
the network increase
and network
whose have
appointment
an event
reaches
appointment
time
defines
as the
ever processes
the lower-bound times either
on
disappear as w(t).
As
are free to execute
events
is the least
halting
time among
have a simulation
clock value
h _ must
with time-stamp
for
As the simulation
until it is at least as large
h I < w(t)
the net-
with other
simulator
or increases
The
simulator
a halting time.
it is
network
appointments
halting
simulators
will not deadlock--if
is a VP
progresses.
the network
nor network
messages,
time must
the application
The protocol
then the simulator
less than
safely
halting
occur
and
be-
time-line;
message
computes
current
of
appointments
to its corresponding
simulated,
application
the fore-warned
on every
as the simulation
simulator
the simulator's
As these modifications
event,
itself
being
destination
are called
the associated
between
A network
Neither
than
send
these appointments
on the network
sends
or increase,
remote
on when
passes
time.
is larger
send call whose
and it is updated
lower bound
simulator
and distribution
The only interaction
These lower bounds
with every
also be in order.
time-stamp
a simulator
i.e., a message
of its source.
is received,
Depending
may
incoming
an event
that
computation
may affect another.
it into an appointment
destination.
simulators
one simulator
associated
application
by the dynamic
send events,
than
the best known
The
which
network
remote
other
when the event
reflects
work hardware.
is governed
at which
is an appointment
appointment
the
times
by a simulator
There
a window
strictly
less than
h I, which
with
it can
execute.
We now elaborate how they
4.1
are used,
on Whoa, and when
in two steps.
First
they are updated.
we discuss
Secondly
how appointments
we discuss
are computed,
construction
of windows.
Appointments
LAPSE
exploits
to timing. trace
generators,
builds
network
simulator
for message-passing
as a whole) application message.
between
network
has advanced events;
An event's
time-lines
around
appointments
3 illustrating
codes'
one can execute
on the
windows
application Figure
is that
maintain and
appointments
Consider
D, and
and
appointments
appointments,
remote
the tendency
The key observation
a VP's
a potentially
these from
lists.
paths
to be largely
processes
insensitive
largely
as on-line
long list of future-events.
In this subsection
application
simulators
we describe to network
Whoa
two types
simulators,
of and
simulators. time-line
and a situation
to t. The time-line
we wish to compute application
execution
the application
records
execution
a lower bound
appointment 10
where
simulation
bursts
on the network
is a lower bound
time
with lengths entry
on the instant
time
(for the A,B,C, of the
when it will
local csend
A
tt Figure
bounds
3: Application
on the network
on startup itself.
to the application
simulator
events
startup
that
When
send events
The appointments
may change
3 fails to find the anticipated
advances
without
increase.
Likewise,
the message
the message
is completely
suspension.
At the point
appointment
time
appointments of this event
update"
event
difference d units
all of the VP's
of time,
from suspension,
at which
appointments
becomes during
of an erecv
point
VP must until
the period
of its
minimum
outgoing
schedule
event
updated
time
suspended
unsuspended.
update
cost.
As simulation
periodically
The
to include
by the suspended
until the VP becomes
of
is not included
if simulation
suspended.
then
state
of the additional
be updated
by d units.
arises
and hence
computed
and
is reported
the duration
cost is included,
of a message
d = t_ -t,
cost
simulator
that
say at t, we find the VP's
appointments
the VP is released
suffered
For example,
must
event
This usually
by the amount
by the arrival
plus lower
on the startup
as the
it may happen
the VP becomes
suspended,
lengths,
that
current
assumed.
of blocking.
and its appointments
the every
First,
the appointments
a VP becomes
t_, compute
advances
the event list when
appearing, is interrupted
received,
appointments
VP are increased
message,
burst
after
and the additional
in the presence
as the sum of the
may increase.
shortly
is not always
is executed
in Figure
send
the lower bound
on the affected
a VP that
to keep
a message--that
the event
It is constructed
the appointment
in two cases.
than
time-line.
plus the lower bound
for a remote
strives
is larger
entry.
events,
are processed
specifically
of copying
in the lower bound. for all remote
Whoa
on a VP's
block, plus all execution
application
is made
change
network
or startup
appointment
event
some cost--like
execution
simulator.
calculation
to model
As application
Appointments
an application
appointment
costs of intervening
An application evolves.
remote csend D
D
= (A-( t-s ) )+ B +C +D + 2( crecv startup )+ 2(csend startup )
time in the current
of the event
local csend []
C
t
Appointment
residual
crecv i-3
°
s
be scheduled
B
crecv
a "VP
Execution
is removed
from
appointments
are
passed. Individual
simulators
time t' from simulator the
subnetwork
that
simulator
latency
network
appointment
using
i to j, for message
managed
appointment, network
synchronize
network
appointments.
m, is a promise
by j will not be scheduled i's network
time (which
appointment (this is overkill,
depends every
simulator
on the distance
we are developing
associated
with appointments).
model,
no further
updating
of network
simulator LAPSE
appointments
ll
time
the network
the message
a version
Because
the event
at j before
constructs
time the network
volume
that
A network t'.
reporting Given
LAPSE
presently is needed.
currently
the sends
application
the communication
uses a contention However,
at
application
by adding
a new or updated
that very much reduces
at
m's arrival m's
appointment
travels).
receives
appointment
more
free network sophisticated
updating
of network
appointments
The transmission associated takes
has been
to completely
received,
Every network simulators;
The target
simulator
The halting
Window
time increases
protocol
once its halting
on the
The
time t is the upper
upper
may execute,
processes
have advanced
many
involving
event
are
last
or msgwait
which
both
At such time
the simulator of message
network exe-
Eventually events
the three
state
in the
wait
until
as the
of the change.
that
that
process
to an interface
routine
executes
an application
message becomes
when
the application
time-lines
may become
is present).
For
flow-control
state
it changes
from
blocked.
First,
sensitive.
sensitive
is received
has as
blocked/unblocked
it will
To become to
call, such as crecy
from
another
the interface
temporarily
blocked
call and responds
insensitive
runnable,
w(t).
events
was temporally
at a temporally
be found
F application
become
the temporally
any
will wait for more events
its current
may be blocked process
events
a user specified
or at least
may
that
the application
many
one of its VP's
separately process
event
the cost of computing
buffers,
reports
send
at least one event
it is blocked,
less
it is a lower bound
globally,
that
rule is the simulator
is also reported
application Finally,
every
construction
with time-stamps
remote
amortize
of communication
ways an application
process
to ensure
wait until
The general reports
in the window
We desire
is walt until
(it will always
The application
call it made
the application
application
routine
blocked
notifies
for purposes
flow-control. all of its time-lines
reduction, 3-tuple
first action
processes
it reports;
it. Secondly,
Once
to better
t and w(t),
it must walt until the the simulator
process.
are received.
of how far into the future,
times
management
already.
There if the
unblocked
incoming
time may be safely
events
(i.e., not on the time-line) processes.
in the decision.
application
exist on the time-line
be blocked
will be
and network
all remaining
engages
is computed
of the simulation
possible
the Paragon's
the
every
it
the upper edge of the next window.
and it has no remaining
is thus a measure
between
F is involved
to unblocked.
halting
window,
A simulator
window
unknown
ahead
as is apparently
either
with
next
this end, a simulator's
parameter until
of the next
w(t)
time-line
events
reasons
w(t)
on the
simulator
Towards
minimum
and messages
in computing
edge of a window.
time is t or larger,
edge
time-stamp
on each VP's
into which
to be described.
the upper edge of the current engages
of packets
time" for both the application
change,
its
for as long as it
the time at which the message
less than the current
as appointments
removes
Construction
that
t.
on the number
notes
contention.
to another
to be suspended
time is the time of the current
and the simulator
Suppose
than
simulator
a single "halting
the halting
past
depends
that capture
simulator
VP is taken
for the window computation
maintains
time increases
window are executed, 4.2
(which
network
models
from one network
Any event with a time-stamp
the halting
for network
the target
arrive
data needed
at any instant
appointment.
message
Upon receipt,
decomposed).
completely
cuted.
of a simulated
appointment.
the message
will be required
the results to the
unsuspended
are filled,
of which
reduction.
The
VP; the second
permit first
element
a simulator
enters
each simulator
element
a global
to compute
is L, a lower
S is a lower bound 12
bound
vector
(component-wise)
w(t).
Each
on the last
on the last time-line
simulator
time-line
minoffers
event
a
of a
event of a suspended
VP, computedunder the assumptionthat it will be releasedfrom suspensionimmediately. The third componentR is a lower bound on the time at which a suspended VP will be released. R is itself a minimum received,
of three
the minimum
on the minimum includes
values--the incoming
network
time at which
local sends).
Given
minimum
B is a lower bound
By construction in [t,w(t))
the global
minima
on the startup
already
been
established,
needed.
It allows us to use the appointments schedule
that
The effect of temporally
sensitive
be the last one on its VP's cases
(between
5
call on a time-line
where
VPs),
Rmin, each simulator
whose
window
VP (this
computes
time-stamp
ultimately
is constructed.
remote
other
in [t, w(t))
indicates
This
send event
than
appointment
falls
property
in the
to define a dynamically
so that
the difference
calls are temporally
the net effect is to serialize
deserves
the application
This means
will be small, most
construction
that
time-line.
is
window
has
management
is
evolving
pairwise
that
sensitive
between tend
Presence
of a
is blocked--the
call
L and R offered
by its
the values
and
the simulation
remark.
process
w(t) and t will also be small. not to occur
of communication
simultaneously
events.
Experiments
In this section,
we report
plications
we used for experimentation,
that
the accuracy
of LAPSE
application
slowdown,
with N virtual natively,
speedup,
which
processors.
to be the
Essentially
data
the application
structures.
(with
We next
processors processor.
optimized
its attendant
runs
a four
instruction
VP problem
that
on four
relative
with
to execute
by that
unnecessary
N vir-
the same
of absolute
speedup,
of LAPSE
on n
overheads
when
is maintenance
to the actual
of the
execution
of
which would have to be done on any serial
with the high computation
processors 13
LAPSE
does not avoid
is small compared
counting) (SOR
an application the application,
an application
divided
avoids most
by the
the simulation
is representative
simulator
LAPSE
this overhead
In fact for one of our applications
LAPSE
by the time it takes
LAPSE
to execute
characterize
to execute
speedup
these are captured
LAPSE
ap-
i.e., quantifying
by the time it takes
We then
serial
the only overhead
overheads;
it takes
LAPSE
This relative
a set of four scientific
the issue of validation,
divided
processors.
divided
describing
quantify
processors
for this belief is that However,
After
to be the time
time it takes
of a hypothetical
The justification
appointment
predictions. is defined
LAPSE.
we first address
on N physical
on n physical
time
using
on N physical
to execute
serially.
simulator.
which
on one physical
the execution running
timing
is defined
tual processors application
on experiments
processors
running
ratio),
the
for every
calls on window
will always In extreme
other
will not deadlock.
sensitive
to the reduction
and a lower bound
to any
Rmi, + Stain},
no synchronization
temporally
simulator
a message
send event
point
an appointment that
sends
Lmln, Smin, and
at the
for it ensures
and
time for the simulator,
any remote
key to Whoa,
synchronization
that
in the midst of being
cost of a send event.
that
on a time-line
time of a message
VP next
= B + min{Lmin,
we are assured
is resident
appointment
an unsuspended
w(t) where
completion
only
1.8 times
to communication
slower
than
the native
applicationon four processors. We will by the amount
of lookahead
to communication timing
ratio.
simulator
5.1
present
see that
in the application,
In this context,
the application
the slowdowns
and speedups
are strongly
as well as by the application's
application
lookahead
is able (or allowed)
refers
affected
computation
to how far in advance
of the
to run.
Applications
We experimented and
with
engineering
workloads
span all such possible performance The
four
workloads).
modeling
17 in [23]). system
equations
x(i,j
x(i,j
East
the NEWS by varying this ratio
West
South)
exchange
G, we can adjust
or crecy.
application
simultaneous
(i,j),
are N processors
x(i,j),
Thus
requires
pattern.
The number
execution
exchange
points. at
x(i - 1,/),
each processor
is
in a NEWS
executed
is of order
ratio
in a
ordering;
thus results
between
G bytes.
of the application;
Thus we call
when G = 250. This application
The message
passing
is timing
independent,
path
grid
x(i + 1,j),
of instructions
to communications
with a given
odd-even
grid,
computation
G 2 and the size of each pairwise
the application
with
(see
resulting
of discretized
in a square
The
pattern.
in two dimensions
is discretized,
(SOR)
arranged
fluid dynamics,
in two dimensions
PDE
G = K/v'_.
communications
equation
K 2 is the number
grid point
the computation
computational
(PDE)
over relaxation
comprehensively
calls are all synchronous, and thereby
has good
lookahead. second
application,
element
discretizations
problem
arises frequently
method
nearest
neighbor
operation
of the matrix
applied
subdomains. Once values
for each
subdomain
the earlier
The
are specified interior.
iterations
itself.
This process
global
solution
tion on a unit
of a problem square
with
differential
equations
matrix
The
The domain
are separated
defined Dirichlet
accurate
solutions
conditions, 14
by an approx-
at each iteration and a global
basket"
independent
problems
of the algorithm.
The specific and
specific
test
only
reduction
into nonover-
consisting
values for the unknowns
only.
of
is of Bramble-Pasciak-
of edges
and
may be solved The objective
of
on the wire-basket
on each edge at each iteration,
on the vertices boundary
[14]; this type
of the PDE is partitioned
for these edges and vertices, at sufficiently
products,
by a "wire
or finite
the PDE [1]. The conjugate
requiring
of matrix-vector
difference
preconditioned
from discretizing
is highly parallelizable,
products.
independent
algorithm
method
This is done in the final iteration
is to arrive requires
for finite
gradient
obtained
in the formation
subdomains
solver
fluid dynamics.
operator
of inner
partial
of a conjugate
to a sparse
in the formation
decomposition
elliptic
in computational
communication
vertices.
is a domain
type consisting
to the inverse
gradient
BPS,
of two-dimensional
substructuring
lapping
where
value
The
these
of scientific
and graphics.
"very low" when G = 25, "low" when G = 50 and "high"
csend
imation
equation
case a rectangle).
communications
is of order
in a highly regular
Schatz
differential
of size G × G where
results
The
boundary
equations
If there
systems,
of a variety that
arise in physics,
Poisson's
at (an interior)
+ 1).
a sub-grid
(North
using
the solution
- 1) and
assigned
of K 2 linear
are solved
each iteration
(in our
or intended
communications
solves
This is a partial
large, The
and
are representative
is made
The applications
$OR,
on a boundary
that
no claim
of computer
set of values sparse
applications
(although
first application,
Chapter
parallel
as well as a small
case is Poisson's
advantage
is taken
equaof the
existenceof fast Poissonsolversfor the subdomalninteriors. The codeusessynchronous callssuch as esend, eprobe, erecv, and globalreductionsand thus exhibits good applicationlookahead. The coderuns on numbersof processorsthat are powersof 4 and two levelsof computationto communicationratio wereobtainedby varyingthe sizeof the G × G subdomain assigned to each processor;
the
"high"
ratio
has G = 256 while
written
in a combination
ICASE
and Old Dominion
University.
third
APUCS,
The
application,
using the parallel model
algorithm
of a large
irregular
The parameter achieved
setting
distributed message
settings
scientific
data
number
ratio
time Markov
in [18]. The specific
computing
system
has G = 64.
to us by David
system
described
in [18].
any-to-any
pairwise
passing
calls (csend,
irecv
and msgwaits)
#1 = 1 and
the
"low"
computation
is of
event
simulator
is the queueing
This
application
messages.
of "Set I" of [18] with the "high"
which
and Ion Stoica
discrete
simulated
involving
This code,
Keyes
Chain
patterns
application, [6].
the vertices
the processors.
spaced
between
the endpoints
Because
This
computation
network
has
highly
application
as well as global
also
reductions.
to communication
to communication
of this
"high"
and
5.2
has very
"low'computation
to communication between
for this library, of a certain
and a Z-buffer
ratio
irecvs
little
application
respectively.
two different
each processor
achieved
by
visualization
of
and
called
a span,
However, and
the designations
This code,
to us by Thomas
"high"
fashion
over
removal.
communications msgdone which
Crockett
"low"
call.
consists
of
of ICASE.
by assigning
it is harder
and
and
surface
sensitive
for this application
rotate
by interpolating
pairwise
were achieved
some
are then sent to the
are colored
temporally
lookahead. ratios
then
is used for hidden
and the provided
generates
in an interleaved
This code uses any-to-any
to communication
ratio,
to support
size. The processors
algorithm
using
lines of C, was written
computation
written
For each scan line, the pixels
frequently
per processor,
library
scan lines are distributed
for each of F frames.
500 triangles distinguish
Display
spans
the simulator 14,000
program triangles
scan line.
of the
asynchronous,
graphics
of the scan lines of each triangle,
for that
is repeated
approximately
driver
and colored
The endpoints
responsible
is highly
is a parallel
of their triangles.
processor
This process
PGL,
In the sample
of randomly
shade
1000 and
to adjust
are used
the
mainly
to
workloads.
Validation
In this
section
describe tions,
we describe
the results
the
transit
We obtained
times,
length,
experiments.
e.g., csend
we validated In order
message
passing
time the application
of the system
specifically
by which
of system
as well as the
estimates
were written
the message
process
of our validation
we must know the overhead
network ca]].
that
"low"
_l = 64. fourth
The
the
was provided
is a continuous
were those
by setting
The
and
Fortran,
described
communications
uses synchronous
ratio
of C and
overheads
to determine were modeled
for LAPSE calls (e.g. spends
by measuring
such overheads.
timing
to make csend)
between the system
Overheads
as a + b × L where 15
LAPSE
predictions, reasonable
and predic-
and interconnection each message using
test
for calls that
L is the message
passing programs
depend
length
on
and a
and b
are constants
transit
times
software
estimated
overheads
times.
Thus
predictions
To obtain
time between
natively gives
of user
instructions
average
time per (user)
between
these
conversion
This conversion
hit ratio,
of nodes.
For example,
mix, and
the data
under
the conversion
factor
patterns
application
was then
and
the timings
compared.
native
observed
These
engineering
and data
number
of nodes
show that
errors,
passing
path
is timing
Thus,
for the
dependent,
several
other
between
LAPSE
predictions
have
reported
while it spends
always
LAPSE
occurred
the PGL-High
in Table
that
when G = 25 (as opposed 90% of its time executing
can provide
accurate
applications.
16
applications).
In
of computing
timing
of nodes,
of execution error
for very
to executing
and
We have
runs
in which
short
These
times
is 6%.
is -12%
when
runs span a range
SOR on 16 processors
spends
system
user instructions
estimates
were used.
of numbers
error on 64 nodes
2 is for 20 frames.
estimates
were
below at most two iterations
the maximum
for
factors
iterations
for a variety
these
the
However,
conversion
LAPSE
For example,
number
the instruction
of processors.
under
but
we use the same
for a fixed G, we expect
and separate ratios
are
the time
calls are approximately
natively
and
used in
if there
on a large
of instructions,
of the number
may be different
timings
is the
case sense,
is then
computes
size per processor,
Unix
the total
which
For example,
to predict
message
independent
c = U/I
calls, LAPSE
we
application.
reports
factor
presented
For example,
LAPSE
between
The
in an average
in the results
user instructions
for messages),
results
passing
As can be seen in the table,
the +3% error
efficiencies.
of its time executing
application
differences
may be present.
10 frames;
of application
or waiting
times.
prediction
effects
generating
the percentage
execution
larger
initialization
run both
factor"
incorporates,
to communication
execution
may be required;
The
actual
two message
factor
G (or computation
2 presents
a "conversion
for timing
First,
2 x 2) with
by the
which
predictions.
to be approximately
for each
Table
LAPSE,
as on an 8 × 8 grid of processors.
computed
factors
size) under factor
(typically
U consumed
LAPSE
executed
value of G, the conversion
the application
link contention,
per node for PGL).
The conversion
between
a different
cases for which
the
transit
as follows.
of nodes
of triangles
in SOR with a fixed value of G, the number
access
network
the hardware
calls, we proceeded
of the application.
on the small
same on a 2 × 2 grid of processors
conversion
etc.
executed
c measured
than
does not model
of user time
the same data
calls as J × c. For a given
factor
Interconnection
for very long messages,
on a small number
this, we compute
instruction.
instructions
which
passing
LAPSE)
(with
runs of the application
J application
simulator,
of the amount
I. From
mix, cache
except
are far greater
G in SOR or number
us an estimate
number
subsequent
messages
message
(i.e., without
ran the same application
the instruction
receiving
data.
that,
accurate.
size per node (e.g.,
command
We next
are)
the application
data
and
on the measurement
We have found
using our simple network
to be (and
ran the application time
measured.
for sending
can be expected
a certain
by regression
can be similarly
for a range
47%
instructions
when G = 250. of scientific
and
5.3
Slowdowns
We next investigated overheads.
First,
is the overhead overhead
running
natively
measurement
simulator
twice
the
User-supplied
speed.
LAPSE
application
that
is able
of such
executed
the
high ratio
remote
the
on N nodes.
generally
ranges
be-
counting,
but
on the
of application
event ratio
from that
slowdown
is strongly
increases.
has an associated resulting
time-line,
events
the
of events
per window.
algorithm
appointment
time.
in additional appointments
one
we can
3 presents
VP/processor.
number
of application
decreases
from
30.4
from 0.3 to 4.6 as F increases upon
application
used in LAPSE. factor
As simulation
However,
is not always
is as follows.
time advances,
Thus
increases.
lookahead.
For the low ratio,
to this behavior
and also tends
17
receives),
on an
from only 0.3 to 8.0. For the
on a time-line, there
passing
to be a call to the LAPSE
communication. overhead
windows.
LAPSE
F. Table
slowdown
dependent
A contributing
and
with
is defined
12.0 to 3.7 and A increases
an effect of the flow-control
number
By running sends
to
of the simulator,
of message
the parameter
the
sent in
lookahead
in larger
number
from 2 to 16. At the same time A increases
be updated,
allowable
this results
of F, as well as the average
to communication
must
We now consider
application
well in advance
on 8 processors
per VP, A. (An application
16 the slowdown
may
the effect that
(i.e., one with only synchronous
as a function
of
as well as an application
of the simulator.
running
we
OSF-1
LAPSE
each message
on the same node).
by varying
shall
that
aside,
application;
earlier,
nor
as the number
overheads
how the maximum
manner
overhead,
in [8] indicating
message
processes
As explained
for APUCS
decreases
send event
of events
the maximum number
past
to application
to run in advance
low computation
3 also illustrates
appointments number
in a single
natively
superlinearly
simulation executing
F determines
in a controlled
from 2 to 16. This demonstrates as F increases
as the natively
to do so.
slowdowns
the slowdown
parallel
to run the application
lookahead
per window
For the
that,
increase
attempts
is permitted
down to 9.3 as F increases
Each
application
of the time to natively
system
are presented that
by demonstrating
an experiment
reports
interface).
Table
there
to execute
with instruction
operating
We begin
parameter
has good
from
these may be between
the effect of lookahead
events
is augmented
overheads
an application
of LAPSE.
the application
table
LAPSE
overhead
of
is the operating
can be captured
comparison
types
Second,
an SPMD
the application
counting
are several
there
node;
time it takes
by direct
measurements
messages
in both
flow-control
application
some
We do note
(although
slowdown
overhead
management
as many
results
has on simulation
This
the instruction
the application
Third,
per
overheads
to be the
were obtained
when
However,
increase.
message
the overall
results
All of these
is defined
simulation
has process
the application
the
processes
There
instructions.
(in parallel).
multiple
that
overheads
to separate
per node
send at least
study
application
by the time it takes to execute
to that
to do so here.
provided
simulation
LAPSE.
simulation.
on the Paragon processes
which
in running
in counting
per node.
have indicated These
the application
It is harder
events
divided
measurements
attempt
the timing
from managing
the slowdown,
on N nodes
not timing
doing
arises
involved
involved
has only one process
307o to 80%.
execute
of overhead
is the overhead
that
called
application Our
there
in actually
system
tween
the amount
if there
Increasing to increase
an increase
these
are a large F increases
A, the average in slowdown
for
largeF;
see APUCS-High.
is not entirely include
clear;
operating
In this case, A levels off as F increases.
there
are a number
system
process
scheduling,
of the application
and the simulator,
We next
slowdowns
consider
LAPSE
to native
LAPSE
runs,
application
there
The results as the
number
more number
taken
submitted
overheads
typically
slowdowns
increases
that
these
has little
message
first that
so permitted.
window.
window
decreases
and number
modest
(between
the
of minimum of low values of processors,
to communication code
but
as the
as a function
application
cost of
increases,
the likelihood
computation
increases
The
of processors
is computed
process.
the slowdown
a global
of the
For the
and one simulator
of VPs increases
lookahead
ratio decreases.
and thus
the simulation
question
by blocking
up to the time
the
windows
constructed
in PGL,
application
simulation
window.
relative
VPs per node, consider
the PGL-high
decreases
We next measure on n physical
combinations
limit
n.
where
For these
28.0)
and
simulators
do not make
As described we deal
simulator
was asked. and
blocked
are
[5, 24].
much
thereby
and
multiple
reductions
use of
N >_ n.
processes nodes
and a different
LAPSE
the level of multiprogramming
18
operating
simulation
time
the size of the speed.
Thus
events
construction
per
algorithm
These
per node,
factors
simulating nodes.
By configuring
LAPSE
set of 16 nodes
for simulation,
on the Paragon the
system
to at most
of temporal
from the simulation
are obtained
ran
synchronization
are few application
application
We again
applications,
limits
work done in the window. processing
this application
simulation
the window
in slowdown
by running
advanced
slows
there
forces
severely
of 139 on 64 nodes.
processing
earlier,
with this type
has
This
per node,
of simulation
slowdown
speedups
processors
they
calls it continually the
the application
for application
(relative)
until
process
by placing
to 117. Similar
of N and
typically
to the amount
and separating
with 32 nodes
since
In LAPSE,
algorithm,
are frequently
(to some extent)
msgdone
question
windowing
processes
can be counteracted
the slowdown
process
msgdone
by the
1.8 and
execution-driven
higher.
processes.
With only one simulation
expensive
For example,
many
simulation
the
lookahead
are significantly
application
at which
by other
calls.
since by executing and
are quite
reported
application
passing
for PGL the slowdowns
the application
straints
We compared
the application
of using
size
executing
APUCS
of slowdowns
codes all have good
between
on 48 nodes
BPS and
low end
sensitive
However,
cessors
is spent
earlier.
for a given application
as the application's
ratios
etc.)
process
whenever
w(t)
the number
to communication
described
as the number
average
next that,
of the time
for SOR,
at the
temporally
multiple
Observe
factors
on 4, 16, 32, and 64 processors.
is an effect
because
These
off
are less important.
The
becomes
simply
manner.
algorithm,
applications
in Table 4. Observe
the
and increasing
most
construction
good lookahead
that
This occurs
in a complex
one application
This
reason for this leveling
the computation
(logarithmically)
observed
over all VPs,
considerably Recall
increases.
we have
a high ratio,
per node;
are reported
increases
to the reduction.
the slowdown With
of processors
of VPs increases.
values
the window
values so as to obtain
a new window
importantly
algorithms,
time for the codes running
of these experiments
computing
interacting
on the set of four
are two processes
F was set at moderate
of factors
(The
for PGL-low. using
applications and
to run
physical
8 VPs/processor;
N virtual for a variety memory
proof con-
in the case of
Application Comp./Comm. Number of Processors Ratio 4 16 I 32 I 64 SOR -4% VeryLow +3% +2% +1% SOR Low o% +4% +4% +4% SOR +1% +1% +1% +2% High BPS Low -6% +1% -4% BPS -1% -2% -3% High APUCS Low -1% -2% -3% -1% APUCS -2% -2% -1% 0% High PGL Low +1% -2% +1% +5% PGL -1% -2% High -1% +3% Table
BPS-High
2: LAPSE
at most
for LAPSE
4 VPs/node
running
power
could
8 VPs on from
for these
experiments.
by the
BPS, and APUCS, on 4 nodes
LAPSE
speedups
PGL, the maximum to 2 nodes.
relative
for these
experiments.
VPs per node,
divided
Again,
The maximum
PGL speedup
For the
monotonically.
than
(Remember
that
the size of N, especially
an actual
speedup
most
as in the hypothetical
larger
number
1 to 4 nodes.)
Table
are stated
are higher faster
is defined
for those
on 64 nodes
of nodes
must
5 shows
with good
be a
the relative time on one
lookahead,
SOR,
relative
speedups
0.60 and 0.92),
while
on 8 nodes
between
is obtained
0.43 and
when
0.60).
increasing
Table 6 shows the relative LAPSE
(For BPS-High to LAPSE
applications
than
of nodes;
applications,
efficiencies
to be the
relative
number
to be the LAPSE
applications
of that
time on n nodes.
speedups
good
for SOR,
with 8
number
of
with 4 VPs
lookahead; BPS,
1
speedups
the maximum
with
from
time on 8 nodes,
time on 16 nodes
on 16 nodes
For
LAPSE
and
APUCS.
is 2.5.
slowdown
the time LAPSE each LAPSE
the
between
We were also able to run a 512 VP case of SOR-Very to 512 nodes,
BPS
For these
efficiencies
is 1.7, and
LAPSE
the speedups
3.9 to 7.0 times
times slower
times.
of VPs on a small
64 VPs on from 8 to 64 nodes.
by the
runs between
access
execution
limits
is defined
3.5 and 4.7 (relative
Here relative
VPs per node was 4, so those per node.)
increase
speedup
simulating
(For
speedup
time on n nodes.
between
number
4 VPs on from
2.4 and 3.7 (relativo range
We next consider
a small
Here relative
the speedups
are between
the relative
in predicted
This effectively
1 to 8 nodes.
of 4 so in this case we simulate divided
errors
be handled.
the case of simulating
simulating
speedups node
percentage
on one processor.
We first consider specifically
validations:
could
predicted
processor
Low on 64 nodes.
not be computed.
However,
it would take the application
has 8 VPs and hence
system.)
19
Since
at least
we did not have
LAPSE
ran only
100
to run on 512 processors. 8 times
as much
work to do
Maximum
Comp./Comm. Ratio
Slowdown
Application Lookahead
Low
High
Table
3: LAPSE
time divided
slowdowns
by native
Avg.
on APUCS
execution
Application
Application
Events Window
(F)
per per VP
0.3
2
30.4
4
14.2
1.1
8
10.2
3.3
16
9.3
8.0
24
9.9
12.8
32
10.9
17.5
40
12.1
21.6
2
12.0
0.3
4
6.1
1.3
8
4.3
2.8
16
3.7
4.6
24
3.5
5.5
32
3.5
6.0
40
3.5
6.2
as a function
time using
of application
lookahead.
LAPSE
execution
8 processors.
Number
Comp./Comm. Ratio
of Processors
4
16
32
17.5
28.4
24.0
28.0
SOR
Very Low Low
6.5
11.5
13.2
16.7
SOR
High
1.8
4.0
4.1
6.1
BPS
Low
3.0
7.5
-
11.9
BPS
High
1.9
2.4
APUCS
Low
7.5
11.1
APUCS
High
3.0
[ 64 l
SOR
Table
4: LAPSE
same number
3.0 13.5
18.8
5.9
7.6
12.9
99.2
148
116
139
PGL
Low
15.1
61.6
PGL
High
17.4
70.2
slowdowns:
LAPSE
execution
time
of processors.
20
divided
by native
execution
time using
the
Application Comp./Comm. Ratio SOR VeryLow SOR Low SOR High BPS Low BPS High
virtual
processors
4 I 3.1
8 4.7
1.0
1.8
3.0
4.5
1.0
2.0
3.4
4.6
1.0
1.7
2.4
1.0
2.0
3.7
1.0
1.7
2.5
3.8
APUCS PGL
High
1.0
1.6
2.4
3.5
Low
1.0
1.5
1.6
1.7
High
1.0
1.4
1.4
1.6
5: LAPSE
speedups
on an 8 VP problem
Comp./Comm.
(on a 4 VP problem
Number 8
Ratio
processors
1.0
2 1.8
Low
Application
Table 6: LAPSE
1
of Processors
APUCS
PGL
Table
Number
VPs / Processor
4211 3.5
SOR SOR
Very Low Low
1.0
2.0
5.9
1.0
2.0
3.5
6.2
SOR
High
1.0
1.7
4.2
7.0
BPS
Low
1.0
1.9
2.9
4.4
BPS
High
1.0
1.7
2.7
APUCS
Low
1.0
1.9
3.0
4.1
APUCS
High
1.0
1.8
2.8
3.9
PGL
Low
1.0
1.9
2.2
2.5
PGL
High
1.0
1.8
2.1
2.3
relative
speedups
per processor.
on a 64 VP problem. (For BPS high,
times
per processor.)
21
for BPS).
Times
are relative
are relative
to 8 processors
to 16 processors
with 8
with 4 virtual
6
Conclusions
This paper parallel
describes
message
a tool,
passing
the direct-execution on LAPSE's
and
where
without
slowdowns questions,
for increasing
on a smaller
to depend
the amount
are modest namely
and
good
those
the slowdowns
continue then,
would
(namely
the application automated
in place
example, and
that
an application
present somewhat
10% of the
exhibit
loDg
can be
running
actual
a
execution
probes
this should most
advanced
However,
at some
speed
for such
approach
so inform
more elaborate
data
can't
time).
These
until
about
the
this approach
time
be canceled),
structures
clock calls.
time,
the
unblocking
of an
The second
currently probe
For blocks
call
in certain
was
cases.
say 100, and
If the probed
then it is guaranteed
thereby
into
of messages.
LAPSE
at which
and application/simulation
22
set of routines
is too conservative unknown)
of this change
in the context
the status
here now?"
support
to be inserted
not pose a problem
type
asyn-
locations
have
to
gets used (and
a simple
frequent
be possible
clock returns
storage
would
and
are not used until
Transparent
certain
the
call is made,
it would
However,
is a query
the application
the clock
Con-
is to block
clock value actually when
(as yet potentially
messages
in LAPSE
the simulation
to time 50 at the time of the probe.
at time 50 (and
applications.
clock values
likely desire
of a certain
has
of possibilities
are stored).
can be handled
a message
the probe.
could
would
to temporally
are a number
applications,
use clock values.
For applications
on the answers
applications,
checking
with good
there
to the calling
involve
to communica-
For applications
at which
For such
clock values
time
depend
when an unreturned
that
call that
has advanced
100. The simulator
requires
system
simulation
answers
the simulator
already time
then
the
paths
the clock call, accept
subsequently
asks "is there until
protocol
for four scientific
are obtained.
However,
to the point
in many
of clock calls; however
sensitive
a probe
application
Suppose
beyond
which
instrumentation
data
of processors
the computation
speedups
simulation
intervals).
since it would
to read and
type of temporally
made,
into
frequently
were validated
lookahead.
The current
time has not yet advanced
be costly
those
can be designed
the
However,
in timing
the application
only if simulation
values
increasing
and only block the application
mechanism
performance
and simultaneously
were within
high.
has advanced
time.
are set (e.g.,
running
chronously,
time
the simulation
for
considerations.
predictions
whose execution
can be quite
calls clock routines.
they
of
suitable
synchronization
on a large number
of processors
simulation
that
report
running
of application
sider first an application
well after
codes
on several factors:
potentially
then
and provide
message-passing to temporal
and simulation Whoa,
5%. and
simulation
systems,
the predictions
and
until
execution protocol, The
timing
lookahead
application
direct
multicomputer.
number
LAPSE
Typically
was shown
good lookahead,
sensitive
that
of applications
of the application
lookahead,
observation
are insensitive
were within speed
Paragon
paths
applications.
they
message-passing
Intel
machine.
parallelized
a synchronization
predictions
of the larger
The simulation tion ratio
the
the application
engineering often
in the
execution timing
simulation
times;
of general
exploits
their
made by executing and
we describe
simulation
Using LAPSE, timing
that supports
implementation
is conservative, periods
LAPSE,
applications;
suppose
for message
is
to be there
at
the application. flow-control
than
This are
currently which
in LAPSE message
that
such
is destined
in order
There
are
runs
our
system
on the
will
delay
Paragon's
e.g.,
We are planning
plan
and
package,
nx-lib
to investigate
need
[17])
simulator with
form
to LAPSE timing
the
must
keep
PGL
track
code
of lookahead)
process
simple
our
example
of
indicates
is well
not
way in which
worth
to incorporate
to be changed
such
LAPSE
for the to run
on other
features
resulting
under
parallel
into
(more
and
platforms
the
our
model,
the
li-
operating
model
is a pure
Paragon's
Our
mesh
model
of the
of the
Paragon's
communications and
buffers.
especially
models.
newly
besides
Paragon's
of some
complex)
model
to work
functionality,
of the
manages
cur-
message-passing
LAPSE.
models
Paragon
LAPSE
network
simulator it with
LAPSE
Paragon
of the
current
include
the
porting
with
parallel
does
First,
Paragon
model
our
of integrating
and
the
workstations
Second, For
pursuing. We are
provides
the
timing.
we are
estimates.
[15], that
crude.
in the
how
standard
the
experience
a packet-by-packet
of the
to port
our
and
sophisticated
provides
with fairly
models
how
Interface
nx-lib
is also
to investigate
calculations
Passing
Paragon
are
system
However, a more
provides
implenlented
network
be reused
extensions
functionality
We have
algorithms,
lookahead
and
may
simulation.
is currently
operating
internal
the
a software
hardware
interconnection
probe). is really
of workstations,
augment
network.
types
of worthwhile
Paragon
with
and
(which
to accelerate
on networks port
message
for which
a number
in conjunction brary
e.g.,
an optimization
pursuing
rently
(since,
how
In addition,
emerging
MPI
we
(Message
Paragon.
Acknowledgments We are
grateful
to better
to Jeff
understand
Earickson the
Intel
of Intel Paragon
Snpercomputers
for his generous
assistance
in helping
us
system.
References [1] J.H.
Bramble,
problems
J.E.
Pasciak,
by substructuring,
[2] E. A. Brewer,
parallel [4] R.
Chen,
September H.-M.
systems.
Covington,
parallel
computer
Su,
Schatz.
The
Comp.,
Technical
construction
47:103-134,
A. Colbrook,
simulator.
of Technology,
A.H.
i. Math.
C. N. Dellarocas,
parallel-architecture
[3] D.-K.
and
and
of preconditioners
1986.
W. E. Weihl.
Report
for elliptic
PROT_,US:
MIT/LCS/TR-516,
A high-performance
Massachusetts
Institute
1991. and
In Int'l.
P.-C.
Symp.
S. Dwarkadas,
Yew. on
J.
The
impact
Computer
Jump,
systems.
International
S. Madala,
V. Mehta,
of synchronization
Architecture,
S. Madala, Journal
and on
pages J.
Computer
and
239-248,
Sinclair.
granularity
May
Efficient
Simulation,
on
1990. simulation
1(1):31-58,
of June
1991. [5] R.C. cessing
Covington, Testbed.
In Proceedings
J.R.
of the 1988
Jump,
and
SIGMETRICS
23
J.B.
Sinclair. Conference,
The pages
Rice
Parallel
4-11,
May
Pro1988.
[6] T.W.
Crockett
Report
and T. Orloff. A parallel
91-3, ICASE,
[7] H. Davis, Tango.
In Proceedings
[8] P.M.
August
Dickens,
ulation
and of the
lation,
J.
algorithm Center,
Hennessy.
1991
P. Heidelberger,
Simulation
for MIMD architectures.
Hampton
Virginia,
Multiprocessor
International
simulation
Conference
Technical
1991. and
on Parallel
tracing
using
Processing,
pages
and
D.M.
programs. (PADS),
Nicol.
A distributed
In Proceedings
Edinburgh,
memory
of the 8th
Scotland,
1994. The
LAPSE:
Workshop Society
Parallel
sim-
on Parallel
and
of Computer
Simu-
to appear.
[9] R. M. Fujimoto. 83/137,
Simon:
ERL,
October
ofmulticomputer
of California,
Parallel
discrete
Berkeley,
networks.
Technical
Report
UCB/CSD
1983.
event simulation.
Communications
of the A CM, 33(10):30-53,
1990.
R. M. Fujimoto Transactions
[12] F.W.
A simulator
University
[10] R. M. Fujimoto.
and
W. B. Campbell.
of the Society
Howell,
and
In MASCOTS and Simulation
Durham,
Carolina,
[13] Intel
Corporation.
[14] D.E.
Keyes
partial
Paragon
and W.D.
differential
and Statistical
G. Stellner,
[16] I. Mathieson
and
October
parallel
In Sun
Number
decomposition
Paragon
A dynamic-trace-driven International
Systems,
on Mod-
pages 363-370,
312489-002. techniques
SIAM
Journal
for elliptic on Scientific
1987.
User Group Proceedings,
of the 21 st Hawaii
Order
implementation.
March
and simulation
Workshop
Press.
1993.
of domain
A. Bode, and T. Ludwig.
and R. Francis.
Proceedings
their
8(2):s166-s202,
Society
of computers.
1988.
design
International
and Telecommunication
A comparison
equations
on Sun workstations. 1993.
January
Gropp.
April
architecture
of the Second
Computer
Guide,
level simulation
5(2):109-124,
Hierarchical
of Computer
User's
Computing,
[15] S. Lamberts,
R.N. Ibbett.
1994. IEEE
instruction
Simulation,
'9_, Proceedings
eling, Analysis, North
Efficient
for Computer
R. Williams,
environment.
parallel
programming
pages 87-98.
simulator
Conference
environment
Sun User Group,
for evaluating
on System
December
parallelism.
Sciences,
pages
In
158-166,
1988.
[17] Message Document
Passing
Interface
for a Standard
Forum,
tive uniformization. Clara,
The
University
Message-Passing
[18] D. Nicol and P. Heidelberger. Santa
Research
1991.
of message-passing
Distributed
rendering
Langley
S. Goldschmidt,
II99-II107,
[ll]
NASA
Parallel
In Proceedings
Interface,
simulation of the
of Tennessee, November
of markovian
1993 SIGMETRICS
CA, May 1993. 24
Knoxville,
Tennessee.
Draft:
2, 1993. queueing
networks
Conference,
using
pages
adap-
135-145,
[19] D. Nicol,
C. Micheal,
memory 680-685,
and
P. Inouye.
parallel simulations. Washington, D.C.,
Efficient
aggregaton
In Proceedings of the 1989 December 1989.
[2o] D. M. Nicol and R. M. Fujimoto.
Parallel
simulation
of multiple
Winter
today.
LP's
Simulation
Annals
in distributed
Conference,
of Operations
pages
Research.
To appear.
[21] D.M.
Nicol.
Proceedings Systems,
[22] D.M.
124-137.
The
[24] S. Reinhardt, Conference,
pages
48-60,
and J.V.
Santa
Walrand.
77(1):99-113,
April
queueing
with Applications,
in parallel
and W.T.
Vettering. Press,
networks.
In
Languages
and
discrete-event
In Proceedings
Numerical
New York,
J. Lewis, and D. Wood.
computers.
Clara,
simulations.
Recipes
in C: The
1988.
The Wisconsin
Wind
Tun-
of the 1993 A CM SIGMETRICS
CA., May 1993.
Distributed
January
stochastic
1993.
University
A. Lebeck,
of parallel
Experiences
synchronization
Cambridge
M. Hill, J. Larus,
of FCFS
1988.
S.A. Teukolsky,
Computing.
prototyping
of the IEEE,
LS 1988:
Press,
40(2):304-333,
nel: Virtual
[25] R. Righter
PPEA
ACM
B.P. Flannery,
of Scientific
simulation
cost of conservative
of the ACM,
[23] W.H. Press, Art
discrete-event
A CM/SIGPLAN
pages
Nicol.
Journal
Parallel
simulation
1989.
25
of discrete
event
systems.
Proceedings
Form
REPORT
DOCUMENTATION
PAGE
Approved
OMBNo
0704-0188
Public reporting burden for this collection of information is estimated to average ] hour per response, including the time for reviewing instructions, searching existing data sources, gather ng and manta n nil the data needed and comp cling and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, nclud ng suggestions for reducing this burden to Washington Headquarters Services Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite [204. Arlington, _IA 22202-4302, and to the Office of Management and Budget. Paperwork Reduct on Project (0704-0188), Washin_on, DC 20503 1. AGENCY
USE
ONLY(Leave
blank)
2.
REPORT
June
DATE
3.
1994
REPORT
Contractor
TYPE
AND
4. TITLE AND SUBTITLE PARALLELIZED PASSING
DATES
COVERED
Report S. FUNDING NUMBERS
DIRECT
PARALLEL
EXECUTION
SIMULATION
OF MESSAGEC NAS1-19480 WU 505-90-52-01
PROGRAMS
AUTHOR(S)
6.
Phillip Philip David 7.
M. Dickens Heidelberger M. Nicol
PERFORMING
ORGANIZATION
NAME(S)
AND
8.
ADDRESS(ES)
PERFORMING REPORT
Institute
for Computer
and Engineering Mail Stop 132C, Hampton,
Applications Langley
Research
Aeronautics
and
10.
AGENCY NAME(S) AND ADDRESS(ES) Space
NASA ICASE
SUPPLEMENTARY
DISTRIBUTION/AVAILABILITY
UnclassifiedSubject
13.
94-50
NUMBER
94-50
Distributed
Systems 12b.
STATEMENT
DISTRIBUTION
CODE
Unlimited
Category
ABSTRACT
REPORT
CR-194936 Report No.
NOTES
Langley Technical Monitor: Michael F. Card Final Report Submitted to IEEE Transactions on Parallel and 12a.
No.
SPONSORING/MONITORING AGENCY
Administration
Langley Research Center Hampton, VA 23681-0001
11.
Report
Center
VA 23681-0001
9. SPONSORING/MONITORING National
in Science ICASE
NASA
ORGANIZATION
NUMBER
(Maximum
61, 62
200 words)
As massively parallel computers prohferate, there is growing interest in finding ways by which performance of massively parallel codes can be efficiently predicted. This problem arises in diverse contexts such as parallelizing compilers, parallel performance monitoring, and parallel algorithm development. In this paper we describe one solution where one directly executes the apphcation code, but uses a discrete-event simulator to model details of the presumed parallel machine, such as operating system and communication approach is computationally expensive, we are interested in its own parallelization, the discrete-event simulator. We describe methods suitable for parallehzed direct
network behavior. Because this specifically the parallehzation of execution simulation of message-
passing parallel programs, and report on the performance of such a system, LAPSE (Large Application Parallel Simulation Environment), we have built on the Intel Paragon. On all codes measured to date, LAPSE predicts performance well, typically within 10% relative error. Depending on the nature of the application code, we have observed low slowdowns (relative to natively executing code) and high relative speedups using up to 64 processors.
14.
SUBJECT
IS.
TERMS
Performance
prediction,
parallel
simulation,
scalability
NUMBER
OF
PAGES
27
analysis 16.
PRICE
CODE
A03 17.
SECURITY OF
CLASSIFICATION
REPORT
Unclassified NSN
18.
SECURITY OF
THIS
19.
CLASSIFICATION
SECURITY OF
PAGE
CLASSIFICATION
ABSTRACT
20.
LIMITATION OF
ABSTRACT
Unclassified Standard Prescribed 298-102
7540-01-280-$$00
1_U.$.
GOVIERNMIPNT
MUN'nNG
omc£:
_
- SZlb.i_g/231Be"/
Form 298(Rev. 2-89) by ANSIStd Z39 18