thread. Contemporary operating system kernels, such as Mach, decouple the thread .... Life [16] when separate address spaces .... double fool0. ; public:.
NASA
Contractor
ICASE
Report
Report
198242
No. 95-81
//i] (
t
ICA TOWARDS DIRECT
A THREAD-BASED EXECUTION
PARALLEL
SIMULATOR
Phillip Dickens Matthew Haines Piyush Mehrotra David Nicol NASA Contract No. NAS1-19480 November 1995 Institute NASA
for
Computer
Langley
Hampton, Operated
VA
in Science
and
Center
23681-0001
by Universities
National Aeronautics Space Administration Langley Hampton,
Applications
Research
Research Virginia
Space
and
Center 23681-0001
Research
Association
Engineering
1
TOWARDS PARALLEL DIRECT
Phillip
Dickens
t
tInstitute
A THREAD-BASED EXECUTION SIMULATOR*
Matthew
Haines t
for Computer
Applications
NASA
Langley
tDepartment
Mehrotra
in Science
Research
Hampton,
The
Piyush
and
David
Nicol t
Engineering
Center
VA 23681-0001 of Computer
College
t
of William
Williamsburg,
Science and
Mary
VA 23187
Abstract Parallel analysis
direct
of large
execution message
simulation passing
However, detailed We are therefore
simulation interested
cost.
One
is to implement
rather
than
and
idea
traditional
context-switching
based
parallel
present
direct
preliminary
execution results
parallel
programs
tool for performance executing
of message-passing codes in pursuing implementation
Unix costs.
is an important
the
application
processes, In this paper simulator.
that
indicate
reducing
we describe We discuss a significant
scalability
on top of a virtual
computer.
requires a great deal of computation. techniques which can decrease this
virtual both
and
processes on-processor
an initial
as lightweight communication
implementation
the advantages improvement
threads
of a thread-
of such an approach over
the
costs and
process-based
approach.
*This work was supported by the National Aeronautics and Space Administration under NASA Contract No. NAS1-19480 while the authors were in residence at the Institute for Computer Applications in Science and Engineering (ICASE), NASA Langley Research Center, Hampton, VA 23681-0001.
1
Introduction
Direct
execution
large
message
this
approach,
and
any
simulation passing the
prediction
The
approach
does
while
where
machine
the
on top directly
are trapped
performance deal
processors
will perform
virtual host
poorly
but
Parallel simulator
passing
of the
execution
direct
the
improving
physical
the
higher
tion
This
network.
number
passing
Given
this
address
Unix
for LAPSE
LAPSE
costs
of the
each
to improve
code
and
between
processors
been
ported provides
is thus
1.8 and when
of the
is written
such
physical
a measure
139 depending
compared
to other
One
approach
to
simulated
system
are
that
there
processor
it is natural
is one process
in the
simulated
each
simulated
to implement
process.
former
Intel
cluster
Paragon.
which
Intel Paragon
node
has
[33], which
this performance.
each
thus
on the
to the
and
machine
is slowdown, which is defined code on N processors divided
reasonable
on a workstation
executing
is in contrast
Additionally,
and
Application
virtual
[9].
are very
assumption,
space,
processes. itself is a botas well as the
the
library
pro-
of workstations.
slowdowns
application
one
communication
(Large
and
on N processors,
the way the virtual
the
LAPSE
Paragon
of processors
be useful
the
execution simulator
on an Intel message
on
application
code
natively
by LAPSE
it would
assumes
communication
mechanism.
the
measured for
called
for paralone easily
resides
either
application
on a network
has reported
as a heavyweight
slowdowns measured
nx-lib
running
reported
its own
processor
The those
and
processor.
will have
code
is to change
LAPSE
physical
system
the
of a Paragon
LAPSE
simulators,
performance
implemented. per
code
slowdowns
execution
the
is implemented using
of the
overhead.
application
While
simulator
both
measure for a direct execution simulator the simulator to execute the application
time
simulation
on the
LAPSE
functionality
One performance as the time it takes
execution
where
of Sun workstations
message
by the
direct
Environment)
are parallelized.
to a cluster the
a parallel
Simulation
systems.
candidate
executing where
for accu-
of parallelism;
simulator
directly
in situations
constructs.
potential
is a good source
In
behavior
as-yet-unbuilt
path to the simulator becomes a bottleneck or where the simulator tleneck. Therefore, it is important to parallelize the virtual machine application processes. We have developed
the
of
computer.
by simulator
offers
machine
the
parallel
possibly
is a clear
analysis
its run-time
handled
programs
of computation,
discrete-event
scalability
of a virtual
on large,
processes
and
to determine and
for parallel
application
of other
solution
executing
simulator
a great
of the
tool for performance
is executed
program
require
a pool
this
code
virtual
of parallel
a system
However,
programs
execution
Execution
envisions cessor
to the
direct
rate
lelization.
parallel
application
references
A detailed
is an important
uses Unix which
on the Intel
are
This
generally
greater
is in large
part
sockets
has
to
as the communication
has a high speed
Paragon
than due
mesh
a dedicated
interconnec-
communication
co-processor. Given to execute
the
its performance each simulated also
the
processor.
high
large
costs
communication
parallel
direct
is to use virtual
lightweight
processor.
of communicating
However,
costs execution
using
among
network
simulations
threads This
lightweight
on the
rather
reduces virtual
threads
not
cluster, using
it can
LAPSE.
than
heavyweight
only
the
processors to support
impractical
way
Unix
context
executing virtual
become One
to increase processes
switching on
processors
the
same
brings
costs
for but
physical up a very
important
issue:
virtual
processors
address
space?
In this address then
paper
we discuss
stations.
version
thread-based results
suggests
of LAPSE
making
this
[10] developed provide without
of a thread. direct
These
execution our
implications and
based and
2
future
is an
is interested perform
as the
fully
of
improve
direct
execution
on a cluster
approach
preliminary
execution three
of work-
the
thread-
our
work
stack
the
unique
use
simulator is that
system
we
support
of a particular
and
function
5.
In Section
2 we briefly
information LAPSE
and
an overview
review
parallel
on LAPSE.
In Section
discuss
performance
the
of thread-based
for providing
separate
address
experimental
results
comparing
executing
that
simulators
[16] and
operating
on the
simulator
other
Lite
makes
in section
4 we give
of LAPSE
on a workstation
computaspaces
among
the
cluster.
3
process-
We conclude
7.
also
direct
way
software
of answering hardware
are expensive
infeasible basis.
for a user
However, small
system,
(with
and
number
how the
these and
numbers
of processors
to predict
the
time
among
the full resources of processors can then it would
in knowing
will perform
is to execute
However,
are shared
increases.
or different this
many
code
users.
taken
A code why
to execute
many
the
code
happens.
processors
machine running
most on the
parallel
expensive,
or
on a regular
be routinely
the application
of the
The
Massively
be difficult, may
gains
releases
architectures.
parallel
a machine
to emulate
different
this
will
achieves
if and
practical.
It can
that
performance
using
always
a program
similar
under
network
of a massively of such
be used have
the
isn't
A programmer
architecture,
may not achieve
application
questions
processing.
and
of processors
are interested
overheads)
of interest.
to acquire
smaller
number
in parallel
algorithm
of processors
Users
different
research
parallel
the
number
wish to know
and
machines
size and
of processors.
system
of current
a particular
on a small
number
may
area
given
problem
performance
operating
This
version
to implement
direct
special
process-based
in Section
important
in how,
on a large One
directions
We
Background
Scalability
good
of the our
What
background
In Section
versions
a separate
significantly
parallel
At least
as follows.
give some
6 we present
processors.
only
[30], Tango
without
more
is organized
5 we detail
large
only
be maintained
are discussed
In Section
provide
spaces
data
and
with
virtual
process-based
approach
performance. [7]).
approach.
and thread-based
the
project
in Section
threads.
is not
RPPT
implementation
of this
to use our
project
address
simulation
for
simulated one global
threads
can
is implemented
Tunnel
of the paper
original
of the
all share
Paragon.
LAPSE
all private
issues
remainder
we discuss tion
that
simple
the
approach tool
Wind
separate
they
of the
of LAPSE
to improve
of the
with
requiring
The
threads Wisconsin
as part
threads
that
for each
lightweight
thread-based
it a practical
Intel
spaces such that
for implementing to that
the
be quite
to note
(the
to providing
threads
version
on the
address
implemented
approach
it should
light-weight
approach
these
thread-based
It is important employs
using
of LAPSE
separate
approach
the
However,
based
our
initial
the
the
are generally
then
the performance simulations. Currently,
threads
and
compare Our
do we provide
when
space
LAPSE.
how
available.
on the
larger
on the
larger
system. Similarly, of interconnected make
the code of the parallel multicomputer workstations. Additionally, designers
use of a parallelized
simulation
of the
network,
could of large driven
by executing
as could designers of new communication networks. Direct execution simulation is a mechanism to predict computer
of interest
processes
whose
execute
the
application
some
number
The
simulator
with
other
execution
simulator
processes
execution
and
of a code's
simulated
behavior--in
LAPSE the
performance
much Intel
for the machine.
to obtain
possibility
is not
normally
and
Intel
scalability
Paragon. code
assumed
on the
predictions
executing
As noted
before, and
another
application fairly
rapid
behavior--albeit
message-passing allows
appli-
a user
Intel
of LAPSE
analysis
simulator
codes.
virtual
version
scalability
actual
promises
of large LAPSE
on a large
processes
The
network
on actual
In particular,
performance
of the
The
is obtained
application
approach
possible
together
machine.
machine
constructs.
of monitoring
process.
and
virtual
of the
This
is assigned
a simulator
virtual
as a function
to both
processor
and
by simulator
behavior.
the
virtual
N application
processes,
interactions
advances
machine
and
of the
some
Given
physical
or VPs)
executing
for
code,
n < N processors
application
behavior
handled
time
a way that
physical
the
and
simulated
performance
to be used
of the
processes--all
of an application
smaller
cluster
the
written
execution
are trapped simulation
Each
processors,
on a network systems could
application
behavior
we use
behavior.
processes
the
[7, 12, 13, 21].
is sought,
(virtual
emulate
scalability
provides
codes
the
application
how
machine
its timing
application
the
evaluation
cation
control
machine
determine
process
simulate
processes
running
physical
processes
of the
virtual
other
on N processors
and
processes
behavior
the
some
of application
by actually with
using
performance
be emulated distributed
to predict
Paragon
using
allows
a network
for codes
written
a
for the
Paragon. Several
other
projects
of multiprocessor
use direct
systems
execution
[1, 5, 6, 17, 29].
of application The
processes
Wisconsin
Wind
to drive
Tunnel
to our knowledge, the only multiprocessor simulator that uses a multiprocessor to execute the simulation. It is worthwhile to note the differences between WWT.
The
first
is the
coherency
protocol
its
LAPSE
host.
researchers, on
which
shared
supports
is the
synchronization
WWT
uses
the
to
[9] for a detailed
requires
of the
using significant
operating
system.
a cluster
Coherency
The system
the
synchronization
YAWNS third As support
a different
of Sun
to maintain
WWT
and
protocols with
here.
difference wilt
be
while
Sparc
currently.
fidelity
of the
protocol The
cache-
of machine
the
than LAPSE
neither
of
the
simulation
A second
difference
parallel
[27] while
simulation.
LAPSE
uses
a
[26]. We do not elaborate interested
is in the approach
for
analysis.
workstations,
complicate
discussed, our
type
[29] is,
(the CM-5) LAPSE and
is a tool
scalability
with appointments
mechanism
threads.
The
to simulate performance
we are not dealing
YAWNS
combining
discussion.
for
used
synchronization
light-weight
designed
memory.
case of the
of the
and
are a facet
mechanism
mechanism
on
Paragon
virtual but
a special
synchronization
processors
being primarily
an Intel
considerably,
details
of purpose
is designed
is implemented problem
issue
simulations
(WWT)
reader
implementation the
approach
does
not.
is directed of virtual taken
in
[29]
3
LAPSE
In this section
we provide
are discussed To use LAPSE ters,
more
such
one
simulated
system, the
that
action,
processes but tem.
call),
In doing
system.
so, the
instruction
counters
LAPSE
routine,
a LAPSE cation The
at basic
LAPSE
routine.
in the
activity,
simulation
This
provides
for simulating
to that
As
since
the
processes.
noted
simulation
of practical
one
interest,
processes
can be run
simulation
processes
3.1
LAPSE
As noted
before,
lib is a software
along the actual
insights
LAPSE
enjoys
excellent
largely
as on-line
from code,
appli-
inserting
process
calls
a
the
call
to
spent for this
the
of the
processes
process.
each also
processes is that
capabilities
since
one beassigned
is required
in other
work
appli-
requested
application
one per node,
little
last
executing
simulation
in many the
codes
application
synchronization
on a workstation
with
the
we have library
ported
developed
LAPSE
[33].
nx-
allowing
the
cluster to a workstation
by researchers
of codes for the Codes run under
at the
cluster
University
using
reflect
the
workstation's
Intel Paragon multicomputer, nx-lib have the functionality
own
4
sense
of time,
not
nx-lib
of Munich
However, owing to temporal sensitivities, it is possible for the execution be different on the workstations than it would be on the actual Paragon. clocks
system.
[9].
development and execution network of workstations.
to system
simulated
simulated
since
by activity
with
the
sys-
executed
by our
trace-generators
larger
information
simulation
"lookahead"
system,
(as it will when
a description
be affected
identified
application
larger
about
the
receive,
to perform
simulated
in the
of all application
and
in the
an application
system,
between
node wherever
assembly
time
with
node
the
clock
time
responsible
behavior
can affect
the time
the application
process
actual
a send,
is that
a system
of the
a set
routine
information calls
Whenever
across
(e.g.,
system into
functionally
However,
of the
of instructions
measure,
executes
effect
gets execution
a measure
timing
key
net
each
do in the simulated
temporal process
LAPSE
For every
a LAPSE
the functionality
synchronization
process of the
The
as they
modifies
simulation the
calls
to the application
us with
above,
in one
However,
process.
number
are distributed
ing responsible
application
boundaries.
This
LAPSE
processes
system
to call parame-
or both)
that
machine
LAPSE
the
system.
to the
node.
block
The
processes. process
virtual
LAPSE
recovers
simulated
is sent
issues
makefile
version
simulated
provide
information
At compile-time
nodes.
in the
application
will return
of actual
simulated
simulate
the
LAPSE temporal
processes.
These
LAPSE
be in C, or Fortran,
messages
simulators
code's
a file specifying
application
simulator same
processes
whenever
tuning),
To help capture cation
use of LAPSE.
application
up
application
modified
the
LAPSE
For instance,
performance
and
In the process-based
would
to the
it to the
exactly
simulation
may
multiple
code
the sets
the number
a unique
LAPSE
to report
LAPSE
and
a call
the
and (which
processes.
application
communicate
the
and
creates
makes
and
modifies
files
process
LAPSE the
or clock
simply
"simulation"
application
probe,
code's
simulator
way
of the structure
compilers,
of simulated
of the and
multi-tasks
original
of native
number
copies
of "application"
exactly
a programmer
instead
as the
transforms
discussion
fully in [9].
LAPSE,
scripts
a brief
the
using an ordinary of Paragon codes. path of a code to Furthermore, calls time
on the
Paragon
being simulated. Thus nx-lib long
cannot
be used
to provide
accurate
timing
the code would have taken had it been run on the Paragon. There are two primary issues which must be addressed when
workstation
cluster.
developed
The
for an Intel
first
Paragon
second
issue
is how to improve
In this
paper
we are concerned
simulations The
executing
the
and
application
such
in the
the
communication Since application
reasons
implementing
provide
each
written
just
First,
global
variables
with
and
memory 5.
on the concern
executing
this
One
cost. than
space.
to execute
This
for each each
of
processes.
other.
Since
each
off-processor
on a physical
Paragon
and
for interference using
on a workstation
thus
Unix
from
each
threads
This
by replacing
TCP
mechanism
application accesses
process.
How
to
process
of the Intel
process package
For these
attractive.
some
processor
another
a lightweight
is very
the
processes.
switching.
requires
each
cluster
is to implement costs
threads
is because
Paragon.
its local
and
to provide
is briefly
is
this
discussed
in
computation are becoming
a popular
mechanism
for expressing
for many
current
languages
to this
demand
for threaded
systems,
many
lightweight
[4, 14, 19, 24, 32].
interface
for lightweight
proposed
standard
Additionally, threads
already
the
called
exists
POSIX
pthreads
and
systems
thread
committee
[18], and
asynchronous [3, 11, 22].
packages has
a library
have
behavIn response
been
established
devel-
a standard
implementation
of this
[23].
Definition
A thread their
for each
since
and
of context
threads
parallelism
context
each
approach
cost
as lightweight
potential
4.1
socket
simulation
heavyweight
the
ior and oped
the
communication
it reduces as lightweight
Intel
environment
threads
execution
process
on-processor
to LAPSE rather
processes processes
Thread-based
Lightweight
Unix
with
of both
on-processor
second,
be if it were
without
the
threads
its own address
memory
distributed
cost
TCP
intensive
as well as with
The
LAPSE.
research.
one
one
communicate
the
to minimize
application
as it would
is no shared
creates
under
of direct
of current assigns
communication
other
so critical
it reduces
application
There
section
are
ways
copies,
thread
be
frequently
socket,
as lightweight
memory
Implementing
4
costs
two benefits: with
LAPSE
can
each
processes
to explore
sockets
with
focus
cluster
a code
cluster.
running
performance
to the
for
on a workstation
the
is the
LAPSE
predictions
codes
of how
high.
processes
provides
former
As noted,
a Unix
executing
of application
workstation
LAPSE
interact
communication
it is important
the
porting
performance
with increasing
on the
node.
uses
is very
it is actually
system.
simulation
communication
to make
performance
LAPSE,
application processes
Additionally,
when the
of nx-lib
process
simulation
is how
primarily
under
implementation
application
issue
estimates
represents
an independent,
of a kernel-supported "weight,"
is removed
from
which the
entity,
corresponds processor,
sequential such
unit
as a Unix
to the amount and
restored
when
of computation process.
of context a thread
that
Threads that
must
is reinstated
executes are often
be saved
within
the
classified when
on a processor
by
a thread (i.e.
a
context
switch).
kernel
stack,
to switch
For example,
user-level
this
therefore kernels,
such
multiple
threads
the
context
must
processes
as Mach, within
include
medium
the address
and
than
thread.
the
thread
thread"
of this paper,
and
operations
By exposing
execution
process
are
still
required
space, by
allowing
the
about. state
kernel, resulting
and
"lightweight
in a at the
operations
interface. As a result, user-level and are thus termed lightweight.
"thread,"
which
switching
operations
can be defined,
for
However,
Context
and
and system
of a thread.
of microseconds,
thread
application
we will use the terms
address
cares
time
operating
controlled
hundreds
thread,"
to
threads For the
and
"user-level
model
model
within
the
for
Unix
lightweight operating
threads system.
is very
at context-switching
points.
• non-preemptive completes,
(typically explicitly
primitive;
It is important is suspended,
the
to note
that
so too are
programs
another.
This
address
register processes process,
either
a thread
of the
processor,
in which
each
all threads which
all threads
is assigned
system.
of user
Lightweight
within
the
model
of a
context
of a
be will continue or blocks
to execute on
until
it
a synchronization
This
without
the
thread hence
is given
a process
preemptive
a time
quantum
and
spaces space
the operating of one
by indirectly for each
and
are therefore
round-robin.
When
subject a process
process.
address
so that
possibility
process,
whose system
process
routing
for
boundaries
and
are enforced
can interleave
corrupting
all memory
threads
the
address
references
updated
by the
by
the execution space
through operating
of of a base system
[2].
packages
a single
within
address
is different
are restored
that
a separate
is accomplished which
execute
is typically
within
is done
thread
expires.
separate
process
the operating
quantum
scheduling,
Providing each
can
are executed
execution
by a thread-level scheduler, whose job execute and to switch between threads
in which
control
round-robin),
when
process-level
In Unix
FIFO),
yields
(typically
is interrupted
5
scheduling
to the
or
• preemptive
to the
Thread
similar
All threads
process, and the thread-level execution is controlled is to determine the next available thread that should
Unix
the
registers,
of microseconds,
the context
all of the
for a particular
Computation
when
from
application
in the
[2]. The
Contemporary
reducing
a particular
more
hardware
synonymously.
4.2 The
and
the
of thousands
thread.
manipulate threads can avoid crossing the kernel can be switched in the order of tens of microseconds, remainder
includes
tables,
order
of control
space
are typically
context
page
on
all thread
state
process
a heavyweight
a single
threads
a minimal
vectors,
decouple
or middleweight
user-level,
of a Unix
is typically represents
more
for kernel-level
context
interrupt
context
of a thread
often
times
stack,
large
a Unix
the
define
address
and
space.
execute Therefore,
threads variables
within
the
defined
context outside
of a single of the
scope
of a thread function (file scope)are sharedby all threads, and in fact are allocated in a single,global block of memory. This is typically beneficialasit allowsthreadsto easily share information using commonaddressescombinedwith mutual exclusionoperations to ensure consistency.However,many applications, suchas LAPSE, require that a certain amount of context be maintained on a per-thread basisacrossprocedurecalls. This type of information is referred to as thread-specific data. The issue for LAPSE is that when emulating virtual distributed with
processors
respect
5.1
using
Most
lightweight
anism
that
where
thread
the
data. allows
but
is not
process. This
data
that
is private
to a thread
but
global
thread.
therefore same
declared
locally
are declared
and
in LAPSE Another
the
approach
address
register
process.
This
provide
machines. so this
access
approach
The the
that
to the
Wind
the operating main
two base
system
from
Finally, to maintain thread
portable the
and
This
and within
private
data
it is the
must
be
of all variables
(since
the variables
cannot
being
a
for threads
all operate
approach
while
within
be used
accessible
to all
a separate this
register,
lightweight
base
address
WWT
spaces
to create
had
to find
separate process
many approach thread
address
spaces
architectures
do
base
portable
among
address
register,
systems.
upon
significant
for threads.
subservient
modify of
is not the
and
implementation
processor
to manage
[30], also depends
within each subcontext, to manipulate memory hence
to support a virtual
so this
among
is to access
support
In particular,
contexts
that
would
the
it requires
(address
to handle traps generated tags in subcontexts [30].
an approach
from spaces),
to
during the We wanted
require
no special
system.
the thread-specific a base address through
spaces
variables
spaces
First,
is required
project
allow
uses
to provide
code
separate
operating
allocate
space
thread,
architectures
the
thread
address
problems.
address
not portable Tunnel
the page mappings of a subcontext and LAPSE
function).
all threads, is designed
its own copy
to those
to each
system
used
save/restore
operating
support
are needed,
access
thread-specific
from
to provide
to keep
which
mech-
for each
address
memory
give each thread
be private
among
approach
address
as threads
spaces
of a particular
was
system
manipulate execution
separate shared
processors
this does
approach
special is also
Wiscon
operating
indirectly
suffers
Second,
to simulate
the thread
must
to providing
base
[20], but
variables
separate
amount
thread.
for each PVM
While
This
a small key/value
is shared
thread.
[16] when
address
within
on the stack
the
within
primarily
thread.
function
stored
because
functions
the
to each
Life
with
purpose
such as a copy of errno
thread-specific
the virtual If separate
key that
for each data,
providing
is used
space.
not give each
differently
by Tango
for dealing a general
data
of thread-specific
to implement
address
mechanism
[18] provides
a thread-specific
can be set
used
Lite
some
pthreads
for actually
Tango
global
it does
key
approach
natural
provide
to create
amount
well-suited is the
systems
of the
a small
are required.
each
the
For example,
a user
value
for managing
that
it requires
within
Approaches
thread-specific
not
threads,
to all functions
data mechanism offered pointer which is different copy address
of each pointer.
global While
by most thread packages for each thread. The
variable this
and provides
access
each
a machine
can be used idea is that
global and
variable system-
independentsolution, it requiressignificant modification to the user codeso that all global variablesarereferencedthrough the baseaddressregisterrather than being accesseddirectly. We know of no project which usesthis approach. 5.2
Our
Our
solution
solution
is to employ
address
spaces
that
machine,
and
without
Our
solution
pointer than
Recall
similar
to the data
to re-write
the
in C++,
class
user
the
global
class definition, Since
all the
global
data
data
creating
have
class pointer,
these
own
the
thread
scope,
more
C++
concrete
global
capability
which
means
this rather
data, with
that
we
minor
variables
class pointer, and are thread specific data in
a single
are surrounded instance
of this class,
consider
address
However,
for all
compiler
target
restoring
this
function,
functions
by the
ideas
main
allocates
into member
transformed
To make
their
base
and
spaces.
indirection
to provide
including
Each
are transformed
automatically
this.
class.
the
address
the
switch.
saving
are visible only through the of the class. Thus, to provide
functions,
a global
functions are
and
to add
and
a separate
By correctly
mechanism
definitions
of a context
separate
thread-specific
system
of maintaining
create
code
to provide
thread
performance
approach
scoping
scope
underlying
is referenced.
(data members) defined within a class maintained separately for each instance our system,
for class the
the
we can effectively
of the C++ class to the user code.
that
mechanism of both affecting
all global
thread,
attempting
take advantage modifications
C++
significantly
which
for each
the
independent
is very
through
variable
are
by a
of this object. all references
to indirectly the example
reference shown
to the
in Figure
1. The
column
executed
by
functions
on the
each
(fool
that
processor.
funcl)
column
to give each
functions,
and
There
all of which
the right
are necessary
global
virtual
and
Now consider
left side of the figure
the
shows
are three
should
thread
function
(my_class
executing
as a member
variable this
it is actually
pointer. one
this point
each
One scoping, present
and the
main
for each thread
drawback
that
has
private
processor,
using
approach interfacing
is that with
of our threaded-LAPSE
Thus,
when
version
(not
the
thread space
it requires user
a C++ code
implementation
and
to the user
code
variables,
) of the application
function
shown entry the
in the
a previously
function class pose next
is called,
through
here)
compiler could
code
entry function is written main function for this
variable
via
is to be
two global
all of the global
it accesses
of the
address
Fortran
First,
that
thread.
new_main()
to be written
its own "private"
(x,y,z)
each
Second, a thread calls the renamed the
code
the modifications
new_main()
when
G_Class.
its own
function
virtual has
of our
therefore results
the class
accessing
A new
N threads,
within
Note
variables
space.
(renamed
application
within
shows
its own address
main()
in this example).
global
be global
of the figure which
are put into one class (G_Class in this example). which declares an object of type G_Class and object
the original
the
which shown
it is global
implicit
creates above.
the At
G_Class.
to provide some
the
problems.
section.
proper We
# include double fool0 int funcl0 ;
#include class G_Class { public: int x, y, z ; double fool0 ; int funcl0 ; new_main0 ; };
;
int x, y, z ; main() {int a, b, c ; double dl ; a=x;b=y; z =a + b ; x = z; dl = fool0 ;
G_Class::new_main0 { int a,b,c ; double dl ; a--x;b=y; z=a+b; x=z; dl ---fool0 ; .......
} double fool 0
{ .....}
}
int funcl0
Thread_Entry_Function0
{'""}
{ G_Class
my_class
my_class.new
;
main0 ;
}
Virtual Processor
Figure
6
of the
tion
costs
primary due
of processes
1,2,8
and
processor).
time
of this
approach).
The
presenting
our
simple
Also,
memory the
employs
processors.
the
communica-
threads
of this using
which
16 virtual
(thus
instead
improvement
a communication
capture
processors,
having
the
the
executes
simulator
using
impact
on
we simulate
the
(threads)
per
discuss
one simulator
process
into
our
process
(at
thread
based
of Sun Sparcstation
20s.
Before
our
of message
a cluster
to briefly
and
16, 8, 2, 1 processes
processor
integrated
conducted
when
a measure
two approaches
lower
ratio.
a constant
be useful
is much time
To get
the
each physical not
execution
a set of experiments
processors
were
it would
virtual
implementation
processors. process-based
of whether version, copy is used data.
Spaces
communication
compare
we present
we have
is for an on processor
to transfer
virtual
16 physical
experiments
regardless
interprocess
communication/computation code
results
above,
Address
of a thread-based
which
Additionally,
In the thread-based sage
to simulate
writing
between
As noted cation
of the
application
using
physical
passing
to a reduction
due to the
/* The
the
benefits
application.
performance code
expected
a set of experiments
intensive
Separate
results
are used
we present
Separate Addess Spaces
1: Creating
Experimental
One
Code
the
version
processes
of LAPSE
reside
on the same
for each communication thread
it must
or an off processor
to transfer
the
data.
uses
thread.
Unix
sockets
or different
physical
be determined If the
If it is off processor,
for all communi-
thread Unix
processors.
whether
the mes-
is on processor sockets
are used
a
8.0
? J
Process Based/ 6.0
== __= 4.0
(/3
2.0
0
O'u.O
5 0 _.
Number
Figure
The
2: High
code
compute
in the
communication
its
nearest
number
the
in the
in a ring
performs
in the
final
10,000
based
the ratio). the
rate.
approach
Note
that
there
normalized
and
The
results
for the
of time
required
cycle.
operations,
and
receives
a message
from
then metric
of performance
the
number
experiment in the
the
second
of integer
application
high
integer
operdoes
experiment
the
communication/computation 250,000
is the
iteration.
by varying first
(medium performs
The When
increases
very
communication the
number
no
appli-
ratio),
operations
and
(low communi-
communication/computation
confirms
to complete
quickly costs
as the of the
of processes
approximately
experiment
ration
are
per
10 times that
one
iteration
number
of processes
thread-based physical
as long
communication
approach
processor
to execute costs
of the
are
algorithm
per physical increase
is 16 the one
process-
iteration
significantly
at
of the
lower
using
approach. the is one
process-based
approach
process/thread
processor,
and
whether
is a little
per processor,
of the lightweight thread system. structures to determine such things on/off
Our
In the
Processors
of integer
compute/send/receive
ratio),
operations
amount
requires
This
a thread-based
sends
ratio
cycle.
16 Virtual
number
first
application
version
is increased. lower
one
Using
2.
process-based
algorithm.
when
integer
experiment
in Figure
a much
configuration.
of the
Processor
of a compute/communication
some
communication/computation
As can be seen, processor
application
to complete
phase
cation/computation
for the
iterations
performs
communication/computation
(high
cation
the
20.0
I
per Physical
Ratio
several
application
required
compute
computation
shown
the phase
neighbor
We vary ations
goes through
phase,
of seconds
15.0
r
Communication/Computation
application
In the
10.0
,
of Threads/Processes
In particular, as whether
a particular
bit faster which
the
than be
the
thread-based
attributed
approach
to the
overhead
using threads requires routines and data recipient of a particular message resides
message
10
can
buffer
is free.
When
there
is only
one
14.0
12.0
Thread f:.,-_:-_
Based
Process
Based
10.0
E'J /
/
P, 8.0
/ ..j"
6.0 CD
j j,." /
4.0
/ / [_
2.0
thread
on a given
l
5 i0 Number
3: Medium
/
/-S
0"%.0
Figure
..._.
of
i 10.0
t5.0i
Threads/Processes
per
Communication/Computation
processor
this unnecessary
20.0
Physical
Processor
Ratio
machinery
for 16 Virtual
can be avoided
which
performance. As shown
We have not yet optimized for the one thread per process in Figure 3, the thread-based approach still significantly
process-based
version
for the
thread-based
Figure is low.
much
more
the
Again quickly
than
is increased.
performs
the
most
has
rise much that
to note
approach
for the
thread
a thread
a message
and
would the
message
during
whether
a particular
Thus
each
the scheduler
execution.
give up
message iteration. may
have
The process-based
will be progressing
that
scheduling
of the
to swap
there
process-based
communication/computation process-based approach
processes
as we now
approach
of processes
process-based
four
costs
approach.
the
the
the
the
the
(threads)
grow
(threads)
per
slightly
out-
per processor.
discuss.
threads
rate
which
it issues
experiments, order
for a particular
is time-sliced,
same
is when
is no guaranteed
several
approach the
Again
with
In our
will be available
at roughly
for the
increase
when
processor
available.
Since
message
ratio.
as the number
two and
would
case. outperforms
of the thread-based approach is non-pre-emptive, and have no more useful work to perform. One example
control
is not
than
associated approach
It is interesting to do with
slowly
in performance the costs
Our preliminary implementation thus the threads execute until they of where
more
the thread-based
thread-based
likely
communication/computation
improvement
it is seen
processor This
medium
approach
4 shows
ratio
for the
Processors
each
making mean
thread
of execution thread
in and out before may
a call to receive
finding likely
fewer
context
a
it is unknown
when
it more
requests
it is required. one eligible
that
for
each process
switches
to find
a process eligible for execution. This issue of thread scheduling merits further investigation. All of the experiments reported here make it clear that thread-based implementations offers
the possibility
improvement
of significant
in performance
performance is not
very
improvement.
sensitive
11
to the
Also, our studies the
show that
communication/computation
this
15.0
,
Thread E_------_
_
.
Based
Process
Based
/ 10.0 /
/
/ J /' / / / !
8co
/
5.0
/ /./ / / / /
./ ,
0.0 '
0.0
5[0
Number
ratio. but
4: Low
Communication/Computation
Clearly
we need
to develop
preliminary
Before
results
leaving
this
important
issue
This
is discussed
issue
a test
are
section
of relative
very
per
20.0
Physical
Processor
Ratio
for 16 Virtual
suite for real direct
execution
Processors
simulation
applications,
encouraging.
it is important
speedup
in detail
i ]5.0
_;.0
Threads/Processes
Figure
our
7
of
'
due
to note
that
to parallelization
these
results
of the
do not
virtual
address
computer
the
simulator.
in [9].
Conclusions
This
paper
outlines
simulator,
such
context-switch
and
implementation outlined
performance improvement Based to direct
for the on the execution
carefully
tion/communication
of this
over
ring based
results
of the
simulation codes
to run
investigate ratio
systems
that
required
message-passing bears
varying
of this
offers
that
not
virtual
we have commonly
processors.
the potential
We observed
high
a thread-based
approach,
something
for supporting
approach. here
we feel that
investigation.
thread-based
such
we show
execution from
for significant up
to a ten
fold
code.
presented further
on the
and spaces,
direct suffers
In support
LAPSE
process-based
research
issues and
but
a parallel
implementation
address
the thread-based the
to support
overheads,
overhead.
thread-specific
thread
improvement
threads
process-based
communication much
show
lightweight
traditional
for providing
results
application
to more
reduce
by lightweight
Preliminary
The
interprocess
can
a method
supported
isting
the idea of using
as LAPSE.
as the
effect
communication
12
Our
LAPSE
a thread-based next
simulator.
on performance patterns.
step This due
approach
is to modify would to the
ex-
allow computa-
us
References [1] D.
Agrawal, M. Choy, H.V. distributed shared memories. tributed
[2] Maurice Hall,
Simulation,
pages
J. Bach.
The Design
E. Bal,
parallel ing,
M. Frans
programming
of the
Operating
System.
Software
Series.
Prentice-
Chen,
H.-M.
Covington,
Su, and
Tango.
[9] P.M.
for
Engineer-
Yew.
E. Weihl. Report
Department
PROTEUS:
A high-
MIT/LCS/TR-516,
Mas-
1991.
The on
impact
of synchronization
Computer
S. Madala,
1991
W.
Technical
September
Architecture,
and
Journal
J. Hennessy.
of the
and
88-01-04,
J. Sinclair.
on Computer
Multiprocessor
International
and
pages
granularity
239-248,
Efficient
simulation
Simulation,
simulation
Conference
of
1(1):31-58,
and
on
May
tracing
using
Parallel
Processing,
execution
simulation
1991.
P. Heidelberger,
and
programs.
D.M.
Technical J.
Nicol.
Parallelized
Report
94-50,
direct
ICASE,
Sinclair. Execution-Driven on Modeling and Computer
July
1994.
Simulation Simulation,
of Multipro4(4):314-338,
for
parallel
1994. Foster
gramming. Science
and
K.
Technical Division,
UCB/CSD
Transactions
Report
Simon:
83/137,
Fujimoto
M. Chandy.
Argonne
M. Fujimoto.
[13] R.M.
1988.
J. Jump,
Dwarkadas, J. Jump, and cessors. In ACM Transaction
T.
Report
January
Symp.
[10] S.
October
A language
on Software
Technical
A. Colbrook,
and
August
Dickens,
Transactions
manual.
International
S. Goldschmidt,
of message-passing
Orca:
of Washington,
P.-C.
systems.
II99-II107,
S. Tanenbaum.
IEEE
simulator.
S. Dwarkadas,
In Proceedings
pages
user's
In Int'l.
parallel computer June 1991. Davis,
PRESTO
of Technology,
systems.
Andrew
1992.
University
Institute
on parallel 1990.
[12] R.
UNIX
systems.
parallel-architecture
sachusetts
[11] I.
1994.
and
C. N. Dellarocas,
performance
[6] D.-K.
The
Science,
A. Brewer,
Kaashoek,
March
N. Bershad.
of Computer
[8] H.
July
of distributed
18(3):190-205,
[4] Brian
[7] R.
151-155,
Maya: A simulation platform for 8th Workshop on Parallel and Dis-
1986.
[3] Henri
[5] E.
Leong, and A. Singh. In Proceedings of the
and
Fortran
M:
MCS-P327-0992 National
A language Revision
Laboratory,
A simulator
June
University
of California,
W.D.
Campbell.
Efficient
of the Society
for
1, Mathematics
Computer
Simulation,
13
networks.
Berkeley,
instruction
and
pro-
Computer
1993.
of multicomputer
ERL,
modular
Technical
Report
1983.
level simulation 5(2):109-124,
April
of computers. 1988.
[14] Dirk
GrunwaJd.
and
Science,
[15]
Carl
Hauser,
[17]
of Colorado
Christian 94-105,
S. Herrod.
Tango
Lite:
F.W.
User's
Howell,
tion environment. pages
[18]
IEEE.
[19] David [20]
R. Konuru,
R. Prouty,
In Proceedings
I. Mathieson
and
Mehrotra
system.
[23]
Computers,
Notes
in Computer
Frank
Mueller.
USENIX,
interface ing,
[26] D.M.
29-41,
Institute
C. Micheal, parallel
680-685, Nicol.
Intro-
IEEE
design
Workshop Systems,
Systems
Society
(Draft
San
Greg
CA, January
Eisenhauer,
threads.
Technical
of Technology, and
P. Inouye.
Washington,
Atlanta, Efficient
In Proceedings D.C.,
discrete-event
Proceedings ACM//SIGPLAN and Systems, pages 124-137,
and
Tech-
package 1994.
on System
Opus
language and
Technical
threads
paralSciences,
and
under
runtime
Compilers
for
Lecture
Report
94-39.
UNIX.
In
Winter
1993. Kaushik
Report
aggregation of the
Ghosh.
A machine
CIT-CC-93/53,
Georgia,
1989
December
1989.
simulation
of FCFS
PPEALS 1988: Experiences ACM Press, 1988.
14
process
1994. Springer-Verlag
as ICASE
of POSIX
packages.
for evaluating
on Languages
November
Also Appears
Diego,
of the
1992.
Conference,
Conference
Workshop
New York,
892.
A user-level
simulator
An overview
Press.
7), February
trace-driven International
simula-
International
Computer
J. Walpole.
and
Telecommunication
Computing
implementation
simulations.
Parallel
and
Performance
7th Annual
346-360,
Science,
for lightweight
D. Nicol, pages
pages
Hawaii
Haines.
of the
Mukherjee,
Georgia
memory
Matthew
A library
pages
[24] Bodhisattwa
[25]
and
and
High
A dynamic
of the Blst 1988.
In Proceedings
Parallel
Unpublished
edu/_herrod/Papers.
of the Second
Operating
S. Otto,
of Scalable
R. Francis.
lelism. In Proceedings pages 158-166, January
[22] Piyush
for Portable
Ising System
and techniques for building fast portable threads 93-05-06, University of Washington, 1993.
J. Casas,
for PVM.
[21]
Extension
Weiser.
on Operating
architecture
of Computer 1994.
of Computer
Mark
Environment.
Hierarchical
Carolina,
and
Symposium
stanford,
Ibbett.
programming
1993.
'9_, Proceedings
North
Welch,
In ACM
Simulation
and Simulation
Keppel. Tools Report UWCSE
nical
R.N.
parallel
Department
1991.
Brent
NC, December
In MASCOTS Durham,
Threads
Theimer,
http ://www-flash. and
Analysis,
363-370,
November
A Multiprocessor
Guide,
oriented
CU-CS-552-91,
A case study.
Asheville,
R. Williams,
on Modeling,
Marvin
systems:
pages
An object
Report
at Boulder,
Jacobi,
in interactive
and
to AWESIME:
Technical
Principles,
duction
guide
system.
University
threads
[16]
A users
simulation
College
independent of Comput-
1993. of multiple Winter
stochastic
LP's
Simulation
queuing
with Applications,
in distributed Conference,
networks. Languages
In
[27] D.M.
Nicol.
tions.
Journal
[28] W.H. [29]
[3o]
The
Press,
B.P.
Flannery,
S. Reinhardt,
M. Hill,
J. Larus,
Virtual
prototyping
Tunnel:
ACM
SIGMETRICS
S. Reinhardt,
Carl
and
pages
W.T.
Symposium
Vettering.
Numerical
J. Lewis,
Press,
and
Santa
simula-
Recipes
York, The
CA,
for the
on Microkernels
May
Other
1988.
of the
1993
1993.
Wisconsin
and
in
Wisconsin
In Proceedings
Clara,
Support
New
D. Wood.
computers.
48-60, Kernel
discrete-event
University
of parallel
D. Wood.
Usenix
and
A. Lebeck,
in parallel 1993.
Cambridge
Conference,
of the
September
Wind Kernel
Tunnel. Architec-
1993.
Schmidtmann,
multithreaded Sun
Computing.
B. Falsafi
April
S.A. Teukolsky,
of Scientific
Wind
synchronization
40(2):304-333,
C: The Art
tures,
[32]
of conservative
of the ACM,
In Proceedings
[31]
cost
Michael
Xlib.
Microsystems,
In Winter Inc.
Tao,
and
Steven
USENIX,
Lightweight
Watt.
pages
Process
193-203, Library,
Design
and
San Diego, sun release
implementation CA,
January
4.1 edition,
of a 1993. January
1990.
[33]
G. Stellner,
S. Lamberts
for workstations Athens,
clusters.
and
T. Ludwig
In PARLE'94,
NxLib Parallel
1994.
15
- a parallel Architectures
programming
environment
and Languages
Europe,
Form Approved
REPORT
DOCUMENTATION
PAGE
OMBNo. 0704-0188
Pub c report ng burden for th s co ect on of nformat on s est mated to average 1 hour per response including the time for reviewing instructions, searching existing data sources, gather ng and ma nta n n_ the data needed and comp et ng and reviewingthe collection of information. Send comments regarding this burden estimate or any other aspect of this collect on of information, mclud ng suggestions for reducing this burden to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington, 'CA 22202-4302, and to the Office of Managernent and Budget, Paperwork Reduction Project (0704-0188), Wash ngton, DC 20503. 1. AGENCY
USE
ONLY(Leave
blank)
2. REPORT
DATE
November
REPORT
1995
TYPE
3. Contractor
AND
A THREAD-BASED
DIRECT
6.
COVERED
5. FUNDING NUMBERS
4. TITLE AND SUBTITLE TOWARDS
DATES
Report
EXECUTION
PARALLEL C NAS1-19480 WU 505-90-52-01
SIMULATOR
AUTHOR(S)
Phillip
Dickens,
Piyush
Mehrotra,
Matthew and
Haines, David
Nicol 8.
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) Institute
for Computer
Applications
Mail Stop
132C,
NASA
Langley
Hampton,
VA 23681-0001
19. SPONSORING/MONITORING National
Aeronautics
in Science Research
and Engineering
Space
NUMBER
ICASE
Report
10.
Administration
Langley Research Center Hampton, VA 23681-0001
II.
SUPPLEMENTARY
No.
95-81
SPONSORING/MONITORING AGENCY
REPORT
NASA
CR-198242
NUMBER
ICASE
Report
No.
95-81
NOTES
Langley Technical Final Report To appear in the 12a.
ORGANIZATION
REPORT
Center
AGENCY NAME(S) AND ADDRESS(ES)
and
PERFORMING
Monitor: 29th
Dennis
Hawaii
DISTRIBUTION/AVAILABILITY
M. Bushnell
International
Conference
on Computer
Systems 12b.
STATEMENT
DISTRIBUTION
CODE
Unclassified-Unlimited Subject
13.
Category
ABSTRACT
Parallel
(Maximum
direct
60, 61
200
execution
words)
simulation
is an important
tool
for performance
and
scalability
analysis
passing parallel programs executing on top of a virtual computer. However, detailed simulation codes requires a great deal of computation. We are therefore interested in pursuing implementation can decrease this cost. One idea is to implement the application virtual processes as lightweight traditional Unix processes, reducing both on-processor communication paper we describe an initial implementation of a thread-based parallel advantages of such an approach and present preliminary results that process-based approach.
14.
SUBJECT
Parallel
15.
Direct
Execution
Simulation;
message
costs and context-switching costs. In this direct execution simulator. We discuss the indicate a significant improvement over the
TERMS
Simulation;
of large
of message-passing techniques which threads rather than
NUMBER
OF
PAGES
17
Threads 16.
PRICE
CODE
A03 17. SECURITY OF
CLASSIFICATION
REPORT
Unclassified NSN
7540-01-280-5500
18.
SECURITY OF
THIS
CLASSIFICATION PAGE
19.
SECURITY OF
ABSTRACT
CLASSIFICATION
20.
LIMITATION OF
ABSTRACT
Unclassified Standard Form298(Rev. 2-89) Prescribed byANSI Std. Z39-18 298-102