parallelized direct execution simulation of message ... - CiteSeerX

2 downloads 7431 Views 2MB Size Report
Applications in Science and Engineering. NASA. Langley. Research. Center .... calls such as "what is the wallclock time now", or, "is there a message of type T.
NASA ICASE

Contractor Report

Report

194936

f

No. 94-50 i

/

IC PARALLELIZED

DIRECT

SIMULATION PARALLEL

S EXECUTION

OF MESSAGE-PASSING PROGRAMS

I'..-

_

¢M

I

"-

,.0

C: _

0 0

0" Z

i Phillip

M. Dickens

t_

Philip David

Heidelberger M. Nicol

C3 t.t. nc uJC_ N 0

_ j..i cd. _..IA

pn t,,_ Z

Contract NAS June 1994

1 - 19480 ,,,_ U.J ,_

0

_u_

_

I

Institute NASA Hampton,

for Computer Langley VA

Applications

Research

in Science

and

Engineering

Center

23681-0001

Operated

by Universities

Space

Research

Association

I k-,- _

u') u.J t/_

Parallelized

Phillip

Direct

Execution

Sinmlation

Parallel

Programs*

Philip

M. Dickens ICASE

NASA

Langley

IBM

Research

Hampton,

T.J.

Heidelberger

Watson

VA 23681

Yorktown

Box

David

t

Research

P.O.

Center

of Message-Passing

Center

704

Heights,

NY

M. Nicol*

Department

of Computer

The

of William

College

Williamsburg,

10598

Science and

Mary

VA 23185

Abstract As which

massively performance

in diverse

parallel

In but

such

execution

simulation

a system,

|ntel

within

10%

On

relative

slowdowns

codes

paper

we

(Large codes

error.

(relative

be

efficiently

parallel

describe

one

communication

we are

Parallel

measured

to natively

to date,

on

the

executing

model

in its

own

of the and

directly

high

relative

presumed this

specifically

suitable

for parallelized

on tile

performance

we have

code,

arises parallel

Because

performance

application

by

executes

the

behavior.

Environment),

predicts

and

of

report

ways

problem

parallelization,

and

Simulation LAPSE

code)

one

details

methods

programs,

nature

This

network

describe

parallel

Application

Depending

We

finding

monitoring, where

to

interested

simulator.

in

predicted.

solution

system

and

interest

performance

simulator

of message-passing

all

can

is growing

a discrete-event

discrete-event

LAPSE

Paragon.

there

compilers,

expensive,

of the

the

this uses

is computationally

of such

parallel

as operating

parallelization

direct

proliferate,

as parallelizing

code,

machine,

approach

low

such

development.

application

the

computers

of massively

contexts

algorithm the

parallel

built

on

well,

typically

we have

observed

speedups

using

up

to

64

processors.

*Research for Computer

supported

by NASA

Applications

lThis

work

was

performed

tThis

work

was

performed

grant

CCR-9201195

.

contract

in Science while while

number

NAS1-19480,

& Engineering this this

author author

(1CASE),

was was

while NASA

on sabbatical on sabbatical

the authors Langley

were

Research

in residence Center,

at tile

Hampton,

Institute VA 23681.

at ICASE. at ICASE.

It is also

supported

in part

by

NSF

1

Introduction

Performance research, arena.

prediction

especially Writers

generating

as parallel

of parallel

efficient

are interested

highly

size and

designs

information

now", the response call's

offers

is a good

This solution simulator then

becomes

to consider This

paper parallel

LAPSE

(Large

for Paragon rithm

programs. While

The

poorly

initialization

prediction

where

tuning

in evaluating

the

problem The

pieces

direct

require

where

execution

the

we describe

have

processes machine

machine

been

implemented

implemented

a direct

in C, or Fortran,

execution

application,

the size and characteristics simulation

or a mixture

From

the

processes. to the

for message-

in a tool

LAPSE code.

algo-

LAPSE

of the virtual

We describe

named

Paragon,

synchronization

application

these

of the application

of the two.

resides

path

on the Intel systems.

and executes

of

It is important

simulator

message-passing

builds,

but

simulator.

to general

machine.

simulator

is a bottleneck.

is applicable

onto the physical

on large,

application

the

file describing

for parallel

is a clear source

to the Paragon,

the

is a

of computation,

are specific

how to build

code

simulator

of LAPSE

describing

at the simulated

the communication

virtual

Environment),

of type T

performance

executing

machine

a message

deal

virtual

itself

application

the application

program

by

From the point the

simulator

a great

either

is directly

all references

when

execution

host the directly

the virtual

Simulation

machine

the discrete-event

of parallelizing

methods

Parallel

Thus,

of the application

in situations

code

by the simulator.

of parallel

does

processors

of parallelizing

machine

written

increasing

offers a solution

but

time now", or, "is there

A detailed

Execution

codes

behavior,

of view of the simulator,

or the simulator

of the virtual

applications

are

using fewer, more

are interested

machine.

of the virtual

approach

though

the mapping loads,

point

a system

in this paper

only a "makefile"

and a LAPSE

the

a pool of other

Application

with

all application

are trapped

on the virtual

on the state

for parallelization.

the problem

codes.

machine

is the wallclock

for accurate

systems.

a bottleneck,

we describe

as input

virtual

depends

potential

considers

passing

execution

to be simulated.

while

tools

in performance

5, 9, 11, 16] of application

the application's

the activity

will perform

tuning

the fact that algorithms

up

available)

networks

simulation

it is running

one easily envisions

on one processor

and

scales

are only infrequently

of

computing

deal with

may be interested

area

as an aid towards

of new parallel

algorithm

execution

From

the

candidate

parallelism;

a direct

placement.

as-yet-unbuilt

of a new users

(which

but must

Developers

of new communication

calls such as "what

describing

possibly

General

about

code,

available

programs

performance

performance

instrumentation

performance,

measurements.

simulation[4,

Under

temporal

the high performance

of performance

parallel

an important

workloads.

code to the simulated

of the

Users

their

execution

executes

driver,

code.

Designers

of view of the application

is currently

would like to be able to predict

of processors

problems.

to obtain

the application

time

to dominate

architecture.

realistic

of direct

of these

executed

are coming

how well the

resources.

under

The method to each

computers

and observing

machine

available

their

programs

parallel

their codes for large numbers readily

of parallel

code may perturb

in predicting

problem

analysis

compilers

in predicting

instrumentation interested

and/or

source machine

code, and

automatically

LAPSE

LAPSE's

accepts

supports

performance

on four parallelapplications:two linearsystemsolvers(oneindirect,onedirect), a continuoustime Markovchainsimulator,anda graphicslibrary driver. Measuringspeedups (relativeto simulations usingonly oneprocessor)and slowdowns(relativeto natively executingcode)weobservea wide rangeof performancecharacteristics, dependingin largepart on the natureof the codebeingsimulate& LAPSEhasprovento accuratelypredictperformance,typically within 10%.While wehave simulatedas manyas 512virtual processors usingonly 64 actualprocessors, the mainlimitation we'veencounteredis limited physicalmemory.This problemis specificto the Paragon,andshould not bean issueon parallel architecturesthat better supportvirtual memory. Severalother projectsusedirect executionsimulationof multiprocessor systems.Amongthese wefind two pertinentcharacteristics, (i) the type of networkbeingsimulated,and(ii) whetherthe simulationis itself parallelized.Table 1 usestheseattributes to categorizerelevantexistingwork, and LAPSE. Tool HASE[121 LAPSE MaxPar[3] Proteus[2]

most

are an important virtual are

shared

a facet

serial

memory

serial

WWT[24]

cache-coherent

shared

memory

parallel

1: Direct

simulators LAPSE

thread

available

protocols

deal

with.

our own,

kernel

are subject

to its costs for context Tunnel

that uses a multiprocessor

work has

it is one that

observed

to accelerate

when

directly

operating

system's

(for small

us.

LAPSE

grain

sizes).

address

as uses The

spaces

in a government

processes

mechanisms

As much

simulator

virtual

machine

but

context-switching

affects

of independent

this ourselves;

considerably,

a direct-execution

of a shared

protocols

which does not support problem

identified

context-switching

the appearance

to support

to that

of cache-coherency

the simulation

nor are we able (in the context system

simulation

existing

and

has been

Tools.

on the Intel Paragon[13],

However,

consideration,

constructs

Simulation

complicate

to us do not support

the operating

Wind

than

is implemented

Coherency not

Execution

other

improvement

at the time

shared

memory

to modify

in parallel

cache-coherent

shared

to our approach,

ulator

serial

(no cacheing)

cache-coherent

necessary

The Wisconsin

memory

Tango[7]

of magnitude

subject

shared

serial

need

threads,

parallel

network

an order

Unix

network

message-passing

as a key performance

OSF-1

message-passing

Simon[9]

concern.

packages

parallel

serial

overhead

thread

network

network

memory.

its own light-weight

message-passing

message-passing

current

LAPSE

simulator

RPPT[4]

Table

Among

communication

lab)

are by necessity

for scheduling,

and

are

switching. (WWT) (the

is to our knowledge

CM-5) to execute

[12] was published.

The

intent

the only working

the simulation. in HASE

(HASE

multiprocessor

sim-

was not operational

is to use a commercially

available

simulatorbasedon the differences

optimistic

between

goal is to support

LAPSE

scalability

cache-coherency its host.

the

protocol

A second

its future

communication

network

another

processor.

requires application barrier

behavior.

WWT

ignores

memory

reference

of better idleness

whereas

in many

lower bound WWT

support

synchronize strate

among

|ong

on the operating

for different parallelized

occurs

numerical

available

required

that

runs purely

have been

proposed

network

simulators.

Our

excellent

affected

message

setting

and any

only an explicit

call

a less rigid between phase.

long periods

ap-

a long Because

of network

comes from the observation

simulating

can still be obtained exploits

The

setting

are independent

of actually

its

is fixed,

alternate

during

of time.

In such

the timing.

Where

provided

a message.

there Finally,

CM-5 idiosyncrasies

before,

e.g., parallelized

contributions

simulations

by actual

until the next

This fact allows

path

WWT

is a the

to recognize

as an application.

of LAPSE

direct-execution

processors

[19, 22] protocol.)

of every

in LAPSE

things

to cause

can have

to send or receive

cleverly

at any time by

is deferred

programs

than simula-

with the

between

is altered

and a communication-intensive

lookahead

overhead

parallel

In a cache-coherent

behavior.

well in advance

system

LAPSE

coupled.

of the execution

of time,

system

the latency

can avoid synchronization

portions

of our approach

sometimes

large

The lookahead

operating cache.

elements

the feasibility

to observe

cannot.

is not independent

misses in the simulated and

LAPSE

code

in a message-passing

network

many

code can be executed

uses a customized

Individual

can influence

no communication

the WWT

path

event;

of machine

by a write

case of the YAWNS

that

is a tool for

may interact

the communication

are less tightly

a network

possibilities,

applications,

the execution

by assuming

primary

deals with this by keeping

object

before

type

a processor

Any communication

is a special

In particular,

where

cases the application

occurs

WWT

any communication

Application

B cycles.

processors

subroutine

phase

every

that

The

a different

The WWT

an assumption

the barrier

might generate

lookahead

codes.

to note

LAPSE's

of a conservative

system,

poor.

of cycles.

contention

to synchronization.

computation

the ability

of synchronization

Paragon

to a message-passing proach

exploits

that

any network

By contrast,

to simulate

is apparently

to synchronize

method

known.

designed

of lookahead,

It is worthwhile

of purpose.

of Paragon

In a cache-coherent

100 number

with the assurance (This

The first is a matter analysis

being

is a matter

protocol.)

on any cache miss, or may have its cache affected

B _

processes

recipient.

that,

researchers,

The WWT

at least

synchronization

and the WWT.

The lookahead

in close synchrony.

Warp

and performance

difference

tion to predict

Time

are to show

of distributed

implementation

performance.

This

of the LAPSE

system.

memory

and testing

combination

direct

how to effectively

programs,

to demon-

on non-trivial

of features

execution,

makes

codes,

LAPSE

and

unique

its peers.

Section a massively nization slowdown,

2 gives an overview parallel

strategy. and

code into a direct Section

speedups,

5 describes while

Section

execution

Section

simulation

our experiments 6 presents

and

3 describes

and then their

our conclusions.

3

how LAPSE

Section

results

4 details

with respect

transforms our synchro-

to validation,

2

An

Overview

A parallel

program

for a distributed

distributed

among

n < N processors.

an equivalence

we presently

communicate

through

the esend defining id and

call in the

processor

area occupied returned

type,

of ereev.

If the

message

will transfer

the message

l(a)

iUustrates

measure these

of control

or predict

durations

Interrupt

at time

a.

operating control

l(b).

network

overhead

block.

the interrupt.

After

explicitly

process

receives

an additional

overhead

is incurred;

point

process

2 begins

executing

time

c) before

user

buffer.

the message Application

time d).

The

application

to assign

virtual

times

process processes to event

version incoming

receive

call,

time

system

2 arrives.

1 can

such durations

a facet we do deal with), synchronization

1 to process

message

state,

2 starting

1. Because over

begins

2, which

on processor

protocol.

by the network

on processor

process

is invisible

of the

the

network,

coming

out of

was in the control

to handle

by the operating

system,

control

process

2 finally

the code

process

execution of these

reaches

it from a system

overhead

The

2 gains

middle

received

receipt

By contrast,

that

for transmission b. The

is complete

it is not interrupted;

in part

system

then calls a

e.g., process

that

simulation

message

When

are unaware

times

is

Otherwise

messages

If we can assume

interrupting

g).

handling

is sent from process

completely

1 continues

the

transaction

durations,

are determined

1 until

this message again.

arrives,

a subsequent

the assumption

(in this case copying

from process

Control

by the user.

lack of interruptions,

the

the operating

the message

spends

in a parallel

thereby

2 (at time

arguments

length.

It runs for a period,

execution

to the operating

has been

message

The message

network.

a message

process

At this point

the message

a message;

is an asynchronous

specified

views time.

about

assuming

l(b),

f,

as soon as the memory

buffer;

a message.

knows

to prepare

2 at time

to application

process

overheads

is passed

required

arguments the process

buffer.

we can exploit

to application

on processor

of an execution

h) that

(again

In Figure

At time a, control system

is not returned

returned

activity

For example,

passes

message

ireev

anticipated

The time the system

and message-passing

in Figure

the

times in a queueing

give us information

as illustrated

is received,

n = N, processes

and length,

to receive

into the location

to send or receive process

application

and maximum

a - 0, c - b, and e - d under

to service

durations

is called

so that

routines.

base address

into a communications

to the application.

of network

these durations

before

into the user's

An application

are analogous

are independent

network

routine

crecy

once the message

library

calling

processes,

Application

to the application

the message,

how an application

message-passing return

the

the message

to the application.

the

from

The

the message

for reuse,

is called

is static.

calls to system

a message.

from the network

of N application

are constructed

mapping

is returned

to place

routine

directly

the

integer),

is available process

receive

copies

sends

Control

base address

transfers

Figure system

nx library

by the message

programs

using explicit

type (a user-defined

to the application

msgdone,

passing

is comprised

parallel

We assume

id of the recipient.

are the message

upon

Intel

machine

Most

assume.

message

the message

memory

1 reaches

arriving after

is completed

the

timing

receive

a, b, c, and so on, as a function

(at time

to user

space)

at time i at which

its receive

message

details.

buffer

is

moves

statement

directly

has been

(at

into the

completed

(at

It falls to the simulator of the execution

durations

execute send process

1

process

2

execute

[]

[]

0

ab

cd

I 0

execute

receive[] hi

(a) Application

execute

process

1

receive execute

I

a b n ra • _send _

I

execute

[] jk

execute

View

c Ooooooo.oooo..ooD--

execute

d • _

receive

0 I

',, I

I 's

I I

X I

process

2

I

% l

execute

_

Figure

f

g

reported

by the application

system

overheads

activity

affects

maintains records

1: Application

processes,

the execution

observed

application

conceptually

separated

application

Figure to LAPSE

message

an overview

LAPSE

code is linked

the interface

progresses,

with

is essentially

LAPSE

to the application, typically

For instance,

in another

becoming

trigger

application

network

resident

some interaction

an application be received process.

simulator,

is essentially

The

simulator

although

is

the set of

between

routine

submitted

message-passing

as the LAPSE

in the same address

routine

that

on future-events

An application

application is known

by an interface

in LAPSE

list.

application

call to send a message An interface

A time-line

activity.

structure.

redirect

operating

simulator

the time-stamps

on the simulation and

assumed

of how message-passing

1(b), called a time-line,

to them.

a single event

that

k

timings.

delays,

Figure

times

communication

macros

application

an application

where

The set of all such routines

which sends the message--to receive

simulator

of LAPSE's

routines.

routines

routines.

events

simulation simulator,

depending

into an application

is recompiled

calls to interface

routine

reflecting

and networks

2 gives

corresponding interface

structure

process

j

i

and a model

process

and assigns

h

of message-passing handling, Indeed,

a data

,send, t3 t5

of parallel

processes.

events

as the simulation

future

an evaluation

I

receive [] []

View

views

and interrupt

list for an application

are modified

Network

of application

for each application

a future-events

execute

and Network

for message-passing

I I

I

_

(b)

l

I

_. interrupt

0

I f

I

_' l

r

41 ,

calls

interface.

space.

Application

processes,

through

is trapped

by an interface

corresponding

to a matching

also communicates

to

The

with some

processor

processor

.........................................

Network Simulator Application Simulator

i

Network Simulator Application Simulator

I t' is chosen,

tion terminates. send events

Once

and the window

Given

whose

simulator

[t', w(t'))

lower window

time-stamps

every

has reached

is simulated.

time

This process

t_ = w(t),

fall in It, w(t))

time

continues

until the simula-

to ensure

that

edge t, the value w(t) is computed

will eventually

another

exist already

all message

on their

applications'

time-lines. Synchronization lower bounds

within

on future

tween simulators managed [21].

initially

is through

computed

simulator,

message's

transforms

minimum

whose

progresses, network

appointment

entry

times.

either

and each simulator's

the halting

time increases,

smaller

time-stamps.

all simulators, strictly

h _, and

must

the network increase

and network

whose have

appointment

an event

reaches

appointment

time

defines

as the

ever processes

the lower-bound times either

on

disappear as w(t).

As

are free to execute

events

is the least

halting

time among

have a simulation

clock value

h _ must

with time-stamp

for

As the simulation

until it is at least as large

h I < w(t)

the net-

with other

simulator

or increases

The

simulator

a halting time.

it is

network

appointments

halting

simulators

will not deadlock--if

is a VP

progresses.

the network

nor network

messages,

time must

the application

The protocol

then the simulator

less than

safely

halting

occur

and

be-

time-line;

message

computes

current

of

appointments

to its corresponding

simulated,

application

the fore-warned

on every

as the simulation

simulator

the simulator's

As these modifications

event,

itself

being

destination

are called

the associated

between

A network

Neither

than

send

these appointments

on the network

sends

or increase,

remote

on when

passes

time.

is larger

send call whose

and it is updated

lower bound

simulator

and distribution

The only interaction

These lower bounds

with every

also be in order.

time-stamp

a simulator

i.e., a message

of its source.

is received,

Depending

may

incoming

an event

that

computation

may affect another.

it into an appointment

destination.

simulators

one simulator

associated

application

by the dynamic

send events,

than

the best known

The

which

network

remote

other

when the event

reflects

work hardware.

is governed

at which

is an appointment

appointment

the

times

by a simulator

There

a window

strictly

less than

h I, which

with

it can

execute.

We now elaborate how they

4.1

are used,

on Whoa, and when

in two steps.

First

they are updated.

we discuss

Secondly

how appointments

we discuss

are computed,

construction

of windows.

Appointments

LAPSE

exploits

to timing. trace

generators,

builds

network

simulator

for message-passing

as a whole) application message.

between

network

has advanced events;

An event's

time-lines

around

appointments

3 illustrating

codes'

one can execute

on the

windows

application Figure

is that

maintain and

appointments

Consider

D, and

and

appointments

appointments,

remote

the tendency

The key observation

a VP's

a potentially

these from

lists.

paths

to be largely

processes

insensitive

largely

as on-line

long list of future-events.

In this subsection

application

simulators

we describe to network

Whoa

two types

simulators,

of and

simulators. time-line

and a situation

to t. The time-line

we wish to compute application

execution

the application

records

execution

a lower bound

appointment 10

where

simulation

bursts

on the network

is a lower bound

time

with lengths entry

on the instant

time

(for the A,B,C, of the

when it will

local csend

A

tt Figure

bounds

3: Application

on the network

on startup itself.

to the application

simulator

events

startup

that

When

send events

The appointments

may change

3 fails to find the anticipated

advances

without

increase.

Likewise,

the message

the message

is completely

suspension.

At the point

appointment

time

appointments of this event

update"

event

difference d units

all of the VP's

of time,

from suspension,

at which

appointments

becomes during

of an erecv

point

VP must until

the period

of its

minimum

outgoing

schedule

event

updated

time

suspended

unsuspended.

update

cost.

As simulation

periodically

The

to include

by the suspended

until the VP becomes

of

is not included

if simulation

suspended.

then

state

of the additional

be updated

by d units.

arises

and hence

computed

and

is reported

the duration

cost is included,

of a message

d = t_ -t,

cost

simulator

that

say at t, we find the VP's

appointments

the VP is released

suffered

For example,

must

event

This usually

by the amount

by the arrival

plus lower

on the startup

as the

it may happen

the VP becomes

suspended,

lengths,

that

current

assumed.

of blocking.

and its appointments

the every

First,

the appointments

a VP becomes

t_, compute

advances

the event list when

appearing, is interrupted

received,

appointments

VP are increased

message,

burst

after

and the additional

in the presence

as the sum of the

may increase.

shortly

is not always

is executed

in Figure

send

the lower bound

on the affected

a VP that

to keep

a message--that

the event

It is constructed

the appointment

in two cases.

than

time-line.

plus the lower bound

for a remote

strives

is larger

entry.

events,

are processed

specifically

of copying

in the lower bound. for all remote

Whoa

on a VP's

block, plus all execution

application

is made

change

network

or startup

appointment

event

some cost--like

execution

simulator.

calculation

to model

As application

Appointments

an application

appointment

costs of intervening

An application evolves.

remote csend D

D

= (A-( t-s ) )+ B +C +D + 2( crecv startup )+ 2(csend startup )

time in the current

of the event

local csend []

C

t

Appointment

residual

crecv i-3

°

s

be scheduled

B

crecv

a "VP

Execution

is removed

from

appointments

are

passed. Individual

simulators

time t' from simulator the

subnetwork

that

simulator

latency

network

appointment

using

i to j, for message

managed

appointment, network

synchronize

network

appointments.

m, is a promise

by j will not be scheduled i's network

time (which

appointment (this is overkill,

depends every

simulator

on the distance

we are developing

associated

with appointments).

model,

no further

updating

of network

simulator LAPSE

appointments

ll

time

the network

the message

a version

Because

the event

at j before

constructs

time the network

volume

that

A network t'.

reporting Given

LAPSE

presently is needed.

currently

the sends

application

the communication

uses a contention However,

at

application

by adding

a new or updated

that very much reduces

at

m's arrival m's

appointment

travels).

receives

appointment

more

free network sophisticated

updating

of network

appointments

The transmission associated takes

has been

to completely

received,

Every network simulators;

The target

simulator

The halting

Window

time increases

protocol

once its halting

on the

The

time t is the upper

upper

may execute,

processes

have advanced

many

involving

event

are

last

or msgwait

which

both

At such time

the simulator of message

network exe-

Eventually events

the three

state

in the

wait

until

as the

of the change.

that

that

process

to an interface

routine

executes

an application

message becomes

when

the application

time-lines

may become

is present).

For

flow-control

state

it changes

from

blocked.

First,

sensitive.

sensitive

is received

has as

blocked/unblocked

it will

To become to

call, such as crecy

from

another

the interface

temporarily

blocked

call and responds

insensitive

runnable,

w(t).

events

was temporally

at a temporally

be found

F application

become

the temporally

any

will wait for more events

its current

may be blocked process

events

a user specified

or at least

may

that

the application

many

one of its VP's

separately process

event

the cost of computing

buffers,

reports

send

at least one event

it is blocked,

less

it is a lower bound

globally,

that

rule is the simulator

is also reported

application Finally,

every

construction

with time-stamps

remote

amortize

of communication

ways an application

process

to ensure

wait until

The general reports

in the window

We desire

is walt until

(it will always

The application

call it made

the application

application

routine

blocked

notifies

for purposes

flow-control. all of its time-lines

reduction, 3-tuple

first action

processes

it reports;

it. Secondly,

Once

to better

t and w(t),

it must walt until the the simulator

process.

are received.

of how far into the future,

times

management

already.

There if the

unblocked

incoming

time may be safely

events

(i.e., not on the time-line) processes.

in the decision.

application

exist on the time-line

be blocked

will be

and network

all remaining

engages

is computed

of the simulation

possible

the Paragon's

the

every

it

the upper edge of the next window.

and it has no remaining

is thus a measure

between

F is involved

to unblocked.

halting

window,

A simulator

window

unknown

ahead

as is apparently

either

with

next

this end, a simulator's

parameter until

of the next

w(t)

time-line

events

reasons

w(t)

on the

simulator

Towards

minimum

and messages

in computing

edge of a window.

time is t or larger,

edge

time-stamp

on each VP's

into which

to be described.

the upper edge of the current engages

of packets

time" for both the application

change,

its

for as long as it

the time at which the message

less than the current

as appointments

removes

Construction

that

t.

on the number

notes

contention.

to another

to be suspended

time is the time of the current

and the simulator

Suppose

than

simulator

a single "halting

the halting

past

depends

that capture

simulator

VP is taken

for the window computation

maintains

time increases

window are executed, 4.2

(which

network

models

from one network

Any event with a time-stamp

the halting

for network

the target

arrive

data needed

at any instant

appointment.

message

Upon receipt,

decomposed).

completely

cuted.

of a simulated

appointment.

the message

will be required

the results to the

unsuspended

are filled,

of which

reduction.

The

VP; the second

permit first

element

a simulator

enters

each simulator

element

a global

to compute

is L, a lower

S is a lower bound 12

bound

vector

(component-wise)

w(t).

Each

on the last

on the last time-line

simulator

time-line

minoffers

event

a

of a

event of a suspended

VP, computedunder the assumptionthat it will be releasedfrom suspensionimmediately. The third componentR is a lower bound on the time at which a suspended VP will be released. R is itself a minimum received,

of three

the minimum

on the minimum includes

values--the incoming

network

time at which

local sends).

Given

minimum

B is a lower bound

By construction in [t,w(t))

the global

minima

on the startup

already

been

established,

needed.

It allows us to use the appointments schedule

that

The effect of temporally

sensitive

be the last one on its VP's cases

(between

5

call on a time-line

where

VPs),

Rmin, each simulator

whose

window

VP (this

computes

time-stamp

ultimately

is constructed.

remote

other

in [t, w(t))

indicates

This

send event

than

appointment

falls

property

in the

to define a dynamically

so that

the difference

calls are temporally

the net effect is to serialize

deserves

the application

This means

will be small, most

construction

that

time-line.

is

window

has

management

is

evolving

pairwise

that

sensitive

between tend

Presence

of a

is blocked--the

call

L and R offered

by its

the values

and

the simulation

remark.

process

w(t) and t will also be small. not to occur

of communication

simultaneously

events.

Experiments

In this section,

we report

plications

we used for experimentation,

that

the accuracy

of LAPSE

application

slowdown,

with N virtual natively,

speedup,

which

processors.

to be the

Essentially

data

the application

structures.

(with

We next

processors processor.

optimized

its attendant

runs

a four

instruction

VP problem

that

on four

relative

with

to execute

by that

unnecessary

N vir-

the same

of absolute

speedup,

of LAPSE

on n

overheads

when

is maintenance

to the actual

of the

execution

of

which would have to be done on any serial

with the high computation

processors 13

LAPSE

does not avoid

is small compared

counting) (SOR

an application the application,

an application

divided

avoids most

by the

the simulation

is representative

simulator

LAPSE

this overhead

In fact for one of our applications

LAPSE

by the time it takes

LAPSE

to execute

characterize

to execute

speedup

these are captured

LAPSE

ap-

i.e., quantifying

by the time it takes

We then

serial

the only overhead

overheads;

it takes

LAPSE

This relative

a set of four scientific

the issue of validation,

divided

processors.

divided

describing

quantify

processors

for this belief is that However,

After

to be the time

time it takes

of a hypothetical

The justification

appointment

predictions. is defined

LAPSE.

we first address

on N physical

on n physical

time

using

on N physical

to execute

serially.

simulator.

which

on one physical

the execution running

timing

is defined

tual processors application

on experiments

processors

running

ratio),

the

for every

calls on window

will always In extreme

other

will not deadlock.

sensitive

to the reduction

and a lower bound

to any

Rmi, + Stain},

no synchronization

temporally

simulator

a message

send event

point

an appointment that

sends

Lmln, Smin, and

at the

for it ensures

and

time for the simulator,

any remote

key to Whoa,

synchronization

that

in the midst of being

cost of a send event.

that

on a time-line

time of a message

VP next

= B + min{Lmin,

we are assured

is resident

appointment

an unsuspended

w(t) where

completion

only

1.8 times

to communication

slower

than

the native

applicationon four processors. We will by the amount

of lookahead

to communication timing

ratio.

simulator

5.1

present

see that

in the application,

In this context,

the application

the slowdowns

and speedups

are strongly

as well as by the application's

application

lookahead

is able (or allowed)

refers

affected

computation

to how far in advance

of the

to run.

Applications

We experimented and

with

engineering

workloads

span all such possible performance The

four

workloads).

modeling

17 in [23]). system

equations

x(i,j

x(i,j

East

the NEWS by varying this ratio

West

South)

exchange

G, we can adjust

or crecy.

application

simultaneous

(i,j),

are N processors

x(i,j),

Thus

requires

pattern.

The number

execution

exchange

points. at

x(i - 1,/),

each processor

is

in a NEWS

executed

is of order

ratio

in a

ordering;

thus results

between

G bytes.

of the application;

Thus we call

when G = 250. This application

The message

passing

is timing

independent,

path

grid

x(i + 1,j),

of instructions

to communications

with a given

odd-even

grid,

computation

G 2 and the size of each pairwise

the application

with

(see

resulting

of discretized

in a square

The

pattern.

in two dimensions

is discretized,

(SOR)

arranged

fluid dynamics,

in two dimensions

PDE

G = K/v'_.

communications

equation

K 2 is the number

grid point

the computation

computational

(PDE)

over relaxation

comprehensively

calls are all synchronous, and thereby

has good

lookahead. second

application,

element

discretizations

problem

arises frequently

method

nearest

neighbor

operation

of the matrix

applied

subdomains. Once values

for each

subdomain

the earlier

The

are specified interior.

iterations

itself.

This process

global

solution

tion on a unit

of a problem square

with

differential

equations

matrix

The

The domain

are separated

defined Dirichlet

accurate

solutions

conditions, 14

by an approx-

at each iteration and a global

basket"

independent

problems

of the algorithm.

The specific and

specific

test

only

reduction

into nonover-

consisting

values for the unknowns

only.

of

is of Bramble-Pasciak-

of edges

and

may be solved The objective

of

on the wire-basket

on each edge at each iteration,

on the vertices boundary

[14]; this type

of the PDE is partitioned

for these edges and vertices, at sufficiently

products,

by a "wire

or finite

the PDE [1]. The conjugate

requiring

of matrix-vector

difference

preconditioned

from discretizing

is highly parallelizable,

products.

independent

algorithm

method

This is done in the final iteration

is to arrive requires

for finite

gradient

obtained

in the formation

subdomains

solver

fluid dynamics.

operator

of inner

partial

of a conjugate

to a sparse

in the formation

decomposition

elliptic

in computational

communication

vertices.

is a domain

type consisting

to the inverse

gradient

BPS,

of two-dimensional

substructuring

lapping

where

value

The

these

of scientific

and graphics.

"very low" when G = 25, "low" when G = 50 and "high"

csend

imation

equation

case a rectangle).

communications

is of order

in a highly regular

Schatz

differential

of size G × G where

results

The

boundary

equations

If there

systems,

of a variety that

arise in physics,

Poisson's

at (an interior)

+ 1).

a sub-grid

(North

using

the solution

- 1) and

assigned

of K 2 linear

are solved

each iteration

(in our

or intended

communications

solves

This is a partial

large, The

and

are representative

is made

The applications

$OR,

on a boundary

that

no claim

of computer

set of values sparse

applications

(although

first application,

Chapter

parallel

as well as a small

case is Poisson's

advantage

is taken

equaof the

existenceof fast Poissonsolversfor the subdomalninteriors. The codeusessynchronous callssuch as esend, eprobe, erecv, and globalreductionsand thus exhibits good applicationlookahead. The coderuns on numbersof processorsthat are powersof 4 and two levelsof computationto communicationratio wereobtainedby varyingthe sizeof the G × G subdomain assigned to each processor;

the

"high"

ratio

has G = 256 while

written

in a combination

ICASE

and Old Dominion

University.

third

APUCS,

The

application,

using the parallel model

algorithm

of a large

irregular

The parameter achieved

setting

distributed message

settings

scientific

data

number

ratio

time Markov

in [18]. The specific

computing

system

has G = 64.

to us by David

system

described

in [18].

any-to-any

pairwise

passing

calls (csend,

irecv

and msgwaits)

#1 = 1 and

the

"low"

computation

is of

event

simulator

is the queueing

This

application

messages.

of "Set I" of [18] with the "high"

which

and Ion Stoica

discrete

simulated

involving

This code,

Keyes

Chain

patterns

application, [6].

the vertices

the processors.

spaced

between

the endpoints

Because

This

computation

network

has

highly

application

as well as global

also

reductions.

to communication

to communication

of this

"high"

and

5.2

has very

"low'computation

to communication between

for this library, of a certain

and a Z-buffer

ratio

irecvs

little

application

respectively.

two different

each processor

achieved

by

visualization

of

and

called

a span,

However, and

the designations

This code,

to us by Thomas

"high"

fashion

over

removal.

communications msgdone which

Crockett

"low"

call.

consists

of

of ICASE.

by assigning

it is harder

and

and

surface

sensitive

for this application

rotate

by interpolating

pairwise

were achieved

some

are then sent to the

are colored

temporally

lookahead. ratios

then

is used for hidden

and the provided

generates

in an interleaved

This code uses any-to-any

to communication

ratio,

to support

size. The processors

algorithm

using

lines of C, was written

computation

written

For each scan line, the pixels

frequently

per processor,

library

scan lines are distributed

for each of F frames.

500 triangles distinguish

Display

spans

the simulator 14,000

program triangles

scan line.

of the

asynchronous,

graphics

of the scan lines of each triangle,

for that

is repeated

approximately

driver

and colored

The endpoints

responsible

is highly

is a parallel

of their triangles.

processor

This process

PGL,

In the sample

of randomly

shade

1000 and

to adjust

are used

the

mainly

to

workloads.

Validation

In this

section

describe tions,

we describe

the results

the

transit

We obtained

times,

length,

experiments.

e.g., csend

we validated In order

message

passing

time the application

of the system

specifically

by which

of system

as well as the

estimates

were written

the message

process

of our validation

we must know the overhead

network ca]].

that

"low"

_l = 64. fourth

The

the

was provided

is a continuous

were those

by setting

The

and

Fortran,

described

communications

uses synchronous

ratio

of C and

overheads

to determine were modeled

for LAPSE calls (e.g. spends

by measuring

such overheads.

timing

to make csend)

between the system

Overheads

as a + b × L where 15

LAPSE

predictions, reasonable

and predic-

and interconnection each message using

test

for calls that

L is the message

passing programs

depend

length

on

and a

and b

are constants

transit

times

software

estimated

overheads

times.

Thus

predictions

To obtain

time between

natively gives

of user

instructions

average

time per (user)

between

these

conversion

This conversion

hit ratio,

of nodes.

For example,

mix, and

the data

under

the conversion

factor

patterns

application

was then

and

the timings

compared.

native

observed

These

engineering

and data

number

of nodes

show that

errors,

passing

path

is timing

Thus,

for the

dependent,

several

other

between

LAPSE

predictions

have

reported

while it spends

always

LAPSE

occurred

the PGL-High

in Table

that

when G = 25 (as opposed 90% of its time executing

can provide

accurate

applications.

16

applications).

In

of computing

timing

of nodes,

of execution error

for very

to executing

and

We have

runs

in which

short

These

times

is 6%.

is -12%

when

runs span a range

SOR on 16 processors

spends

system

user instructions

estimates

were used.

of numbers

error on 64 nodes

2 is for 20 frames.

estimates

were

below at most two iterations

the maximum

for

factors

iterations

for a variety

these

the

However,

conversion

LAPSE

For example,

number

the instruction

of processors.

under

but

we use the same

for a fixed G, we expect

and separate ratios

are

the time

calls are approximately

natively

and

used in

if there

on a large

of instructions,

of the number

may be different

timings

is the

case sense,

is then

computes

size per processor,

Unix

the total

which

For example,

to predict

message

independent

c = U/I

calls, LAPSE

we

application.

reports

factor

presented

For example,

LAPSE

between

The

in an average

in the results

user instructions

for messages),

results

passing

As can be seen in the table,

the +3% error

efficiencies.

of its time executing

application

differences

may be present.

10 frames;

of application

or waiting

times.

prediction

effects

generating

the percentage

execution

larger

initialization

run both

factor"

incorporates,

to communication

execution

may be required;

The

actual

two message

factor

G (or computation

2 presents

a "conversion

for timing

First,

2 x 2) with

by the

which

predictions.

to be approximately

for each

Table

LAPSE,

as on an 8 × 8 grid of processors.

computed

factors

size) under factor

(typically

U consumed

LAPSE

executed

value of G, the conversion

the application

link contention,

per node for PGL).

The conversion

between

a different

cases for which

the

transit

as follows.

of nodes

of triangles

in SOR with a fixed value of G, the number

access

network

the hardware

calls, we proceeded

of the application.

on the small

same on a 2 × 2 grid of processors

conversion

etc.

executed

c measured

than

does not model

of user time

the same data

calls as J × c. For a given

factor

Interconnection

for very long messages,

on a small number

this, we compute

instruction.

instructions

which

passing

LAPSE)

(with

runs of the application

J application

simulator,

of the amount

I. From

mix, cache

except

are far greater

G in SOR or number

us an estimate

number

subsequent

messages

message

(i.e., without

ran the same application

the instruction

receiving

data.

that,

accurate.

size per node (e.g.,

command

We next

are)

the application

data

and

on the measurement

We have found

using our simple network

to be (and

ran the application time

measured.

for sending

can be expected

a certain

by regression

can be similarly

for a range

47%

instructions

when G = 250. of scientific

and

5.3

Slowdowns

We next investigated overheads.

First,

is the overhead overhead

running

natively

measurement

simulator

twice

the

User-supplied

speed.

LAPSE

application

that

is able

of such

executed

the

high ratio

remote

the

on N nodes.

generally

ranges

be-

counting,

but

on the

of application

event ratio

from that

slowdown

is strongly

increases.

has an associated resulting

time-line,

events

the

of events

per window.

algorithm

appointment

time.

in additional appointments

one

we can

3 presents

VP/processor.

number

of application

decreases

from

30.4

from 0.3 to 4.6 as F increases upon

application

used in LAPSE. factor

As simulation

However,

is not always

is as follows.

time advances,

Thus

increases.

lookahead.

For the low ratio,

to this behavior

and also tends

17

receives),

on an

from only 0.3 to 8.0. For the

on a time-line, there

passing

to be a call to the LAPSE

communication. overhead

windows.

LAPSE

F. Table

slowdown

dependent

A contributing

and

with

is defined

12.0 to 3.7 and A increases

an effect of the flow-control

number

By running sends

to

of the simulator,

of message

the parameter

the

sent in

lookahead

in larger

number

from 2 to 16. At the same time A increases

be updated,

allowable

this results

of F, as well as the average

to communication

must

We now consider

application

well in advance

on 8 processors

per VP, A. (An application

16 the slowdown

may

the effect that

(i.e., one with only synchronous

as a function

of

as well as an application

of the simulator.

running

we

OSF-1

LAPSE

each message

on the same node).

by varying

shall

that

aside,

application;

earlier,

nor

as the number

overheads

how the maximum

manner

overhead,

in [8] indicating

message

processes

As explained

for APUCS

decreases

send event

of events

the maximum number

past

to application

to run in advance

low computation

3 also illustrates

appointments number

in a single

natively

superlinearly

simulation executing

F determines

in a controlled

from 2 to 16. This demonstrates as F increases

as the natively

to do so.

slowdowns

the slowdown

parallel

to run the application

lookahead

per window

For the

that,

increase

attempts

is permitted

down to 9.3 as F increases

Each

application

of the time to natively

system

are presented that

by demonstrating

an experiment

reports

interface).

Table

there

to execute

with instruction

operating

We begin

parameter

has good

from

these may be between

the effect of lookahead

events

is augmented

overheads

an application

of LAPSE.

the application

table

LAPSE

overhead

of

is the operating

can be captured

comparison

types

Second,

an SPMD

the application

counting

are several

there

node;

time it takes

by direct

measurements

messages

in both

flow-control

application

some

We do note

(although

slowdown

overhead

management

as many

results

has on simulation

This

the instruction

the application

Third,

per

overheads

to be the

were obtained

when

However,

increase.

message

the overall

results

All of these

is defined

simulation

has process

the application

the

processes

There

instructions.

(in parallel).

multiple

that

overheads

to separate

per node

send at least

study

application

by the time it takes to execute

to that

to do so here.

provided

simulation

LAPSE.

simulation.

on the Paragon processes

which

in running

in counting

per node.

have indicated These

the application

It is harder

events

divided

measurements

attempt

the timing

from managing

the slowdown,

on N nodes

not timing

doing

arises

involved

involved

has only one process

307o to 80%.

execute

of overhead

is the overhead

that

called

application Our

there

in actually

system

tween

the amount

if there

Increasing to increase

an increase

these

are a large F increases

A, the average in slowdown

for

largeF;

see APUCS-High.

is not entirely include

clear;

operating

In this case, A levels off as F increases.

there

are a number

system

process

scheduling,

of the application

and the simulator,

We next

slowdowns

consider

LAPSE

to native

LAPSE

runs,

application

there

The results as the

number

more number

taken

submitted

overheads

typically

slowdowns

increases

that

these

has little

message

first that

so permitted.

window.

window

decreases

and number

modest

(between

the

of minimum of low values of processors,

to communication code

but

as the

as a function

application

cost of

increases,

the likelihood

computation

increases

The

of processors

is computed

process.

the slowdown

a global

of the

For the

and one simulator

of VPs increases

lookahead

ratio decreases.

and thus

the simulation

question

by blocking

up to the time

the

windows

constructed

in PGL,

application

simulation

window.

relative

VPs per node, consider

the PGL-high

decreases

We next measure on n physical

combinations

limit

n.

where

For these

28.0)

and

simulators

do not make

As described we deal

simulator

was asked. and

blocked

are

[5, 24].

much

thereby

and

multiple

reductions

use of

N >_ n.

processes nodes

and a different

LAPSE

the level of multiprogramming

18

operating

simulation

time

the size of the speed.

Thus

events

construction

per

algorithm

These

per node,

factors

simulating nodes.

By configuring

LAPSE

set of 16 nodes

for simulation,

on the Paragon the

system

to at most

of temporal

from the simulation

are obtained

ran

synchronization

are few application

application

We again

applications,

limits

work done in the window. processing

this application

simulation

the window

in slowdown

by running

advanced

slows

there

forces

severely

of 139 on 64 nodes.

processing

earlier,

with this type

has

This

per node,

of simulation

slowdown

speedups

processors

they

calls it continually the

the application

for application

(relative)

until

process

by placing

to 117. Similar

of N and

typically

to the amount

and separating

with 32 nodes

since

In LAPSE,

algorithm,

are frequently

(to some extent)

msgdone

question

windowing

processes

can be counteracted

the slowdown

process

msgdone

by the

1.8 and

execution-driven

higher.

processes.

With only one simulation

expensive

For example,

many

simulation

the

lookahead

are significantly

application

at which

by other

calls.

since by executing and

are quite

reported

application

passing

for PGL the slowdowns

the application

straints

We compared

the application

of using

size

executing

APUCS

of slowdowns

codes all have good

between

on 48 nodes

BPS and

low end

sensitive

However,

cessors

is spent

earlier.

for a given application

as the application's

ratios

etc.)

process

whenever

w(t)

the number

to communication

described

as the number

average

next that,

of the time

for SOR,

at the

temporally

multiple

Observe

factors

on 4, 16, 32, and 64 processors.

is an effect

because

These

off

are less important.

The

becomes

simply

manner.

algorithm,

applications

in Table 4. Observe

the

and increasing

most

construction

good lookahead

that

This occurs

in a complex

one application

This

reason for this leveling

the computation

(logarithmically)

observed

over all VPs,

considerably Recall

increases.

we have

a high ratio,

per node;

are reported

increases

to the reduction.

the slowdown With

of processors

of VPs increases.

values

the window

values so as to obtain

a new window

importantly

algorithms,

time for the codes running

of these experiments

computing

interacting

on the set of four

are two processes

F was set at moderate

of factors

(The

for PGL-low. using

applications and

to run

physical

8 VPs/processor;

N virtual for a variety memory

proof con-

in the case of

Application Comp./Comm. Number of Processors Ratio 4 16 I 32 I 64 SOR -4% VeryLow +3% +2% +1% SOR Low o% +4% +4% +4% SOR +1% +1% +1% +2% High BPS Low -6% +1% -4% BPS -1% -2% -3% High APUCS Low -1% -2% -3% -1% APUCS -2% -2% -1% 0% High PGL Low +1% -2% +1% +5% PGL -1% -2% High -1% +3% Table

BPS-High

2: LAPSE

at most

for LAPSE

4 VPs/node

running

power

could

8 VPs on from

for these

experiments.

by the

BPS, and APUCS, on 4 nodes

LAPSE

speedups

PGL, the maximum to 2 nodes.

relative

for these

experiments.

VPs per node,

divided

Again,

The maximum

PGL speedup

For the

monotonically.

than

(Remember

that

the size of N, especially

an actual

speedup

most

as in the hypothetical

larger

number

1 to 4 nodes.)

Table

are stated

are higher faster

is defined

for those

on 64 nodes

of nodes

must

5 shows

with good

be a

the relative time on one

lookahead,

SOR,

relative

speedups

0.60 and 0.92),

while

on 8 nodes

between

is obtained

0.43 and

when

0.60).

increasing

Table 6 shows the relative LAPSE

(For BPS-High to LAPSE

applications

than

of nodes;

applications,

efficiencies

to be the

relative

number

to be the LAPSE

applications

of that

time on n nodes.

speedups

good

for SOR,

with 8

number

of

with 4 VPs

lookahead; BPS,

1

speedups

the maximum

with

from

time on 8 nodes,

time on 16 nodes

on 16 nodes

For

LAPSE

and

APUCS.

is 2.5.

slowdown

the time LAPSE each LAPSE

the

between

We were also able to run a 512 VP case of SOR-Very to 512 nodes,

BPS

For these

efficiencies

is 1.7, and

LAPSE

the speedups

3.9 to 7.0 times

times slower

times.

of VPs on a small

64 VPs on from 8 to 64 nodes.

by the

runs between

access

execution

limits

is defined

3.5 and 4.7 (relative

Here relative

VPs per node was 4, so those per node.)

increase

speedup

simulating

(For

speedup

time on n nodes.

between

number

4 VPs on from

2.4 and 3.7 (relativo range

We next consider

a small

Here relative

the speedups

are between

the relative

in predicted

This effectively

1 to 8 nodes.

of 4 so in this case we simulate divided

errors

be handled.

the case of simulating

simulating

speedups node

percentage

on one processor.

We first consider specifically

validations:

could

predicted

processor

Low on 64 nodes.

not be computed.

However,

it would take the application

has 8 VPs and hence

system.)

19

Since

at least

we did not have

LAPSE

ran only

100

to run on 512 processors. 8 times

as much

work to do

Maximum

Comp./Comm. Ratio

Slowdown

Application Lookahead

Low

High

Table

3: LAPSE

time divided

slowdowns

by native

Avg.

on APUCS

execution

Application

Application

Events Window

(F)

per per VP

0.3

2

30.4

4

14.2

1.1

8

10.2

3.3

16

9.3

8.0

24

9.9

12.8

32

10.9

17.5

40

12.1

21.6

2

12.0

0.3

4

6.1

1.3

8

4.3

2.8

16

3.7

4.6

24

3.5

5.5

32

3.5

6.0

40

3.5

6.2

as a function

time using

of application

lookahead.

LAPSE

execution

8 processors.

Number

Comp./Comm. Ratio

of Processors

4

16

32

17.5

28.4

24.0

28.0

SOR

Very Low Low

6.5

11.5

13.2

16.7

SOR

High

1.8

4.0

4.1

6.1

BPS

Low

3.0

7.5

-

11.9

BPS

High

1.9

2.4

APUCS

Low

7.5

11.1

APUCS

High

3.0

[ 64 l

SOR

Table

4: LAPSE

same number

3.0 13.5

18.8

5.9

7.6

12.9

99.2

148

116

139

PGL

Low

15.1

61.6

PGL

High

17.4

70.2

slowdowns:

LAPSE

execution

time

of processors.

20

divided

by native

execution

time using

the

Application Comp./Comm. Ratio SOR VeryLow SOR Low SOR High BPS Low BPS High

virtual

processors

4 I 3.1

8 4.7

1.0

1.8

3.0

4.5

1.0

2.0

3.4

4.6

1.0

1.7

2.4

1.0

2.0

3.7

1.0

1.7

2.5

3.8

APUCS PGL

High

1.0

1.6

2.4

3.5

Low

1.0

1.5

1.6

1.7

High

1.0

1.4

1.4

1.6

5: LAPSE

speedups

on an 8 VP problem

Comp./Comm.

(on a 4 VP problem

Number 8

Ratio

processors

1.0

2 1.8

Low

Application

Table 6: LAPSE

1

of Processors

APUCS

PGL

Table

Number

VPs / Processor

4211 3.5

SOR SOR

Very Low Low

1.0

2.0

5.9

1.0

2.0

3.5

6.2

SOR

High

1.0

1.7

4.2

7.0

BPS

Low

1.0

1.9

2.9

4.4

BPS

High

1.0

1.7

2.7

APUCS

Low

1.0

1.9

3.0

4.1

APUCS

High

1.0

1.8

2.8

3.9

PGL

Low

1.0

1.9

2.2

2.5

PGL

High

1.0

1.8

2.1

2.3

relative

speedups

per processor.

on a 64 VP problem. (For BPS high,

times

per processor.)

21

for BPS).

Times

are relative

are relative

to 8 processors

to 16 processors

with 8

with 4 virtual

6

Conclusions

This paper parallel

describes

message

a tool,

passing

the direct-execution on LAPSE's

and

where

without

slowdowns questions,

for increasing

on a smaller

to depend

the amount

are modest namely

and

good

those

the slowdowns

continue then,

would

(namely

the application automated

in place

example, and

that

an application

present somewhat

10% of the

exhibit

loDg

can be

running

actual

a

execution

probes

this should most

advanced

However,

at some

speed

for such

approach

so inform

more elaborate

data

can't

time).

These

until

about

the

this approach

time

be canceled),

structures

clock calls.

time,

the

unblocking

of an

The second

currently probe

For blocks

call

in certain

was

cases.

say 100, and

If the probed

then it is guaranteed

thereby

into

of messages.

LAPSE

at which

and application/simulation

22

set of routines

is too conservative unknown)

of this change

in the context

the status

here now?"

support

to be inserted

not pose a problem

type

asyn-

locations

have

to

gets used (and

a simple

frequent

be possible

clock returns

storage

would

and

are not used until

Transparent

certain

the

call is made,

it would

However,

is a query

the application

the clock

Con-

is to block

clock value actually when

(as yet potentially

messages

in LAPSE

the simulation

to time 50 at the time of the probe.

at time 50 (and

applications.

clock values

likely desire

of a certain

has

of possibilities

are stored).

can be handled

a message

the probe.

could

would

to temporally

are a number

applications,

use clock values.

For applications

on the answers

applications,

checking

with good

there

to the calling

involve

to communica-

For applications

at which

For such

clock values

time

depend

when an unreturned

that

call that

has advanced

100. The simulator

requires

system

simulation

answers

the simulator

already time

then

the

paths

the clock call, accept

subsequently

asks "is there until

protocol

for four scientific

are obtained.

However,

to the point

in many

of clock calls; however

sensitive

a probe

application

Suppose

beyond

which

instrumentation

data

of processors

the computation

speedups

simulation

intervals).

since it would

to read and

type of temporally

made,

into

frequently

were validated

lookahead.

The current

time has not yet advanced

be costly

those

can be designed

the

However,

in timing

the application

only if simulation

values

increasing

and only block the application

mechanism

performance

and simultaneously

were within

high.

has advanced

time.

are set (e.g.,

running

chronously,

time

the simulation

for

considerations.

predictions

whose execution

can be quite

calls clock routines.

they

of

suitable

synchronization

on a large number

of processors

simulation

that

report

running

of application

sider first an application

well after

codes

on several factors:

potentially

then

and provide

message-passing to temporal

and simulation Whoa,

5%. and

simulation

systems,

the predictions

and

until

execution protocol, The

timing

lookahead

application

direct

multicomputer.

number

LAPSE

Typically

was shown

good lookahead,

sensitive

that

of applications

of the application

lookahead,

observation

are insensitive

were within speed

Paragon

paths

applications.

they

message-passing

Intel

machine.

parallelized

a synchronization

predictions

of the larger

The simulation tion ratio

the

the application

engineering often

in the

execution timing

simulation

times;

of general

exploits

their

made by executing and

we describe

simulation

Using LAPSE, timing

that supports

implementation

is conservative, periods

LAPSE,

applications;

suppose

for message

is

to be there

at

the application. flow-control

than

This are

currently which

in LAPSE message

that

such

is destined

in order

There

are

runs

our

system

on the

will

delay

Paragon's

e.g.,

We are planning

plan

and

package,

nx-lib

to investigate

need

[17])

simulator with

form

to LAPSE timing

the

must

keep

PGL

track

code

of lookahead)

process

simple

our

example

of

indicates

is well

not

way in which

worth

to incorporate

to be changed

such

LAPSE

for the to run

on other

features

resulting

under

parallel

into

(more

and

platforms

the

our

model,

the

li-

operating

model

is a pure

Paragon's

Our

mesh

model

of the

of the

Paragon's

communications and

buffers.

especially

models.

newly

besides

Paragon's

of some

complex)

model

to work

functionality,

of the

manages

cur-

message-passing

LAPSE.

models

Paragon

LAPSE

network

simulator it with

LAPSE

Paragon

of the

current

include

the

porting

with

parallel

does

First,

Paragon

model

our

of integrating

and

the

workstations

Second, For

pursuing. We are

provides

the

timing.

we are

estimates.

[15], that

crude.

in the

how

standard

the

experience

a packet-by-packet

of the

to port

our

and

sophisticated

provides

with fairly

models

how

Interface

nx-lib

is also

to investigate

calculations

Passing

Paragon

are

system

However, a more

provides

implenlented

network

be reused

extensions

functionality

We have

algorithms,

lookahead

and

may

simulation.

is currently

operating

internal

the

a software

hardware

interconnection

probe). is really

of workstations,

augment

network.

types

of worthwhile

Paragon

with

and

(which

to accelerate

on networks port

message

for which

a number

in conjunction brary

e.g.,

an optimization

pursuing

rently

(since,

how

In addition,

emerging

MPI

we

(Message

Paragon.

Acknowledgments We are

grateful

to better

to Jeff

understand

Earickson the

Intel

of Intel Paragon

Snpercomputers

for his generous

assistance

in helping

us

system.

References [1] J.H.

Bramble,

problems

J.E.

Pasciak,

by substructuring,

[2] E. A. Brewer,

parallel [4] R.

Chen,

September H.-M.

systems.

Covington,

parallel

computer

Su,

Schatz.

The

Comp.,

Technical

construction

47:103-134,

A. Colbrook,

simulator.

of Technology,

A.H.

i. Math.

C. N. Dellarocas,

parallel-architecture

[3] D.-K.

and

and

of preconditioners

1986.

W. E. Weihl.

Report

for elliptic

PROT_,US:

MIT/LCS/TR-516,

A high-performance

Massachusetts

Institute

1991. and

In Int'l.

P.-C.

Symp.

S. Dwarkadas,

Yew. on

J.

The

impact

Computer

Jump,

systems.

International

S. Madala,

V. Mehta,

of synchronization

Architecture,

S. Madala, Journal

and on

pages J.

Computer

and

239-248,

Sinclair.

granularity

May

Efficient

Simulation,

on

1990. simulation

1(1):31-58,

of June

1991. [5] R.C. cessing

Covington, Testbed.

In Proceedings

J.R.

of the 1988

Jump,

and

SIGMETRICS

23

J.B.

Sinclair. Conference,

The pages

Rice

Parallel

4-11,

May

Pro1988.

[6] T.W.

Crockett

Report

and T. Orloff. A parallel

91-3, ICASE,

[7] H. Davis, Tango.

In Proceedings

[8] P.M.

August

Dickens,

ulation

and of the

lation,

J.

algorithm Center,

Hennessy.

1991

P. Heidelberger,

Simulation

for MIMD architectures.

Hampton

Virginia,

Multiprocessor

International

simulation

Conference

Technical

1991. and

on Parallel

tracing

using

Processing,

pages

and

D.M.

programs. (PADS),

Nicol.

A distributed

In Proceedings

Edinburgh,

memory

of the 8th

Scotland,

1994. The

LAPSE:

Workshop Society

Parallel

sim-

on Parallel

and

of Computer

Simu-

to appear.

[9] R. M. Fujimoto. 83/137,

Simon:

ERL,

October

ofmulticomputer

of California,

Parallel

discrete

Berkeley,

networks.

Technical

Report

UCB/CSD

1983.

event simulation.

Communications

of the A CM, 33(10):30-53,

1990.

R. M. Fujimoto Transactions

[12] F.W.

A simulator

University

[10] R. M. Fujimoto.

and

W. B. Campbell.

of the Society

Howell,

and

In MASCOTS and Simulation

Durham,

Carolina,

[13] Intel

Corporation.

[14] D.E.

Keyes

partial

Paragon

and W.D.

differential

and Statistical

G. Stellner,

[16] I. Mathieson

and

October

parallel

In Sun

Number

decomposition

Paragon

A dynamic-trace-driven International

Systems,

on Mod-

pages 363-370,

312489-002. techniques

SIAM

Journal

for elliptic on Scientific

1987.

User Group Proceedings,

of the 21 st Hawaii

Order

implementation.

March

and simulation

Workshop

Press.

1993.

of domain

A. Bode, and T. Ludwig.

and R. Francis.

Proceedings

their

8(2):s166-s202,

Society

of computers.

1988.

design

International

and Telecommunication

A comparison

equations

on Sun workstations. 1993.

January

Gropp.

April

architecture

of the Second

Computer

Guide,

level simulation

5(2):109-124,

Hierarchical

of Computer

User's

Computing,

[15] S. Lamberts,

R.N. Ibbett.

1994. IEEE

instruction

Simulation,

'9_, Proceedings

eling, Analysis, North

Efficient

for Computer

R. Williams,

environment.

parallel

programming

pages 87-98.

simulator

Conference

environment

Sun User Group,

for evaluating

on System

December

parallelism.

Sciences,

pages

In

158-166,

1988.

[17] Message Document

Passing

Interface

for a Standard

Forum,

tive uniformization. Clara,

The

University

Message-Passing

[18] D. Nicol and P. Heidelberger. Santa

Research

1991.

of message-passing

Distributed

rendering

Langley

S. Goldschmidt,

II99-II107,

[ll]

NASA

Parallel

In Proceedings

Interface,

simulation of the

of Tennessee, November

of markovian

1993 SIGMETRICS

CA, May 1993. 24

Knoxville,

Tennessee.

Draft:

2, 1993. queueing

networks

Conference,

using

pages

adap-

135-145,

[19] D. Nicol,

C. Micheal,

memory 680-685,

and

P. Inouye.

parallel simulations. Washington, D.C.,

Efficient

aggregaton

In Proceedings of the 1989 December 1989.

[2o] D. M. Nicol and R. M. Fujimoto.

Parallel

simulation

of multiple

Winter

today.

LP's

Simulation

Annals

in distributed

Conference,

of Operations

pages

Research.

To appear.

[21] D.M.

Nicol.

Proceedings Systems,

[22] D.M.

124-137.

The

[24] S. Reinhardt, Conference,

pages

48-60,

and J.V.

Santa

Walrand.

77(1):99-113,

April

queueing

with Applications,

in parallel

and W.T.

Vettering. Press,

networks.

In

Languages

and

discrete-event

In Proceedings

Numerical

New York,

J. Lewis, and D. Wood.

computers.

Clara,

simulations.

Recipes

in C: The

1988.

The Wisconsin

Wind

Tun-

of the 1993 A CM SIGMETRICS

CA., May 1993.

Distributed

January

stochastic

1993.

University

A. Lebeck,

of parallel

Experiences

synchronization

Cambridge

M. Hill, J. Larus,

of FCFS

1988.

S.A. Teukolsky,

Computing.

prototyping

of the IEEE,

LS 1988:

Press,

40(2):304-333,

nel: Virtual

[25] R. Righter

PPEA

ACM

B.P. Flannery,

of Scientific

simulation

cost of conservative

of the ACM,

[23] W.H. Press, Art

discrete-event

A CM/SIGPLAN

pages

Nicol.

Journal

Parallel

simulation

1989.

25

of discrete

event

systems.

Proceedings

Form

REPORT

DOCUMENTATION

PAGE

Approved

OMBNo

0704-0188

Public reporting burden for this collection of information is estimated to average ] hour per response, including the time for reviewing instructions, searching existing data sources, gather ng and manta n nil the data needed and comp cling and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, nclud ng suggestions for reducing this burden to Washington Headquarters Services Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite [204. Arlington, _IA 22202-4302, and to the Office of Management and Budget. Paperwork Reduct on Project (0704-0188), Washin_on, DC 20503 1. AGENCY

USE

ONLY(Leave

blank)

2.

REPORT

June

DATE

3.

1994

REPORT

Contractor

TYPE

AND

4. TITLE AND SUBTITLE PARALLELIZED PASSING

DATES

COVERED

Report S. FUNDING NUMBERS

DIRECT

PARALLEL

EXECUTION

SIMULATION

OF MESSAGEC NAS1-19480 WU 505-90-52-01

PROGRAMS

AUTHOR(S)

6.

Phillip Philip David 7.

M. Dickens Heidelberger M. Nicol

PERFORMING

ORGANIZATION

NAME(S)

AND

8.

ADDRESS(ES)

PERFORMING REPORT

Institute

for Computer

and Engineering Mail Stop 132C, Hampton,

Applications Langley

Research

Aeronautics

and

10.

AGENCY NAME(S) AND ADDRESS(ES) Space

NASA ICASE

SUPPLEMENTARY

DISTRIBUTION/AVAILABILITY

UnclassifiedSubject

13.

94-50

NUMBER

94-50

Distributed

Systems 12b.

STATEMENT

DISTRIBUTION

CODE

Unlimited

Category

ABSTRACT

REPORT

CR-194936 Report No.

NOTES

Langley Technical Monitor: Michael F. Card Final Report Submitted to IEEE Transactions on Parallel and 12a.

No.

SPONSORING/MONITORING AGENCY

Administration

Langley Research Center Hampton, VA 23681-0001

11.

Report

Center

VA 23681-0001

9. SPONSORING/MONITORING National

in Science ICASE

NASA

ORGANIZATION

NUMBER

(Maximum

61, 62

200 words)

As massively parallel computers prohferate, there is growing interest in finding ways by which performance of massively parallel codes can be efficiently predicted. This problem arises in diverse contexts such as parallelizing compilers, parallel performance monitoring, and parallel algorithm development. In this paper we describe one solution where one directly executes the apphcation code, but uses a discrete-event simulator to model details of the presumed parallel machine, such as operating system and communication approach is computationally expensive, we are interested in its own parallelization, the discrete-event simulator. We describe methods suitable for parallehzed direct

network behavior. Because this specifically the parallehzation of execution simulation of message-

passing parallel programs, and report on the performance of such a system, LAPSE (Large Application Parallel Simulation Environment), we have built on the Intel Paragon. On all codes measured to date, LAPSE predicts performance well, typically within 10% relative error. Depending on the nature of the application code, we have observed low slowdowns (relative to natively executing code) and high relative speedups using up to 64 processors.

14.

SUBJECT

IS.

TERMS

Performance

prediction,

parallel

simulation,

scalability

NUMBER

OF

PAGES

27

analysis 16.

PRICE

CODE

A03 17.

SECURITY OF

CLASSIFICATION

REPORT

Unclassified NSN

18.

SECURITY OF

THIS

19.

CLASSIFICATION

SECURITY OF

PAGE

CLASSIFICATION

ABSTRACT

20.

LIMITATION OF

ABSTRACT

Unclassified Standard Prescribed 298-102

7540-01-280-$$00

1_U.$.

GOVIERNMIPNT

MUN'nNG

omc£:

_

- SZlb.i_g/231Be"/

Form 298(Rev. 2-89) by ANSIStd Z39 18

Suggest Documents