TOWARDS A THREAD-BASED PARALLEL DIRECT EXECUTION SIMULATOR Phillip ...

2 downloads 0 Views 1MB Size Report
thread. Contemporary operating system kernels, such as Mach, decouple the thread .... Life [16] when separate address spaces .... double fool0. ; public:.
NASA

Contractor

ICASE

Report

Report

198242

No. 95-81

//i] (

t

ICA TOWARDS DIRECT

A THREAD-BASED EXECUTION

PARALLEL

SIMULATOR

Phillip Dickens Matthew Haines Piyush Mehrotra David Nicol NASA Contract No. NAS1-19480 November 1995 Institute NASA

for

Computer

Langley

Hampton, Operated

VA

in Science

and

Center

23681-0001

by Universities

National Aeronautics Space Administration Langley Hampton,

Applications

Research

Research Virginia

Space

and

Center 23681-0001

Research

Association

Engineering

1

TOWARDS PARALLEL DIRECT

Phillip

Dickens

t

tInstitute

A THREAD-BASED EXECUTION SIMULATOR*

Matthew

Haines t

for Computer

Applications

NASA

Langley

tDepartment

Mehrotra

in Science

Research

Hampton,

The

Piyush

and

David

Nicol t

Engineering

Center

VA 23681-0001 of Computer

College

t

of William

Williamsburg,

Science and

Mary

VA 23187

Abstract Parallel analysis

direct

of large

execution message

simulation passing

However, detailed We are therefore

simulation interested

cost.

One

is to implement

rather

than

and

idea

traditional

context-switching

based

parallel

present

direct

preliminary

execution results

parallel

programs

tool for performance executing

of message-passing codes in pursuing implementation

Unix costs.

is an important

the

application

processes, In this paper simulator.

that

indicate

reducing

we describe We discuss a significant

scalability

on top of a virtual

computer.

requires a great deal of computation. techniques which can decrease this

virtual both

and

processes on-processor

an initial

as lightweight communication

implementation

the advantages improvement

threads

of a thread-

of such an approach over

the

costs and

process-based

approach.

*This work was supported by the National Aeronautics and Space Administration under NASA Contract No. NAS1-19480 while the authors were in residence at the Institute for Computer Applications in Science and Engineering (ICASE), NASA Langley Research Center, Hampton, VA 23681-0001.

1

Introduction

Direct

execution

large

message

this

approach,

and

any

simulation passing the

prediction

The

approach

does

while

where

machine

the

on top directly

are trapped

performance deal

processors

will perform

virtual host

poorly

but

Parallel simulator

passing

of the

execution

direct

the

improving

physical

the

higher

tion

This

network.

number

passing

Given

this

address

Unix

for LAPSE

LAPSE

costs

of the

each

to improve

code

and

between

processors

been

ported provides

is thus

1.8 and when

of the

is written

such

physical

a measure

139 depending

compared

to other

One

approach

to

simulated

system

are

that

there

processor

it is natural

is one process

in the

simulated

each

simulated

to implement

process.

former

Intel

cluster

Paragon.

which

Intel Paragon

node

has

[33], which

this performance.

each

thus

on the

to the

and

machine

is slowdown, which is defined code on N processors divided

reasonable

on a workstation

executing

is in contrast

Additionally,

and

Application

virtual

[9].

are very

assumption,

space,

processes. itself is a botas well as the

the

library

pro-

of workstations.

slowdowns

application

one

communication

(Large

and

on N processors,

the way the virtual

the

LAPSE

Paragon

of processors

be useful

the

execution simulator

on an Intel message

on

application

code

natively

by LAPSE

it would

assumes

communication

mechanism.

the

measured for

called

for paralone easily

resides

either

application

on a network

has reported

as a heavyweight

slowdowns measured

nx-lib

running

reported

its own

processor

The those

and

processor.

will have

code

is to change

LAPSE

physical

system

the

of a Paragon

LAPSE

simulators,

performance

implemented. per

code

slowdowns

execution

the

is implemented using

of the

overhead.

application

While

simulator

both

measure for a direct execution simulator the simulator to execute the application

time

simulation

on the

LAPSE

functionality

One performance as the time it takes

execution

where

of Sun workstations

message

by the

direct

Environment)

are parallelized.

to a cluster the

a parallel

Simulation

systems.

candidate

executing where

for accu-

of parallelism;

simulator

directly

in situations

constructs.

potential

is a good source

In

behavior

as-yet-unbuilt

path to the simulator becomes a bottleneck or where the simulator tleneck. Therefore, it is important to parallelize the virtual machine application processes. We have developed

the

of

computer.

by simulator

offers

machine

the

parallel

possibly

is a clear

analysis

its run-time

handled

programs

of computation,

discrete-event

scalability

of a virtual

on large,

processes

and

to determine and

for parallel

application

of other

solution

executing

simulator

a great

of the

tool for performance

is executed

program

require

a pool

this

code

virtual

of parallel

a system

However,

programs

execution

Execution

envisions cessor

to the

direct

rate

lelization.

parallel

application

references

A detailed

is an important

uses Unix which

on the Intel

are

This

generally

greater

is in large

part

sockets

has

to

as the communication

has a high speed

Paragon

than due

mesh

a dedicated

interconnec-

communication

co-processor. Given to execute

the

its performance each simulated also

the

processor.

high

large

costs

communication

parallel

direct

is to use virtual

lightweight

processor.

of communicating

However,

costs execution

using

among

network

simulations

threads This

lightweight

on the

rather

reduces virtual

threads

not

cluster, using

it can

LAPSE.

than

heavyweight

only

the

processors to support

impractical

way

Unix

context

executing virtual

become One

to increase processes

switching on

processors

the

same

brings

costs

for but

physical up a very

important

issue:

virtual

processors

address

space?

In this address then

paper

we discuss

stations.

version

thread-based results

suggests

of LAPSE

making

this

[10] developed provide without

of a thread. direct

These

execution our

implications and

based and

2

future

is an

is interested perform

as the

fully

of

improve

direct

execution

on a cluster

approach

preliminary

execution three

of work-

the

thread-

our

work

stack

the

unique

use

simulator is that

system

we

support

of a particular

and

function

5.

In Section

2 we briefly

information LAPSE

and

an overview

review

parallel

on LAPSE.

In Section

discuss

performance

the

of thread-based

for providing

separate

address

experimental

results

comparing

executing

that

simulators

[16] and

operating

on the

simulator

other

Lite

makes

in section

4 we give

of LAPSE

on a workstation

computaspaces

among

the

cluster.

3

process-

We conclude

7.

also

direct

way

software

of answering hardware

are expensive

infeasible basis.

for a user

However, small

system,

(with

and

number

how the

these and

numbers

of processors

to predict

the

time

among

the full resources of processors can then it would

in knowing

will perform

is to execute

However,

are shared

increases.

or different this

many

code

users.

taken

A code why

to execute

many

the

code

happens.

processors

machine running

most on the

parallel

expensive,

or

on a regular

be routinely

the application

of the

The

Massively

be difficult, may

gains

releases

architectures.

parallel

a machine

to emulate

different

this

will

achieves

if and

practical.

It can

that

performance

using

always

a program

similar

under

network

of a massively of such

be used have

the

isn't

A programmer

architecture,

may not achieve

application

questions

processing.

and

of processors

are interested

overheads)

of interest.

to acquire

smaller

number

in parallel

algorithm

of processors

Users

different

research

parallel

the

number

wish to know

and

machines

size and

of processors.

system

of current

a particular

on a small

number

may

area

given

problem

performance

operating

This

version

to implement

direct

special

process-based

in Section

important

in how,

on a large One

directions

We

Background

Scalability

good

of the our

What

background

In Section

versions

a separate

significantly

parallel

At least

as follows.

give some

6 we present

processors.

only

[30], Tango

without

more

is organized

5 we detail

large

only

be maintained

are discussed

In Section

provide

spaces

data

and

with

virtual

process-based

approach

performance. [7]).

approach.

and thread-based

the

project

in Section

threads.

is not

RPPT

implementation

of this

to use our

project

address

simulation

for

simulated one global

threads

can

is implemented

Tunnel

of the paper

original

of the

all share

Paragon.

LAPSE

all private

issues

remainder

we discuss tion

that

simple

the

approach tool

Wind

separate

they

of the

of LAPSE

to improve

of the

with

requiring

The

threads Wisconsin

as part

threads

that

for each

lightweight

thread-based

it a practical

Intel

spaces such that

for implementing to that

the

be quite

to note

(the

to providing

threads

version

on the

address

implemented

approach

it should

light-weight

approach

these

thread-based

It is important employs

using

of LAPSE

separate

approach

the

However,

based

our

initial

the

the

are generally

then

the performance simulations. Currently,

threads

and

compare Our

do we provide

when

space

LAPSE.

how

available.

on the

larger

on the

larger

system. Similarly, of interconnected make

the code of the parallel multicomputer workstations. Additionally, designers

use of a parallelized

simulation

of the

network,

could of large driven

by executing

as could designers of new communication networks. Direct execution simulation is a mechanism to predict computer

of interest

processes

whose

execute

the

application

some

number

The

simulator

with

other

execution

simulator

processes

execution

and

of a code's

simulated

behavior--in

LAPSE the

performance

much Intel

for the machine.

to obtain

possibility

is not

normally

and

Intel

scalability

Paragon. code

assumed

on the

predictions

executing

As noted

before, and

another

application fairly

rapid

behavior--albeit

message-passing allows

appli-

a user

Intel

of LAPSE

analysis

simulator

codes.

virtual

version

scalability

actual

promises

of large LAPSE

on a large

processes

The

network

on actual

In particular,

performance

of the

The

is obtained

application

approach

possible

together

machine.

machine

constructs.

of monitoring

process.

and

virtual

of the

This

is assigned

a simulator

virtual

as a function

to both

processor

and

by simulator

behavior.

the

virtual

N application

processes,

interactions

advances

machine

and

of the

some

Given

physical

or VPs)

executing

for

code,

n < N processors

application

behavior

handled

time

a way that

physical

the

and

simulated

performance

to be used

of the

processes--all

of an application

smaller

cluster

the

written

execution

are trapped simulation

Each

processors,

on a network systems could

application

behavior

we use

behavior.

processes

the

[7, 12, 13, 21].

is sought,

(virtual

emulate

scalability

provides

codes

the

application

how

machine

its timing

application

the

evaluation

cation

control

machine

determine

process

simulate

processes

running

physical

processes

of the

virtual

other

on N processors

and

processes

behavior

the

some

of application

by actually with

using

performance

be emulated distributed

to predict

Paragon

using

allows

a network

for codes

written

a

for the

Paragon. Several

other

projects

of multiprocessor

use direct

systems

execution

[1, 5, 6, 17, 29].

of application The

processes

Wisconsin

Wind

to drive

Tunnel

to our knowledge, the only multiprocessor simulator that uses a multiprocessor to execute the simulation. It is worthwhile to note the differences between WWT.

The

first

is the

coherency

protocol

its

LAPSE

host.

researchers, on

which

shared

supports

is the

synchronization

WWT

uses

the

to

[9] for a detailed

requires

of the

using significant

operating

system.

a cluster

Coherency

The system

the

synchronization

YAWNS third As support

a different

of Sun

to maintain

WWT

and

protocols with

here.

difference wilt

be

while

Sparc

currently.

fidelity

of the

protocol The

cache-

of machine

the

than LAPSE

neither

of

the

simulation

A second

difference

parallel

[27] while

simulation.

LAPSE

uses

a

[26]. We do not elaborate interested

is in the approach

for

analysis.

workstations,

complicate

discussed, our

type

[29] is,

(the CM-5) LAPSE and

is a tool

scalability

with appointments

mechanism

threads.

The

to simulate performance

we are not dealing

YAWNS

combining

discussion.

for

used

synchronization

light-weight

designed

memory.

case of the

of the

and

are a facet

mechanism

mechanism

on

Paragon

virtual but

a special

synchronization

processors

being primarily

an Intel

considerably,

details

of purpose

is designed

is implemented problem

issue

simulations

(WWT)

reader

implementation the

approach

does

not.

is directed of virtual taken

in

[29]

3

LAPSE

In this section

we provide

are discussed To use LAPSE ters,

more

such

one

simulated

system, the

that

action,

processes but tem.

call),

In doing

system.

so, the

instruction

counters

LAPSE

routine,

a LAPSE cation The

at basic

LAPSE

routine.

in the

activity,

simulation

This

provides

for simulating

to that

As

since

the

processes.

noted

simulation

of practical

one

interest,

processes

can be run

simulation

processes

3.1

LAPSE

As noted

before,

lib is a software

along the actual

insights

LAPSE

enjoys

excellent

largely

as on-line

from code,

appli-

inserting

process

calls

a

the

call

to

spent for this

the

of the

processes

process.

each also

processes is that

capabilities

since

one beassigned

is required

in other

work

appli-

requested

application

one per node,

little

last

executing

simulation

in many the

codes

application

synchronization

on a workstation

with

the

we have library

ported

developed

LAPSE

[33].

nx-

allowing

the

cluster to a workstation

by researchers

of codes for the Codes run under

at the

cluster

University

using

reflect

the

workstation's

Intel Paragon multicomputer, nx-lib have the functionality

own

4

sense

of time,

not

nx-lib

of Munich

However, owing to temporal sensitivities, it is possible for the execution be different on the workstations than it would be on the actual Paragon. clocks

system.

[9].

development and execution network of workstations.

to system

simulated

simulated

since

by activity

with

the

sys-

executed

by our

trace-generators

larger

information

simulation

"lookahead"

system,

(as it will when

a description

be affected

identified

application

larger

about

the

receive,

to perform

simulated

in the

of all application

and

in the

an application

system,

between

node wherever

assembly

time

with

node

the

clock

time

responsible

behavior

can affect

the time

the application

process

actual

a send,

is that

a system

of the

a set

routine

information calls

Whenever

across

(e.g.,

system into

functionally

However,

of the

of instructions

measure,

executes

effect

gets execution

a measure

timing

key

net

each

do in the simulated

temporal process

LAPSE

For every

a LAPSE

the functionality

synchronization

process of the

The

as they

modifies

simulation the

calls

to the application

us with

above,

in one

However,

process.

number

are distributed

ing responsible

application

boundaries.

This

LAPSE

processes

system

to call parame-

or both)

that

machine

LAPSE

the

system.

to the

node.

block

The

processes. process

virtual

LAPSE

recovers

simulated

is sent

issues

makefile

version

simulated

provide

information

At compile-time

nodes.

in the

application

will return

of actual

simulated

simulate

the

LAPSE temporal

processes.

These

LAPSE

be in C, or Fortran,

messages

simulators

code's

a file specifying

application

simulator same

processes

whenever

tuning),

To help capture cation

use of LAPSE.

application

up

application

modified

the

LAPSE

For instance,

performance

and

In the process-based

would

to the

it to the

exactly

simulation

may

multiple

code

the sets

the number

a unique

LAPSE

to report

LAPSE

and

a call

the

and (which

processes.

application

communicate

the

and

creates

makes

and

modifies

files

process

LAPSE the

or clock

simply

"simulation"

application

probe,

code's

simulator

way

of the structure

compilers,

of simulated

of the and

multi-tasks

original

of native

number

copies

of "application"

exactly

a programmer

instead

as the

transforms

discussion

fully in [9].

LAPSE,

scripts

a brief

the

using an ordinary of Paragon codes. path of a code to Furthermore, calls time

on the

Paragon

being simulated. Thus nx-lib long

cannot

be used

to provide

accurate

timing

the code would have taken had it been run on the Paragon. There are two primary issues which must be addressed when

workstation

cluster.

developed

The

for an Intel

first

Paragon

second

issue

is how to improve

In this

paper

we are concerned

simulations The

executing

the

and

application

such

in the

the

communication Since application

reasons

implementing

provide

each

written

just

First,

global

variables

with

and

memory 5.

on the concern

executing

this

One

cost. than

space.

to execute

This

for each each

of

processes.

other.

Since

each

off-processor

on a physical

Paragon

and

for interference using

on a workstation

thus

Unix

from

each

threads

This

by replacing

TCP

mechanism

application accesses

process.

How

to

process

of the Intel

process package

For these

attractive.

some

processor

another

a lightweight

is very

the

processes.

switching.

requires

each

cluster

is to implement costs

threads

is because

Paragon.

its local

and

to provide

is briefly

is

this

discussed

in

computation are becoming

a popular

mechanism

for expressing

for many

current

languages

to this

demand

for threaded

systems,

many

lightweight

[4, 14, 19, 24, 32].

interface

for lightweight

proposed

standard

Additionally, threads

already

the

called

exists

POSIX

pthreads

and

systems

thread

committee

[18], and

asynchronous [3, 11, 22].

packages has

a library

have

behavIn response

been

established

devel-

a standard

implementation

of this

[23].

Definition

A thread their

for each

since

and

of context

threads

parallelism

context

each

approach

cost

as lightweight

potential

4.1

socket

simulation

heavyweight

the

ior and oped

the

communication

it reduces as lightweight

Intel

environment

threads

execution

process

on-processor

to LAPSE rather

processes processes

Thread-based

Lightweight

Unix

with

of both

on-processor

second,

be if it were

without

the

threads

its own address

memory

distributed

cost

TCP

intensive

as well as with

The

LAPSE.

research.

one

one

communicate

the

to minimize

application

as it would

is no shared

creates

under

of direct

of current assigns

communication

other

so critical

it reduces

application

There

section

are

ways

copies,

thread

be

frequently

socket,

as lightweight

memory

Implementing

4

costs

two benefits: with

LAPSE

can

each

processes

to explore

sockets

with

focus

cluster

a code

cluster.

running

performance

to the

for

on a workstation

the

is the

LAPSE

predictions

codes

of how

high.

processes

provides

former

As noted,

a Unix

executing

of application

workstation

LAPSE

interact

communication

it is important

the

porting

performance

with increasing

on the

node.

uses

is very

it is actually

system.

simulation

communication

to make

performance

LAPSE,

application processes

Additionally,

when the

of nx-lib

process

simulation

is how

primarily

under

implementation

application

issue

estimates

represents

an independent,

of a kernel-supported "weight,"

is removed

from

which the

entity,

corresponds processor,

sequential such

unit

as a Unix

to the amount and

restored

when

of computation process.

of context a thread

that

Threads that

must

is reinstated

executes are often

be saved

within

the

classified when

on a processor

by

a thread (i.e.

a

context

switch).

kernel

stack,

to switch

For example,

user-level

this

therefore kernels,

such

multiple

threads

the

context

must

processes

as Mach, within

include

medium

the address

and

than

thread.

the

thread

thread"

of this paper,

and

operations

By exposing

execution

process

are

still

required

space, by

allowing

the

about. state

kernel, resulting

and

"lightweight

in a at the

operations

interface. As a result, user-level and are thus termed lightweight.

"thread,"

which

switching

operations

can be defined,

for

However,

Context

and

and system

of a thread.

of microseconds,

thread

application

we will use the terms

address

cares

time

operating

controlled

hundreds

thread,"

to

threads For the

and

"user-level

model

model

within

the

for

Unix

lightweight operating

threads system.

is very

at context-switching

points.

• non-preemptive completes,

(typically explicitly

primitive;

It is important is suspended,

the

to note

that

so too are

programs

another.

This

address

register processes process,

either

a thread

of the

processor,

in which

each

all threads which

all threads

is assigned

system.

of user

Lightweight

within

the

model

of a

context

of a

be will continue or blocks

to execute on

until

it

a synchronization

This

without

the

thread hence

is given

a process

preemptive

a time

quantum

and

spaces space

the operating of one

by indirectly for each

and

are therefore

round-robin.

When

subject a process

process.

address

so that

possibility

process,

whose system

process

routing

for

boundaries

and

are enforced

can interleave

corrupting

all memory

threads

the

address

references

updated

by the

by

the execution space

through operating

of of a base system

[2].

packages

a single

within

address

is different

are restored

that

a separate

is accomplished which

execute

is typically

within

is done

thread

expires.

separate

process

the operating

quantum

scheduling,

Providing each

can

are executed

execution

by a thread-level scheduler, whose job execute and to switch between threads

in which

control

round-robin),

when

process-level

In Unix

FIFO),

yields

(typically

is interrupted

5

scheduling

to the

or

• preemptive

to the

Thread

similar

All threads

process, and the thread-level execution is controlled is to determine the next available thread that should

Unix

the

registers,

of microseconds,

the context

all of the

for a particular

Computation

when

from

application

in the

[2]. The

Contemporary

reducing

a particular

more

hardware

synonymously.

4.2 The

and

the

of thousands

thread.

manipulate threads can avoid crossing the kernel can be switched in the order of tens of microseconds, remainder

includes

tables,

order

of control

space

are typically

context

page

on

all thread

state

process

a heavyweight

a single

threads

a minimal

vectors,

decouple

or middleweight

user-level,

of a Unix

is typically represents

more

for kernel-level

context

interrupt

context

of a thread

often

times

stack,

large

a Unix

the

define

address

and

space.

execute Therefore,

threads variables

within

the

defined

context outside

of a single of the

scope

of a thread function (file scope)are sharedby all threads, and in fact are allocated in a single,global block of memory. This is typically beneficialasit allowsthreadsto easily share information using commonaddressescombinedwith mutual exclusionoperations to ensure consistency.However,many applications, suchas LAPSE, require that a certain amount of context be maintained on a per-thread basisacrossprocedurecalls. This type of information is referred to as thread-specific data. The issue for LAPSE is that when emulating virtual distributed with

processors

respect

5.1

using

Most

lightweight

anism

that

where

thread

the

data. allows

but

is not

process. This

data

that

is private

to a thread

but

global

thread.

therefore same

declared

locally

are declared

and

in LAPSE Another

the

approach

address

register

process.

This

provide

machines. so this

access

approach

The the

that

to the

Wind

the operating main

two base

system

from

Finally, to maintain thread

portable the

and

This

and within

private

data

it is the

must

be

of all variables

(since

the variables

cannot

being

a

for threads

all operate

approach

while

within

be used

accessible

to all

a separate this

register,

lightweight

base

address

WWT

spaces

to create

had

to find

separate process

many approach thread

address

spaces

architectures

do

base

portable

among

address

register,

systems.

upon

significant

for threads.

subservient

modify of

is not the

and

implementation

processor

to manage

[30], also depends

within each subcontext, to manipulate memory hence

to support a virtual

so this

among

is to access

support

In particular,

contexts

that

would

the

it requires

(address

to handle traps generated tags in subcontexts [30].

an approach

from spaces),

to

during the We wanted

require

no special

system.

the thread-specific a base address through

spaces

variables

spaces

First,

is required

project

allow

uses

to provide

code

separate

operating

allocate

space

thread,

architectures

the

thread

address

problems.

address

not portable Tunnel

the page mappings of a subcontext and LAPSE

function).

all threads, is designed

its own copy

to those

to each

system

used

save/restore

operating

support

are needed,

access

thread-specific

from

to provide

to keep

which

mech-

for each

address

memory

give each thread

be private

among

approach

address

as threads

spaces

of a particular

was

system

manipulate execution

separate shared

processors

this does

approach

special is also

Wiscon

operating

indirectly

suffers

Second,

to simulate

the thread

must

to providing

base

[20], but

variables

separate

amount

thread.

for each PVM

While

This

a small key/value

is shared

thread.

[16] when

address

within

on the stack

the

within

primarily

thread.

function

stored

because

functions

the

to each

Life

with

purpose

such as a copy of errno

thread-specific

the virtual If separate

key that

for each data,

providing

is used

space.

not give each

differently

by Tango

for dealing a general

data

of thread-specific

to implement

address

mechanism

[18] provides

a thread-specific

can be set

used

Lite

some

pthreads

for actually

Tango

global

it does

key

approach

natural

provide

to create

amount

well-suited is the

systems

of the

a small

are required.

each

the

For example,

a user

value

for managing

that

it requires

within

Approaches

thread-specific

not

threads,

to all functions

data mechanism offered pointer which is different copy address

of each pointer.

global While

by most thread packages for each thread. The

variable this

and provides

access

each

a machine

can be used idea is that

global and

variable system-

independentsolution, it requiressignificant modification to the user codeso that all global variablesarereferencedthrough the baseaddressregisterrather than being accesseddirectly. We know of no project which usesthis approach. 5.2

Our

Our

solution

solution

is to employ

address

spaces

that

machine,

and

without

Our

solution

pointer than

Recall

similar

to the data

to re-write

the

in C++,

class

user

the

global

class definition, Since

all the

global

data

data

creating

have

class pointer,

these

own

the

thread

scope,

more

C++

concrete

global

capability

which

means

this rather

data, with

that

we

minor

variables

class pointer, and are thread specific data in

a single

are surrounded instance

of this class,

consider

address

However,

for all

compiler

target

restoring

this

function,

functions

by the

ideas

main

allocates

into member

transformed

To make

their

base

and

spaces.

indirection

to provide

including

Each

are transformed

automatically

this.

class.

the

address

the

switch.

saving

are visible only through the of the class. Thus, to provide

functions,

a global

functions are

and

to add

and

a separate

By correctly

mechanism

definitions

of a context

separate

thread-specific

system

of maintaining

create

code

to provide

thread

performance

approach

scoping

scope

underlying

is referenced.

(data members) defined within a class maintained separately for each instance our system,

for class the

the

we can effectively

of the C++ class to the user code.

that

mechanism of both affecting

all global

thread,

attempting

take advantage modifications

C++

significantly

which

for each

the

independent

is very

through

variable

are

by a

of this object. all references

to indirectly the example

reference shown

to the

in Figure

1. The

column

executed

by

functions

on the

each

(fool

that

processor.

funcl)

column

to give each

functions,

and

There

all of which

the right

are necessary

global

virtual

and

Now consider

left side of the figure

the

shows

are three

should

thread

function

(my_class

executing

as a member

variable this

it is actually

pointer. one

this point

each

One scoping, present

and the

main

for each thread

drawback

that

has

private

processor,

using

approach interfacing

is that with

of our threaded-LAPSE

Thus,

when

version

(not

the

thread space

it requires user

a C++ code

implementation

and

to the user

code

variables,

) of the application

function

shown entry the

in the

a previously

function class pose next

is called,

through

here)

compiler could

code

entry function is written main function for this

variable

via

is to be

two global

all of the global

it accesses

of the

address

Fortran

First,

that

thread.

new_main()

to be written

its own "private"

(x,y,z)

each

Second, a thread calls the renamed the

code

the modifications

new_main()

when

G_Class.

its own

function

virtual has

of our

therefore results

the class

accessing

A new

N threads,

within

Note

variables

space.

(renamed

application

within

shows

its own address

main()

in this example).

global

be global

of the figure which

are put into one class (G_Class in this example). which declares an object of type G_Class and object

the original

the

which shown

it is global

implicit

creates above.

the At

G_Class.

to provide some

the

problems.

section.

proper We

# include double fool0 int funcl0 ;

#include class G_Class { public: int x, y, z ; double fool0 ; int funcl0 ; new_main0 ; };

;

int x, y, z ; main() {int a, b, c ; double dl ; a=x;b=y; z =a + b ; x = z; dl = fool0 ;

G_Class::new_main0 { int a,b,c ; double dl ; a--x;b=y; z=a+b; x=z; dl ---fool0 ; .......

} double fool 0

{ .....}

}

int funcl0

Thread_Entry_Function0

{'""}

{ G_Class

my_class

my_class.new

;

main0 ;

}

Virtual Processor

Figure

6

of the

tion

costs

primary due

of processes

1,2,8

and

processor).

time

of this

approach).

The

presenting

our

simple

Also,

memory the

employs

processors.

the

communica-

threads

of this using

which

16 virtual

(thus

instead

improvement

a communication

capture

processors,

having

the

the

executes

simulator

using

impact

on

we simulate

the

(threads)

per

discuss

one simulator

process

into

our

process

(at

thread

based

of Sun Sparcstation

20s.

Before

our

of message

a cluster

to briefly

and

16, 8, 2, 1 processes

processor

integrated

conducted

when

a measure

two approaches

lower

ratio.

a constant

be useful

is much time

To get

the

each physical not

execution

a set of experiments

processors

were

it would

virtual

implementation

processors. process-based

of whether version, copy is used data.

Spaces

communication

compare

we present

we have

is for an on processor

to transfer

virtual

16 physical

experiments

regardless

interprocess

communication/computation code

results

above,

Address

of a thread-based

which

Additionally,

In the thread-based sage

to simulate

writing

between

As noted cation

of the

application

using

physical

passing

to a reduction

due to the

/* The

the

benefits

application.

performance code

expected

a set of experiments

intensive

Separate

results

are used

we present

Separate Addess Spaces

1: Creating

Experimental

One

Code

the

version

processes

of LAPSE

reside

on the same

for each communication thread

it must

or an off processor

to transfer

the

data.

uses

thread.

Unix

sockets

or different

physical

be determined If the

If it is off processor,

for all communi-

thread Unix

processors.

whether

the mes-

is on processor sockets

are used

a

8.0

? J

Process Based/ 6.0

== __= 4.0

(/3

2.0

0

O'u.O

5 0 _.

Number

Figure

The

2: High

code

compute

in the

communication

its

nearest

number

the

in the

in a ring

performs

in the

final

10,000

based

the ratio). the

rate.

approach

Note

that

there

normalized

and

The

results

for the

of time

required

cycle.

operations,

and

receives

a message

from

then metric

of performance

the

number

experiment in the

the

second

of integer

application

high

integer

operdoes

experiment

the

communication/computation 250,000

is the

iteration.

by varying first

(medium performs

The When

increases

very

communication the

number

no

appli-

ratio),

operations

and

(low communi-

communication/computation

confirms

to complete

quickly costs

as the of the

of processes

approximately

experiment

ration

are

per

10 times that

one

iteration

number

of processes

thread-based physical

as long

communication

approach

processor

to execute costs

of the

are

algorithm

per physical increase

is 16 the one

process-

iteration

significantly

at

of the

lower

using

approach. the is one

process-based

approach

process/thread

processor,

and

whether

is a little

per processor,

of the lightweight thread system. structures to determine such things on/off

Our

In the

Processors

of integer

compute/send/receive

ratio),

operations

amount

requires

This

a thread-based

sends

ratio

cycle.

16 Virtual

number

first

application

version

is increased. lower

one

Using

2.

process-based

algorithm.

when

integer

experiment

in Figure

a much

configuration.

of the

Processor

of a compute/communication

some

communication/computation

As can be seen, processor

application

to complete

phase

cation/computation

for the

iterations

performs

communication/computation

(high

cation

the

20.0

I

per Physical

Ratio

several

application

required

compute

computation

shown

the phase

neighbor

We vary ations

goes through

phase,

of seconds

15.0

r

Communication/Computation

application

In the

10.0

,

of Threads/Processes

In particular, as whether

a particular

bit faster which

the

than be

the

thread-based

attributed

approach

to the

overhead

using threads requires routines and data recipient of a particular message resides

message

10

can

buffer

is free.

When

there

is only

one

14.0

12.0

Thread f:.,-_:-_

Based

Process

Based

10.0

E'J /

/

P, 8.0

/ ..j"

6.0 CD

j j,." /

4.0

/ / [_

2.0

thread

on a given

l

5 i0 Number

3: Medium

/

/-S

0"%.0

Figure

..._.

of

i 10.0

t5.0i

Threads/Processes

per

Communication/Computation

processor

this unnecessary

20.0

Physical

Processor

Ratio

machinery

for 16 Virtual

can be avoided

which

performance. As shown

We have not yet optimized for the one thread per process in Figure 3, the thread-based approach still significantly

process-based

version

for the

thread-based

Figure is low.

much

more

the

Again quickly

than

is increased.

performs

the

most

has

rise much that

to note

approach

for the

thread

a thread

a message

and

would the

message

during

whether

a particular

Thus

each

the scheduler

execution.

give up

message iteration. may

have

The process-based

will be progressing

that

scheduling

of the

to swap

there

process-based

communication/computation process-based approach

processes

as we now

approach

of processes

process-based

four

costs

approach.

the

the

the

the

the

(threads)

grow

(threads)

per

slightly

out-

per processor.

discuss.

threads

rate

which

it issues

experiments, order

for a particular

is time-sliced,

same

is when

is no guaranteed

several

approach the

Again

with

In our

will be available

at roughly

for the

increase

when

processor

available.

Since

message

ratio.

as the number

two and

would

case. outperforms

of the thread-based approach is non-pre-emptive, and have no more useful work to perform. One example

control

is not

than

associated approach

It is interesting to do with

slowly

in performance the costs

Our preliminary implementation thus the threads execute until they of where

more

the thread-based

thread-based

likely

communication/computation

improvement

it is seen

processor This

medium

approach

4 shows

ratio

for the

Processors

each

making mean

thread

of execution thread

in and out before may

a call to receive

finding likely

fewer

context

a

it is unknown

when

it more

requests

it is required. one eligible

that

for

each process

switches

to find

a process eligible for execution. This issue of thread scheduling merits further investigation. All of the experiments reported here make it clear that thread-based implementations offers

the possibility

improvement

of significant

in performance

performance is not

very

improvement.

sensitive

11

to the

Also, our studies the

show that

communication/computation

this

15.0

,

Thread E_------_

_

.

Based

Process

Based

/ 10.0 /

/

/ J /' / / / !

8co

/

5.0

/ /./ / / / /

./ ,

0.0 '

0.0

5[0

Number

ratio. but

4: Low

Communication/Computation

Clearly

we need

to develop

preliminary

Before

results

leaving

this

important

issue

This

is discussed

issue

a test

are

section

of relative

very

per

20.0

Physical

Processor

Ratio

for 16 Virtual

suite for real direct

execution

Processors

simulation

applications,

encouraging.

it is important

speedup

in detail

i ]5.0

_;.0

Threads/Processes

Figure

our

7

of

'

due

to note

that

to parallelization

these

results

of the

do not

virtual

address

computer

the

simulator.

in [9].

Conclusions

This

paper

outlines

simulator,

such

context-switch

and

implementation outlined

performance improvement Based to direct

for the on the execution

carefully

tion/communication

of this

over

ring based

results

of the

simulation codes

to run

investigate ratio

systems

that

required

message-passing bears

varying

of this

offers

that

not

virtual

we have commonly

processors.

the potential

We observed

high

a thread-based

approach,

something

for supporting

approach. here

we feel that

investigation.

thread-based

such

we show

execution from

for significant up

to a ten

fold

code.

presented further

on the

and spaces,

direct suffers

In support

LAPSE

process-based

research

issues and

but

a parallel

implementation

address

the thread-based the

to support

overheads,

overhead.

thread-specific

thread

improvement

threads

process-based

communication much

show

lightweight

traditional

for providing

results

application

to more

reduce

by lightweight

Preliminary

The

interprocess

can

a method

supported

isting

the idea of using

as LAPSE.

as the

effect

communication

12

Our

LAPSE

a thread-based next

simulator.

on performance patterns.

step This due

approach

is to modify would to the

ex-

allow computa-

us

References [1] D.

Agrawal, M. Choy, H.V. distributed shared memories. tributed

[2] Maurice Hall,

Simulation,

pages

J. Bach.

The Design

E. Bal,

parallel ing,

M. Frans

programming

of the

Operating

System.

Software

Series.

Prentice-

Chen,

H.-M.

Covington,

Su, and

Tango.

[9] P.M.

for

Engineer-

Yew.

E. Weihl. Report

Department

PROTEUS:

A high-

MIT/LCS/TR-516,

Mas-

1991.

The on

impact

of synchronization

Computer

S. Madala,

1991

W.

Technical

September

Architecture,

and

Journal

J. Hennessy.

of the

and

88-01-04,

J. Sinclair.

on Computer

Multiprocessor

International

and

pages

granularity

239-248,

Efficient

simulation

Simulation,

simulation

Conference

of

1(1):31-58,

and

on

May

tracing

using

Parallel

Processing,

execution

simulation

1991.

P. Heidelberger,

and

programs.

D.M.

Technical J.

Nicol.

Parallelized

Report

94-50,

direct

ICASE,

Sinclair. Execution-Driven on Modeling and Computer

July

1994.

Simulation Simulation,

of Multipro4(4):314-338,

for

parallel

1994. Foster

gramming. Science

and

K.

Technical Division,

UCB/CSD

Transactions

Report

Simon:

83/137,

Fujimoto

M. Chandy.

Argonne

M. Fujimoto.

[13] R.M.

1988.

J. Jump,

Dwarkadas, J. Jump, and cessors. In ACM Transaction

T.

Report

January

Symp.

[10] S.

October

A language

on Software

Technical

A. Colbrook,

and

August

Dickens,

Transactions

manual.

International

S. Goldschmidt,

of message-passing

Orca:

of Washington,

P.-C.

systems.

II99-II107,

S. Tanenbaum.

IEEE

simulator.

S. Dwarkadas,

In Proceedings

pages

user's

In Int'l.

parallel computer June 1991. Davis,

PRESTO

of Technology,

systems.

Andrew

1992.

University

Institute

on parallel 1990.

[12] R.

UNIX

systems.

parallel-architecture

sachusetts

[11] I.

1994.

and

C. N. Dellarocas,

performance

[6] D.-K.

The

Science,

A. Brewer,

Kaashoek,

March

N. Bershad.

of Computer

[8] H.

July

of distributed

18(3):190-205,

[4] Brian

[7] R.

151-155,

Maya: A simulation platform for 8th Workshop on Parallel and Dis-

1986.

[3] Henri

[5] E.

Leong, and A. Singh. In Proceedings of the

and

Fortran

M:

MCS-P327-0992 National

A language Revision

Laboratory,

A simulator

June

University

of California,

W.D.

Campbell.

Efficient

of the Society

for

1, Mathematics

Computer

Simulation,

13

networks.

Berkeley,

instruction

and

pro-

Computer

1993.

of multicomputer

ERL,

modular

Technical

Report

1983.

level simulation 5(2):109-124,

April

of computers. 1988.

[14] Dirk

GrunwaJd.

and

Science,

[15]

Carl

Hauser,

[17]

of Colorado

Christian 94-105,

S. Herrod.

Tango

Lite:

F.W.

User's

Howell,

tion environment. pages

[18]

IEEE.

[19] David [20]

R. Konuru,

R. Prouty,

In Proceedings

I. Mathieson

and

Mehrotra

system.

[23]

Computers,

Notes

in Computer

Frank

Mueller.

USENIX,

interface ing,

[26] D.M.

29-41,

Institute

C. Micheal, parallel

680-685, Nicol.

Intro-

IEEE

design

Workshop Systems,

Systems

Society

(Draft

San

Greg

CA, January

Eisenhauer,

threads.

Technical

of Technology, and

P. Inouye.

Washington,

Atlanta, Efficient

In Proceedings D.C.,

discrete-event

Proceedings ACM//SIGPLAN and Systems, pages 124-137,

and

Tech-

package 1994.

on System

Opus

language and

Technical

threads

paralSciences,

and

under

runtime

Compilers

for

Lecture

Report

94-39.

UNIX.

In

Winter

1993. Kaushik

Report

aggregation of the

Ghosh.

A machine

CIT-CC-93/53,

Georgia,

1989

December

1989.

simulation

of FCFS

PPEALS 1988: Experiences ACM Press, 1988.

14

process

1994. Springer-Verlag

as ICASE

of POSIX

packages.

for evaluating

on Languages

November

Also Appears

Diego,

of the

1992.

Conference,

Conference

Workshop

New York,

892.

A user-level

simulator

An overview

Press.

7), February

trace-driven International

simula-

International

Computer

J. Walpole.

and

Telecommunication

Computing

implementation

simulations.

Parallel

and

Performance

7th Annual

346-360,

Science,

for lightweight

D. Nicol, pages

pages

Hawaii

Haines.

of the

Mukherjee,

Georgia

memory

Matthew

A library

pages

[24] Bodhisattwa

[25]

and

and

High

A dynamic

of the Blst 1988.

In Proceedings

Parallel

Unpublished

edu/_herrod/Papers.

of the Second

Operating

S. Otto,

of Scalable

R. Francis.

lelism. In Proceedings pages 158-166, January

[22] Piyush

for Portable

Ising System

and techniques for building fast portable threads 93-05-06, University of Washington, 1993.

J. Casas,

for PVM.

[21]

Extension

Weiser.

on Operating

architecture

of Computer 1994.

of Computer

Mark

Environment.

Hierarchical

Carolina,

and

Symposium

stanford,

Ibbett.

programming

1993.

'9_, Proceedings

North

Welch,

In ACM

Simulation

and Simulation

Keppel. Tools Report UWCSE

nical

R.N.

parallel

Department

1991.

Brent

NC, December

In MASCOTS Durham,

Threads

Theimer,

http ://www-flash. and

Analysis,

363-370,

November

A Multiprocessor

Guide,

oriented

CU-CS-552-91,

A case study.

Asheville,

R. Williams,

on Modeling,

Marvin

systems:

pages

An object

Report

at Boulder,

Jacobi,

in interactive

and

to AWESIME:

Technical

Principles,

duction

guide

system.

University

threads

[16]

A users

simulation

College

independent of Comput-

1993. of multiple Winter

stochastic

LP's

Simulation

queuing

with Applications,

in distributed Conference,

networks. Languages

In

[27] D.M.

Nicol.

tions.

Journal

[28] W.H. [29]

[3o]

The

Press,

B.P.

Flannery,

S. Reinhardt,

M. Hill,

J. Larus,

Virtual

prototyping

Tunnel:

ACM

SIGMETRICS

S. Reinhardt,

Carl

and

pages

W.T.

Symposium

Vettering.

Numerical

J. Lewis,

Press,

and

Santa

simula-

Recipes

York, The

CA,

for the

on Microkernels

May

Other

1988.

of the

1993

1993.

Wisconsin

and

in

Wisconsin

In Proceedings

Clara,

Support

New

D. Wood.

computers.

48-60, Kernel

discrete-event

University

of parallel

D. Wood.

Usenix

and

A. Lebeck,

in parallel 1993.

Cambridge

Conference,

of the

September

Wind Kernel

Tunnel. Architec-

1993.

Schmidtmann,

multithreaded Sun

Computing.

B. Falsafi

April

S.A. Teukolsky,

of Scientific

Wind

synchronization

40(2):304-333,

C: The Art

tures,

[32]

of conservative

of the ACM,

In Proceedings

[31]

cost

Michael

Xlib.

Microsystems,

In Winter Inc.

Tao,

and

Steven

USENIX,

Lightweight

Watt.

pages

Process

193-203, Library,

Design

and

San Diego, sun release

implementation CA,

January

4.1 edition,

of a 1993. January

1990.

[33]

G. Stellner,

S. Lamberts

for workstations Athens,

clusters.

and

T. Ludwig

In PARLE'94,

NxLib Parallel

1994.

15

- a parallel Architectures

programming

environment

and Languages

Europe,

Form Approved

REPORT

DOCUMENTATION

PAGE

OMBNo. 0704-0188

Pub c report ng burden for th s co ect on of nformat on s est mated to average 1 hour per response including the time for reviewing instructions, searching existing data sources, gather ng and ma nta n n_ the data needed and comp et ng and reviewingthe collection of information. Send comments regarding this burden estimate or any other aspect of this collect on of information, mclud ng suggestions for reducing this burden to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington, 'CA 22202-4302, and to the Office of Managernent and Budget, Paperwork Reduction Project (0704-0188), Wash ngton, DC 20503. 1. AGENCY

USE

ONLY(Leave

blank)

2. REPORT

DATE

November

REPORT

1995

TYPE

3. Contractor

AND

A THREAD-BASED

DIRECT

6.

COVERED

5. FUNDING NUMBERS

4. TITLE AND SUBTITLE TOWARDS

DATES

Report

EXECUTION

PARALLEL C NAS1-19480 WU 505-90-52-01

SIMULATOR

AUTHOR(S)

Phillip

Dickens,

Piyush

Mehrotra,

Matthew and

Haines, David

Nicol 8.

7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) Institute

for Computer

Applications

Mail Stop

132C,

NASA

Langley

Hampton,

VA 23681-0001

19. SPONSORING/MONITORING National

Aeronautics

in Science Research

and Engineering

Space

NUMBER

ICASE

Report

10.

Administration

Langley Research Center Hampton, VA 23681-0001

II.

SUPPLEMENTARY

No.

95-81

SPONSORING/MONITORING AGENCY

REPORT

NASA

CR-198242

NUMBER

ICASE

Report

No.

95-81

NOTES

Langley Technical Final Report To appear in the 12a.

ORGANIZATION

REPORT

Center

AGENCY NAME(S) AND ADDRESS(ES)

and

PERFORMING

Monitor: 29th

Dennis

Hawaii

DISTRIBUTION/AVAILABILITY

M. Bushnell

International

Conference

on Computer

Systems 12b.

STATEMENT

DISTRIBUTION

CODE

Unclassified-Unlimited Subject

13.

Category

ABSTRACT

Parallel

(Maximum

direct

60, 61

200

execution

words)

simulation

is an important

tool

for performance

and

scalability

analysis

passing parallel programs executing on top of a virtual computer. However, detailed simulation codes requires a great deal of computation. We are therefore interested in pursuing implementation can decrease this cost. One idea is to implement the application virtual processes as lightweight traditional Unix processes, reducing both on-processor communication paper we describe an initial implementation of a thread-based parallel advantages of such an approach and present preliminary results that process-based approach.

14.

SUBJECT

Parallel

15.

Direct

Execution

Simulation;

message

costs and context-switching costs. In this direct execution simulator. We discuss the indicate a significant improvement over the

TERMS

Simulation;

of large

of message-passing techniques which threads rather than

NUMBER

OF

PAGES

17

Threads 16.

PRICE

CODE

A03 17. SECURITY OF

CLASSIFICATION

REPORT

Unclassified NSN

7540-01-280-5500

18.

SECURITY OF

THIS

CLASSIFICATION PAGE

19.

SECURITY OF

ABSTRACT

CLASSIFICATION

20.

LIMITATION OF

ABSTRACT

Unclassified Standard Form298(Rev. 2-89) Prescribed byANSI Std. Z39-18 298-102

Suggest Documents