MINIMIZING OVERHEAD IN PARALLEL ... - Semantic Scholar

/ NASA

Contractor

ICASE

Report

Report

201649

No. 97-8

LNNIVERSARY

MINIMIZING

OVERHEAD

ALGORITHMS

THROUGH

IN PARALLEL OVERLAPPING

COMMUNICATION/COMPUTATION

Arun

K. Somani

Allen

M. Sansano

NASA

Contract

Februao' Institute NASA

for

Operated

Space

NAS1-19480

Computer

Langley

Hampton,

National

No.

1997

VA

Applications

Research

in Science

and

Center

23681-0001

by Universities

Aeronautics

Space

and

Administration

Langley Research Center Hampton, Virginia 23681-0001

Research

Association

Engineering

Minimizing

Overhead

Overlapping

in Parallel

Algorithms

Through

Communication/Computation Arun {arun,

K.

Somani*and

allen}

M.

Sansano

@shasta.ee.washington.edu

Department University

Allen

of Electrical of Washington,

Seattle,

WA

Engineering Box

352500

98195-2500

Abstract One of the major goals in the design of parallel processing machines and algorithms is to reduce the effects of the overhead introduced when a given problem is parallelized. A key contributor to overhead is communication time. Many architectures try to reduce this overhead by minimizing the actual time for communication, including latency and bandwidth. Another approach is to hide communication by overlapping it with computation. This paper presents the Proteus parallel computer and its effective use of communication hiding through overlapping communication/computation techniques. These techniques are easily extended for use in compiler support of parallel programming. We also address the complexity or rather simplicity, in achieving complete exchange on the Proteus Machine.

*This research in part Contract No. NAS1-19480 and Engineering (ICASE),

supported by the National Aeronautics and Space Administration under NASA while this author was in residence at the Institute for Computer Applications in Science NASA Langley Research Center, Hampton, VA 23681. was

1

Introduction

1.1

Background

This

paper

presents

Proteus

[12], a parallel

communication/computation. communication started

overhead

in 1990

spare)

and

processor

is the world

machine

Intel

one

include The

processor

processor

was

the

The

result

costs

processor first

among the

When

a problem

serial

version,

overhead.

is added

nication

to be completed

a connection

possible

available

causes

for

ciencies can and data. Many

lead

general,

but

of time

data

serial

overhead

lacking

overlap

time

Overlapping

and

in the

following

would

have

through usually

the and

and/or

laying

the

hardware.

processor

communication and

computation.

data

buffer

waiting

as well.

This

to concentrate functions.

method

hardware on Proteus

low

scheduling

hardware

This

is not

additional

bulk tasks

the

minimize good

to approach a practical

to hide the

communicaare

technique

attempts

not ineffi-

resources

to

attempts

is

and

worse,

These

latency

overhead

provides

time

for communication

These

handles

Much

or

contention.

a

commu-

and

be slow,

and

computation

to invoke the

resources

protocols can

in the

is considered

call

resolved.

Inefficient link

found

Finally,

being

overhead.

is an This

a kernel

communication

partitioning

zero

at lower

time

by

For communication

to approach

Proteus processor.

communication

networks,

communication way.

system.

not

I/O

idle

with com-

on a node.

Allocating

latency.

problem

in the

cluster MAGIC

communication,

communication

pattern.

use of special

be servicing

communication

receiving

communication/computation

frees

This

interconnection

efficient

to reduce

processors

needs

link

Paragon + 1 com-

communication

to be transmitted.

to the

contention

per

the

node

processor.


data

processors


administration would

links,

employed

communication goal.

as low

processors

of the data.

due

the

processor

node

is initiated

the

processor for

utilizes

for a given

time.

and

a dual

a com-

commercial

an 8 processor

processors

sending

Also,

communicating

such

are

the

has

Stanford

inter-processor

to communication

amount

to the

the

transmission latency.

fast are

with

and contribute

techniques

protocols,

Then,

of Proteus

as a communication

Japan

among

execution

between

of longer

a given

communications

izable

routine.

management

is shared

processors'

from

[7] [6] at

was and

In the

releases

1 communication

utilizes

multiple

for

node

also

of overlapping

time

with

of

control

utilizing

do exist. time

effect

project

feature

architectures

Recent

tasks

to utilize

of data

handling

needs in the

attaining

tion

to the

Other

same

the

Proteus

architectural

acting

Project

advantage

the

the

[14] machine

hardware

Communication

The

one of them

communication

machines can take

control.

of overlapping

A 32 (46 including

novel

processor.

FLASH

the

1993.

One

about

with

PIM/c

is parallelized,

communication wasted

these that

at

in Germany

The

communication

Contributions

introduced

The

to handle

is a node

since

node.

advantage minimizes


processors

processor.

1.2

the

per

work.

for communication

at GMD

to take

is parallelized. in early

in current

as a communication

up to three

communication

used

[4] was

[2] project

a problem

was available

overlapping

acting

with

MANNA

munication

the

and

Paragon

nodes

putation

are

is being

designed


when

hardware

processor

processor

the

with

one

introduced

working

use of a dedicated

munication

architecture

Overlapping

or

used

in

zero, to

realbreak

communication

of communication during

necessary

the

time

hardware

it to

This paper is organizedin the following fashion. Section2 architecture

and

protocol

of Proteus.

demonstrates presents mance

its

with

other

Proteus

The

Proteus

optimized

our

be

[3]. Within shared

this

Cluster

and

high

richness

as the

Proteus

on

Intel

such

this Paragon,

are

network.

some

a unique

communication

as memory, these

and

o:9 t

The

Section

5

its perfor-

Meiko

CS-2.

A diagram

In

VME Bus

_

....

...........

both

of shared

Controllers

node

is shown

/1" 'Z,

while

is balanced in Figure

with 1.

Proteus

, _

.......

• Connections

",,

' I

.....

design

upon

J/

S_'e110

Proteus

system

Hypercube

based

Cluster

clustered

by

of Clus-

utilization

/ _

.-"'1 -_""

1: The

and

of the

........

Crossbar _

Figure

developed

lim-

scales

Groups

Enhanced

efficient

hier-

as the

Proteus

as an

was

This

is a powerful

as well

scalability,

allows

machine

....... _

s

itself

documented. such

of the

(MIMD)

processing.

model

subsystems, power

Data

memory well

system

communication

,,-

one megabyte at the Cluster

SP2,

Cluster

system

This

resources. network.

I:_I7:_

Large nisms

compares

image

For further

t_¢¢¢¢¢¢¢,t" :"

•

and

Computer.

The

interconnection

principles.

CI us r0

Daughterboard processors.

IBM

and

of a shared

through

I

•

Proteus system communication

and

Multiple

vision

architecture

interconnection

features

Parallel

machine

of processors.

a crossbar

passing

for

as machine Clusters

memory

oup Controlle

Interesting

the the


Instruction,

a_ivantages

through

machine

contention of the

on the

exchange

such

memory The

message

components

avoiding

examples such

tasks

together

hybrid

memory

of overlapping

[12] is a Multiple

shared

together

connected

idea

describes 3 discusses

Architecture

in a shared

Clusters

can

some

machines

multiprocessor.

of scalability

connecting

the

Section

conclusions.

granularity

employs

of Proteus.

a complete

Computer

large

memory

itations

the

commercial

Parallel

system

shared

through

to achieve

System

for

archical

features

4 outlines

usefulness

6, we present

2

uniques

Section

a mechanism

Section

ters

discusses

l

_

_

i

' ,,,

""_"--.. _______"" Optical Connection

System

to Crossbar

Architecture

include: for

caches level.

the

processing

with

software

element

controlled,

to

facilitate

hardware

upgrading

assisted

and

changing

coherency

mecha-

of

• Cache

Write Generate

• Shared

communication

• Hardware

support

• Circuit-switched • Roving • Fault

2.1

fault

Pixel

mode

hardware

tolerance

processors

communication

by extra techniques

on a cluster. and computation.

with hierarchical

supported

synchronization

[16].

between

for overlapping interconnection

tolerant

The

caching

control.

hardware. [5].

Processor

The processing element, or Pixel Processor (PP), used in Proteus is an Intel i860 [17] microprocessor running at 40 MHz. The processor is supported by a one megabyte, unified, direct-mapped L2 cache with a custom cache controller/bus interface at the next level of hierarchy, a shared memory

bus.

This is the

largest

amount

of cache

available

with

any machine

even

today.

The

purpose of the L2 cache is to 1) provide faster memory access and 2) ease the contention on the shared memory bus. The L2 cache will act in various cache accessing modes or as a local memory through an addressing mechanism. The access type is decoded by the cache controller from the upper nibble, 4 bits, of the 32 bit address. This addressing nibble can also specify whether the PP internal cache is also enabled. Of special

interest

is the cache generate

[16] mode,

which

has been

implemented

for research

purposes. Cache generate is similar to the write allocate scheme except the line being written into is validated by cache without reading from the main memory. This assumes that the corresponding processor will write into each byte of the cache line before using it for further computation. Hence the reading of a line is not necessary as all data in the line are overwritten. It has been shown that in some cases, cache generate is very beneficial. overwriting of data in the cache. Flushing and invalidation issues. A software command

However,

care must

be taken

to avoid destructive

mechanisms are provided for in hardware to handle cache coherency loads a control register in the cache controller signaling it to flush

and/or invalidate specified cache lines. Therefore, the programmer (compiler) maintaining cache consistency. While this places the burden on the programmer,

is responsible for these techniques

reduce costs and complexity related to hardware maintained consistency [8]. In order for the PP to communicate with the Cluster Controller (CC), an interrupt system as well as dedicated mailboxes have been set up. In order to interrupt a CC, the PP first fills in some header type information in a pre-defined shared memory location. Then it sets the appropriate bit in a control register which triggers an interrupt to the CC. The PP is then free to continue on with computations while the CC services the interrupt. A flagging system is set up so that a PP does not corrupt the pre-deflned interrupt has yet to be completed.

shared

memory

location

on the next interrupt

if the previous

Also useful for the PP are some shared memory mailboxes. These are shared memory locations that are dedicated to certain functions between the CC and the PP. For instance, an end-oftransmission (EOT) counter box is set up to keep track of the current EOT count. The EOT is a system clock used to define communication cycles. The CC increments this counter box and the PP reads

it for such

things

as time stamping

messages.

Other

mailboxes

are used for the

PP to

giveinformationto the CC.

For instance, send requests are posted to shared memory mailboxes and are read by the CC. More details on these mechanisms are given in the next section.

2.2

The

Cluster

The main components interface.

of each cluster

The Cluster Controller: resources within that cluster.

are the PPs, a Cluster

The CC is primarily responsible It is an Intel i960 microprocessor.

memory bus and also has it's own local bus. bus. The CC's tasks include: • Scheduling

tasks

on the

• Managing

the shared

• Managing

cluster

• Handling

One megabyte

(CC),

and a communication

for managing all the hardware The CC is attached to the shared

of local RAM is provided

port

communication

(DP)

memory

through

resources.

the optical

links.

to the Cluster.

• Acting as PP's interface to the outside world setting up I/O like operations memory to the VME bus through the Group Controller and Group Host. • Loading • Interfacing • Handling The

PP code and

Communication

serial much

bringing

all communication Interface:

from the shared

the PPs out of reset.

with the GCI to coordinate almost

on the local

PEs.

and dual

all interrupts

Controller

crossbar

functions

Each cluster

communication. so the PP does not have to.

is equipped

with an input

link. Each is set to run at 250 megabits/second, although they as 1000 megabits/sec. With 4/5 encoding and 40 bits transferred

and an output

optical,

can be set up to run at as for every 32 bits of data,

the deliverable bandwidth is 20 megabytes/sec. These links are used for data movement to/from the cluster from/to another cluster or an external source. Effective bandwidth is actually about 16 megabytes/sec when a message

after overhead is accounted for. The latency of a connection set up depends arrives with respect to a cycle boundary as explained in Section 3.

on

The Communication Interface consists of a Dual Port Memory (DP), a DMA controller, and an input/output FIFO buffer to input and output optical serial links. This subsystem is shared by all the PPs on a cluster. This avoids the cost of replicated communication hardware for each PP. Allocation of DP memory is controlled by the CC. The CC also controls the DMA controller through

control

registers,

telling

it what

to transfer

to/from

the

DP to the serial

links.

The

DP

can handle any combination of the following: reads from the serial link input FIFO buffer, writes to the serial link output FIFO buffer, and reads/writes from the shared memory bus. The FIFO buffers allow for speed matching between the DP accesses and the serial links data transfer rates.

4

2.3 At

The the

Group

next

though

level

optical

transfers.

Separate

of links seven

of hierarchy,

links

per

free

node

links a spare

The

a single

GC,

interface, The

and

the

incoming buffers

processed

system

equipped

memory

and

like the

cluster

The

what

GCI

The card.

chips.

grant

also

does

may

the

actual

to the

connections

purpose

files

to the are

Host by the

for group-to-group

connections

are

to the

3

The

3.1

Proteus

Proteus the

Proteus

uses

a hybrid

munication

intra-cluster

The

approach

the

and

8

(GCI).

an

Ethernet

crossbar

are

arbitrated

requests

from

communication

GCI

via

the

requests

mailbox

connections

buffers

images

registers

on the

buffer

buffer

Nine

on

crossbar

and

256 are

buffers.

(GC)

is a Sun

Clusters.

waiting

to

kilobytes

be

of DP

configured

exactly

memory.

connections

frame

sets

contains

less DP

for processing. to the

data

subsystems

have

of the

scheduling

incoming

Each they

to clusters

pending

between

frame

through

over

the

an ethernet

GC from

are

The

for

use

remaining

by

seven

are

passed

Host

VME

via

Sparcstation bus

and

connection.

file systems

Communication

between

the

cluster

incoming

that

use by the

Group

Communication

interconnections

sets

Group.

actions

forwarded

Interface

interface

are

(GC),

cycle.

is to arbitrate

frame

is to coordinate

bus

then

there

Controller

the

nine

are

used

sources.

requests and

is because

all connection

up those

Controller

transferred

number

Host. on

receives

then

Group

Group

bus

Group

to/from

except

are for

The

the

is passed

connections.

external

That

required

processor was chosen to maintain compatibility This made it easier to design and program the

of the

crossbar.

of 20x20

as the

together

bus for control

Communications

communication

matching

rate

interface

Controller: main

rate

processing

and

in four

data

VME

of a Group

a VME

which

of scheduling

be placed for

consists

and

connected

the

is set.

the job

communication of the

Controller

connection

seven.

consists

with

job

information

GCI

the

intended

an interface

executable

primary

The

once

interconnect

It's

with

on a given

the

via

Connections

(GCI)

Crossbar.

a connection

Group

up

of clusters as long

a Generalized

Interface:

on

and

group

Group

controller's

connections

images

Four

for group

the

Interface

to set

are

crossbar

clusters.

between

Each

GCI is an Intel i960. This and the GCI controllers.

The

and

exceed

processor

controller

These

not

board

interface

These

does and

request

GCI

topology

tolerance

activities

groups connected

any

for fault

transmissions

The

links

are also

with

Cluster

the Cluster

those

VME

I/O

Communication

Connection grants

group

PPs, are

in a group.

determines

up

and

clusters

be connected

Communication

interface. set

clusters The

crossbar

coordinates

Controlling between the and

can

the

Generalized

VME

groups including

by a Generalized CCs

a crossbar.

from

Clusters,

above

and

through ethernet

to the the

VME

to

provide

Operating

clusters

GC

from

controller the

system

via

the

VME

the

GCI

via

group and

bus. the

all All

VME

for arbitration.

Techniques

Functions to communications, clusters

communications

are

message and

the

since passing. message

the

clusters We

passing

call

are the

shared shared

communication

memory memory

and com-

inter-cluster

communications.Figure2 showsthe differencein the pathstakenfor thesetwo type of communications.While the Cluster has a shared bus, an API to the programmer can give the illusion of a message

passing

programming

system

while

retaining

the

speed

"_"

Figure

Intra-Cluster shared coherency

A simple

problem.

mechanism The tasks to

and

messages,

by PP

x it needs

a flag,

PP

sending

PP

has

the

receiving

PP

sending can

it.

This

PP

complete

has

y know

is a more

uniform

already

its receive

happens

is the

coherency

through

registers

with

through

classical

cache

hardware-based,

the

action

Proteus

break

should

to be

data

available

PP

used

at the

to

to it.

only

its own

if the

same

data

action

from

another

checking

of the

is just

sent,

then

checked and

be taken

on

time with

can

The

other

processor.

until the

point. the

receiving

its computations.

but the

If the PP

is now it can

checks The

by

to this

modify

as it

by a processor

for

is programmed sure

it is the PP

the

data

same

checks

check-in

in then receiving

that

"owned"

A solution

then

the

setting receiving

the

flag to make

sets

then

before

Proteus

receiving

PP

and

sure

which

operation,

sending

can

in making

is required

used

to check

task

applications.

data

data

area be

grain

low burden this

data

rules.

memory

it needs

the

memory

of the

larger

potentially cached data to an cluster. Once data is flushed

in shared

is a blocking

a synchronization it is blocked

in at the

continue

PP, flag

the the

grain

be taken

any

copy

larger

must

assignment by

applications

By flushing

Care

be modified makes

for

With

is a relatively

grain

to flush PP on

location

multiple

not

fine

is optimized just needs to another

(or compiler). commands

irregular

its computations.

would the

programmer

by software

more

it is now

with the

If the

been

PEs

PP subsystem design is a mechanism certain locations in the cache. This

control

of the

With

redefine

some

This has

cache

the

between

communications

Part of the invalidate

Alternate methods of consistency a single owner per data block.

there.

data

maintains

maintained

[15].

in which needs

intra-cluster

programming

However,

and

y wants

synchronization. the

PP

not

purpose. based on

placed

design

coherency

to continue

does

Buffering

been

before

to let

processor

other rules

When

result

Paths

communication

at the

send, the sending PP to make it available

is buffering

pleases. some using

PP

received

problem

The

on

Communication

responsibility

difficult.

x is free

the

has

is the

For an intra-cluster of shared memory

area

bus.

invalidate).

(compiler)

increasingly

Co--cat,

mechanisms. flush and/or

a PP

and

cache

programmer

become

Proteus

by

flush

job of coherency

the

of looking

flush-and-sync will selectively

is activated

invalidate,

Co_c._t/on

Inter-Cluster

lntra-cluster

way

The

software-controlled, that when activated

Inu-a-Cluster

2: Proteus

Communications:

memory.

(flush,

of a shared

style.

as a

in to see flag.

the

receiving

PP

also

If the

needs

PP to

know the address of where the data is. This address The sending PP places this address in a predefined Inter-Cluster

Communications:

Inter-cluster

is passed location

back by the synchronization routine. as part of its check-in operation.

communication

is more involved

than the intra-

cluster, flush-and-sync communication. The communication is done on a synchronous, circuitswitched basis and occurs on a defined cycle basis. The CC acts as a controller for all interaction between a cluster and the crossbar network. A PP has to post a communication send request to the CC which then handles all control while the PP is free to continue with work. The choice of synchronous circuit of macro-pipelining and permutation

switching fits well with the original intent of the machine routing that is employed in volume visualization and other

image processing algorithms. Research is being done to implement an asynchronous protocol on Proteus. The hardware is already in place and the protocol is being defined and refined. The inter-cluster communication send mechanism is somewhat similar to the intra-cluster communication mechanism. In this case, the PP needs to place data in the DP memory area. Remember, the DP is directly addressable by the PP. This is much like the flushing part of the intra-cluster send. The network transmission part can be thought of as the synchronization between communication

the sending cluster has occurred.

PPs communicate wants to send data,

and

the

receiving

cluster.

send requests to the CC through it posts a send request in shared

Once

this

shared memory memory. This

action

is completed,

the

"mailboxes." When a PP procedure fills a mailbox

with communication information such as a destination processor, source address, and a communication ID tag. Once this action is complete, the PP is finished with the tasks it needs to do to communicate. Once again, to maintain multiple assignment rules, the data in DP should not be modified once it is assigned to be communicated. In fact, once a send is requested the sending PP releases ownership of that data block. This is very similar to the case for intra-cluster sends. Once the CC reads the request it handles all further action necessary to complete the send such as forwarding the request burden of communication during the period do computations.

to the GCI and setting up the actual communication. This takes the control off the PPs. Most of the time necessary to send data occurs

in which the CC is performing its actions. During this time the PP is free to This is exactly the concept of overlapping communication/computations.

For an inter-cluster receive, a PP interrupts the CC informing it that the PP needs data. An interrupt mechanism is used in order to allow the CC to respond almost immediately to the PP request. If a mailbox was used the CC could potentially have to wait until the next cycle to begin servicing the request. If the data was already in the DP, the delay is eliminated by the interrupt mechanism because the CC can immediately release the data to the PP. The data then sits in the destination DP until a receive for it has been issued. If a receive has been requested before the data has been received, the requesting PP has to wait until the receive occurs at which time the data is then released to the PP. Uniform

Communication

API:

munication calls that are made nisms without the programmer routines

can be fashioned

munication and takes cation,

To simplify

programming,

the API is fashioned

so that

com-

for local communications will invoke the flush-and-sync mechaexplicitly maintaining the flush-and-sync. The send or receive

such that

only one

library

function

needs

to be called

for any com-

(inter or intra) and the library function determines the kind of communication it is the necessary actions. For ease of use, the determination of which type of communi-

intra-cluster

or inter-cluster,

should

be invisible

to the programmer

(compiler).

In this

way coherencyis maintainedby message-passing coherencyrulesin the samefashionthat interclustercoherencyis maintained. The task of determiningdata movementshas to be donefor inter-clustercommunicationsalreadyso wemaintainthat havingto determinethesemovements for intra-clustercommunicationaddsnosignificantburdento the programmer(compiler). The burdenof determiningwhat kind of communicationis occurring (inter or intra) and maintainingthat type of communicationis takenoff the programmer.In this waya singleuniform communicationAPI is presentedto the programmer.This uniformity of communication commandsis not alwaysthe mostefficientmethodsincean extra layerhasto beaddedto intraclustercommunicationsto makeit appearas message-passing in nature. Currently, this uniform API is not implemented. functionality to the library

3.2

Inter-Cluster

While

communication

However, support.

the structure

Communication on a cluster

is present

in existing

code

to easily

add

this

Specifics

is controlled

between

two PPs or by the CC in a uniform

case, communication between clusters within the Group The movement of data among clusters is synchronized within a specified time which is defined is mentioned, it is only for inter-cluster

API

is controlled by the GCI. and each transmission is to be completed

before run time. When the term communications since intra-cluster

communication communications

cycle are

asynchronous through shared memory and do not use a defined cycle. The control and data for inter-cluster communication are separate. Control is handled over the VME backplane data is transferred via the crossbar during each cycle.

paths while

The Communication Cycle: To understand more about how communication occurs on a cycle basis, one needs to understand the actions taken during a cycle. At every synchronization cycle the CC reads transmission and reception grants from the VME register mailboxes that the GCI set up in the last cycle and sets up the communications accordingly. It then reads the requests that

PPs

have posted

and

forwards

one of them

to the GCI.

Only

one request

is forwarded

per

cycle. The CC and GCI then reach a synchronization point and the CC is through with its preliminary work. The GCI control receives the connection requests from the cluster controllers during this cycle and arbitrates connection for the next cycle. The GCI then writes the VME register

mailboxes

with grant

not be serviced are queued shown in Figure 3. Four

signals

Transmission include the start circuitry resulting

are

(EOT)

a signal

used

by the GCI.

in defining

signal.

to switch

information

All other the

crossbar

for the next The actions

a cycle. signals (SW)

taken

The and

Any remaining

requests

by the CC and GCI during

baseline

are defined

signal

in relation

signals

for the

for a cycle

End-Of-

Other

of reception

could

a cycle are

is the

to this signal. start

that

signals

(RX)

and

of transmission (TX). Overhead is added by the synchronization time for the TX/RX for the optical links. The optical links we use have a rather large synchronization time in a total overhead (latency) per cycle of 60 microseconds. With 64 kilobyte data

packets, the effective data rate works out to about of 4.1 milliseconds. This represents a 71 percent megabyte/sec. The 4.1 millisecond are most

cycle.

busy.

The

cycle time is necessary exchange

example

in the

15.98 megabyte/sec per link with a cycle time efficiency of the peak data transfer rate of 20

for those next

times that

chapter

the communication

is one of those

times.

resources In this case,

Overh_ S_ter_ Commumcation for communication fault Read PP communication request Allocate other resources Write request and status to GCI Release halo to GCI

o

o

GCI waits for iafo from CC Read requests Check for communication faults Arbitrate links Write grants and status for next cycle Load crossbar

m

7. ?, Time

Figure

during

any

than

one

cycle,

2 processors resulting

controller

allowed

transfer

64

complete if the

the

DMA

DMA

was

from

the

a send

to the

resources to

3: A Breakdown

DP

this

set

the

shared

cycle

time.

tests

to do

plus

the

show

that

priority

over

occurring

memory

60

the

shared

and

a receive timing

EOT

RX TX

on

a cycle

is shown

Arbitration:

mined and

by the

PP1

ing send CC-PP basis gets

had

CC

Priority

on a round

the

next

requests.

Requests

are

placed

arbitration

scheme.

The

provided arbitrated

a send

no

request,

Proteus

basis.

conflicts

on the

next

exist. cycle.

GCI

could

some

a small

amount

the

by

more

for the

in the

take

be dropped

in Figure

made

contention

error

it should

accesses

that

being

DP

3.125

DP

memory

milliseconds

of overhead

to 3.325

to

milliseconds

to the

DP.

In these

tests

the

were

not

accessing

the

DP

PEs

4.

EOT

urges

Communication

among

robin

are

a design

plus

time

memory

3325-4100

Request

causes

that correctly,

cycle

Cycle

accesses

This

microseconds

the

Cycle

4: The

and

bus.

If designed

bus.

Figure

Communication

It is suspected

a send

memory

Proteus

are

to occur.

of data

Sample

complete

up

a receive

contention

kilobytes

shared

on

in a longer

cycle. had

and

of the

the

PPs

for forwarding

For instance

cycle

would

into arbitrates

If a conflict

the

if the

have queue from

exists

Cycle

requests

priority

priorities

for a given

was

(PP2,

on a round a request

Timing

PP3,

robin

pool request,

to the (PP1, PP4,

approach

GCI

PP2,

is deterPP3,

PP1)

PP4)

for post-

similar

to the

on a first-request/first-serve it stays

in the

queue

and

Optimal Dual Port Performance: The methodof involving the DP in the inter-cluster communicationscan impact efficiencyin termsof sharedmemoryaccess.This sharedmemory accesscan result in contentionfor the sharedmemorybustherebydegradingperformance.The following codefragmentsshowhow to optimizeDP accessto minimizetotal memoryaccesses. "A", "B", and "C" arearraysof elementswith sizeequalto the data packetandresidein shared memory."DP" is the DP packetresidingin DP memory.All operationsoccurfor all elementsin the arrays. Notethat in the efficientcaseno unnecessary data movements are performed. Efficient

DP Use

For

Receive

A=B+DP

{h

MINIMIZING OVERHEAD IN PARALLEL ... - Semantic Scholar

MINIMIZING OVERHEAD IN PARALLEL ... - Semantic Scholar

Suggest Documents

Minimizing Communication Overhead Using Pipelining for Multi ...

Minimizing the overhead of dynamic scheduling ...

modelling overhead in javaspaces - Semantic Scholar

Disorientation and Cognitive Overhead in ... - Semantic Scholar

MINIMIZING DELAY IN LOSSLESS SEQUENTIAL ... - Semantic Scholar

Minimizing Makespan in Hybrid Flowshops - Semantic Scholar

Minimizing Unnecessary Handovers in a ... - Semantic Scholar

Minimizing Synchronization Overhead in the Implementation of MPI ...

Minimizing the Reconfiguration Overhead in Content-Based Publish ...

Minimizing incomplete automata - Semantic Scholar

TAUoverSupermon: Low-Overhead Online Parallel Performance ...

Low Overhead Parallel Schedules for Task. Graphs*

TAUoverSupermon: Low-Overhead Online Parallel Performance ...

Communication Overhead for Space Science ... - Semantic Scholar

Communication Overhead for Space Science ... - Semantic Scholar

Function call overhead benchmarkswith MATLAB ... - Semantic Scholar

Low-Overhead Barrier Synchronization for ... - Semantic Scholar

Overhead-Optimized Gamma Network Codes - Semantic Scholar

Parallel Operators - Semantic Scholar

PARALLEL COMPUTERS AND PARALLEL ... - Semantic Scholar

Hardware/Software Partitioning and Minimizing ... - Semantic Scholar

Orthogonal Locality Minimizing Globality ... - Semantic Scholar

Minimizing Interference for Scheduling Two Parallel ...

Minimizing Power Supply Noise Through ... - Semantic Scholar