Runtime Compilation Techniques for Data Partitioning and ... - CUCIS

16 downloads 0 Views 945KB Size Report
to reuse previously computed results from inspectors. (e.g. communication .... including sparse matrix linear solvers, adaptive computational fluid dynamics.
Runtime

Compilation

Techniques

Communication Ravi t

Science

University College

Joel

Saltzt

MD

Syracuse

University NY

ecutor In

this paper,

we describe

compiler

can deal with

The first

mechanism

two new ideas

irregular

invokes

cedure via a set of compiler the user

to use progmm

tivity,

spatial

load.

The second

many

location

arrays

that

on-processor

bufler

sults for

these

The directives

elements

loop

from

loops

allow

We

it

data

copies

piler

generates

of when re-

with

arrays

These tern

is determined

time.

In these

by variable

movement

programmers

work,

map

sors. The

of data

to carry

can also be generated in a process

we call

per, we present where

compilers D

methods

Fortran

On distributed array

memory

The

inspector

cal memory element

related

loops with

by transforming

of code:

partitions

an inspector

loop

iterations,

off-processor

by a loop,

and builds

the original allocates

mation

distributed

tion.

Ia-

of graph

ments

a communication

needed

to produce connectivity,

and information

computational *This NSF

(ASC

work was sponsored in part by 9213821)

and

ONR

ARPA

(SC292-1-22913).

(NAG-1-1485),

produces

Author

represent

Investigator The content of the information does award ( CCR-9357840). not necessarily reflect the position of the policy of the Government and no official endorsement should be inferred Choudhary

was

also

supported

by

NSF

Young

th~

$1.50

ation

of

standardized

tioned.

produces

361

@ 1993 ACM 0-8186-4340-4/93/0011

load.

code that, the

that

a data

have been de-

support

and compiler

users to specify can consist

spatial

of array

array

information,

elements

alao generates

structure

that

elewith

the compiler a standardized

and

a (user

func-

of a descrip

location

generates to

the infor-

distribution

Based on user directives above

parti-

[24, 25, 19, 17, 2, 13].

associates

at runtime,

we

to each pro-

the different

a customized

representation

The compiler

of the mesh.

communication,

with

this information

For incompu-

in such a problem

elements

the runtime to allow

manner.

heuristics

associated

in the to par-

does not have a use-

interprocessor

years promising

aris as-

locations

pattern

array

need

data

of an irregular

structures

arbitrary

arrays

of processors.

advantageous

frequently

the data

needed

In our view,

tion

array

the in-

distributed

in an irregular

We have implemented

and an execu-

time

array

and memory

have been studied

transformations

data

are called

to the connectivity

and tradeoffs methods

large

the nodes

minimizes

In recent

veloped

indirect

any indirection

of distributed

arrays

need to assign

tioning

architectures,

reference

each inspector

memories

It is frequently

we partition

cessor.

coded usto Fortran

[28].

for each unique

accessed

problems

closely

may

for

arrays

processor

machine.

in a way that

implementation

make it possible

irregular

accesses can be handled

loop into two sequences tor.

that

com-

a record

have written

since the last

local

mesh are numbered

When

scheme,

machines,

the way in which

tational

pa-

to see whether

storage

ful correspondence

compiler In th~

record

data

term

distributed

stance,

preprocessing [23].

and a prototype

tition

schedule

may

In this

iter-

The

maintains

is used to indirectly

between

to specific

distributed

of proces-

memory

compilation

handle

and

Long

signed

at run-

preproces-

memories

out runtime

extensions

only

loop

off-processor

locations).

array.

memory

partitioned

mys.

access pat-

out

structures the

techniques

to efficiently

[10] or Vienna

known carry

by a d~tributed run time

we demonstrate

ing a set of language

data

between

code needed

the data

values

cases,

sing to partition the

problems

schedules, associates

buffer

often results

was invoked.

In distributed

In sparse and unstructured

loops). that

computed

or intrinsic

that

to handle

method

at runtime,

90D loop

[26] and KALI

(irregular

previously

on-processor

runtime

arrays

that

code that,

to be partitioned

Introduction

compiler

may have been modified

spector 1

reuse

In the ex-

of transformation

conservative

array

data.

and computation

communication

dwtributed

checks th~

ARF

kind

information

a Fortran

another

compiler

to

(e.g.

to a d~tributed

implementation.

The

referenced

possible

it is pos-

off-processor

communication

a simple

partitions,

performance 90D

propose inspectors

with

[21].

indirectly

ation

copies

a Fortran

with

from

partitions,

data

out

in

that

required

the actual

[16] used thw

makes

inspectors

iteration

We present

mechanisms

that from

off-processor

locations).

pro-

connec.

method

results

schedules,

are carried

Center

1.9244

to prefetch phase,

compiler

and computational

to recognize

computed

associates

graph

conservative

a compiler

communication

information

mapping

to describe

of army

to reuse previously

(e.g.

directives.

HPF

effectively.

a user specified

is a simple

casea enables

sible

by which

computations

Choudhary$

Architectures

Syracuse,

schedule

and

Reuse*

parallel

20742

Abstract

Partitioning

Alok

~No~hea~t

Department

of Maryland

Park,

Data

Schedule

Ponnusamytt

computer

for

then

specified)

code that,

at

is used to partition

passes parti-

runtime, loop it-

Pennission to copy wuhout fee all or p-ret of IMs material is granted, provided thm h copies we not made or dlstibuted for dkct ccinmercial advantage, the ACM copyrighl mice and die dlle of the publication and its dme appear, and nouce is given that copying IS by permission of the Association for Computing Machiaev. T. copy dkerwx. or to republish, requires a fee andhx sjwific permission.

Pkwe

c Single statement FORALL

loop

i = 1, N

y(ia(i))

= x(ib(i))

END

L1

+ .. . x(ic(i))

A

Generate

GCOCO1 Graph

Pzrtitien

Geocol

Phase

FORALL

Partition

Graph

>

Data

B

Psrtition

Gem#ate

Iteration

Graph

Partition

Iteration

Graph

Loop Iterzticn

>

c Sweep over edges: Loop L2 Phsse

FORALL

i = l,N

REDUCE

(ADD,

C

Remzp Arrays

x(end-pt2(i))))

REDUCE

(ADD,

g(x(end-ptl(i)), END

y(end-pt2(i)),

Phzse

piler

1: Example

Irregular

To our knowledge, is the first

to provide

the Vienna specify

this

Fortran

trol

Loops

kind

mations

describe

and language

statement

pendencies dition,

rdlowed

accumulation,

irregular

array

indirection

D syntax

pendencies. out

with

loop

second

90D

This time

compiler Our

paper technique

The first

concurrent

ad-

CHAOS; The

being

iteration

partitioners

an overview

transformations

methods

developed

and describe

fluid

as follows.

ture

in Sec-

the

effort.

We describe

extensions

data

demarcated

is called

irregular

(Figure

mapping

project

is called

the CHAOS

of the

earlier

them

first

on

support,

three

library. PARTI

(the

GeoCoL

onto

processors.

sections.

arrays

GeoCoL

graph

data

are

data

partitioned

structure)

with

decomposed

using

a particular

structure

calculates

five

of these steps here, and will

in later

distributed

distributed involves

steps in the figure

and computations

description

in detail

the

problems

our runtime

2). The

data

a brief

associated

the

data

a

access

set of loops.

is passed

how data

in

2, CHAOS data strucThe

to a partitioned.

arrays

should

be d~

In Phase B, the newly calculated used to decide how loop iterations

array d~tributions are are to be partitioned

among

takes into

data

processors. of arrays

and

In Phase

5 we

(1) coordinate the storage

the

involves

we use to con-

362

calculation In Phase

and loop

D, we carry

out

interprocessor

C we carry

a shared

generating

account out

loop

the actual

iterations. the

preprocessing

data name

communication

needed

movement,

of, and access to, copies

and (3) support

struc-

This

access patterns.

remapping

In Secdata

In Section

standard

of

the run-

schedules.

used to couple

to deal with

tributed.

code

We set the context

support of clearly The

is a superset

using

patterns The

Univertemplates

3, we describe

to compilers.

the language

dy-

of the For-

of the compiler generated paralleiized version.

generate

work

known regular manner. In Phase A of Figure procedures can be called to construct a graph

and compiler

on simple

steps

Initially,

We use th~

as part

phases.

support

concurrent machines

d~cuss

to

runtime

of a sequence

library

We provide

we carry

is similar

by Syracuse

results

of our compiler

which

loop codes.

to save communication

loop

in which

procedures

the procedures

related 8.

[21, 26, 23].

concern

de-

efficient

consist

CHAOS

major

For-

sections.

2. In Section

6 we

CHAOS

computational

Solving

directly

without

computational

dynamics

our

discuss

in Section

of

the runtime

library

loop is a single

references

second

our runtime

is organized

in Section

4 we describe

ture

or

that

de-

level of

1, we employ

is a loop

The

implementation

tion

present

that

array

in the following

reveal that the performance is within 10% of the hand the work

problems

We also assume that is indexed

loop

We have implemented sit y [9].

(e.g.

array

Overview

We have developed

irreg-

carried

of a single

in unstructured

to demonstrate

tran

etc).

two loops.

codes and molecular

transformations

Overview

In Section

the performance

the

of a single

loop

side reductions

in Figure

indirect

operations.

those loops found namics

only

that

as a result

shown

to depict The

reduction

2

memory loops

loop

7 and we conclude

2.1

transfor-

index.

In the example statement

hand

a distributed

partitioning. to characterize

We briefly

tion

runtime

to provide

We assume

the

max, rein,

methods.

data

in

described

compiler

in the context

accesses occur

with

by the loop tran

are left

The

required

where

runtime

performance

Problems

Fortran.

support,

out

loop

Irregular

a user can also

strategies

above.

accesses are carried

multiple

2: Solving

compiler-linked

of our

com-

We also note that function.

extensions

described

Fortran

definition,

to Vienna

the runtime

new capabilities ular

of support.

transformation

here can also be applied

described

memory

distribution

and compiler

We will

the implementation

dwt ributed

[28] language

a customized

support

Leeps

Figure

present in t h~ paper

Leeps

E

Rxecute

FORALL

erations.

Remap

>

D

preprocess

x(end-pt2(i))))

Figure

Itemtions

y(end.ptl(i)), Phzse

f(x(end-ptl(i)),

and Loop

of off-processor

space.

This

to

(2) manage data,

preprocessing

schedules,

translating

array

indices

to access local

and allocating data.

buffer

distributed

processor

the

Finally,

earlier

of off-processor

to retrieve

data-sets

memories.

fkom

copies

....

data

space for copies of off-processor

It is also necessary

irregularly tion

local

phases

globally

from

indexed

the numerous

E we use informa-

to carry

out

CHAOS

and PARTI

adaptive

procedures including

necessary

computational

dynamics

codes

distributed

fluid

and

memory

dynamics

a prototype

linear

codes,

compiler

S5 ALIGN

solvers,

S6 . . .

molecular

[23] aimed

set

values

Language

decomposition

lar problems While

will

directives

Sup-

be presented

our work

will

sions could pilers

Fortran

D and HPF

from

a rich set of data decomposition

a definition

of such language

that

These

languages,

users explicitly

Fortran

D can

define

In Figure

attributes

which

size, dimension titioned

array.

processors.

using two declarations. POSITION. ity

declaration statement onto

and specifies

the user with

In addition,

a distribution array

Fortran

D statement

with

ALIGN.

In statement

titioned

equal

to each processor. distribution

between

The

which

Fortran-D

for the user to couple of partitioning from

scratch

separately

constructs process.

heuristics

is no standard

by run-

are not rich of the map ar-

While

available,

can represent

interface

3

dis-

gives the dwtribution

the generation

compilation

and the application

there

coding

a significant

between

the

are such

effort.

partitioners

codes.

D

Communication

3

called

Schedule

The cost of carrying in Figure duced

is produced

Reuse

The

by the inspector Compile

We propose

second

is to be mapped

analysis

schedules

a simple

once

and

needed

conservative

us to reuse the results an inspector

for loop

pro-

then

used

to reuse inspecupon

method

results

B, C and D

the information

is touched

cases allows from

(phases

when

is computed

time

tor communication

dimensional-

out an inspector

2) can be amortized

repeatedly.

is an executable

onto

a d~tribution

sized

In statement reg.

Array

irreg

An irregular

using

an integer

array;

ement

i of the

distribution

when

the



The

assigned

map is aligned (in

from

in

[12,

that

7].

in many

inspectors.

L can be reused

The zs long

is set equal

compiler

that

L h&e

generates

of when

intrinsic

may

scheme,

in loop

L have

the inspector

indirection

been

arrasw

modified

sinc~

associthe last

invocation.

record

thw

referenced

since the lsst time

and

l~op

code that

a Fortran have

used to indirectly

is specified

arrays

is no ~ossibilitv with

inspector

is to be partitioned

is assigned

there ated

is par-

be used to specify

distribution

map(i)

irreg

reg

of data unchanged

was invoked,

A

S3, of Fig-

one block

S5, array

map will

S7) how distribution processors.

with

using

distributions remained

specify

decompositions

S4, decomposition blocks,



regu-

processors.

In statement

3, two of size N each, one dimensional

statement

map array

in Figure

the irregularly

has to be generated

a partitioned.

There

fixes the

a user can explicitly

is associated

are defined.

with

The

depicted

how to partition

elements.

a Fortran

a choice of several

is to be mapped

specific

into

the declarations

as:

D provides

lar distributions. how

array. of irreg

partitioners

processors.

Fortran

ure

in

is DECOM-

template.

a template

tributed

a wealth

is to be par-

declaration

Distribute how

Distribution

the significant

fixes the name,

is DISTRIBUTE.

with

ray to the program

require

a template

the array

array

D Irregular

it is not obvious

enough

an irregu-

array

A distribution

Decomposition

be found

The distribution

The first

and size of the distributed

Fortran,

specify of such

mapping

irreg

3: Fortran

pattern ning

is to be distributed.

is used to characterize

and way in which

between

may

D, one declares

of a distributed

exten-

specifications;

of distributed

some

irreg(map)

x,y with

difficulty

is that

D and Fortran

specified,

explicitly

an example

In Fortran

a distribution

to

partition

3, we present

declaration.

how data

be used

lar inter-processor

extensions as currently

The

D.

and com-

Vienna

Fortran

using

....

of Fortran

of languages

90) provide [1 O, 8].

of Fortran

language

and HPF.

(evolved

array

for irregu-

in the context

and analogous

be used for a wide range

such as Vienna

Fortran

in the context

be presented

D, the same optimizations

we employ

map

at

Figure The data

of

S7 DISTRIBUTE

Existing

reg

..

S8 ALIGN

of

reg(block)

map with

method

multiprocessors.

Overview port

2.2

reg(N),irreg(N)

S4 DISTRIBUTE

have been used in a vari-

sparse matrix

map(N)

S3 DECOMPOSITION

computation. et y of applications,

x(N),y(N)

S2 INTEGER

local

in Phase

the

S1 REAL*8

but

written reference

each inspector

at runtime

90D loop’s

to a distributed another checks

maintains

statements array

distributed thw

runtime

a

or array that

is

array.

In

record

to

see whether any indirection arrays may have been modified since the last time the inspector was invoked.

to p, el-

to processor

In th~

P.

presentation,

an inspector

363

for

we assume

a forall

loop.

that

we are carrying

We also

assume

that

out all

indirect

array

the form loop

references

y(ia(i))

index

associated

A data

with

(among

of the

array

cess to the array’s a global data any array

with

Note

that

number

to the distributed of times

that

writes

array,

data

data

structure

the current current nmod.

modifies

In this

out,

first

perform

L has m data arrays,

ind~,

carried

out,

array.

arrays 1

Suggest Documents