Automation of Data Traffic Control on DSM ... - Semantic Scholar

2 downloads 4801 Views 755KB Size Report
their efficiency on the NAS Parallel. Benchmarks. We also present a tool which automates detection the constructs causing data congestions in Fortran array.
Automation

of Data Traffic Architecture Michael

Frumkin,

Numerical

Haoqiang

Aerospace NASA

Control Jin,

Simulation

Ames

Jerry

Yan

Systems

Research

on

DSM

1

Division

Center

Abstract Design

of distributed

distribute

data

for example, of parallel a good paper

having

application

we discuss

Benchmarks.

in Fortran

improving

data

are

in the

computer

OS and

very

the

to reconsider traps and

to avoid

traffic are

with

problems

difficult

factor

such

about

accessing

20%

Several

point

all functional

operations Table 1.

detect

the

1M/S { f rumkin,

Traffic

data

blocking,

congestions.

efficiency

the

the

user

data

data

on

placement,

the

NAS

constructs

on code

In this Parallel

causing

data

transformations

for

code

and

data

location

with

the

computer

a machine

trashing

and

padding

data

Many

performance

the

of the

new

misses

it runs

the

and

on

thread

[13]),

has

simple

other

data

interference

similar

of CFD (see

user

to identify

The

looking

and

The

difficult

privatization.

TLB

location

compiler

and

not

implementations

machine

data

the

architecture. are

constructs

best

on

architecture

variables

excessive

to express

depends

sharing

programming

Even

allow

with

false

and

locality,

cure.

which

the on

as cache

as poor

Control

constructs

varies

in performance.

contribute units

to data code

T27A-2,

to low efficiency

operations

are stalled

Approach

traffic

detection

to achieve

to understand

data

their

advises

user

to

using,

development

However,

the

duty

may

codes

have

achieve

spending

80%

of

data.

factors

of floating

data

dimensions

of peak

including evaluate

the

programs

simplifies

processors. from

to avoid

and

a result,

of his

array

3-4 difference

time

As

such

to diagnose

only

keep

access

a few

automates

codes

greatly

requires

and

programming 2.

performance

on

techniques,

of Data

for accessing

in memory

architecure

from

application.

few explicit

time

parallel

techniques

control

oriented

memory

user

develop

computers

a tool which

in the

Significance

There

size

array

traffic

liberates

performance

DSM

of such

present

computers incrementally

DSM

use various

page

We also

congestions

good

a number and

allows

threads.

on

and

(DSM)

and

Java

scalability

transposition

1

or

programs

program

memory

processors

OpenMP

flow in the data

shared

across

does busy.

not

allow

Second,

and

during

computation

tra_c

optimization.

constructs NASA

suffering Ames

codes.

First,

an optimal

a larger

factor

waiting

for data,

The

from Research

of CFD to provide

data

first

step

congestions

Center,

Moffett

the balance mixture

comes

from

the fact

see Example in addressing and

identify

Field,

of the

number

of instructions that

the

1 in Section this the

CA

challenge data

94035-1000;

to many 2 and is'to

congestion e-mail:

hj in, yan} ©nas. nasa. gov

2Under explicit constructs directives in HPF.

we

mean

such

statements

as "register" qualifyer in C

or data

distribution

type

causing

mary

Data

(TLB)

loss Cache

misses,

to resolve

for

for reduction data

where

known

for some

known

use

possible

so far

padding

at compile allow

tools

leave

code

and

control

of the This which

We demonstrate suggest

a cure

and

LU

with

rhsz

average.

of NAS and

zsolve

nor

channel Even

For example,

it is

Parallel

corresponding

fixing

standard

with to

SP of NAS

for

size

architecture

then

way

are

implemented

the

codes

in the

problem

for

reporting

compilers

level optimizations and

others.

however, analysis.

such

On the

cannot

data

for

as loop level

perform

Many

problems

during

events,

the

inter-

compilers

deep

code

types

of analysis

searching identifies

can

not

tool's

and

ability

using

hardware

the

problems The helped

and

to resolve traffic

anal-

are

not

counters

construct.

Perforrneter

user

on tool

have

was

to improve

the

data

time

the

data

simulated able

user. solutions

CFD

performance

The

inserts

poor

traffic

to resolve

affinity

traffic. tool

been

code and in nature

and suggests

warnings.about

problematic three

to the

Some

data-to-computations

with

compile

receives

them

problems and

problems at the

to identify

and

perfex

affinity

possible

be evaluated

Benchmarks. operators

ways data

data-to-data

about

on

of a program

These tools allow to instrument the However these tools are diagnostic

and

user

relies

execution

including

a tool which

performance

Parallel

this

nest

traffic

[6, 12]. anomalies.

are evaluated time.

the

of the

data

problems

the

code. These statements code constructs at run

are

Compilers,

tool analyses

informs

statements

in BT and performance

facto

prefetching

the

of these counters constructs with

them.

traffic

of events

analyzing

we present

and

misses.

computer

traffic. page

in memory

TLB

have

data

of the

be applyed.

zsolve

trans-

at all. statistics

analysis

de

or dependency

to identify

In this paper for resolving

time

and

worse for

to improve

contention

should

code

problems

optimization

target

Buffer remedy

placement/migration these

developed

in the

and

include

privatization.

to collect

built on the top identify the code

data

pipelining,

approach

for collecting

in the

and

or page

invalidations

NPB

be either

of publications

to resolve

reason

can

Pri-

[11].

optimizations

interprocedural

size

been

of 3-4

of that

tools

remedy

optimization,

of rhsz factor

Neither

to improve software

This

transformations

[7].

and

Lookaside a proper

for nonexpert

have

in spite

These

as full

Another which

cf.

Translation

have

of cache

paper:

is to choose

step

as page

of pages

computations version)

optimizations

such

placement

in this

(SC) misses,

such

for cache

metrics

second

In a number

grouping

easy

congestion

code.

a few techniques

reduction

in compilers

loop fusion,

usually

and

that

architecture.

change, ysis

time

in x-direction,

Many

traj_c.

it is not

(OpenMP

improvements target

data

for

The

to the

and how the appropriate

Benchmarks been

them

four

Cache

(CI).

environment

misses,

in hand

identify

operators

apply

data

transposition

use

Secondary

program

include

techniques

have

and

of TLB

We

Invalidations

[4, 9, 10] and

techniques

these

misses,

Cache

controlling

addressed

and

(PDC)

and

or changing

Methods been

performance.

congestions

formation, mechanisms.

These

of the

applications traffic

of the

codes

in the

performing

constructs

data

data

and BT,

to SP

problems by 27%

in

2

Automaton

of Detection

For controlling

data

puter

hierarchy

memory

and

traffic

cases

offsets, cache

and

however

and

data

data

access

invalidations.

on cache The

the

processor

the

memory

and

typical

such

user

across

can

be complicated

target

computer

in terms

of cache

metrics

shared

to the

awareness

a help

of the

in the

architecture. as cache

the

comIn

parameters,

such

data,

variations

specifics

of a tool detecting

data

data

array

misses

traffic

order

and

depends

of the

with the

code

execution

placement

and

data

could

advising

streaming

for avoiding the

coherence

tool intended

architecture

congestions

for avoiding

size for reducing

by cache

which

traffic

data

page

caused

of the computer

grouping

on initial

an optimal

interference

data

on data

reuse,

on choosing

problem

as accessing

movement

movement

by simple

is sensitive

a tool can advise

thread

in the

be formulated

Problems

on data

of such

expertise

characterized

and

by increasing access,

on reducing The

with

Such

information

Details

can

and

Traffic

threads.

to the

reduced them.

to have

require

movement

protocol

requirements

to resolve

may

strides

by different

be greately

has

In a few cases,

coherency

of statements

user

in his application.

machine-dependent

many

the

of Data

ways

through

contention

number

of TLB

in

misses

issues.

to advise

is shown

in the following

example. Example version)

are shown

nested

loop

pages the

1. The

right

has and

pane,

execution Figure

time

and

tating

number

loops

of pages

of increase

Placement

in the

such Tool),

code

a tool

see

with

[2].

affinity key

affinity.

Two

data

run.

For a pair

stream the

often

affinity

data

items.

The

and

is the

the program

ability

Grouping improve

relation

affine

geometry

Cache

Miss

curve cache

(serial in the

number

2.

Merging

see

utilization, point

first

of memory

in Figure

expressions,

are

the

of the

Equation

data

Figure

and

the

1, total

instructions,

program that

if both

see

the

self interference

relation possibility lattice

there

of grouping array

the

traffic

with

affine that

exe-

statement

the

same

the

value

of

into a continuous

latency. ways

array

and datacapa-

instruction

groups many

and

control

!oop nest

memory

are

ALIGN

affinity

same

referred

organizing

by hiding

of the

at the

anno-

data-to-data

HPF

data

in the same

and and

through

used

Align-

for automatic

data-to-data

elements

together

performance

are

Data

to identify

automatic

used

array

designed

affinities

it with

(Automatic

is able

to extract

of arrays

items

was tool

the

tool

affine

between

to ADAPT

The

express

of the

is a many-to-many

In [3] it is shown

on the

computations

of floating

ADAPT a.

for enabling

is a correspondence

loop index.

lhsz

number

Benchmarks

a large

of the

features

directives

affinity

the

see

improves

total

Originally

HPF

to-computations bilities.

relation

cache,

by adding

directives.

during

of the

it touckes

recalculation accessed,

Parallel

curve.

to data-to-computations

affinity

NAS

optimization since

and

DISTRUBUTE

cuted

The

of primary

the

implemented

Data-to-data

of SP from

computations

utilization

in spite

in zsolve

1, left pane. down

nested

FORTRAN

and

nests

second

decreases

We have ment

slows

a poor the

2 lhsz_t

two

in Figure

actua!ly

and first

first

In general,

to group

elements

affine depends

is a set of solutions

of the

[4].

aADAPT is built on the top of CAPTools analysis and some CAPTools utilities.

[8]. It uses a CAPTools

generated

data

base, CAPTools

code

lhsz_t

lhsz

j=l,ny

do

do

do

i=l ,nx

do

k=l

do

,nz

j=l,ny do

i=l,nx

cv (k)=ws (i,j,k) rhon end

(k) =SFunct

k=l,nz

lhs(i,j,k,l)=O.OdO

ion (rho

(i, ], k) )

lhs(i,j,k,2)=-dttz2*ws(i,j,k-l)

do

do

-dttz1*SFunction(rho(i,j,k-l))

k=l

,nz

lhs(i,j,k,3)=1.0dO

lhs(i,j,k,

1)=O.OdO

+c2dttzl*SFunction(rho(i,j,k))

lhs (i ,j ,k, 2) =-dttz2*cv

(k-l)

lhs(i,j,k,4)=dttz2*ws(i,j,k+l)

-dttzl*rhon(k-l) lhs(i,j,k,3)=

-dttzl*SFunction(rho(i,j,k+l)

l.OdO

lhs(i,j,k,5)=O.OdO

+c2dttzl*rhon lhs(i,j,k,4)=

end

dttz2*cv(k+l) -dttzl*rhon

lhs(i,j end

do

,k,5)=O.OdO

do

1: Data

serial,

saves

a large large

Traffic

number

(right

control

affinity

most arrays

problems

are

profiles

of both

results

from

stencil

affinity

nest

along

over

all directed

relations

In order

ADAPT

relation

each

lists

is one-to-many

set of memory

to a statement affinity

graph data

locations c if the

has C and

affine

to it.

of program

referenced

in spite

nest

in the

affinity grids.

in different

The

relations and

allows

A

between

These

elements)

rule

an array

of

control

relations

used

of u used

a

statement. statement.

flow graph.

q from

vertices

program

program

statements,

in the

at address

D as the Many

data

all elements

We represent

set

datum

chain

time

by the

constant

and

rearranging

involved

discretization for arrays

pages By

we call

statements

to propagate

The

u forms

union the

an

of these

nest

for computation

affinity of each

mapping.

affinity. C be the

arrays

crea_es

2.

in each

dominated

nest

to an array

of q and Let

in the

leading

element

execution

see [2]. The

relation

u.

memory line.

is one-to-many

affinity

ordering

many

in Figure

with

of NPB2.3-

cache

shown

the

The

q and

graph.

path

loop

improving

between

paths

between

Data-to-computations

the

lhsz.f

are

immediately

rule,

per

of arrays

on structured

the chain

directed

through

by a set of vectors

to deduce

uses

relation

affinity

(i.e.

from Such

pair

applications

operators

taken

word

codes

for each

blocks

in CFD

one

resolved

relations

basic

by a stencil

relations.

of the same

in affinity in the

difference

can be approximated them

be deduced

case we observe

resulted

scan

only

The

can

(left)

in SFunction.

calculations it uses

this

all arrays

common

since

used

pane)

relation

and

since

misses

code

of FPI.

dependence

statement

Original

instructions

misses

of PDC

in number

The

point

of TLB

number

increase

Optimization.

few floating

computations

with

do end

(k+ i)

do

Figure

the

do

do

end end

end

(k)

program.

d is either of the

properties

4

parts can

by a bipartite and

graph

let D be the

We say a memory operand and

or result

location of c.

an arc connecting

be expressed

called

program

program

in terms

data,

d is affine

The each

i.e.

program statement

of the

affinity

Cycles

Figure

2:

The

optimization

Time

effect

of TLB

the

performance

on

Flal

(Table

TLB

PDC

Lookaside

of lhsz

SC

Buffer)

nest.

The

Cl

and

PDC

(Primary

performance

of lhsx

a reference. The horizontal axis shows different types hardware counters. The vertical axis shows a normalized

of events number

FPI

for Secondary

stands

for Floating

stands

for the

graph.

For example,

cl to c2. The

analysis

nests

data

and

loop

index

In most index

and

FFT

multiple

algorithm grids

points

with

nonlinear

most

nests

function

coefficients knowledge but

of the multiple of the

of the actual free

can

not

set

{-1,0,1},

Checking good

However,

cache temporal

some

traffic

(see

where

with

The of the of the

the

time.

The

core

working

of

with

enumerated

indicates

nests

I.

grids) the

specially

matrix

inside

at iteration

nests

tool

from

I is a vector

include

nest;

CI

connecting

at compile

nests

loop

of the

an exclusion

the

can be deduced thread

values property

be verified

without

unfriendly

necessary

analysis

arcs

on structured

working

coefficients

path

statements

the

known

and

order.

referenced

These

array.

further

with

If the

and

a case.

as

the

nests

at this

point.

In

representing

the

idx

multigrid

methods

where

of 2.

tool inserts the expression in the call such test run time test.

volve

element

a file; nests

the

the case

(I; idx(I))

coefficients

is not

any

function

numerical

tiles).

symbolic

from

is a direct

indexing

applications

in a precomputed

without

data

of ±dx function

of interference form

are

properties

is read

access

(CFD

is given

misses,

in any

In this

j, k) = i + 2 •j • k for kji

stored

functions linear

elements

coefficients Some

function

by

arrays.

of an array

this

Cache

cl if there

of expressions

domain

where

idx

function

the

address

idx(i,

the

nest

Cache)

measured with the use of of measured events. Here

can be executed

be simplified by

of I with

the

access

with

are

function

and

as a pair

application

where

idx

can used

be expressed

applications

where

use

graph

is a memory

is linear

on a statement

independent

locations

in our

in our

grid

c2 are

affinity

can

cases

function

the

c2 depends

idx(I)

of the

few nests

a statement

memory

statements

SC stands

Invalidations.

cl and

the

Instructions,

Cache

of the

and

are

the

secondary

Otherwise,

the

Point

Data

access

spatial conditions

condition

of the coefficients

knowing and

can

numerical

user

can

friendly

(see the

traffic obtains

In general,

[5] and

for cache

data the

the

patterns.

locality

only symbolic

noninterference of the

code

using

not

below) subsection

values

of the

warning

cache

friendly

be expressed

others

require

in a symbolic coefficients

at run

time.

computations in simple

can

on the

on generation

be expressed

the

computations

information

terms

be formulated

the We in[4]. and

checked.The first condition is simple: the coefficient at the innermost loop index is 1. Otherwise, nonunit stride in memory accesscan cause,underutilization of data loaded into the cache.

The

other

Detection

of self

represents

the

array

and

sizes

a set the

the

sizes,

a test

Detection affine offset

and

of this

of both

Detection

nxa

requires

same

TLB_SIZE;

loop

exceeds

the

user

can

not

2.

arrays. • nya

block

TLB

the

gets

address

then the

misses.

accessed

high then

in this

a single

by different locality.

would

case)

i, j, k with

is known. symbolic

addrp(i,j, where

nest

p is the

coefficients

thread

condition

c > a(nx-

This read

cross are

when

interference

the

inter

array

represented

k = 1, 2, 3.

An

for example, bigger

misses

(as in Example

by

evaluation

if both

same

address

4. If the

accessed

can

arrays

array. 1) usually

nest.

Otherwise,

time

be formulated

condition

is checked

only

invalidations. copied

and

assume

function

to be true if both

then

conditions

as nonoverlapping

noninterference

are

innermost

test.

at the memory

lines

in the

be proved

be placed

program access

in the can

arrays

as a run

happens

of the

conditions

condition

cache

at

condition of the for

In the

into that

the

be a linear

running

"read/write"

arrays

of "read"

cache

array

anyway

parallelized

function

is satisfied

processor

case

secondary

of

of the

and

loop

(k-loop

nest

indices

a, b, c:

k) = ai + bj + ck + cwp number,

nx, O < j < ny, O