Structuring distributed relation-based computations with SCDRC

Structuring distributed relation-based computations with SCDRC N. Botta, C. Ionescu, C. Linstead, R. Klein

with contributions of: - J. Gerlach, Fraunhofer Inst. f¨ ur Rechnerarch. und Softwaretechnik. - A. Priesnitz, Chalmers University of Technology.

Software architecture and SCDRC applications

...

...

SCDSDP

SCDAFVM S project

SCDRC MPI, OpenMP ? IBM p655, PC cluster, . . .

• SCDRC is a thin layer between message passing libraries and application dependent components.

1

Why a software layer ?

Stru turing parallel algorithms for adaptive grids, partial differential equations, sequential de ision problems is diÆ ult. Implementing su h algorithms in terms of message passing libraries is { time- onsuming, { error-prone, { requires knowledge of low-level message passing primitives. 2


• Structuring parallel algorithms for adaptive grids, partial differential equations, sequential decision problems is difficult.

Implementing su h algorithms in terms of message passing libraries is { time- onsuming, { error-prone, { requires knowledge of low-level message passing primitives. 2-a


• Structuring parallel algorithms for adaptive grids, partial differential equations, sequential decision problems is difficult.

• Implementing such algorithms in terms of message passing libraries is – time-consuming, – error-prone, – requires knowledge of low-level message passing primitives. 2-b

Why distributed relation-based computations ?

Relation-based omputations are non-trivially parallelizable generi omputational patterns (algorithmi skeletons).

Many algorithms an be written in terms of relation-based

omputations.

If we are able to implement generi distributed relation-based

omputations, we an express parallel algorithms by ombining spe ializations of DRCs. 3


• Relation-based computations are non-trivially parallelizable generic computational patterns (algorithmic skeletons).

Many algorithms an be written in terms of relation-based

omputations. If we are able to implement generi distributed relation-based

omputations, we an express parallel algorithms by ombining spe ializations of DRCs. 3-a



• Many algorithms can be written in terms of relation-based computations.

If we are able to implement generi distributed relation-based

omputations, we an express parallel algorithms by ombining spe ializations of DRCs. 3-b



• Many algorithms can be written in terms of relation-based computations.

• If we are able to implement generic distributed relation-based computations, we can express parallel algorithms by combining specializations of DRCs.

3-c

Pattern 1: relation-based algorithms RBA for j in js do compute h([ f (i) | i in R(j) ]) end for

an be spe ialized to ompute triangle areas, enters, et . ta = RBA(area, x, vt) for in [ 0 size(vt) ) do

ompute area ([ x(vt( 0)) x(vt( 1)) x(vt( 2)) ℄) j

:::

j;

;

j;

;

j;

end for

4

Pattern 1: relation-based algorithms RBA for j in js do compute h([ f (i) | i in R(j) ]) end for can be specialized to compute triangle areas, centers, etc.

ta = RBA(area, x, vt) for in [ 0 size(vt) ) do

ompute area ([ x(vt( 0)) x(vt( 1)) x(vt( 2)) ℄) j

:::

j;

;

j;

;

j;

end for

4-a

Pattern 1: relation-based algorithms RBA for j in js do compute h([ f (i) | i in R(j) ]) end for can be specialized to compute triangle areas, centers, etc. ta = RBA(area, x, vt) for j in [ 0 . . . size(vt) ) do compute area ( [ x(vt(j, 0)), x(vt(j, 1)), x(vt(j, 2)) ] ) end for

4-b

... or sparse matrix-vector multiplications: mvmult for j in [ 0 . . . n ) do compute sum([ e(k) ∗ b[c(k)] | k in [ p(j) . . . p(j + 1) ) ]) end for b[c(k)]

b

c

c(k)

e

e(k) k

p(j)

p

p(j)

p(j+1)

j

j+1

p(j+1)

5

One has: [ e(k) ∗ b[c(k)] | k in [ p(j) . . . p(j + 1) ) ] = [ e(k) ∗ b[i] | i = c(k), k in [ p(j) . . . p(j + 1) ) ]

and therefore mvmult = RBA(sum, f, R).

6

One has: [ e(k) ∗ b[c(k)] | k in [ p(j) . . . p(j + 1) ) ] = [ e(k) ∗ b[i] | i = c(k), k in [ p(j) . . . p(j + 1) ) ] = { let R(j) = [ c(k) | k in [ p(j) . . . p(j + 1) ) ] } [ e(k) ∗ b[i] | i in R(j), k = p(j) + pos(i, R(j)) ]

and therefore mvmult = RBA(sum, f, R).

6-a

One has: [ e(k) ∗ b[c(k)] | k in [ p(j) . . . p(j + 1) ) ] = [ e(k) ∗ b[i] | i = c(k), k in [ p(j) . . . p(j + 1) ) ] = { let R(j) = [ c(k) | k in [ p(j) . . . p(j + 1) ) ] } [ e(k) ∗ b[i] | i in R(j), k = p(j) + pos(i, R(j)) ] = { let g(i, j) = p(j) + pos(i, R(j)) } [ e(g(i, j)) ∗ b[i] | i in R(j) ]

and therefore mvmult = RBA(sum, f, R). 6-b

One has: [ e(k) ∗ b[c(k)] | k in [ p(j) . . . p(j + 1) ) ] = [ e(k) ∗ b[i] | i = c(k), k in [ p(j) . . . p(j + 1) ) ] = { let R(j) = [ c(k) | k in [ p(j) . . . p(j + 1) ) ] } [ e(k) ∗ b[i] | i in R(j), k = p(j) + pos(i, R(j)) ] = { let g(i, j) = p(j) + pos(i, R(j)) } [ e(g(i, j)) ∗ b[i] | i in R(j) ] = { let f (i, j) = e(g(i, j)) ∗ b[i] } [ f (i, j) | i in R(j) ] and therefore mvmult = RBA(sum, f, R). 6-c

Parallel distributed RBAs

• RBAs are non-trivially parallelizable if: – f is represented by storing disjoint subsets of its graph f0 . . . fnp−1 on the partitions.

{

For any disjoint splitting 0 . . . np 1 of there are partitions , for whi h ran( p) \ ran( q ) 6= ;. Example: = [[ 0 1 3 ℄ [ 1 2 3 ℄℄ R

p

q

R

R

R

;

;

;

R

R

;

;

7


• RBAs are non-trivially parallelizable if: – f is represented by storing disjoint subsets of its graph f0 . . . fnp−1 on the partitions. – For any disjoint splitting R0 . . . Rnp−1 of R there are partitions p, q for which ran(Rp) ∩ ran(Rq ) 6= ∅. 2

3 1

Example:

R = [ [ 0, 1, 3 ] , [ 1, 2, 3 ] ] 0 1

0 7-a


• To parallelize RBAs, one has to answer the questions:

How to distribute and on the partitions ? { How to ompute, on a given partition, whi h values of are needed but not lo ally owned ? { How to obtain, on a given partition, su h values from the partitions that own them ?

{

f

R

f

8


• To parallelize RBAs, one has to answer the questions: – How to distribute f and R on the partitions ?

How to ompute, on a given partition, whi h values of are needed but not lo ally owned ? { How to obtain, on a given partition, su h values from the partitions that own them ?

{

f

8-a


• To parallelize RBAs, one has to answer the questions: – How to distribute f and R on the partitions ? – How to compute, on a given partition, which values of f are needed but not locally owned ?

{

How to obtain, on a given partition, su h values from the partitions that own them ?

8-b


• To parallelize RBAs, one has to answer the questions: – How to distribute f and R on the partitions ? – How to compute, on a given partition, which values of f are needed but not locally owned ? – How to obtain, on a given partition, such values from the partitions that own them ?

8-c

Conceptual model of distributed functions and relations

• Functions represented as arrays, relations as arrays of arrays.

Distributed fun tions: 0 of0 f

f of

= [ (0 0) (2 0) (2 1) (0 1) ℄ = [0 2 4℄ ;

;

;

;

;

;

;

;

!

;

f

= [ (0 0) (2 0) ℄ = [0 2 4℄ ;

;

;

;

= [ (2 1) (0 1) ℄ 1 = [0 2 4℄

1

of

;

;

;

;

;

;

9


• Functions represented as arrays, relations as arrays of arrays.

• Distributed functions: f0 = [ (0, 0), (2, 0) ] f = [ (0, 0), (2, 0), (2, 1), (0, 1) ] of = [ 0, 2, 4 ]

of0 = [ 0, 2, 4 ] → f1 = [ (2, 1), (0, 1) ] of1 = [ 0, 2, 4 ]

9-a


• Distributed relations: R0 = [ [ 0, 1, 3 ] ] R = [ [ 0, 1, 3 ] , [ 1, 2, 3 ] ] oR = [ 0, 1, 2 ]

oR0 = [ 0, 1, 2 ] → R1 = [ [ 1, 2, 3 ] ] oR1 = [ 0, 1, 2 ]

10

Distributed data and local computations • P0: f0 = [ (0, 0), (2, 0) ], R0 = [ [ 0, 1, 3 ] ] and offsets. • P1: f1 = [ (2, 1), (0, 1) ], R1 = [ [ 1, 2, 3 ] ] and offsets. 2

3 1

3 (1) 3 (?)

2 (0) 1 (0)

0 0

1 0 (0) 0 (0)

1 (?) 1 (1)

11

Local computations and access/exchange tables Given: (R0 . . . Rnp−1), ( ofs . . . ofs ) with Rp :: A(A(Nat)), ofs :: A(Nat) such that: is offsets(ofs) == true and max elem(Rp(j)) < ofs(np) find access and exchange tables (at0 . . . atnp−1), (et0 . . . etnp−1) with atp, etp :: A(A(Nat)) such that: ∀ (f0 . . . fnp−1) with offsets(map(size, (f0 . . . fnp−1))) == ofs, ′ the tuple (f0′ . . . fnp−1 ) obtained with complete f satisfies: fp′ (atp(j)(k)) == f (Rp (j)(k))

12

Local computations and access/exchange tables complete f et′p = map(sfp′ , etp) where sfp′ (a) = map(sfp, a) sfp (i) = fp (i − ofs(p)) ′ ′ ′ fp = concat fp, breadth(exchange(et0 . . . etnp−1)(p))

13

SCDRC support for distributed RBAs

• SCDRC supports RBA construction, computation of access and exchange tables and data exchange with suitable primitives:

typedef Triangle_Center H; typedef Array F; typedef Reg_Rel R; H triangle_ enter; RBA t a(triangle_ enter, x, vt); init_a

ess_ex h_tables(t a);

omplete_f(t a); 14


• SCDRC supports RBA construction, computation of access and exchange tables and data exchange with suitable primitives: typedef Triangle_Center H; typedef Array F; typedef Reg_Rel R; H triangle_center; RBA tca(triangle_center, x, vt);

14-a


• SCDRC supports RBA construction, computation of access and exchange tables and data exchange with suitable primitives: typedef Triangle_Center H; typedef Array F; typedef Reg_Rel R; H triangle_center; RBA tca(triangle_center, x, vt); init_access_exch_tables(tca);

14-b


• SCDRC supports RBA construction, computation of access and exchange tables and data exchange with suitable primitives: typedef Triangle_Center H; typedef Array F; typedef Reg_Rel R; H triangle_center; RBA tca(triangle_center, x, vt); init_access_exch_tables(tca); complete_f(tca); 14-c


• Distributed RBAs can be evaluated in parallel as common functions in a sequential computational environment: for(Nat j = 0; j < n_triangles; j++) { const Real area = taa(j); const Vector center = tca(j); a += area; c += area * center; }

15

Distributed RBCs for partitioning and grid computations

• Trivial partitioning ⇒ poor clustering, high comm. volume !

-

188966 triangles, 43 partitions: total comm. volume = 338495 units

16


• Partitioning using Metis and SCDRC renumbering algorithms:

-

188966 triangles, 43 partitions: total comm. volume = 2543 units

17


• To use graph partitioning algorithms, one needs operations to: – converse R.

ompose Æ with (neighbor triangle omputation). { interse t, join, et .

{

R

R

18


• To use graph partitioning algorithms, one needs operations to: – converse R. – compose R◦ with R (neighbor triangle computation).

{

interse t, join, et .

18-a


• To use graph partitioning algorithms, one needs operations to: – converse R. – compose R◦ with R (neighbor triangle computation). – intersect, join, etc.

18-b

SCDRC support for distributed RBCs

• SCDRC supports: sequential relational operations . . . CRS_Rel tv; converse(tv, vt); CRS_Rel tvt; compose(tvt, tv, vt); CRS_Rel r; subtract_id(r, tvt); partition_kway(vtpart, n_cuts, r, n_p());

19


• . . . local relational operations in a distributed computational environment: CRS_Rel tv; local::converse(tv, vt); CRS_Rel tvt; local::compose(tvt, tv, vt); CRS_Rel r; local::subtract_id(r, tvt); local::partition_kway(vtpart, n_cuts, r, n_p()); 20


• . . . and parallel relational operations in a distributed computational environment CRS_Rel tv; converse(tv, vt, ofs); CRS_Rel tvt; compose(tvt, tv, vt); CRS_Rel r; subtract_id(r, tvt); with a suitable set of primitives and with a seamless notation. 21

Preliminary results

• A simplistic cost model ... C(np) = Ccomp.(np) + Ccomm. (np) k1/np + c1 ∗ np ∗ pbs k1/np + c1 ∗ np ∗ c2 ∗ (pvs)1/2 k1/np + c1 ∗ np ∗ c2 ∗ (c3/np)1/2 k1/np + k2 ∗ np1/2

22

Preliminary results

• ... explains quite well low comm. frequency speedup results: 40

C(2) / C(np) with k1=3836.1, k2=147.4 elapsed(2) / elapsed(np)

35

speedup

30 25 20 15 10 5 0 10

20

30 np

40

50

60

23

Preliminary results

• For high frequencies, a more realistic model is required: 6

C(2) / C(np) with k1=3044.6, k2=134 elapsed(2) / elapsed(np)

5

speedup

4 3 2 1 0 10

20

30

40

50

60

np

24

Structuring distributed relation-based computations with SCDRC

Structuring distributed relation-based computations with SCDRC

Suggest Documents

Distributed Symbolic Computations

Debugging Distributed Computations by Reverse

Reliable Management of Distributed Computations

Distributed Network Structuring with Scopes - TUprints - TU Darmstadt

Distributed Stencil Computations: Conway's Game of Life

Impact of Mobility on Distributed Computations - CiteSeerX

Distributed Multiscale Computations Using the ...

Sequential and Distributed Evolutionary Computations in Structural ...

Distributed Transitive Closure Computations: The ... - VLDB Endowment

Modeling and Analyzing Periodic Distributed Computations

Webs of Archived Distributed Computations for ... - CiteSeerX

Distributed Computations Driven by Resource ... - Semantic Scholar

Computations with Parametric Equationsâ

Structuring Content with XML - dret.net

Computations with Differential Rational Parametric

Approximate computations with modular curves

Learning universal computations with spikes

Implementation of Aerodynamic Computations with

IPML: Structuring distributed multimedia ... - Jun HU :: HU, Jun

Structuring Distributed Algorithms for Mobile Hosts 1 ... - CiteSeerX

Mapping parallel computations onto distributed systems in ... - Unibo

MPP: A Framework for Distributed Polynomial Computations - CiteSeerX

Restoring Consistent Global States of Distributed Computations - Wisc

Parallel and distributed computations of maximum ... - Science Direct

Structuring distributed relation-based computations with SCDRC