Document not found! Please try again

Linear Algebraic Primitives for Parallel Computing on Large Graphs

0 downloads 151 Views 2MB Size Report
Mar 16, 2010 - A. Computationally intensive graph/data mining algorithms. ... Software for Graph Computation ... An Appl
Linear Algebraic Primitives for Parallel Computing on Large Graphs $\GÕQ Buluç Ph.D. Defense March 16, 2010

1

Committee: John R. Gilbert (Chair) Ömer (÷HFLR÷OX Fred Chong Shivkumar Chandrasekaran

Sources of Massive Graphs Graphs naturally arise from the internet and social interactions

(WWW snapshot, courtesy Y. Hyun)

Many scientific (biological, chemical, cosmological, ecological, etc) datasets are modeled as graphs.

(Yeast protein interaction network, courtesy H. Jeong)

Types of Graph Computations 1

3

2

5

4

map

map

6

7

Tightly coupled

Filtering based

Tool: Graph Traversal

Examples: - Centrality - Shortest paths - Network flows - Strongly Connected Components 3

map

reduce

reduce

Tool: Map/Reduce Examples: Fuzzy intersection Examples: Clustering, Algebraic Multigrid

- Loop and multi edge removal - Triangle/Rectangle enumeration

Tightly Coupled Computations A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality, dimensionality reduction) B. Inherently latency-bound computations. (e.g. finding s-t connectivity)

Let the special purpose architectures handle (B), but what can we do for (A) with the commodity architectures?

Huge Graphs

We need memory ! 4

+ Expensive Kernels Æ Disk is not an option !

High Performance and Massive Parallelism

Software for Graph Computation ³P\PDLQFRQFOXVLRQDIWHU spending ten years of my life on the TeX project is that software is hard. It's harder than anything HOVH, YHHYHUKDGWRGR´

5

Software for Graph Computation ³P\PDLQFRQFOXVLRQDIWHU spending ten years of my life on the TeX project is that software is hard. It's harder than anything HOVH, YHHYHUKDGWRGR´

6

Dealing with software is hard !

Software for Graph Computation ³P\PDLQFRQFOXVLRQDIWHU spending ten years of my life on the TeX project is that software is hard. It's harder than anything HOVH, YHHYHUKDGWRGR´

Dealing with software is hard !

High performance computing (HPC) software is harder !

7

Software for Graph Computation ³P\PDLQFRQFOXVLRQDIWHU spending ten years of my life on the TeX project is that software is hard. It's harder than anything HOVH, YHHYHUKDGWRGR´

Dealing with software is hard !

High performance computing (HPC) software is harder !

Deal with parallel HPC software?

8

Tightly ParallelCoupled Graph Software Computations

‡ Shared memory scales, not it pays off to target productivity ‡ Distributed memory should be targeting p > 100 at least ‡ Shared and distributed memory complement each other.

‡ Coarse-grained parallelism for distributed memory. 9

Outline ‡ The Case for Primitives ‡ The Case for Sparse Matrices

‡ Parallel Sparse Matrix-Matrix Multiplication ‡ Software Design of the Combinatorial BLAS ‡ An Application in Social Network Analysis ‡ Other Work ‡ Future Directions

10

All-Pairs Shortest Paths ‡

Input: 'LUHFWHGJUDSKZLWK³FRVWV´RQHGJHV

‡

Find least-cost paths between all reachable vertex pairs

‡

Classical algorithm: Floyd-Warshall 1

2

3

4

5

1

for k=1:n // the induction sequence for i = 1:n for j = 1:n if(w(iĺk)+w(kĺM) < w(iĺM)) w(iĺM):= w(iĺk) + w(kĺM)

2 3 4 5

k = 1 case

‡

Case study of implementation on multicore architecture: ±

11

graphics processing unit (GPU)

GPU characteristics Powerful: two Nvidia 8800s > 1 TFLOPS Inexpensive: $500 each

But: ‡ Difficult  programming  model:     One  instruction  stream  drives  8  arithmetic  units

‡ Performance  is  counterintuitive  and  fragile: Memory  access  pattern  has  subtle  effects  on  cost    

‡ Extremely  easy  to  underutilize  the  device: Doing  it  wrong  easily  costs  100x  in  time

12

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16

Recursive All-Pairs Shortest Paths

Based on R-Kleene algorithm

Well suited for GPU architecture:

C A

D B

‡

Fast matrix-multiply kernel

‡

In-place computation => low memory bandwidth

‡

Few, large MatMul calls =>

A = A*;

low GPU dispatch overhead

B = AB; C = CA;

Recursion stack on host CPU, not on multicore GPU

D = D + CB;

Careful tuning of GPU code

B = BD; C = DC;

‡

‡

B

C

D

+ LV³PLQ´ × LV³DGG´

D = D*;

% recursive call

% recursive call

A = A + BC; 13

A

APSP: Experiments and The Case for Primitives observations Lifting Floyd-Warshall to GPU

480x Unorthodox R-Kleene algorithm

The right primitive !

Conclusions: High performance is achievable but not simple Carefully chosen and optimized primitives will be key 14

Outline ‡ The Case for Primitives ‡ The Case for Sparse Matrices

‡ Parallel Sparse Matrix-Matrix Multiplication ‡ Software Design of the Combinatorial BLAS ‡ An Application in Social Network Analysis ‡ Other Work ‡ Future Directions

15

Sparse Adjacency Matrix and Graph

Æ

1

2

4

7

3

T A

5

6

x

‡ Every graph is a sparse matrix and vice-versa

‡ Adjacency matrix: sparse array w/ nonzeros for graph edges ‡ Storage-efficient implementation from sparse data structures 16

The Case for Sparse Matrices ‡

17

Many irregular applications contain suƥcient coarsegrained parallelism that can ONLY be exploited using abstractions at proper level. Traditional graph computations

Graphs in the language of linear algebra

Data driven. Unpredictable communication.

Fixed communication patterns.

Irregular and unstructured. Poor locality of reference

Operations on matrix blocks. Exploits memory hierarchy

Fine grained data accesses. Dominated by latency

Coarse grained parallelism. Bandwidth limited

Identification Linear Algebraic of Primitives Primitives Sparse matrix-matrix Multiplication (SpGEMM) x

Element-wise operations

Sparse matrix-vector multiplication x

Sparse Matrix Indexing

.*

Matrices on semirings, e.g. (˜, +), (and, or), (+, min) 18

Why Applications focus onofSpGEMM? Sparse GEMM ‡

Graph clustering (MCL, peer pressure)

‡

Subgraph / submatrix indexing

‡

Shortest path calculations

‡

Betweenness centrality

‡

Graph contraction

‡

Cycle detection

‡

Multigrid interpolation & restriction

‡

Colored intersection searching

‡

Applying constraints in finite element computations

‡ 19

Context-free parsing ...

Loop until convergence A = A * A; % expand A = A .^ 2; % inflate sc = 1./sum(A, 1); A = A * diag(sc); % renormalize A(A < 0.0001) = 0; % prune entries

1 1

1

x

x

1 1

Outline ‡ The Case for Primitives ‡ The Case for Sparse Matrices

‡ Parallel Sparse Matrix-Matrix Multiplication ‡ Software Design of the Combinatorial BLAS ‡ An Application in Social Network Analysis ‡ Other Work ‡ Future Directions

20

Two Versions of Sparse GEMM

A1 A2 A3 A4 A5 A6 A7 A8

x

B1 B2 B3 B4 B5 B6 B7 B8

=

C1 C2 C3 C4 C5 C6 C7 C8

1D block-column distribution

Ci = Ci + A Bi j

k

k

x

i

Cij += Aik Bkj 21

=

Cij

Checkerboard (2D block) distribution

Comparative Projected performances Speedup ofof Sparse 1D & 2D 1D

2D

N

N

P

P

In practice, 2D algorithms have the potential to scale, but not linearly

E 22

W p(Tcomp  Tcomm)

Jc n (J  E )cn p  Jc 2 n lg c 2 n p  Dp p 2

Compressed Sparse Columns (CSC):

A Standard Layout

n

n×n matrix with nnz nonzeroes

Column pointers 0 4 8 10 11 12 13 16 17 rowind 0 2 3 4 0 1 5 7 2 3 3 4 5 4 5 6 7

data

23

‡

Stores entries in column-major order

‡

Dense collection of ³VSDUVHFROXPQV´

‡

Uses O (n  nnz) storage.

Node Level Considerations Submatrices are ³hypersparse´ (i.e. nnz computation for your algorithm, overlapping them will NOT make it scale. ‣ Scale your algorithm, not your implementation. ‣ Any sustainable speedup is a success !

‣ 'RQ¶WJHWGLVDSSRLQWHGZLWKSRRUSDUDOOHOHIILFLHQF\ 39

Related Publications ‡

Hypersparsity in 2D decomposition, sequential kernel. B., Gilbert, "On the Representation and Multiplication of Hypersparse 0DWULFHV³,3'36¶

‡

Parallel analysis of sparse GEMM, synchronous implementation B., Gilbert, "Challenges and Advances in Parallel Sparse Matrix-0DWUL[0XOWLSOLFDWLRQ,&33¶

‡

The case for primitives, APSP on the GPU B., Gilbert, Budak, ³6ROYLQJ3DWK3UREOHPVRQWKH*38³, Parallel Computing, 2009

‡

SpMV on Multicores B., Fineman, Frigo, Gilbert, Leiserson, "Parallel Sparse Matrix-Vector and Matrix-Transpose9HFWRU0XOWLSOLFDWLRQXVLQJ&RPSUHVVHG6SDUVH%ORFNV³63$$¶

‡

Betweenness centrality results B., Gilbert, ³3DUDOOHO6SDUVH0DWUL[-0DWUL[0XOWLSOLFDWLRQDQG/DUJH6FDOH$SSOLFDWLRQV´

‡

Software design of the library B., Gilbert, ³3DUDOOHO&RPELQDWRULDO%/$6,QWHUIDFHDQG5HIHUHQFH,PSOHPHQWDWLRQ´

40

$FNQRZOHGJPHQWV«

David Bader, Erik Boman, Ceren Budak, Alan Edelman, Jeremy Fineman, Matteo Frigo, Bruce Hendrickson, Jeremy Kepner, Charles Leiserson, Kamesh Madduri, Steve Reinhardt, Eric Robinson, Viral Shah, Sivan Toledo

41

Suggest Documents