Linear Algebraic Primitives for Parallel Computing on Large Graphs

Linear Algebraic Primitives for Parallel Computing on Large Graphs $\GÕQ Buluç Ph.D. Defense March 16, 2010

1

Committee: John R. Gilbert (Chair) Ömer (÷HFLR÷OX Fred Chong Shivkumar Chandrasekaran

Sources of Massive Graphs Graphs naturally arise from the internet and social interactions

(WWW snapshot, courtesy Y. Hyun)

Many scientific (biological, chemical, cosmological, ecological, etc) datasets are modeled as graphs.

(Yeast protein interaction network, courtesy H. Jeong)

Types of Graph Computations 1

3

2

5

4

map

map

6

7

Tightly coupled

Filtering based

Tool: Graph Traversal

Examples: - Centrality - Shortest paths - Network flows - Strongly Connected Components 3

map

reduce

reduce

Tool: Map/Reduce Examples: Fuzzy intersection Examples: Clustering, Algebraic Multigrid

- Loop and multi edge removal - Triangle/Rectangle enumeration

Tightly Coupled Computations A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality, dimensionality reduction) B. Inherently latency-bound computations. (e.g. finding s-t connectivity)

Let the special purpose architectures handle (B), but what can we do for (A) with the commodity architectures?

Huge Graphs

We need memory ! 4

+ Expensive Kernels Æ Disk is not an option !

High Performance and Massive Parallelism

Software for Graph Computation ³P\PDLQFRQFOXVLRQDIWHU spending ten years of my life on the TeX project is that software is hard. It's harder than anything HOVH, YHHYHUKDGWRGR´

5


6

Dealing with software is hard !



High performance computing (HPC) software is harder !

7



High performance computing (HPC) software is harder !

Deal with parallel HPC software?

8

Tightly ParallelCoupled Graph Software Computations

Shared memory scales, not it pays off to target productivity Distributed memory should be targeting p > 100 at least Shared and distributed memory complement each other.

Coarse-grained parallelism for distributed memory. 9

Outline The Case for Primitives The Case for Sparse Matrices

Parallel Sparse Matrix-Matrix Multiplication Software Design of the Combinatorial BLAS An Application in Social Network Analysis Other Work Future Directions

10

All-Pairs Shortest Paths

Input: 'LUHFWHGJUDSKZLWK³FRVWV´RQHGJHV

Find least-cost paths between all reachable vertex pairs

Classical algorithm: Floyd-Warshall 1

2

3

4

5

1

for k=1:n // the induction sequence for i = 1:n for j = 1:n if(w(iĺk)+w(kĺM) < w(iĺM)) w(iĺM):= w(iĺk) + w(kĺM)

2 3 4 5

k = 1 case

Case study of implementation on multicore architecture: ±

11

graphics processing unit (GPU)

GPU characteristics Powerful: two Nvidia 8800s > 1 TFLOPS Inexpensive: $500 each

But: Difficult programming model: One instruction stream drives 8 arithmetic units

Performance is counterintuitive and fragile: Memory access pattern has subtle effects on cost

Extremely easy to underutilize the device: Doing it wrong easily costs 100x in time

12

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16

Recursive All-Pairs Shortest Paths

Based on R-Kleene algorithm

Well suited for GPU architecture:

C A

D B

Fast matrix-multiply kernel

In-place computation => low memory bandwidth

Few, large MatMul calls =>

A = A*;

low GPU dispatch overhead

B = AB; C = CA;

Recursion stack on host CPU, not on multicore GPU

D = D + CB;

Careful tuning of GPU code

B = BD; C = DC;

B

C

D

+ LV³PLQ´ × LV³DGG´

D = D*;

% recursive call

% recursive call

A = A + BC; 13

A

APSP: Experiments and The Case for Primitives observations Lifting Floyd-Warshall to GPU

480x Unorthodox R-Kleene algorithm

The right primitive !

Conclusions: High performance is achievable but not simple Carefully chosen and optimized primitives will be key 14



15

Sparse Adjacency Matrix and Graph

Æ

1

2

4

7

3

T A

5

6

x

Every graph is a sparse matrix and vice-versa

Adjacency matrix: sparse array w/ nonzeros for graph edges Storage-efficient implementation from sparse data structures 16

The Case for Sparse Matrices

17

Many irregular applications contain suƥcient coarsegrained parallelism that can ONLY be exploited using abstractions at proper level. Traditional graph computations

Graphs in the language of linear algebra

Data driven. Unpredictable communication.

Fixed communication patterns.

Irregular and unstructured. Poor locality of reference

Operations on matrix blocks. Exploits memory hierarchy

Fine grained data accesses. Dominated by latency

Coarse grained parallelism. Bandwidth limited

Identification Linear Algebraic of Primitives Primitives Sparse matrix-matrix Multiplication (SpGEMM) x

Element-wise operations

Sparse matrix-vector multiplication x

Sparse Matrix Indexing

.*

Matrices on semirings, e.g. (, +), (and, or), (+, min) 18

Why Applications focus onofSpGEMM? Sparse GEMM

Graph clustering (MCL, peer pressure)

Subgraph / submatrix indexing

Shortest path calculations

Betweenness centrality

Graph contraction

Cycle detection

Multigrid interpolation & restriction

Colored intersection searching

Applying constraints in finite element computations

19

Context-free parsing ...

Loop until convergence A = A * A; % expand A = A .^ 2; % inflate sc = 1./sum(A, 1); A = A * diag(sc); % renormalize A(A < 0.0001) = 0; % prune entries

1 1

1

x

x

1 1



20

Two Versions of Sparse GEMM

A1 A2 A3 A4 A5 A6 A7 A8

x

B1 B2 B3 B4 B5 B6 B7 B8

=

C1 C2 C3 C4 C5 C6 C7 C8

1D block-column distribution

Ci = Ci + A Bi j

k

k

x

i

Cij += Aik Bkj 21

=

Cij

Checkerboard (2D block) distribution

Comparative Projected performances Speedup ofof Sparse 1D & 2D 1D

2D

N

N

P

P

In practice, 2D algorithms have the potential to scale, but not linearly

E 22

W p(Tcomp Tcomm)

Jc n (J E )cn p Jc 2 n lg c 2 n p Dp p 2

Compressed Sparse Columns (CSC):

A Standard Layout

n

n×n matrix with nnz nonzeroes

Column pointers 0 4 8 10 11 12 13 16 17 rowind 0 2 3 4 0 1 5 7 2 3 3 4 5 4 5 6 7

data

23

Stores entries in column-major order

Dense collection of ³VSDUVHFROXPQV´

Uses O (n nnz) storage.

Node Level Considerations Submatrices are ³hypersparse´ (i.e. nnz computation for your algorithm, overlapping them will NOT make it scale. ‣ Scale your algorithm, not your implementation. ‣ Any sustainable speedup is a success !

‣ 'RQ¶WJHWGLVDSSRLQWHGZLWKSRRUSDUDOOHOHIILFLHQF\ 39

Related Publications

Hypersparsity in 2D decomposition, sequential kernel. B., Gilbert, "On the Representation and Multiplication of Hypersparse 0DWULFHV³,3'36¶

Parallel analysis of sparse GEMM, synchronous implementation B., Gilbert, "Challenges and Advances in Parallel Sparse Matrix-0DWUL[0XOWLSOLFDWLRQ,&33¶

The case for primitives, APSP on the GPU B., Gilbert, Budak, ³6ROYLQJ3DWK3UREOHPVRQWKH*38³, Parallel Computing, 2009

SpMV on Multicores B., Fineman, Frigo, Gilbert, Leiserson, "Parallel Sparse Matrix-Vector and Matrix-Transpose9HFWRU0XOWLSOLFDWLRQXVLQJ&RPSUHVVHG6SDUVH%ORFNV³63$$¶

Betweenness centrality results B., Gilbert, ³3DUDOOHO6SDUVH0DWUL[-0DWUL[0XOWLSOLFDWLRQDQG/DUJH6FDOH$SSOLFDWLRQV´

Software design of the library B., Gilbert, ³3DUDOOHO&RPELQDWRULDO%/$6,QWHUIDFHDQG5HIHUHQFH,PSOHPHQWDWLRQ´

40

$FNQRZOHGJPHQWV«

David Bader, Erik Boman, Ceren Budak, Alan Edelman, Jeremy Fineman, Matteo Frigo, Bruce Hendrickson, Jeremy Kepner, Charles Leiserson, Kamesh Madduri, Steve Reinhardt, Eric Robinson, Viral Shah, Sivan Toledo

41

Linear Algebraic Primitives for Parallel Computing on Large Graphs

Linear Algebraic Primitives for Parallel Computing on Large Graphs

Suggest Documents