Mar 16, 2010 - A. Computationally intensive graph/data mining algorithms. ... Software for Graph Computation ... An Appl
Linear Algebraic Primitives for Parallel Computing on Large Graphs $\GÕQ Buluç Ph.D. Defense March 16, 2010
1
Committee: John R. Gilbert (Chair) Ömer (÷HFLR÷OX Fred Chong Shivkumar Chandrasekaran
Sources of Massive Graphs Graphs naturally arise from the internet and social interactions
(WWW snapshot, courtesy Y. Hyun)
Many scientific (biological, chemical, cosmological, ecological, etc) datasets are modeled as graphs.
(Yeast protein interaction network, courtesy H. Jeong)
Types of Graph Computations 1
3
2
5
4
map
map
6
7
Tightly coupled
Filtering based
Tool: Graph Traversal
Examples: - Centrality - Shortest paths - Network flows - Strongly Connected Components 3
map
reduce
reduce
Tool: Map/Reduce Examples: Fuzzy intersection Examples: Clustering, Algebraic Multigrid
- Loop and multi edge removal - Triangle/Rectangle enumeration
Tightly Coupled Computations A. Computationally intensive graph/data mining algorithms. (e.g. graph clustering, centrality, dimensionality reduction) B. Inherently latency-bound computations. (e.g. finding s-t connectivity)
Let the special purpose architectures handle (B), but what can we do for (A) with the commodity architectures?
Huge Graphs
We need memory ! 4
+ Expensive Kernels Æ Disk is not an option !
High Performance and Massive Parallelism
Software for Graph Computation ³P\PDLQFRQFOXVLRQDIWHU spending ten years of my life on the TeX project is that software is hard. It's harder than anything HOVH, YHHYHUKDGWRGR´
5
Software for Graph Computation ³P\PDLQFRQFOXVLRQDIWHU spending ten years of my life on the TeX project is that software is hard. It's harder than anything HOVH, YHHYHUKDGWRGR´
6
Dealing with software is hard !
Software for Graph Computation ³P\PDLQFRQFOXVLRQDIWHU spending ten years of my life on the TeX project is that software is hard. It's harder than anything HOVH, YHHYHUKDGWRGR´
Dealing with software is hard !
High performance computing (HPC) software is harder !
7
Software for Graph Computation ³P\PDLQFRQFOXVLRQDIWHU spending ten years of my life on the TeX project is that software is hard. It's harder than anything HOVH, YHHYHUKDGWRGR´
Dealing with software is hard !
High performance computing (HPC) software is harder !
Deal with parallel HPC software?
8
Tightly ParallelCoupled Graph Software Computations
Shared memory scales, not it pays off to target productivity Distributed memory should be targeting p > 100 at least Shared and distributed memory complement each other.
Coarse-grained parallelism for distributed memory. 9
Outline The Case for Primitives The Case for Sparse Matrices
Parallel Sparse Matrix-Matrix Multiplication Software Design of the Combinatorial BLAS An Application in Social Network Analysis Other Work Future Directions
10
All-Pairs Shortest Paths
Input: 'LUHFWHGJUDSKZLWK³FRVWV´RQHGJHV
Find least-cost paths between all reachable vertex pairs
Classical algorithm: Floyd-Warshall 1
2
3
4
5
1
for k=1:n // the induction sequence for i = 1:n for j = 1:n if(w(iĺk)+w(kĺM) < w(iĺM)) w(iĺM):= w(iĺk) + w(kĺM)
2 3 4 5
k = 1 case
Case study of implementation on multicore architecture: ±
11
graphics processing unit (GPU)
GPU characteristics Powerful: two Nvidia 8800s > 1 TFLOPS Inexpensive: $500 each
But: Difficult programming model: One instruction stream drives 8 arithmetic units
Performance is counterintuitive and fragile: Memory access pattern has subtle effects on cost
Extremely easy to underutilize the device: Doing it wrong easily costs 100x in time
12
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16
Recursive All-Pairs Shortest Paths
Based on R-Kleene algorithm
Well suited for GPU architecture:
C A
D B
Fast matrix-multiply kernel
In-place computation => low memory bandwidth
Few, large MatMul calls =>
A = A*;
low GPU dispatch overhead
B = AB; C = CA;
Recursion stack on host CPU, not on multicore GPU
D = D + CB;
Careful tuning of GPU code
B = BD; C = DC;
B
C
D
+ LV³PLQ´ × LV³DGG´
D = D*;
% recursive call
% recursive call
A = A + BC; 13
A
APSP: Experiments and The Case for Primitives observations Lifting Floyd-Warshall to GPU
480x Unorthodox R-Kleene algorithm
The right primitive !
Conclusions: High performance is achievable but not simple Carefully chosen and optimized primitives will be key 14
Outline The Case for Primitives The Case for Sparse Matrices
Parallel Sparse Matrix-Matrix Multiplication Software Design of the Combinatorial BLAS An Application in Social Network Analysis Other Work Future Directions
15
Sparse Adjacency Matrix and Graph
Æ
1
2
4
7
3
T A
5
6
x
Every graph is a sparse matrix and vice-versa
Adjacency matrix: sparse array w/ nonzeros for graph edges Storage-efficient implementation from sparse data structures 16
The Case for Sparse Matrices
17
Many irregular applications contain suƥcient coarsegrained parallelism that can ONLY be exploited using abstractions at proper level. Traditional graph computations
Graphs in the language of linear algebra
Data driven. Unpredictable communication.
Fixed communication patterns.
Irregular and unstructured. Poor locality of reference
Operations on matrix blocks. Exploits memory hierarchy
Fine grained data accesses. Dominated by latency
Coarse grained parallelism. Bandwidth limited
Identification Linear Algebraic of Primitives Primitives Sparse matrix-matrix Multiplication (SpGEMM) x
Element-wise operations
Sparse matrix-vector multiplication x
Sparse Matrix Indexing
.*
Matrices on semirings, e.g. (, +), (and, or), (+, min) 18
Why Applications focus onofSpGEMM? Sparse GEMM
Graph clustering (MCL, peer pressure)
Subgraph / submatrix indexing
Shortest path calculations
Betweenness centrality
Graph contraction
Cycle detection
Multigrid interpolation & restriction
Colored intersection searching
Applying constraints in finite element computations
19
Context-free parsing ...
Loop until convergence A = A * A; % expand A = A .^ 2; % inflate sc = 1./sum(A, 1); A = A * diag(sc); % renormalize A(A < 0.0001) = 0; % prune entries
1 1
1
x
x
1 1
Outline The Case for Primitives The Case for Sparse Matrices
Parallel Sparse Matrix-Matrix Multiplication Software Design of the Combinatorial BLAS An Application in Social Network Analysis Other Work Future Directions
20
Two Versions of Sparse GEMM
A1 A2 A3 A4 A5 A6 A7 A8
x
B1 B2 B3 B4 B5 B6 B7 B8
=
C1 C2 C3 C4 C5 C6 C7 C8
1D block-column distribution
Ci = Ci + A Bi j
k
k
x
i
Cij += Aik Bkj 21
=
Cij
Checkerboard (2D block) distribution
Comparative Projected performances Speedup ofof Sparse 1D & 2D 1D
2D
N
N
P
P
In practice, 2D algorithms have the potential to scale, but not linearly
E 22
W p(Tcomp Tcomm)
Jc n (J E )cn p Jc 2 n lg c 2 n p Dp p 2
Compressed Sparse Columns (CSC):
A Standard Layout
n
n×n matrix with nnz nonzeroes
Column pointers 0 4 8 10 11 12 13 16 17 rowind 0 2 3 4 0 1 5 7 2 3 3 4 5 4 5 6 7
data
23
Stores entries in column-major order
Dense collection of ³VSDUVHFROXPQV´
Uses O (n nnz) storage.
Node Level Considerations Submatrices are ³hypersparse´ (i.e. nnz computation for your algorithm, overlapping them will NOT make it scale. ‣ Scale your algorithm, not your implementation. ‣ Any sustainable speedup is a success !
‣ 'RQ¶WJHWGLVDSSRLQWHGZLWKSRRUSDUDOOHOHIILFLHQF\ 39
Related Publications
Hypersparsity in 2D decomposition, sequential kernel. B., Gilbert, "On the Representation and Multiplication of Hypersparse 0DWULFHV³,3'36¶
Parallel analysis of sparse GEMM, synchronous implementation B., Gilbert, "Challenges and Advances in Parallel Sparse Matrix-0DWUL[0XOWLSOLFDWLRQ,&33¶
The case for primitives, APSP on the GPU B., Gilbert, Budak, ³6ROYLQJ3DWK3UREOHPVRQWKH*38³, Parallel Computing, 2009
SpMV on Multicores B., Fineman, Frigo, Gilbert, Leiserson, "Parallel Sparse Matrix-Vector and Matrix-Transpose9HFWRU0XOWLSOLFDWLRQXVLQJ&RPSUHVVHG6SDUVH%ORFNV³63$$¶
Betweenness centrality results B., Gilbert, ³3DUDOOHO6SDUVH0DWUL[-0DWUL[0XOWLSOLFDWLRQDQG/DUJH6FDOH$SSOLFDWLRQV´
Software design of the library B., Gilbert, ³3DUDOOHO&RPELQDWRULDO%/$6,QWHUIDFHDQG5HIHUHQFH,PSOHPHQWDWLRQ´
40
$FNQRZOHGJPHQWV«
David Bader, Erik Boman, Ceren Budak, Alan Edelman, Jeremy Fineman, Matteo Frigo, Bruce Hendrickson, Jeremy Kepner, Charles Leiserson, Kamesh Madduri, Steve Reinhardt, Eric Robinson, Viral Shah, Sivan Toledo
41