A Unified Framework for Numerical and Combinatorial Computing

27 downloads 0 Views 3MB Size Report
is John R. Gilbert and Shang-Hua Teng's mesh- partitioning toolbox.5. Sparse matrix computations allow structured representation of irregular data structures ...
Co m b i n ato r ics in Computing

A Unified Framework for Numerical and Combinatorial Computing A rich variety of tools help researchers with high-performance numerical computing, but few tools exist for large-scale combinatorial computing. The authors describe their efforts to build a common infrastructure for numerical and combinatorial computing by using parallel sparse matrices to implement parallel graph algorithms.

M

odern scientific applications often mix combinatorial computing and numerical computing. Scientists have found that they can often gain key scientific insights by studying the relationships between a system’s individual elements, rather than studying them in isolation. Combinatorial data structures such as graphs help describe such relationships. Very high-level languages (VHLLs) are already popular among scientists and engineers. These languages provide native support for complex data structures, comprehensive numerical libraries, and visualization along with an interactive environment for execution, editing, and debugging. Matlab and Python are examples of commonly used VHLLs for scientific computing. In this article, we focus our attention on Matlab and Star-P (a parallel implementation of the Matlab programming language). Traditionally, numerical algorithms and graph 1521-9615/08/$25.00 © 2008 IEEE Copublished by the IEEE CS and the AIP

John R. Gilbert and Viral B. Shah University of California, Santa Barbara

Steve R einhardt Interactive Supercomputing

20

This article has been peer-reviewed.

algorithms have been developed and implemented separately. Many VHLLs for scientific computing provide a comprehensive infrastructure for numerical algorithms, but graph data structures and algorithms are often deployed as an afterthought, which restricts their versatility and interoperability with the rest of the system. Fortunately, we can unify the diverse worlds of numerical and combinatorial computing in terms of a common infrastructure for sparse matrices. Sparse matrices have long been thought of as graphs, and many sparse matrix algorithms are built with graph algorithms.1–4 We believe this relationship can be turned around: graph algorithms can be efficiently designed and implemented using methods and systems originally developed for sparse linear algebra. An early example of this approach is John R. Gilbert and Shang-Hua Teng’s mesh­partitioning toolbox.5 Sparse matrix computations allow structured representation of irregular data structures and access patterns in parallel applications. In our work, we use the distributed sparse matrix type in StarP as the basis for an infrastructure for computing with graphs.6–8 This approach has many desirable characteristics: the implementation is written in the VHLL (here, Matlab), making the codes short, simple, and readable; the VHLL code is data-parallel, with a single thread of control. This

Computing in Science & Engineering

makes it easier to write and debug programs, so the efficiency of the graph algorithms depends on the efficiency of the underlying sparse matrix infrastructure’s efficiency. Parallelism is derived from operations on parallel sparse matrices. We use the graph primitives described here to implement several graph algorithms in the graph algorithms and pattern discovery toolbox (GAPDT) we developed.8,9 From the outset, we designed the toolbox to run interactively with terascale graphs via Star-P, scaling to tens or hundreds of processors. High performance and interactivity are this toolbox’s salient features.

Sparse Matrices and Graphs

A graph consists of a set V of nodes connected by directed or undirected edges E. We can specify a graph with tuples (u, v, w) to indicate a directed edge of weight w from node u to node v; this is the same as a nonzero w at location (u, v) in a sparse matrix. The storage required is (|V| + |E|). A symmetric sparse matrix represents an undirected graph. Every sparse matrix problem is a graph problem, and every graph problem is a sparse matrix problem. Gilbert, Cleve Moler, and Rob Schreiber discuss the basic design principles for a comprehensive sparse matrix infrastructure,10 and Viral B. Shah and Gilbert describe additional considerations for parallel environments.7 Let’s reiterate some of the basic design principles for sparse matrix data structures and algorithms: • The storage for a sparse matrix of size n-by-n with nnz nonzeros should be (max(n, nnz)). • An operation on sparse matrices should take time approximately proportional to the size of the data accessed and the number of nonzero arithmetic operations on it. These principles assure efficient operations on graphs as well. Consider, for example, constructing a graph from edge-vertex tuples. The sparse() function accepts three vectors (U, V, W ) and constructs a sparse matrix G with a nonzero w at location (u, v) in the sparse matrix. In our terms, G is also a graph with an edge of weight w between nodes u and v. We can recover edgevertex tuples from a graph by using the dual of sparse(), which is find. Both these operations take time proportional to the number of nonzeros in the sparse matrix or the number of edges and nodes in the graph. Similarly, we can express many graph queries as sparse matrix operations. The function March/April 2008 

Table 1. Simple sparse matrix operations that perform basic graph operations. Sparse matrix operation

Graph operation

G = sparse (U, V, W)

Construct a graph from an edge list

[U, V, W] = find (G)

Obtain the edge list from a graph

vtxdeg = sum (spones(G))

Node degrees for an undirected graph

indeg = sum (spones(G))

In-degrees for a directed graph

outdeg = sum (spones(G), 2)

Out-degrees for a directed graph

N = G(i, :)

Find all neighbors of node i

Gsub = G(subset, subset)

Extract a subgraph of G

G(i, j) = W

Add or modify graph edges

G(i, j) = 0

Delete edges from a graph

G(I, :) = [] G(:, I) = []

Delete nodes from a graph

G = G(perm, perm)

Permute nodes of a graph

reach = G * start

Breadth-first search step

spones(G) replaces all edge weights with edge

weight 1, which lets us compute in-degrees of nodes as column sums and out-degrees as row sums. For undirected graphs, the in-degrees and out-degrees are the same, but in a sparse matrix, the row and column sums are equal because such a matrix is symmetric. Modern scientific languages derive a great deal of their expressive power from indexing operations, and sparse matrix indexing is a powerful notation for manipulating graphs. The neighbors of node i in graph G, for example, are the offdiagonal nonzeros in row i of the sparse matrix representing G. The indexing operation that performs this operation is N = G(i, :). We can extract an induced subgraph of G by selecting the rows and columns corresponding to the desired nodes from the corresponding sparse matrix. This looks like Gsub = G(subset, subset), where subset is a list of nodes in the resulting subgraph. Matrix indexing on the right-hand side extracts submatrices, or subgraphs; matrix indexing on the left-hand side results in assignment. We can also relabel graph vertices by permuting the rows and columns symmetrically. If perm is a permutation of 1:n, for example, the operation G = G(perm, perm) relabels the nodes of G according to the permutation. Table 1 lists these and other corresponding matrix and graph operations. GAPDT provides several tools to manipulate 21

1

AT

x

ATx

Breadth-First Search

2

4

7

3

6

5

(AT )2x

Figure 1. Breadth-first search implemented with sparse matrix/sparse vector multiplication. We initialize a sparse vector with a 1 in the position corresponding to the start node. Repeated multiplication yields multiple breadth-first steps on the graph. The graph can be either directed or undirected.

function C = contract (G, labels) % Contract nodes of a graph n = length (G); m = max(labels); S = sparse (labels, 1:n, 1, m, n); C = S * G * S’;

Figure 2. Parallel graph contraction. We can perform this contraction via sparse matrix multiplication.

graphs, including scalable graph generators for Erdös-Rényi random graphs, several kinds of meshes, and power law graphs. It also includes several graph algorithms, such as breadth-first search, connected components, strongly connected components, maximal independent set, maximum weight-spanning tree, and graph contraction. The toolbox provides scalable routines for graph partitioning and clustering and a geometric mesh-partitioning algorithm for partitioning large, well-formed meshes. We’ve also included a spectral graph-partitioning algorithm for general graphs and non-negative matrix factorization algorithms for clustering. Because visualization is an important component of any interactive tool, we provide a scalable visualization routine to view the structure of large graphs.

Graph Algorithms

Let’s review our implementation of a few common graph algorithms to demonstrate the versatility of the array-based infrastructure: breadth-first search, connected components, and graph contraction. 22

We can perform a breadth-first search by multiplying a sparse matrix A with a sparse vector x. Consider such a search starting from node i. We take x to be a vector with x(i) = 1 and all other elements zero. The product y = A * x simply picks out column i of A, which contains the neighbors of node i. (If the graph is directed, this produces the in-neighbors of i; to produce the out-neighbors, we would compute xTA or ATx.) Repeating the multiplication yields a vector that’s a linear combination of all columns of A corresponding to the nonzero elements in vector x, or all nodes that are—at most—distance 2 from node i. Figure 1 shows a breadth-first search on a directed graph. We can perform several independent breadthfirst searches simultaneously by using sparse matrix/matrix multiplication. Instead of multiplying with a vector, we multiply with a matrix, with one column for each starting node. Thus, we compute Y = A * X, after which column j of Y contains the result of performing an independent breadth-first search starting from the node (or nodes) specified by column j of X. The total work to perform a breadth-first search with sparse matrix multiplication is the same as that obtained via other efficient graph data structures. Connected Components

A connected component of an undirected graph is a maximal connected subgraph. Every node in the graph belongs to exactly one connected component. We implement an algorithm from Baruch Awerbuch and Yossi Shiloach to find connected components of a graph in parallel.11 The algorithm works by combining trees of nodes, such that all nodes in a given tree belong to the same connected component; the roots of the trees serve as node labels. The algorithm finishes with one tree per component, labeling each graph node with the root of its component’s tree. We store the trees in a parent vector P; the parent of node i is stored in P(i). We then use pointer jumping to find nodes that belong to the same connected component. Pointer jumping replaces a node’s label with that of its parent. Repeating this step several times traverses the trees stored in P, replacing a node’s label with that of its ancestor. Eventually, all nodes that belong to the same component will have the same label as that of the root of the tree. Because the trees are stored in a vector, pointer jumping can be performed by vector indexing—for example, P = P(P) performs one jump, simultaneously replacing all node labels

Computing in Science & Engineering

with their parent labels. We obtain parallelism from data-parallel operations on large vectors.8 Graph Contraction

Many graph algorithms proceed by solving the problem in question iteratively on smaller subgraphs. But nodes in a graph are sometimes relabeled during a computation, resulting in nodes sharing labels. Graph contraction combines nodes with the same label, merging edges incident on those nodes as well. We can efficiently implement contraction in terms of multiplication via a strategically chosen sparse matrix. The code fragment in Figure 2 shows how: it creates a sparse matrix S with n nonzeros (all ones). The column index of each nonzero is a node’s original label, whereas the row index is its new label in the contracted graph. As a result, nodes to be combined end up sharing the same row in S. Multiplying the graph G with S from the left combines rows that share labels; similarly, multiplying G with the transpose of S from the right combines columns with the same label. When contraction causes edges to merge, this implementation adds their weights together. We can apply different rules for the weight of merged edges by performing matrix multiplication over different semi-rings.

Applications

Two applications give a feel for our infrastructure: the first is a purely combinatorial graph-clustering benchmark, and the second is an application in computational ecology that combines numerical and combinatorial methods to model connectivity in heterogeneous landscapes. Graph Clustering

Our implementation of a graph-clustering benchmark arose from the HPCS Scalable Synthetic Compact Applications project.12 The benchmark consists of multiple kernels that access a single data structure representing a directed multigraph with weighted edges. Additional information appears elsewhere.6,8 The data generator generates an edge list in random order for a multigraph of sparsely connected cliques, as Figure 3 shows. The four kernels are 1. 2. 3. 4.

Create a data structure for the later kernels. Search the graph for a maximum weight edge. Perform breadth-first searches from a set of start nodes. Recover the clusters from the undirected graph.

March/April 2008 

Figure 3. Undirected graph from the graphclustering benchmark. This visualization is produced by relaxing the graph’s Fiedler coordinates projected onto a sphere.

Our implementation of these four kernels is a couple of hundred lines of code, and it works on shared as well as distributed-memory architectures (Star-P runs on both architectures). Kernel 1 uses the sparse() function to create a graph from an edge list, Kernel 2 uses the find function to locate the maximum weight edge, and Kernel 3 uses sparse matrix/sparse matrix multiplication for parallel breadth-first search. For Kernel 4, we experimented with both breadth-first-search-based “seed growing” methods and “peer pressure” algorithms6,8 to recover the clusters (see Figure 4). We ran our implementation of the graph benchmark in Star-P and used a graph generated with 2 million nodes (scale 21). The multigraph has 321 million directed edges; the undirected graph corresponding to the multigraph has 89 million edges. The graph has 32,000 cliques, the largest with 128 nodes. Because the input graph is a collection of sparsely connected cliques, most of the edges are within cliques—in this case, there are only 212,000 undirected edges between cliques. All kernels except Kernel 3 scale with the number of processors, as shown in Figure 5. Kernel 3 doesn’t scale in our implementation (even though sparse matrix multiplication does scale) because of the excessive overhead of client-server communication within Star-P. Computational Ecology

Circuitscape,13 a tool written originally in Java and now in Matlab, uses circuit theory to model 23

0 100 200 300 400 500 600 700 800 900 1,000 0 (a)

200

400

600

0 100 200 300 400 500 600 700 800 900 1,000 800 1,000 0 (b)

200

400

600

800 1,000

Figure 4. Clusters. (a) A spy plot of the input graph, and (b) the result of clustering in Kernel 4. Clusters are revealed as dense blocks on the diagonal.

1,000

Data generator Kernel 1 Kernel 2 Kernel 3 Kernel 4

Time (sec)

100

10

1

8

24

40

56 72 88 104 120 Processors

Figure 5. Execution times for the graph benchmark in Star-P. We generated the input graph with scale 21 and ran the benchmark on an SGI Altix with 128 Itanium II processors and 128-Gbytes of RAM.

animal movement and gene flow in heterogeneous landscapes. Landscapes are modeled as resistive networks; current flow across a model landscape takes into account multiple dispersal pathways, which can be useful in explaining patterns of gene flow and genetic differentiation among animal populations.14 Analyzing a landscape requires first converting the landscape of interest into a graph, with areas between whose connectivity is to be measured (such as nature reserves or animal populations) represented as polygons. To accomplish this, Circuitscape first reads a raster cell map as an m × n conductance matrix, where each nonzero element represents a cell in the landscape. Next, it uses stencil operations to convert the m × n conductance matrix into a graph with mn nodes. Graph edges correspond to connections between cells 24

in the landscape, typically first- or second-order neighbors. Habitats, which span several cells, are contracted into a single graph node. As a result, all cells neighboring a habitat become neighbors of the contracted node. The connected components algorithm then removes any disconnected parts of the landscape. Finally, Circuitscape constructs the graph Laplacian with elementary matrix operations and uses Kirchoff’s laws to compute current flows by solving a sparse linear system. These operations use both combinatorial and numerical algorithms. Combinatorial algorithms first preprocess the landscape graph (see Figure 6a), so that numerical algorithms can then compute current flows in the landscape (see Figure 6b). Even at moderate resolutions, the underlying graphs of habitat maps can be quite large, depending on the size of the landscape and the species being modeled. At large scales, combinatorial algorithms are also used to compute a preconditioner15 to accelerate the iterative solution of linear systems. The original Circuitscape code ran sequentially and typically took several hours to process a landscape with hundreds of thousands of cells; a landscape with a million cells took roughly three days. Our improvements included using GAPDT to speed up the combinatorial computations and iterative methods to speed up the solutions of linear systems (and allow memory use to scale), introducing parallel processing with Star-P, and general vectorization. The new Circuitscape can process landscapes with as many as 40 million cells in an hour.8

A

lthough high-performance numerical computing is a well-developed field, high-performance combinatorial computing is in its infancy. Our work aims to build on the large body of tools and techniques for numerical computing (in particular, for sparse matrix computation) to create efficient, effective, usable tools for large-scale problems that require both discrete and numerical computation. In ongoing work, we’re extending our sparse matrix infrastructure to accommodate a larger variety of graph algorithms. One example is supporting sparse matrix multiplication on arbitrary semi-rings: graph construction from triples currently only lets us add duplicate edges, and we’d like to allow other user-defined schemes. Another exciting research problem is the choice of sparse structures and algorithms to efficiently manipulate “hypersparse” graphs, which arise in highly

Computing in Science & Engineering

(a)

(b)

Figure 6. Connectivity modeling for mountain lions in Southern California. In (a), disconnected parts of the landscape are shown in red, and habitats, which are contracted nodes, are shown in green. The result of the combinatorial phase is used as an input to the numerical phase (b), which shows current flow across the landscape.

parallel settings. General submatrix indexing and sparse matrix/matrix multiplication are examples of primitives that are ubiquitous in array-based sparse graph algorithms but haven’t been studied extensively by the numerical computing community; developing more efficient algorithms (which could be polyalgorithms) for such primitives is a fertile area for further research.

References 1.

G. Birkhoff and A. George, “Elimination by Nested Dissection,” SIAM J. Numerical Analysis, vol. 10, no. 2, 1973, pp. 345–363.

2.

T.F. Coleman, A. Edenbrandt, and J.R. Gilbert, “Predicting Fill for Sparse Orthogonal Factorization,” J. ACM, vol. 33, no. 3, 1986, pp. 517–532.

3.

I.S. Duff and J.K. Reid, “Algorithm 529: Permutations to Block Triangular Form [F1],” ACM Trans. Mathematical Software, vol. 4, no. 2, 1978, pp. 189–192.

4.

J.R. Gilbert, “Predicting Structure in Sparse Matrix Computations,” SIAM J. Matrix Analysis and Applications, vol. 15, no. 1, 1994, pp. 62–79.

5.

J.R. Gilbert and S.-H. Teng, “MATLAB Mesh Partitioning and Graph Separator Toolbox,” 2002; www.cerfacs.fr/algor/ Softs/MESHPART/index.html.

6.

J.R. Gilbert, S. Reinhardt, and V.B. Shah, “High Performance Graph Algorithms from Parallel Sparse Matrices,” LNCS 4699, B. Kagstrom et al., eds., Springer, 2006, pp. 260–269.

7.

V. Shah and J.R. Gilbert, “Sparse Matrices in MATLAB*P: Design and Implementation,” LNCS 3296, L. Bouge and V.K. Prasanna, eds., Springer, 2004, pp. 144–155.

8.

V.B. Shah, An Interactive System for Combinatorial Scientific Computing with an Emphasis on Programmer Productivity, PhD thesis, Dept. Computer Science, Univ. Calif., Santa Barbara, June 2007.

9.

J.R. Gilbert, S. Reinhardt, and V. Shah, “An Interactive Environment to Manipulate Large Graphs,” Proc. 2007 IEEE Int’l Conf. Acoustics, Speech, and Signal Processing, IEEE CS Press, 2007, pp. IV-1201–IV-1204.

10. J.R. Gilbert, C. Moler, and R. Schreiber, “Sparse Matrices in MATLAB: Design and Implementation,” SIAM J. Matrix Analysis and Applications, vol. 13, no. 1, 1992, pp. 333–356.

March/April 2008 

11. B. Awerbuch and Y. Shiloach, “New Connectivity and MSF Algorithms for Shuffle-Exchange Network and PRAM,” IEEE Trans. Computers, vol. 36, no. 10, 1987, pp. 1258–1263. 12. D. Bader et al., HPCS Scalable Synthetic Compact Applications #2, version 1.1, 2005; www.highproductivity.org/SSCABmks.htm 13. B. McRae, Circuitscape User Manual, v. 2.2, 2006; www. nceas.ucsb.edu/~mcrae/software/circuitscape.htm. 14. B.H. McRae, “Isolation by Resistance,” Evolution, vol. 60, no. 8, 2006, pp. 1551–1561. 15. R.D. Falgout, and U.M. Yang, “Hypre: a Library of High Performance Preconditioners,” LNCS 2331, P.M.A. Sloot et al., eds., Springer, 2002, pp. 632–641.

John R. Gilbert is a professor at the University of California, Santa Barbara. His technical interests include graph computations, sparse matrix computations, and parallel computing. Gilbert has a PhD in computer science from Stanford University. Contact him at [email protected]. Steve Reinhardt is vice president of joint research at Interactive Supercomputing. His technical interests include parallel computing architectures and compilers, and graph computations. Reinhardt has an MS in biological science with a minor in bioinformatics from the University of Minnesota, Twin Cities. Contact him at sreinhardt@interactivesuper computing.com. Viral B. Shah is a visiting scholar at the University of California, Santa Barbara, and a senior research engineer at Interactive Supercomputing. His technical interests include graph computations and high-performance computing. Shah has a PhD in computer science from the University of California, Santa Barbara. Contact him at [email protected].

25

Suggest Documents