Optimizing Parallel Sparse Matrix-Vector Multiplication by Partitioning Michael M. Wolf1 , Erik G. Boman2 , and Bruce A. Hendrickson3 1
3
Dept. of Computer Science, University of Illinois at Urbana-Champaign, IL, USA 2 Scalable Algorithms Dept, Sandia National Laboratories, NM, USA ?? Computer Science and Informatics Dept, Sandia National Laboratories, NM, USA
[email protected],
[email protected],
[email protected] Abstract. Sparse matrix times vector multiplication is an important kernel in scientific computing. We study how to optimize the performance of this operation in parallel by reducing communication. We review existing approaches and present a new partitioning method for symmetric matrices. Our method is simple and can be implemented using existing software for hypergraph partitioning. Experimental results show our method produces better quality than traditional 1-dimensional partitioning methods and is competitive with 2-dimensional methods. It is also fast. Finally, we propose a graph model for an ordering problem to further optimize our approach. Key words: parallel algorithms, combinatorial scientific computing, partitioning, sparse matrix-vector multiplication
1
Parallel Matrix-Vector Multiplication
Sparse matrix-vector multiplication is a common kernel in many computations, e.g., iterative solvers for linear systems of equations and PageRank computation (power method) for ranking web pages. Often the same matrix is used for many iterations. An important combinatorial problem in parallel computing is how to distribute the matrix and the vectors among processors so as to minimize the communication cost. Such “communication” is also important on single-processor computers with deep memory hierarchies, where slow memory is typicaly much slower than fast memory. Since processor speeds increase much more rapidly than memory, we expect memory latency and bandwith to grow in importance. We focus on minimizing the total communication volume while keeping the computation balanced across processes. Our assumption is that both the matrix and the vectors are distributed among processes and that the process that “owns” aij computes the contribution from that matrix entry. We review current oneand two-dimensional data distributions, and propose a new “corner” partitioning method. ??
Sandia is a multiprogram laboratory operated by Sandia Corporation, a LockheedMartin company, for the US Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.
2
2
Optimizing Parallel Sparse Matrix-Vector Multiplication
One- and Two-Dimensional Partitioning
In one-dimensional sparse matrix partitioning, each partition is assigned the nonzeros belonging to a set of rows (1-D row partitioning) or a set of columns (1-D column partitioning). One-dimensional partitioning is sufficient for many problems, and most applications use matrices distributed in a one-dimensional manner. However, for particular problems, one-dimensional partitioning is potentially disastrous in terms of the communication volume. The “arrowhead” matrix shown in Figure 1(a) is an example for which one-dimensional partitioning is inadequate. For the bisection (k = 2) case shown in the figure, any load-balanced one-dimensional partitioning will yield a communication volume of approximately 34 n for the matrix-vector product. This is far from a minimum communication volume for this problem and it is unacceptable for the communication volume to scale as n for this matrix. Thus, we need more flexible partitioning than traditional one-dimensional partitioning. Two-dimensional partitioning is a more flexible alternative to one-dimensional partitioning, in which no specific partition is assigned to a given row or column. Rather, we have to specify the partition for particular sets of nonzeros. In the most flexible method each nonzero is assigned a partition independently. This problem can be modeled as a fine-grain hypergraph[1], where each nonzero is represented by a vertex in the hypergraph. Each row is represented by a hyperedge in the hypergraph (h1-h8 in Figure 1(b)). Likewise, each column is represented by a hyperedge in the hypergraph (h9-h16 in Figure 1(b)). The vertices are partitioned into k equal sets (k = 2 in Figure 1(b)) such that the number of cut hyperedges is minimized. The communication volume is equivalent to this hyperedge cut metric. Catalyurek and Aykanat [1] proved that this fine-grain hypergraph model yields a minimum volume partitioning when optimally solved. In Figure 1(b), we see the fine-graph hypergraph partitioning of the 8×8 arrowhead matrix. The resulting communication volume is 2, a significant improvement over the communication volume of 6 from the optimal one-dimensional partitioning. Solving the fine-grain hypergraph model optimally is NP-hard but there are heuristics that can solve it close to optimally in near-linear time. Unfortunately, the resulting fine-grain hypergraph problem is a larger NP-hard problem (than 1-D partitioning) and thus, may be too expensive to solve for large matrices. The recent Mondriaan method [5] is 2-D but based on recursive 1-D partitioning, and quite fast. We propose a new, even faster, partitioning method.
3
Two-Dimensional Corner Partitioning
An examination of the minimum cut fine-grain hypergraph partitioning in Figure 1(b) suggests a new partitioning method. We see that each partition consists of a set of “corners” (more easily seen in Figure 1(c)), which are basically onedimensional partitions reflected across the diagonal. Using these “corners”, the hope is that we could reproduce an optimal fine-grain partitioning using a less costly one-dimensional partitioning method for certain matrices.
Optimizing Parallel Sparse Matrix-Vector Multiplication
h9
3
h10 h11 h12 h13 h14 h15 h16
h1 h2 h3 h4 h5 h6 h7 h8
(a) 1D, volume=6
(b) 2D, volume=2
(c) “Corners”
Fig. 1. Arrowhead matrix partitioned for two processes. (a) Row-wise decomposition. (b) 2D Fine-grain hypergraph bisection. Cut hyperedges are shaded. (c) “Corners” shown in (c).
We describe the corner partitioning method for the previously partitioned 8 × 8 arrowhead matrix. We start with a one-dimensional column partitioning of the lower triangular portion of the matrix. We then reflect this one-dimensional partitioning across the diagonal such that row i in the upper triangular portion of the matrix is assigned to the same partition as column i in the lower triangular part of the matrix. For this arrowhead matrix, this corner partitioning method produces the same optimal partitioning as obtained by the fine-grain hypergraph method at a reduced computational cost. This indicates that the corner method can be an effective two-dimensional partitioning method for some matrices.
4
Results
We used PaToH [2] for hypergraph partitioning in order to implement onedimensional column, two-dimensional corner, and two-dimensional fine-grain partitioning. We studied the partitioning of five symmetric matrices [3, 4] used as a benchmark set in [5] for these three methods. We partitioned these matrices into 4, 16, 64, and 256 partitions. The resulting communication volumes are reported in Table 1. We see the corner method yielded significantly more optimal partitionings than the one-dimensional method. With the exception of the lap200 matrix and a few k = 256 data points, the corner method yielded similar or slightly more optimal partitionings than the fine-grain hypergraph method. The corner method was much faster to compute than the fine-grain hypergraph method, and often slightly faster than 1-D.
5
Reordering
Recently, we have been studying symmetric reordering of the matrix rows and columns for the corner partitioning method. Reordering is unnecessary for onedimensional partitioning since the related graph and hypergraph models are
4
Optimizing Parallel Sparse Matrix-Vector Multiplication
Table 1. Communication volume for k-way partitioning of symmetric matrices using different partitioning methods. Name cage10
lap200
finan512
bcsstk32
bcsstk30
k 1d hyp. fine-grain hyp. 4 5379.0 4063.7 16 12874.5 8865.5 64 23463.3 16334.7 256 40830.9 29239.0 4 1535.1 1538.5 16 3013.9 3017.9 64 5813.0 5786.4 256 11271.8 11061.4 4 295.7 261.2 16 1216.7 1027.4 64 9986.0 8624.6 256 38985.4 26471.6 4 2111.9 1611.4 16 7893.1 6330.8 64 19905.4 18673.1 256 46399.0 46469.7 4 1794.4 1935.7 16 8624.7 9774.8 64 23308.0 25677.2 256 56100.4 57844.8
corner 4089.3 8920.9 17164.0 32138.0 1640.0 3336.5 6656.4 13342.8 215.0 845.0 8135.2 42248.6 1751.3 7220.0 19616.4 47695.8 1531.0 7232.2 20351.4 50689.4
invariant to ordering (although in practice these may differ when the problem is solved less than optimally using heuristics). However, the hyperedges in the corner symmetric partitioning method are dependent on the row/column ordering. Thus, ordering can be potentially exploited to decrease the communication volume for a corner partitioning. In the full paper, we will empirically show the effect of reordering and then discuss graph models for optimal ordering of matrices for corner partitioning.
References ¨ C 1. U. ¸ ataly¨ urek and C. Aykanat. A fine-grain hypergraph model for 2d decomposition of sparse matrices. In Proc. IPDPS 8th Int’l Workshop on Solving Irregularly Structured Problems in Parallel (Irregular 2001), April 2001. 2. Umit Catalyurek and Cevdet Aykanat. PaToH home page. http://bmi.osu.edu/˜umit/software.html, 2007. 3. Timothy A. Davis. The University of Florida Sparse Matrix Collection, 1994. http://www.cise.ufl.edu/research/sparse/matrices/. 4. Iain S. Duff, R. G. Grimes, and J. G. Lewis. Sparse matrix test problems. ACM Trans. Mathematical Software, 15:1–14, 1989. 5. Brendan Vastenhouw and Rob H. Bisseling. A two-dimensional data distribution method for parallel sparse matrix-vector multiplication. SIAM Review, 47(1):67–95, 2005.