A New Format for the Sparse Matrix-vector

A New Format for the Sparse Matrix-vector Multiplication Ivan Šimeček [email protected]

Department of Computer Science , Faculty of Electrical Engineering, Czech Technical University, Technická 2, 166 27 Prague 6, Czech Republic Abstract Algorithms for the sparse matrix-vector multiplication (shortly SpMV) are important building blocks in solvers of sparse systems of linear equations. Due to matrix sparsity, the memory access patterns are irregular and the utilization of a cache suffers from low spatial and temporal locality. To reduce this effect, the register blocking formats were designed. This paper introduces a new combined format, for storing sparse matrices that extends possibilities of the diagonal register blocking format. Introduction There are several formats for storing sparse matrices. They have been designed mainly for the SpMV. The SpMV for the most common format, the compressed sparse rows (shortly CSR) format, suffers from low performance due to the indirect addressing. Many studies were published about increasing the efficiency of the SpMV[1,2]. There are some formats, such as register blocking, that eliminate indirect addressing during the SpMV. Then, vector instructions can be used. These formats are suitable only for matrices with a known structure of nonzero elements. The overhead of a reorganization of a matrix from one format to another one is often of the order of tens of executions of a SpMV. So, such a reorganization pays off only if the same matrix A is multiplied with multiple different vectors, e.g., in iterative linear solvers. The register blocking formats These formats are designed to handle randomly occurring dense blocks in a sparse matrix. We will discuss only about register blocking (shortly RB) [1,3,4]. Nonzero elements of matrix A are grouped into dense diagonal blocks whose sizes can differ. When we compare our work with similar project SPARSITY [1] (and his successor OSKI) which is based on different ideas: 1. It uses register blocking formats - it handles randomly occurring dense mini-blocks. 2. It uses also a cache optimization, but in completely different approach than in our approach. 3. Author overlooks the matrix transformation time overhead in contrast to our algorithm. Design of new format for sparse matrices We include new features to our format: a) The combined format feature - In the classical RB format, all nonzero elements must be in blocks so blocks of size 1 can exist. Simply speaking, in our format, some elements are stored in the CSR format, while the blocks in the RB format. b) Cache adaptive feature - We divide matrix A into disjoint nonempty rectangular regions so that the cache can hold all data accessed during the partial SpMV within each region. c) Partially full blocks feature - To reduce the number of blocks, we omit the assumption of only fully dense blocks and allow also partially full blocks. The maximum sparsity of blocks can be defined by a block heuristic. Various (even architecture dependent) block heuristics can be designed. We have designed algorithm to find blocks with minimal space complexity. This algorithm uses dynamic

programming so its computational and space complexity for a times b region is approximately Theta(a b). Evaluation of the results All results were measured at Pentium Celeron 2.4 GHz, 512 MB@ 266 MHz, running OS Windows XP with the following cache parameters: SW: Microsoft Visual C++ 6.0 Enterprise edition and Intel compiler version 7.1 All cache events were monitored by Intel Vtune performance analyzer 7.0. For testing purposes, we have used 52 real matrices from various technical areas from MatrixMarket and Harwell sparse matrix test collection. We compare performance of CSR format with our format, but it is very hard to conclude the results from our test data set. Speedups depend strongly on the structure of nonzero elements (it means on the "origin" of the matrix). 1) Significant speedups (sometimes over 200%) were achieved for matrices of PDE origin. 2) No speedups were achieved for matrices with the near-random structure. In this paper, we described a new format for storing sparse matrices that combine advantages of the CSR format and the RB format. This format is also adaptive to the parameters of the CPU cache. References: [1] E. IM: Optimizing the Performance of Sparse Matrix-Vector Multiplication, Dissertation thesis, University of Carolina at Berkeley, 2001. [2] J. MELLOR-CRUMMEY, J. GARVIN: Optimizing sparse matrix vector product computations using unroll and jam, International Journal of High Performance Computing Applications, 225–236, 2004. [3] R. VUDUC, J. W. DEMMEL, K. A. YELICK, S. KAMIL, R. NISHTALA, AND B. LEE: Performance optimizations and bounds for sparse matrix-vector multiply, Proceedings of Supercomputing,, 2002. [4] K. R. WADLEIGH, I. L. CRAWFORD: Software optimization for high performance computing, Hewlett-Packard professional books, 2000. This research has been supported by MŠMT under research program MSM6840770014.