Efficient Assembly of Sparse Matrices Using Hashing - Semantic Scholar

10 downloads 0 Views 540KB Size Report
the sparse matrix has been assembled we transfer the matrix to e.g. CRS ... elements in a highly regular fashion, normally row-wise or column-wise, and do.
Efficient Assembly of Sparse Matrices Using Hashing Mats Aspn¨as, Artur Signell, and Jan Westerholm ˚ Abo Akademi University, Faculty of Technology, Department of Information Technologies, Joukahainengatan 3–5, FI-20520 ˚ Abo, Finland {mats, asignell, jawester}@abo.fi

Abstract. In certain applications the non-zero elements of large sparse matrices are formed by adding several smaller contributions in random order before the final values of the elements are known. For some sparse matrix representations this procedure is laborious. We present an efficient method for assembling large irregular sparse matrices where the nonzero elements have to be assembled by adding together contributions and updating the individual elements in random order. A sparse matrix is stored in a hash table, which allows an efficient method to search for an element. Measurements show that for a sparse matrix with random elements the hash-based representation performs almost 7 times faster than the compressed row format (CRS) used in the PETSc library. Once the sparse matrix has been assembled we transfer the matrix to e.g. CRS for matrix manipulations.

1

Introduction

Sparse matrices play a central role in many types of scientific computations. There exists a large number of different storage formats for sparse matrices [1], such as the compressed row format, compressed column format, block compressed row storage, compressed diagonal storage, jagged diagonal format, transposed jagged diagonal format [2] and skyline storage. The Sparskit [3] library supports 16 different sparse matrix formats. These storage formats are designed to take advantage of the structural properties of sparse matrices, both with respect to the amount of memory needed to store the matrix and the computing time to perform operations on it. The compressed row format, for instance, is a very efficient representation for sparse matrices where rows have identical structure and where elements are accessed in increasing row and column order, e.g. when multiplying a matrix with a vector. However, in applications where a sparse matrix has to be assembled, by adding and updating individual elements in random order, before the matrix can be used, for example for inversion, these storage methods are inefficient. In this paper, we argue that hashing with long hash tables can be used to efficiently assemble individual matrix elements in sparse matrices, improving time performance of matrix element updating by up to a factor of 50 compared to the compressed row format. B. K˚ agstr¨ om et al. (Eds.): PARA 2006, LNCS 4699, pp. 900–907, 2007. c Springer-Verlag Berlin Heidelberg 2007 

Efficient Assembly of Sparse Matrices Using Hashing

901

We stress that the proposed sparse matrix format is not intended to be used as a general replacement for the traditional storage formats, and that it is not efficient for performing computations on sparse matrices, like computing matrix– vector products or solving linear systems. These algorithms access the matrix elements in a highly regular fashion, normally row-wise or column-wise, and do not exhibit any of the irregular behavior for which the hash-based representation is designed to be efficient. The hash-based representation is efficient for applications where the elements have to be accessed repeatedly and in random order. Thus, once the matrix has been assembled in the hash-based representation it is transferred to, e.g., CRS for matrix manipulations or matrix-vector calculations.

2

Background

The problem of assembling large irregular sparse matrices originated in a project where the goal was to improve the performance of a parallel fusion plasma simulation, Elmfire [4]. In fusion plasma simulations particles move along orbits circling a torus simultaneously making small Larmor circles along their main trajectory. Instead of explicitly calculating the long range particle–particle interactions, a short time average electrostatic potential is calculated on a grid spanning the torus, and the particle-particle interaction is replaced by a local interaction with the electrostatic potential. The grid points receive contributions from every particle within a given radius from the grid point, typically several grid units. The problem of calculating the electrostatic potential turns into the problem of collecting and inverting a linear sparse matrix, where the individual elements are assembled from the contributions of up to 100 particles at irregular moments in time as the particles are simulated forward in time. In a typical application the electrostatic potential was represented by a 28000 × 28000 sparse matrix of double-precision floating-point values with a 10% fill rate. The plasma simulation program used a collection of sparse matrix routines from the PETSc library [5] to first assemble and then invert the distributed sparse matrix representing the electrostatic potential. Computing the inverse of the sparse matrix turned out to be very efficient, taking only a few seconds. However, the assembly of the matrix was very time consuming (of the order 100 seconds), and the decision was made to use another sparse matrix representation to first assemble the matrix and then construct the compressed row format used in the PETSc library to invert it. Effectively, we are trading memory for speed, since the sparse matrix is assembled in one data structure (a hash table) and then converted to another data structure (the CRS format) before it is used in the computation. However, the improved efficiency of the program clearly outweighs the drawback of the increased memory consumption.

902

3

M. Aspn¨ as, A. Signell, and J. Westerholm

Sparse Matrix Representation Using Hash Tables

Our sparse matrix representation is essentially a coordinate storage format where the non-zero elements are stored in a hash table [6] to allow for an efficient method to update individual elements in random order. An element (x, y) in a two-dimensional sparse matrix with r rows and c columns, x ∈ [0, r − 1], y ∈ [0, c−1], with value v will be stored in a hash table with the key k = x∗c+y. The hash function h(k) is simply the value k mod s. The size of the hash table, s, is chosen to be nz /b, where nz is the number of non-zeros in the sparse matrix and b is a small integer value, for instance 20. The number of non-zero elements in the sparse matrix is given as a parameter by the user. The idea is that, assuming that the non-zero elements are evenly distributed over the hash keys, on an average b elements are mapped to any hash key and we try to keep b reasonably small in order to be able to quickly locate an entry with this hash key value. The size of the hash table is actually chosen to be a prime number s larger than or equal to nz /b. This choice will typically map the non-zero elements evenly among the different hash values. Encoding the (x, y) coordinates of non-zero matrix elements as a single integer value k = x ∗ c + y saves memory space. Using 32-bit integers, a non-zero value requires only 12 bytes of memory instead of 16 bytes if the (x, y)-coordinates and the value are stored as two integers and one double-precision floating point value. Using 64-bit integers we need 16 bytes for each non-zero element instead of 24 bytes. A minor drawback of this approach is that the row and column indices have to be computed with integer division and modulo-operations. The representation also puts an upper limit on the size of the sparse matrices that can be represented. The largest unsigned integer that can be represented with 32 bits (the constant UINT MAX in C) is 232 − 1. Thus, the largest square matrix that can be stored has 65535 rows and columns. The largest unsigned 64-bit integer value (ULONG MAX in C) is 264 − 1, which allows for square matrices with approximately 4.2 x 109 rows and columns. Matrix elements which map to the same position in the hash table are stored in an array of tuples consisting of a key k and a value v. The number of elements in an array is stored in the first position (with index zero) of the array, together with the size of the array. All arrays are initialized to size nz /s . In other words, the arrays are initialized to the average number of non-zero elements per hash key. If the non-zero values are evenly distributed over all hash keys, no memory reallocation should be necessary. Three operations can be performed on the sparse matrix: an insert-operation which assumes that the element does not already exist in the matrix, a getoperation which retrieves a value and an add-operation which adds a value to an already existing value in the matrix. To insert a value v into position (x, y) in the matrix, we first compute its key k and the hash key h(k) as described earlier. We check if there is space for the new element in the array given by the hash key h(k) by comparing the number of elements stored in the array with the size of the array. If the array is full, it is enlarged by calling the realloc function and updating the size of

Efficient Assembly of Sparse Matrices Using Hashing

903

the array. The value is inserted as a tuple (k, v) into the next free position in the array, and the number of elements in the array is incremented by one. Since memory reallocation is a time-consuming procedure, arrays are always enlarged by a factor that depends on the current size of the array (for instance by 10% of its current size). The minimum increase, however, is never less than a pre-defined constant, for instance 10 elements. This avoids small and frequent reallocations at the expense of some unused memory. The get-operation searches for an element with a given (x, y)-position in the sparse matrix. We first compute the key k and the hash key h(k) and perform a linear search in the array h(k), looking for an element with the given key. If the key is found we return its value, otherwise the value 0.0 is returned. The linear search that is needed to locate an element will be short and efficient, since it goes through a contiguous array of elements with an average length of b stored in consecutive memory locations. This means that the search is efficient, even though it is an O(N ) operation. The key to the success of using long hash tables is thus manifested by the short length of the array needed for each hash value h(k). An alternative approach would be to use a tree-based structure to resolve collisions among hash keys, for instance an AVL-tree [6]. This would make the search an O(logN ) operation, but the insert-operation would become much more complex. Another drawback would be the increase in memory to store the elements, since each element would then need two pointers for its left and right subtrees, in addition to the key and the value. The add-operation adds a given value to the value stored in the sparse matrix at position (x, y). It first performs a get-operation, as described above, to locate the element. If the element is found, i.e. (x, y) already has a non-zero value stored in the sparse matrix representation, the given value is added to this. If the element is not found, it is inserted using the insert-operation. 3.1

Look-Up Table

In some cases there is a temporal locality when assembling the elements of a sparse matrix, which implies that we will be accessing the same element repeatedly at small time intervals. To speed up accesses to such matrix elements and avoid linear searches, we store the positions of the elements that have been accessed recently in a look-up table. The look-up table is also implemented as a hash table, but with a different hash function l(k) than that used for storing the matrix elements. The look-up table contains the key of an element and the index of its position in the array. To search for an element (x, y) we compute the key k and the look-up table hash function l(k) and compare the value at position l(k) in the look-up table against the key k. If these are equal, the element can be found in the array associated with hash table h(k) at the position given in the index in the look-up table. More than one element can be mapped to the same position in the look-up table. No collision handling is used in the look-up table, but only the last reference to an element will be stored. The size of the look-up table is given as a parameter by the user when the matrix is created. The size is a fraction 1/a of the hash table size s , and is also chosen to be a prime number.

904

M. Aspn¨ as, A. Signell, and J. Westerholm

Fig. 1. Hash table structure

A typical size for the look-up table could be 1/10 of the hash table size. The data structure used to represent the sparse matrix is illustrated in Figure 1.

4

Performance

We have compared the times to assemble sparse matrices of size 10000 by 10000 with a fill degree of 10% using both the compressed row format of PETSc and our hash-based representation. The test matrices are chosen to exhibit both efficient and inefficient behavior for the two storage formats. In the test programs, 10 million non-zero elements are inserted into the sparse matrix storage either with an insert-operation or an add-operation, depending on whether the elements are all unique or not. The time to perform the insert- or add-operations are measured with the time-function in the C language. The structures of the test matrices (for illustration purposes with 100 by 100 matrices) are shown in figure 2. The first test inserts non-zero values into every 10th column in the matrix in order of increasing (x, y)-coordinates. Hence the matrix is perfectly regular 0

0

0

10

10

10

20

20

20

30

30

30

40

40

40

50

50

50

60

60

60

70

70

70

80

80

80

90

90

90

100

100

100

0

10

20

30

40

50

60

70

80

90

0

10

20

30

40 50 nz = 1000

60

70

80

90

0

10

20

30

40

50 60 nz = 910

70

80

90

100

0

10

20

30

40

50 60 nz = 956

70

80

90

100

0

10

20

30

40

50 nz = 497

60

70

80

90

Fig. 2. Structure of test matrices: every 10th element, every 11th element, random and clustered

Efficient Assembly of Sparse Matrices Using Hashing

905

with identical row structures. The matrix elements are all unique and we need only perform an insert-operation. As the measurements show, the CRS format used in PETSc is very efficient for this case, clearly outperforming the hash-based representation. This is to be expected since in the CRS format all accesses go to consecutively located addresses in memory (stride one), thus making excellent use of cache memory. The second test inserts the same elements as in the first case, i.e. a nonzero value in every 10th position, but in reverse order of (x, y)-coordinates, starting from the largest row and column index. The matrix elements are all unique and we need only perform an insert-operation. The measurements show that this case performs considerably slower using the CRS format in PETSc, while the hash-based representation is only slightly slower than the previous case. The third test inserts non-zero elements in every 11th position in the matrix. Thus, consecutive rows will have different structure since the row length is not divisible by 11. The matrix elements are all unique and we need only perform an insert-operation. Surprisingly, the CRS format in PETSc is extremely inefficient for this case, while the performance of the hash-based representation is similar to the two previous cases. The fourth test inserts 10 million randomly generated non-zero elements, which are inserted into the sparse matrix storage using an add-operation, since the elements are not necessarily all unique. We can see that the CRS format in PETSc is very inefficient for this case, while the hash-based representation shows a small performance penalty. In the fifth test we first generate 100000 randomly located clusters centers, and then for each cluster we generate 100 random elements located within a square area of 9 by 9 around the cluster center. As in the fourth test, we must use an add-operation since the matrix elements are not all unique. An example of the type of sparse matrices encountered in the fusion plasma simulation is illustrated in figure 3. 4.1

Execution Time

The results of the measurements of the execution times are summarized in table 1. The tests were run on a AMD Athlon XP 1800+ processor with 1 GB of memory. The operating system was Red Hat Enterprise Linux 3.4 with gcc 3.4.4. The test programs were compiled with the -O3 compiler optimization switch. The measurements show that the compressed row format used in PETSc is very inefficient if the sparse matrix lacks structure, while the hash-based representation is insensitive to the structure of the matrix. In our application, the fusion simulation, the matrix structure was close to the fifth test case, consisting of randomly distributed clusters, each consisting of about 1 to 100 values. In our application a significant speed-up was indeed achieved using the sparse matrix representation with long hash tables. In addition, the logic for assembling the distributed matrix with contributions from all parallel processes was simplified, thanks to the restructuring of the code that had to be done at the same time.

906

M. Aspn¨ as, A. Signell, and J. Westerholm

Fig. 3. Example of a sparse matrix in the fusion plasma simulation

Table 1. Measured execution times in seconds for the five testcases Matrix structure PETSc Every 10th element, natural order 2.4 Every 10th element, reverse order 20.6 Every 11th element 377.8 Random 86.6 Clusters 37.8

4.2

Hash 6.9 7.5 7.1 12.5 12.0

Memory Usage

The sparse matrix representation based on hash tables uses slightly more memory than the compressed row format. The compressed row format uses 12nz +4(r+1) bytes of memory to store a matrix with r rows and nz non-zero entries. The hashbased representation uses 12nz + 16nz /b + 8nz /ab bytes, where b is the average number of non-zero elements per hash key and a gives the size of the look-up table as a fraction of the hash table size, as explained in section 3.1 For the sparse matrices used in the test cases presented in section 4, all with 10000 rows and columns filled to 10%, that is, with 10 million non-zero elements, the compressed row format needs about 114 MB of memory, whereas the hashbased representation needs about 122 MB.

5

Conclusions

We have presented an efficient solution to the problem of assembling unstructured sparse matrices, where most of the non-zero elements have to be computed by adding together a number of contributions. The hash-based representation

Efficient Assembly of Sparse Matrices Using Hashing

907

has been shown to be efficient for the particular operation of assembling and updating elements of sparse matrices, for which the compressed row storage format is inefficient. The reason for this inefficiency is that access to individual elements is slow, particularly if the accesses do not exhibit any spatial locality, but are randomly distributed. For these types of applications, a representation using a hash table provides a good solution that is insensitive to the spatial locality of the elements. The hash-based representation is not intended to be used as a general purpose storage format for sparse matrices, for instance to be used in matrix computations. It is useful in applications where unstructured sparse matrices have to be assembled, as described in this paper, before they can be used for computations. Our experiences show that the performance of programs of this kind can be significantly improved by using two separate sparse matrix representations; one during matrix element assembly and another for matrix operations, e.g. inversion. The trade-off in memory consumption is outweighed by the increased performance when assembling the matrix elements using long hash tables.

References 1. Dongarra, J.: Sparse matrix storage formats. In: Bai, Z., Demmel, J., Dongarra, J., Ruhe, A., van der Vorst, H. (eds.) Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide, SIAM (2000), http://www.cs.utk.edu/∼ dongarra/etemplates/node372.html 2. Montagne, E., Ekambaram, A.: An optimal storage format for sparse matrixes. Information Processing Letters 90, 87–92 (2004) 3. Saad, Y.: SPARSKIT: a basic tool kit for sparse matrix computetions, version 2, University of Minnesota, Department of Computer Science and Engineering, http://www-users.cs.umn.edu/∼ saad/software/SPARSKIT/sparskit.html 4. Heikkinen, J., Henriksson, S., Janhunen, S., Kiviniemi, T., Ogando, F.: Gyrokinetic simulation of particle and heat transport in the presence of wide orbits and strong profile variations in the edge plasma. Contributions to Plasma Physics 46(7-9), 490– 495 (2006) 5. Balay, S., Buschelman, K., Eijkhout, V., Gropp, W.D., Kaushik, D., Knepley, M.G., McInnes, L.C., Smith, B.F., Zhang, H.: PETSc users manual. Technical Report ANL-95/11 - Revision 2.1.5, Argonne National Laboratory (2004) 6. Goodrich, M.T., Tamassia, R.: Algorithm Design. Foundations, Analysis, and Internet Examples. John Wiley & Sons, Chichester (2001)