David and Goliath : computing the rank of sparse matrices Jean-Guillaume Dumas September 20, 1999 Abstract
We want to achieve ecient exact computations, such as the rank, of sparse matrices over nite elds. We therefore compare the practical behaviors of the deterministic Gaussian elimination technique, even with reordering heuristics, with the probabilistic, matrix-free, Wiedemann algorithm.
Key words: Large sparse matrix, exact computation. Gaussian elimination. Minimum degree ordering. Iterative technique. BlackBox. Recurring sequences. Wiedemann Algorithm.
1 Introduction
Many applications of computer algebra require the most eective algorithms for the computation of normal form of matrices. An essential linear algebra part of these is the computation of the rank of large sparse matrices. We rst recall some Gaussian elimination strategies (x2) and propose a new reordering heuristic. Some experimental results with these are presented in (x3). Then we produce a fast preconditioning for the Wiedemann iterative method in order to compute the rank of a matrix via minimumpolynomial computation (x4). We will then compare both methods using a discrete logarithm based arithmetic (x5). In x3 and x5 we will also present some results concerning matrices arising in the computation of Grobner basis for image compression or robotics (GB project [Fau94]) and in the determination of homology groups of simplicial complexes [BLVZ 94].
2 Gaussian elimination
2.1 Straightforward algorithm
Consider Gaussian elimination for sparse matrices over a eld. In all following implementations we use the classical compressed row format for the matrix storage. The algorithm is then straightforward:
Algorithm: Gau
Input: Output:
{ a sparse matrix A in Fmn. { the rank of A over F.
set rank = 0; For k from 1 to m if rowk is not empty increment rank; set pivot = rowk [0]:value(); set index = rowk [0]:index(); For each row having non-zero element on position index do permutation of elements k and index Laboratoire Informatique et
.
[email protected]
Distribution, B.P. 53 X, 100, rue des Mathematiques, 38041 Grenoble Cedex, France. e-mail:
1
elimination on this row with rowk return rank
P
Theorem 2.1. Let A be a sparse matrix in Fmn of rank r with ! non-zero elements per row. Algorithm 2.1 requires 2 r (m ? k) min(!2k ; n ? k) (r + 1)(mn + r 2r?3n?3m ) eld operations in the worst case. k=0
6
Proof. We consider that each step doubles the number of elements per row.
We see that the amount of ll-in can be huge. When the eld is suciently big (more than 100 elements, say), no signi cant number of zeroes can then result from a x+ = y z (\gaxpy") operation. Moreover we will see in section 3 that this bound is realistic when ! is greater than 3. Indeed in that case the matrix is quickly lled.
2.2 Reordering techniques
The classical methods for reducing ll-in in Gaussian elimination are graph-theoretic. There, considering a matrix as an adjacency matrix of a graph G, eliminating a variable in a system of equations corresponds to eliminating the corresponding vertex in G. Very informally, to eliminate a vertex v from G, rst add edges to interconnect every two neighbors of v and then remove all edges incident to v (see [BS90] for more details). The trouble is that, for an arbitrary graph, the problem of minimizing the ll-in is NP-complete [Yan81]. Therefore we use heuristics. In [LO91], a \Structured Gaussian Elimination" is proposed to solve large sparse linear systems arising in integer factorization. Some more general methods are discussed in [Zla92] and, in particular, methods using the Markowitz cost-function are proposed. Another commonly used heuristic is the so-called \minimum degree ordering" which at a step k will eliminate a vertex of current minimum degree. We implement a simpler version of this (using a simpli ed Markowitz cost-function) better suited to our storage format. Throughout the algorithm we maintain a vector of the number of elements per column. Then we choose a row having a minimum number of elements in order to produce the smallest ll-in per row. Then in this row we choose a pivot having a minimum number of elements per column. Therefore the number of lled rows at the next step will be minimized. We present experiments with this heuristic in the next section.
3 Reordering experiments 3.1 Experimental platform
Our experiments were realized on a Sun Microsystem Ultra Enterprise 450 with 250MHz Ultra-II processor and 512MB of memory. We computed ranks of matrices over nite elds GF(q) where q has half word size. The chosen arithmetic used discrete logarithms with precomputed tables, as in [SMJ90]. The algorithms were implemented in C++ with the Givaro library .
oat dot long dot long % dot Galois Field dot 81.25 23.08 6.41 13.04 Table 1: Million of eld (65521 elements) OPerations per Second. We rst produce the performances obtained with this machine in table 1: as the base operation in Gaussian elimination is the \gaxpy" operation, we compared the number of eld operation realized per second by our implementation on a simple large dense dot product with classical C++ basic types and using the C++ remainder function. Those measures were made with half-word size primes. \Mops" here are the number of additions plus the number of multiplications per second. A dot product of oating point numbers can be computed at a speed of 81.25 Mops. One with machine integer numbers can be computed at a speed of 23.08 Mops. One with machine integers and only one modulus ( x = (x+a b)%p ) can only reach a speed of 6.41 Mops, whereas our implementation reaches 13.04 Mops for half-world size primes. Furthermore, the arithmetic is even better with very small elds when cache eects occur. Anyway, these results show C++
computer algebra library, developed within the APACHE and DESIR projects [Gau96]
2
improvements even without those eects. We foremost remark that those machines are not very well suited for exact arithmetic since there is a huge dierence with oating points. It is unfortunately not possible to implement our Galois Field arithmetic over oats since the discrete logarithm arithmetic needs lookup tables and some slow conversions are unavoidable. However, we have a good implementation of the nite elds since its performance is close to twice that of the classical implementation of the integers modulo a prime.
3.2 Matrices
We call \Gau" the algorithm 2.1 and \Pivot" the elimination with the reordering heuristic. In this section we present the ratios of time, number of operations, and number of operations per seconds (\Mops"), between the classical algorithm and the reordering heuristic, on three matrix classes : random square matrices of size from 1000x1000 to 5000x5000 with 1 to 25 non-zero elements on the average per row, homology matrices of size up to 141120x58800 with 1 to 7 non-zero elements per row, and matrices form Grobner basis of size up to 4562x5761 with 10% of non-zero elements. On very sparse random matrices (with no more than two elements per row) the cost of the Gaussian elimination is dominated by permutations and not by eld operations, so as we can see on gure 1 the heuristic causes a small slow down whenever the number of operations is decreasing. On the other hand we observe an extraordinary speed-up as soon as the average number of elements per 1.4 Time Operations Mops
Ratio : with / without reordering
1.2
1
0.8
0.6
0.4
0.2
0 0
5
10
15 20 average number of non zero elements per row
25
30
35
Figure 1: Reordering to non-reordering ratios for random matrices. row is greater than 2.5 as the ll-in is drastically reduced. This speed-up begins evidently to slowly reduce as the matrices are getting denser and we even observe better performances in Mops for the heuristic as the ll-in requires more accesses to the memory. However, when matrices are much denser, this amazing characteristic seems to reverse to an expected overhead for the heuristic. We observe on table 2 that even for the much denser Grobner matrices we can have substantial speed-ups (up to a 43% time reduction !) or no real loss. 3
Matrix Elements Size Rank Time Operations Mops robot24c1 mat5 15118 404 x 302 262 0.65 0.49 1.32 rkat7 mat5 38114 694 x 738 611 0.57 0.30 1.9 f855 mat9 171214 2456 x 2511 2331 1.08 0.80 1.35 c8 mat11 2462970 4562 x 5761 3903 0.79 0.62 1.27 Table 2: Reordering to non-reordering ratios for some Grobner matrices.
4 3 2 1 0 1
2 3 4 5 6 7 8 non zero elements per row
Mops ratio : with / without reordering
5
Operations ratio : with / without reordering
Time ratio : with / without reordering
The homology matrices are already quasi-triangular so the normal ll-in is already low. However we observe on gure 3.2 a reduction of the number of eld operation on every matrices. Moreover, and with a predominance of data structure manipulation, timings are reduced or very close. 5 4 3 2 1 0 1
2 3 4 5 6 7 8 non-zero elements per row
5 4 3 2 1 0 1
2 3 4 5 6 7 8 non-zero elements per row
Figure 2: Reordering to non-reordering ratios for homology matrices.
3.3 Reordering experiments conclusion
We validate our heuristic by exhibiting reduced ll-in within Gaussian elimination. Timings are consequently reduced in most of the cases and are only slightly worse otherwise. We were even able to compute the rank of two matrices from homology with this method where the classical algorithm suered from thrashing because 512 MB of memory is not sucient to store the intermediate matrices.
4 Diagonal scaling
We present here a method for computing the rank of a matrix . using only matrix-vector products. When a matrix is only used this way we call it a \BlackBox". Therefore an algorithm viewing matrices as blackboxes will not suer from any ll-in. We will rst recall that we have access to the rank of a square matrix via its minimum polynomial when this matrix is well preconditioned [KS91]. We actually use a faster and more generic randomization [EK97]: Theorem 4.1. Let S be a nite subset of a eld F that does not include 0. Let A 2 F mn having rank r. Let D1 2 S nn and D2 2 S mm2 be two random diagonal matrices then degree(minpoly(D1 At D2 AD1 )) = ?n r, with probability 1 ? 11:n 2jS j . Proof. see [EK97]. Informally this follows since rst for a square matrix A, DA will have its characteristic polynomial be a power of X times its minimum polynomial [Sau98]. Then At :A will not have zero blocks on its Jordan form so no power of X can occur in its minimum polynomial. And then At:D:A will not have self-orthogonality. To conclude, the probability follows from Schwartz-Zippel lemma [Zip93]. 4
Then we use Wiedemann's algorithm [Wie86] to compute the minimum polynomial and the rank. It is a shift-register synthesis [Mas69] using some projection of the powers of the matrix as a sequence:
Algorithm: Diagonal scaled Input: Output:
Wiedemann
{ a sparse matrix A in Fmn of rank r. { the rank of A over F with probability 1 ? 112:njF2?j n .
Select random D1 2 Fmm and D2 2 Fnn as stated above. (1)
[BlackBox Initialization with explicit multiplication by the diagonals] Compute B = D1 A D2 , the matrix whose entries are D1ii Aij D2jj .
(1')
[BlackBox Initialization otherwise] Form the true BlackBox composition C = D2 At D1 AD2
(2)
[Wiedemann Sequence Initialization] Set u 2 Fn a random vector. set S0 = ut :u
(3)
[Berlekamp/Massey Initialization] set b = 1; x = 1; L = 0; = 1; = 1;
for k from 0 to 2min(m; n) (k1)
[Berlekamp/Massey shift register] P L d = Sk + i=1 i Sk?i if iszero(d) or (2L > k) ++x if 2L > k = ? db?1X x . else = ? db?1X x == = . L = k + 1 ? L; b = d; x = 1.
(k2)
[If early termination, break]
(k3)
[Next coecient, non-symmetric case] if k even v = Bu S[k + 1] = vt :v else u = Btv S[k + 1] = ut :u
(k30 )
[Next coecient, symmetric case] if k even v = Cu S[k + 1] = ut :v else u=v S[k + 1] = ut :u
return degree() ? valuation(). 5
Remark 4.2. When the matrix has rank r < min(n; m) this algorithm will produce the minimal polynomial after only 2r steps. Therefore one can heuristically stop it when remains the same after a certain number of steps. The number of matrix-vector computations can therefore be drastically reduced in some cases. Remark 4.3. We present here both the generic algorithm when explicit multiplication by a diagonal is possible and one with true BlackBox. The cost of a matrix-vector product is reduced in the former case but the size of a the set of random choices is divided by 2 since it is equivalent to selecting a random diagonal matrix in the set of squares of F. Theorem 4.4. Let A be a matrix in Fmn of rank r. Algorithm 4 is correct and requires 2r matrix-vector products and an additional 8r2 + 2r(n + m) eld operations. Proof. For correctness of the minimum polynomial see [Mas69, Wie86]. For correctness of the rank use theorem 4.1. Now for complexity : the loop ends when the degree of the polynomial reaches r (i.e. when k = 2r) according to remark 4.2. Loop k has a cost of k \gaxpy" operations for discrepancy computation, another k for the polynomial update, a matrix-vector products, and a n or m-dotproduct. In the case where A is sparse with non-zero elements (and ! = n is the average number of non-zero elements per row), the total cost of this algorithm is 4 r + 8r2 + 4rn = O(rn). So asymptotically this algorithm is better than Gaussian elimination which has a worst case arithmetic complexity in O(rn2). However the constant factor of 4! + 12 is not negligible when one considers that Gaussian elimination has a low cost in its rst steps. This cost is then growing up to rn2 according to the ll-in. Remark 4.5. For the symmetric case there are only n-dotproducts. Therefore if n > m it is better to form this other BlackBox : B = D1 AD2 At D1 . Remark 4.6. At least for rank computations, this algorithm requires less eld operations than the similar Lanczos and Conjugate gradient algorithms [GL96, EK97]. In those the polynomial operation of degree k, in the Berlekamp/Massey part of Wiedemann's are replaced with n-vector operations. Therefore the P2kr=0 n algorithm additional cost of this operation is 2 = 4rn in Lanczos and Conjugate gradient algorithms, whereas P 2 r 2 it is only 2 k=0 k = 4r in Wiedemann's algorithm.
5 David versus Goliath
In this section we report on some experiments comparing the behavior of Wiedemann's algorithm and Gau's algorithm.
5.1 Arithmetic
We rst compare the respective implementations of our algorithms. We therefore produce the performance obtained for all the matrices of section 3 and compare them in table 3 to the simple dense dot product. Then, Minimum Average Maximum
Galois Field dot Gau Pivot Wiedemann 13.04 0 0 7.25 13.04 1.82 1.38 8.67 13.04 4.57 4.26 12.12
Table 3: Million of eld (65521 elements) OPerations per Second. comparing the number of operations reached by our two algorithms, we recall that sparse elimination requires many data structures manipulations and that reordering requires a few more structures manipulations. Next thing to remark is that we achieve better performances when measured in term of \Mops" with BlackBox methods than with elimination; we were able to run at 12.12 Mops in the best case. Indeed in Wiedemann algorithm there are sparse matrix-vector operations which need structural manipulations but also a fairly important proportion of dense dot products and polynomial multiplications. 6
Remark 5.1. In the following experiments we will present results for matrices of size greater than half the
size of the eld. The conditions of theorem 4.4 do not hold anymore for the probabilities of success of Wiedemann algorithm. However it seems clear that the probability estimates of 4.1 are too pessimistic for most of the matrices. Indeed, we check our results with the deterministic Gau algorithm whenever it is possible. It appears that with a eld of size 65521, Wiedemann's algorithm sometimes failed but only with a few very special matrices (i.e. : the higher dimensional boundary matrices of matching and chessboard complexes [BLVZ 94]; those are matrices in Zndn where d is the dimension and with only one element per column). We therefore present timings associated with correct Wiedemann answers when the Gaussian elimination result is present. For some other matrices, Gaussian elimination ran out of memory and we don't know the correct answer. Some experiments with arbitrary precision integers and larger elds will have to be conducted. Anyway we present the timings associated to these matrices with the restriction that the rank might not be correct.
5.2 Matrices
We rst produce in gure 3 the results obtained for random matrices. We see that for very sparse matrices Gaussian elimination is far better since it has nearly nothing to do! Anyway, and even for small matrices, as soon as the ll-in is no longer negligible Wiedemann algorithm takes the advantage. That happens when ! is about 5. Moreover, this advantage remains until matrices are nearly dense. Also, for much more triangular matrices, Gaussian elimination has not much to compute and is therefore better than Wiedemann's algorithm. However, we note that for very large matrices Wiedemann's algorithm is able to execute whereas Gaussian elimination fails because of memory thrashing. For these cases, where memory is the limiting factor, even the slightest ll-in can kill elimination. Matrix robot24c1 mat5 rkat7 mat5 f855 mat9 c8 mat11 mk9.b3 mk7-7.b6 mk7-6.b4 mk7-7.b5 mk8-8.b5
d, n m, r
15118,404 x 302,262 38114, 694x738, 611 171214, 2456x2511 , 2331 2462970,4562 x 5761, 3903 3780,945 x 1260,875 35280,5040x35280,5040 75600, 15120 x 12600, 8989 211680,35280x52920,29448 3386880,564480x376320,279237
Time 0.52 1.84 10.54 671.33 0.26 4.67 49.32 2179.62
Gau Operations 1.10e+06 4.06e+06 1.60e+07 1.70e+09 3.01e+05 0 1.12e+07 5.09e+09 Thrashing
Mops 2.11 2.20 1.51 2.52 1.14 0 0.23 2.33
Time 1.84 10.51 202.17 4972.99 2.11 119.53 416.97 4283.4
Wiedemann Operations 1.68e+07 9.73e+07 1.65e+09 3.86e+10 2.03e+07 1.17e+09 3.54e+09 3.36e+10 14.9 Days
Mops 9.17 9.26 8.14 7.77 9.62 9.84 8.50 7.85
Table 4: Compared timings of elimination and iterative techniques on homology and Grobner matrices.
6 Conclusion
We have experimented with the currently known best methods for computing the rank of large sparse matrices. We rst validated an ecient heuristic for ll-in reduction in Gaussian elimination. We have then seen that for certain highly structured matrices Gaussian elimination is still the best since it has relatively little to do! On the other hand we can say that the Wiedemann algorithm is very practical. It has good behavior in general, even for small matrices, and is the only solution for extremely large matrices. We now plan to compare the parallel approaches to these two algorithms. In both cases it seems quite hard since the amount of communication is of the same order as the amount of computation (at least before any substantial ll-in for Gaussian elimination). Nonetheless, Coppersmith's block version of Wiedemann algorithm [Vil97, KL96] seems promising at least on Symmetric Multi-Processors machines. Moreover new high-level parallel systems, such as Athapascan-1 [DGRV98, CGR98], are designed to support ecient processing of large irregular problems, and can provide ways of eciently overlapping communication with computation.
7
1.4 1Kx1K Time 2Kx2K Time 3Kx3K Time 4Kx4K Time 5Kx5K Time
1.2
Ratio : Elimination / BlackBox
1
0.8
0.6
0.4
0.2
0 0
1
2
3 4 5 average number of non zero elements per row
6
7
8
18 1Kx1K Time 2Kx2K Time 3Kx3K Time 4Kx4K Time 5Kx5K Time
16
Ratio : Elimination / BlackBox
14
12
10
8
6
4
2
0 0
5
10
15 20 average number of non zero elements per row
25
30
Figure 3: Compared timings of elimination and iterative techniques on random matrices.
8
35
7 Acknowledgments
The author is very grateful to Mathias Doreille, Thierry Gautier, Austin Lobo, B. David Saunders and Gilles Villard for their countless advice and unlimited patience.
References
[BLVZ 94] Anders Bjoerner, Laszlo Lovasz, S.T. Vrecica, and Rade T. Z ivaljevic. Chessboard complexesand matchingcomplexes. J. Lond. Math. Soc., II. Ser. 49, No.1, 25-39, 1994. [BS90] Piotr Berman and Georg Schnitger. On the performance of the minimum degree ordering for Gaussian elimination. SIAM Journal on Matrix Analysis and Applications, 11(1):83{88, January 1990. [CGR98] Gerson Cavalheiro, Francois Galilee, and Jean-Louis Roch. Athapascan-1: Parallel programming with asynchronous tasks. In Yale Multithreaded Programming Workshop, Yale, USA, June 1998. [DGRV98] Jean-Guillaume Dumas, Thierry Gautier, Jean-Louis Roch, and Gilles Villard. Data- ow multithreaded parallelism in computer algebra algorithms. 1998 IMACS Conference on Applications of Computer Algebra : High Performance Symbolic Computing, August 1998. [EK97] Wayne Eberly and Erich Kaltofen. On randomized Lanczos algorithms. In Wolfgang W. Kuchlin, editor, ISSAC '97. Proceedings of the 1997 International Symposium on Symbolic and Algebraic Computation, Maui, Hawaii, pages 176{183, New York, NY 10036, USA, July 21{23, 1997. ACM Press. [Fau94] Jean-Charles Faugere. Parallelization of Grobner basis. In Hoon Hong, editor, First International Symposium on Parallel Symbolic Computation, PASCO '94, Hagenberg/Linz, Austria, volume 5 of Lecture notes series in computing, pages 124{132, September 26{28, 1994. [Gau96] Thierry Gautier. Calcul Formel et Parallelisme : Conception du Systeme GIVARO et Applications au Calcul dans les Extensions Algebriques. PhD thesis, Institut National Polytechnique de Grenoble, 1996. http://wwwapache.imag.fr/software/givaro/. [GL96] Gene H. Golub and Charles F. Van Loan. Matrix computations. Johns Hopkins Studies in the Mathematical Sciences. The Johns Hopkins University Press, Baltimore, MD, USA, third edition, 1996. [KL96] Eric Kaltofen and Austin Lobo. Distributed matrix-free solution of large sparse linear systems over nite elds. In A.M. Tentner, editor, Society for Computer Simulation, Simulation Councils, Inc. Proceedings of High Performance Computing 1996 , San Diego, CA, April 1996. [KS91] Erich Kaltofen and B. David Saunders. On wiedemann's method of solving sparse linear systems. In Harold F. Mattson, Teo Mora, and T. R. N. Rao, editors, Proceedings of Applied Algebra, Algebraic Algorithms and Error{Correcting Codes (AAECC '91), volume 539 of LNCS, pages 29{38, Berlin, Germany, October 1991. Springer. [LO91] Brian A. LaMacchia and Andrew M. Odlyzko. Solving large sparse linear systems over nite elds. Lecture Notes in Computer Science, 537:109{133, 1991. [Mas69] James L. Massey. Shift-register synthesis and BCH decoding. IEEE Transactions on Information Theory, IT-15:122{ 127, 1969. [Sau98] B. David Saunders, August 1998. Personal communication. [SMJ90] Ernest Sibert, Harold F. Mattson, and Paul Jackson. Finite Field Arithmetic Using the Connection Machine. In Zippel, editor, Computer algebra and parallelism, Proceedings of the second International Workshop on Parallel Algebraic Computation, pages 51{61, Ithaca, USA, May, 1990. LNCS 584, Springer Verlag. [Vil97] Gilles Villard. Further analysis of Coppersmith's block Wiedemann algorithm for the solution of sparse linear systems. In Wolfgang W. Kuchlin, editor, ISSAC '97. Proceedings of the 1997 International Symposium on Symbolic and Algebraic Computation, Maui, Hawaii, pages 32{39, July 21{23, 1997. [Wie86] Douglas H. Wiedemann. Solving sparse linear equations over nite elds. IEEE Transactions on Information Theory, 32(1):54{62, January 1986. [Yan81] Mihali Yannakakis. Computing the minimum ll-in is NP-complete. SIAM Journal on Algebraic and Discrete Methods, 2(1):77{79, March 1981. [Zip93] Richard Zippel. Eective Polynomial Computation, chapter Zero Equivalence Testing, pages 189{206. Kluwer Academic Publishers, 1993. [Zla92] Zahari Zlatev. Computational Methods for General Sparse Matrices, chapter Pivotal Strategies for Gaussian Elimination, pages 67{86. Kluwer Academic Publishers, Norwell, MA, 1992.
9