Iterative solution of sparse linear least squares ... - NC State University

LL−1 1 Preconditioning for Overdetermined Sparse Least Squares Problems Gary W. Howell1 and Marc Baboulin2 1

2

North Carolina State University, USA [email protected] Université Paris-Sud and Inria, France [email protected]

Abstract. In this paper we are interested in computing the solution of an overdetermined sparse linear least squares problem Ax ' b via the normal equations method. For A = LU , the normal equations with L are usually better conditioned that the original normal equations AT Ax = AT b. Here we explore a further preconditioning by L−1 where 1 L1 is the n × n upper partition of m × n of L. No additional storage is required. Since the condition number of the iteration matrix can be explicitly computed, we can easily determine whether the iteration will be effective, and whether further preconditioning is required. Numerical experiments were performed with the Julia computer language. When the upper triangular matrix U has no near zero diagonal elements, the algorithm is observed to be reliable, usually requiring less storage than the Cholesky factor of AT A or the R factor of QR = A facrtorization. Keywords: Sparse linear least squares, iterative methods, preconditioning, conjugate gradient algorithm, lsqr algorithm

1

Introduction

Linear least squares (LLS) problems arise whenever the number of linear equations is not equal to the number of unknown parameters. Here we consider the overdetermined full rank LLS problem min kAx − bk2 ,

x∈Rn

(1)

with A ∈ Rm×n , m ≥ n and b ∈ Rm . We have observed that the storage of LU factorization of a rectangular matrix is often less that that of R from QR factorization or from the Cholesky factorization of AT A. In fact, if A is “strong Hall”, the storage required by U is always bounded by that of R (George, Gilbert, Ng, cited on P. 83 [9]. In iterative conjugate gradient solution of the normal equation AT Ax = AT b the linear rate of convergence is bounded by K=

κ−1 κ+1

where κ = cond2 (A) =

q

cond2 (AT A)

For the case of m equations > n unknowns, we can perform an LU factorization of and m × n matrix A, obtaining a lower trapezoidal m × n matrix L and an upper triangular marix U . As L tends to be better conditioned than A, we explored converting the normal equations to LT Ly = LT b,

y = Ux

(equivalent when U is invertible) where U is n × n, and L has m rows and n columns. [14]. Here we take L from the cholmod factorization LU = P AQ, i.e., where the column pivoting Q avoids fill in L and U and the row pivots P are chosen to get partial pivoting, i.e., to ensure that the largest in absolute value elements in lower trapezoidal L are the 10 s on the diagonal. Since L is usually better conditioned than A, fewer iterations are required for the L iteration. Partition the rectangular matrix L as L=

L1 L2

where L1 is square and lower triangular L2 is m−n×n. Here we explore iterating with the normal equations in terms of F = LL−1 1 =

In In = C L−1 L 2 1

(2)

The following plot compares the condition numbers of A, L, and F for some rectangular matrices small enough that we can in matlab explicitly compute the SVD and thus the condition number as the ratio of largest and smallest singular values. These matrices were created in a similar fashion to those discussed in the next section, but these were less than size five thousand or so. As in Bj¨ orck and Yuan [12] and Arioli and Duff [5], er can take the normal equations in the form In T [In |C ] z = [In |C T ]b, z = L1 U x C where L1 is lower triangular and U is upper triangular so that given z, forward and backward substitutions can compute x. If we do not explicitly compute C no additional storage beyond that required for L is required. The lsqr algorithm requires multiplication both by F and F T . We compute In u = u + L2 (L−1 Fu = 1 u) L2 L−1 1 where y = L−1 1 u is accomplished by a forward solve L1 y = u and similarly a multiplication v T F will entail a backsolve with LT1 . Multiplications by F and F T

Condition number for LSQR iterate ●

15

●

A L L2*inv(L1)

●

10

●

●

● ●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

5

log10(Matrix Condition numbers)

●

● ●

●

●

● ●

●

●

●

● ● ●

●

0

10

●

● ●

20

●

●

●

●

●

●

30

●

40

45 Matrices

..

Fig. 1: Only a few matrices had LL−1 1 poorly conditioned. take almost the same number of multiplications as by L, but back and forward solves are likely to be harder to parallelize than multiplication by L. In our experience (as in the first figure), F = LL−1 1 is usually better conditioned than L. To see why, consider that In T T F F = [In |C ] = In + C T C C has singular values bounded below by one. Since the condition number is the ratio of the largest and smallest singular values, the condition number is just the largest singular value of F , computation of which requires only an estimate of the largest singular value, i.e. of the square root of the largest eigenvalue of F T F . Applying a few iterations of the symmetric power method to F T F suffices for an estimate sufficient to bound the number of required iterations, allowing us to easily decide whether further preconditioning is needed. One possible way to save storage is to perform the original LU factorization in a lower precision (or use incomplete factorization of A). The iteration is then I with AU −1 L−1 1 , possibly with a preliminary cheaper iteration with L11 L2 . That iteration can be compared to the RIF preconditioner [7] a sparse Cholesky-like factorization) of AT A. While these approaches have been explored by several authors, including Peters and Wilkinson [10], Björck and Duff [11], Björck and Yuan [12], more recently Arioli and Duff [5] and are worth revisiting because of the recent progress in sparse LU factorization. For example, both Matlab and Octave use fast sparse

LU factorizations built on Davis’ CSparse package [13]. Also direct sparse solvers (see, e.g., [15, ?,?,?,?]) offer scalable sparse LU factorizations for large problems.

2

A Test Set of Matrices, Using the Julia Language

Numerical experiments here are with the F = LL−1 1 =

I C

variantof lsqr. Neither C and L−1 are explicitly computed. The LU = P AQ factorization is performed in full precision and with both column and row permutations. We refer to this algorithm as lsqrLinvL. For a set of test problems we took matrices from the Davis collection, using 235 matrices to around size 30 thousand rows and columns. For matrices with more columns than rows, we transposed. For matrices that were square, we augmented the matrices with one hundred additional rows (or for square matrices with fewer than one thousand rows, increasing the number of rows by ten per cent). The additional rows were perturbed copies of randomly selected rows of the original square matrix. The randomly perturbed entries were the original entries times a factor in the range [.9, 1.1], i.e. aijnew = (1 + .1 ∗ τij )aij where τij was randomly selected from a uniform distribution on [-1,1]. For the entire list of Davis collection matrices, see the file 7feb2017 at. The plots and numeric results here are from loading the 7feb2017 as a csv (comma separated) data frame. We performed numeric experiments with the language julia. julia requires fairly minimal changes from Matlab or octavei codes. Wherever there are loops, julia runs much faster. For example, writing a sparse matrix multiplication Ax in C and julia, the C version was only slightly faster, with octave much slower. Converting code from octave to julia is significantly easier than converting to C or Fortran. Some of the issues we encountered: in some instances, we had to be aware that in the statement A = B for julia arrays, the copy of B is “shallow” i.e., no new copy of A is produced, so changes in A also change B. Though Matlab, octave and julia all use sparse matrix LU and QR factorizations based on Davis’s SuiteSparse, the julia lufact and qrfact functions do not offer user options. The julia developers were kind enough to remove the error thrown in LU factorization of non square A. To get an LU factorization that did not scale rows, we had to track the lufact function and find which line to In change. In octave we used QR = Cdrop . Alas the sparse qrfact function in julia currently returns a least squares solution, but does not offer options to return R or Q. In exact arithmetic, RT R = RT QT QR, so in this sense storage for R from Cholesky is also storage for R from QR. We could use this equivalence as a work-around for the julia qrfact, so long as A was not so poorly conditioned that the Cholesky factorization failed.

3

Numerical experiments on the test set of 235 matrices.

We performed julia lufact LU = P AQ decompositions for each of the matrices. LL−1 1 tends to be acceptably well conditioned, and as noted above, the condition number is easily computed. 192 of 235 matrices had cond2 (LL−1 1 ) < 400. Of the 43 more poorly conditioned matrices, 29 were among the 47 matrices with more than one hundred rows in C, indicating that LL−1 1 preconditioning is particularly appropriate for near square matrices.

12

●

10

Condition(L inv(L1))

●

6

●

●

4

● ●

0

2

log10(cond(L inv(L1))

8

● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0

50

100

150

200

233 sparse matrices, 192 had cond(L inv(L1) < 400

..

Fig. 2: Only a few matrices had LL−1 1 poorly conditioned.

After lsqrLinvL has converged, a final solve with U is needed. Alas, U can be singular or poorly conditioned. Consider the normalized (by column norm) size of diagonals. 30 of the 235 matrices had values uii /kui k2 < 1.e − 10, so for these matrices neither L nor LL−1 1 preconditioning was judged to be feasible. As we would expect from the relative condition numbers of LL−1 1 , L, and A, lsqr with LL−1 converged relatively fast. 1

minimal values of diagonals of U from PAQ = LU

−4

● ● ●

−6

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

−8

log10(min(abs(u[i,i])))

−2

0

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

−10

● ● ● ● ● ●

0

50

100

150

200

dropped 26 matrices with consecutive zeros on diagonal of U

..

Fig. 3: Quite a few matrices had upper triangular U with near zeros on the diagonal.

Iterations for LSQR on A, L, Linv(L1)

5

●

3

● ●●

2

● ● ● ●●●●●● ●●● ●●● ● ●● ● ●●●● ● ●● ● ●● ●●●●●● ● ●●● ●●●●● ●●● ● ● ●●●●● ●●● ●●●●● ●●●●●●● ●●●●

LinvL1 L A

1

log10(Iterations to Converge)

4

●

●●

0

20

40

60

80

Discarded the 156 of 235 matrices for which LUlsqr did not converge

..

Fig. 4: Only 83 matrices are plotted because these are the ones for which lsqrL converged in n iterations.

4

Numeric experiments with a hybrid method

For each of 235 matrices, we computed LU = P AQ. For 205 of the matrices U has large enough diagonal entries to attempt L and LL−1 1 K = cond2 (LL−1 1 ) – For K < 400, we used LL−1 1 lsqr (173 of 205 matrices) – For 400 < K < 108 , explicity compute C and do a drop tolerance. Denote the maximal entry of the ith column of C as ci , Get Cdrop by zeroing elements cij < |ci |/K .25 Compute the Cholesky factorization R satisfying In T T R R = [In |Cdrop ] Cdrop −1 then use lsqr iteration with LL−1 (28 of 205 matrices) 1 R 8 – For K > 10 , we have cond2 (F ) > 1016 . Cholesky decomposition in double precision arithmetic gives a floating point error. In this case we try lsqr iteration with L. (4 of 205 matrices)

We iterated with a tolerance of 1.e-10. For the 173 of 205 cases for which K < 400, lsqr iteration with LL−1 1 iteration converged in all cases, requiring only the storage from LU , which averaged 12.6 times the nonzero elements of A. The average number of iterations for a tolerance of 1.e-19 was .32 n iterations (n the number of columns of A). For 400 < K < 108 (28 of 205 matrices) the average storage needed was 165 times the storage of A, .138 n average iterations were needed (after deleting one outlier). For K > 108 , (4 of 205 matrices), using lsqr on L, used an average of 6.52 times the storage of A with an average of .6 n iterations. For each of the 205 matrices we computed a solution x. We checked each solution to k|Ax − bk|2 against the Julia qrfact solution, which uses Davis’s sparse QR algorithm. Denote the qrfact solution as xqr and the solution from the hybrid method as xLinvL Then for 152 of 205 matrices, kxLinvL − xqr k|2 < 1.e − 8 kxLinvL k2 For the 53 solutions that don’t agree, which is better? The residual kAx−bk2 tends to be smaller for lsqrLinL (38 of 53 cases) but k|xk|2 tends to be larger (42 of 53 cases). Which is more accurate depends on how you regularize. The last figure plots the ratio kkb − Axqr k2 , 1.e − 8kxqr k2 k2 log10 kb − AxLinvL k2 , 1.e − 8kxLinvL k2 k2

1.5

Ratio of sqrt(||Ax−b||^2 + 1.e−16 ||x||^2), QR/LUinvL ●

0.5 0.0

●

● ● ● ● ● ● ● ● ●● ●●●●●●● ● ● ● ● ● ● ● ●● ●●● ● ● ●● ● ● ●● ● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●●● ●●●●● ●● ●●●● ● ●●●●● ● ●●● ● ● ● ● ● ●●

●●

●●

●●●

●

●

●

−0.5

log10(TichonovQR/TichonovLUinvL

1.0

● ●

● ●

●

● ● ●

● ● ● ●

●

●

●

●

−1.0

●

● ●

0

2

4

6

8

−log10(min(abs(diag(u))) − 2 outliers chopped

..

Fig. 5: The matrices for which U has small diagonal entries are to the right of the graph.

5

Conclusion and future work

When the number of equations is not much larger than the number of variables, LU factorization usually allows iterative least squares solves with less storage than QR factorzation. A main limitation is when U has zero diagonal elements.

Acknowledgments We would like to thank Keichi Morikuni, Iain Duff, Michael Saunders and Dominique Orban for advice and encouragement. Thanks to North Carolina State University for use of the Henry2 HPC cluster.

References 1. M. Baboulin, L. Giraud, S. Gratton, J. Langou, Parallel tools for solving incremental dense least squares problems. Application to space geodesy, Journal of Algorithms and Computational Technology 3 (1) (2009) 117–133. 2. J. Nocedal, S. J. Wright, Numerical Optimization, Springer, 1999. 3. M. T. Heath, Numerical methods for large sparse linear squares problems, SIAM J. Sci. Stat. Computing 5 (4) (1984) 497–513. 4. M. Lourakis, Sparse non-linear least squares optimization for geometric vision, European Conference on Computer Vision 2 (2010) 43–56.

5. M. Arioli, I. S. Duff, Preconditioning linear least-squares problems by identifying a basis matrix, SIAM J. on Scientific Computing, 37 (5), (2015) s544–s561. 6. ˚ A. Bj¨ orck, Numerical Methods for Least Squares Problems, SIAM, Philadelphia, 1996. 7. M. Benzi, M. Tuma, A robust incomplete factorization preconditioner for positive definite matrices, Numerical Linear Algebra with Applications 10 (2003) 385–400. 8. T. Davis, The University of Florida sparse matrix collection, available from http://www.cise.ufl.edu/research/sparse/matrices/. 9. T. Davis, Direct Methods for Sparse Linear Matrices, SIAM, Philadelphia, 2005. 10. G. Peters, J. H. Wilkinson, The least squares problem and pseudo-inverses, Computing J. 13 (1970) 309–316. 11. A. Bj¨ orck, I. S. Duff, A direct method for the solution of sparse linear least squares problems, Linear Algebra Appl. 34 (1980) 43–67. 12. A. Bj¨ orck, J. Yuan, Preconditioners for least squares problems by LU factorization, Electronic Transactions on Numerical Analysis 8 (1999) 26–35. 13. T. Davis, Direct Methods for Sparse Linear Systems, SIAM, 2006. 14. G. W. Howell, M. Baboulin, LU preconditioning for overdetermined sparse least squares problems, Proceedings of the International Conference on Computational Science, Elsevier, 2015 15. X. S. Li, An overview of SuperLU: Algorithms, implementation, and user interface, ACM Transactions on Mathematical Software 31 (3) (2005) 302–325. 16. X. S. Li, J. W. Demmel, Superlu dist: a scalable distributed-memory sparse direct solver for unsymmetric linear systems, ACM Transactions on Mathematical Software 29 (9) (2003) 110–140. 17. MUMPS: a MUltifrontal Parallel sparse direct Solver, http://mumps.enseeiht,fr/index.php?page=home (2014). 18. O. Schenk, K. G¨ artner, PARDISO User Guide, http://www.pardisoproject.org/manual/manual.pdf (2014). 19. T. A. Davis, UMFPACK User Guide, https://www.cise.ufl.edu/research/sparse/umfpack/UMFPACK/Doc/UserGui (2013). 20. G. H. Golub, C. F. V. Loan, Matrix Computations, The Johns Hopkins University Press, Baltimore, 1996, third edition. 21. D. Bateman, A. Adler, Sparse matrix implementation in octave, available from arxiv.org/pdf/cs/0604006.pdf (2006). 22. A. Jennings, M. Ajiz, Incomplete methods for solving AT Ax = b, SIAM J. Sci. Stat. Comput. 5 (4) (1984) 978–987. 23. N. Li, Y. Saad, MIQR: a multilevel incomplete QR preconditioner for large sparse least-squares problems, SIAM J. Matrix Anal. and Appl. 28 (2) (2006) 524–550. 24. W. W. Hager, Condition estimates, SIAM J. Sci. Stat. Computing 5 (1984) 311– 316. 25. N. J. Higham, F. Tisseur, A block algorithm for matrix 1-norm estimation with an application to 1-norm pseudospectra, SIAM J. Matrix Anal. Appl. 21 (2000) 1185–1201. 26. Y. Saad, Iterative Methods for Sparse Linear Systems, SIAM, Philadelphia, 2000, second edition. 27. C. Paige, M. Saunders, An algorithm for sparse linear equations and sparse least squares, ACM Trans. on Math. Software 8 (43–71). 28. M. Arioli, M. Baboulin, S. Gratton, A partial condition number for linear leastsquares problems, SIAM J. Matrix Anal. Appl. 29 (2) (2007) 413–433.

29. M. Baboulin, S. Gratton, R. Lacroix, A. J. Laub, Statistical estimates for the conditioning of linear least squares problems, in: R. Wyrzykowski et. al. (Ed.), 10th International Conference on Parallel Processing and Applied Mathematics (PPAM 2013), Vol. 8384 of Lecture Notes in Computer Science, Springer-Verlag, Heidelberg, 2014, pp. 124–133.

Iterative solution of sparse linear least squares ... - NC State University

Iterative solution of sparse linear least squares ... - NC State University

Suggest Documents

ITERATIVE SOLUTION OF LEAST-SQUARES PROBLEMS APPLIED ...

Solution of large-scale sparse least squares problems ... - CiteSeerX

Iterative Tomographic Solution of Integer Least Squares Problems with ...

An iterative solution of weighted total least-squares ... - Semantic Scholar

Sparse Non-linear Least Squares Optimization for Geometric Vision

Adaptive least squares matching as a non-linear least squares ...

the efficient parallel iterative solution of large sparse linear ... - CiteSeerX

Non-linear Least-squares Solution to the MoirÃ© Hole

Restricted unbiased iterative generalized least-squares ... - CiteSeerX

Bayesian Sparse Partial Least Squares - Computational Intelligence ...

Sparse Least Squares Support Vector Regressors

Parallel iterative solution method for large sparse linear equation ...

Linear-least-squares-fitting procedure for the ... - K-State Physics

Numerical investigations of linear least squares ...

Iterative algorithms for solution of large sparse ... - Bilkent University

Least Squares Solution and Pseudo- Inverse

Least Squares Linear Discriminant Analysis - CiteSeerX

Alternatives to the Least Squares Solution to

LEAST-SQUARES METHODS FOR LINEAR ... - Semantic Scholar

Mixed linear-nonlinear least squares regression

Flexible least squares for approximately linear

S1 Appendix The ordinary least squares solution

MINIMUM NORM LEAST-SQUARES SOLUTION TO GENERAL

Iterative Methods for Sparse Linear Systems - Stanford University