Matrices with hierarchical low-rank structures Jonas Ballani and Daniel Kressner ANCHP, EPF Lausanne, Station 8, CH-1015 Lausanne, Switzerland e-mail:
[email protected],
[email protected] The first author has been supported by an EPFL fellowship through the European Union’s Seventh Framework Programme under grant agreement no. 291771. Summary. Matrices with low-rank off-diagonal blocks are a versatile tool to perform matrix compression and to speed up various matrix operations, such as the solution of linear systems. Often, the underlying block partitioning is described by a hierarchical partitioning of the row and column indices, thus giving rise to hierarchical low-rank structures. The goal of this chapter is to provide a brief introduction to these techniques, with an emphasis on linear algebra aspects.
1 Introduction 1.1 Sparsity versus data-sparsity An n × n matrix A is called sparse if nnz(A), the number of its nonzero entries, is much smaller than n2 . Sparsity reduces the storage requirements to O(nnz(A)) and, likewise, the complexity of matrix-vector multiplications. Unfortunately, sparsity is fragile; it is easily destroyed by operations with the matrix. For example, the factors of an LU factorization of A usually have (many) more nonzeros than A itself. This so called fill-in can be reduced, sometimes significantly, by reordering the matrix prior to the factorization [25]. Based on existing numerical evidence, it is probably safe to say that such sparse factorizations constitute the methods of choice for (low-order) finite element or finite difference discretizations of two-dimensional partial differential equations (PDEs). They are less effective for threedimensional PDEs. More severely, sparse factorizations are of limited use when the inverse of A is explicitly needed, as for example in inverse covariance matrix estimation and matrix iterations for computing matrix functions. For nearly all sparsity patterns of practical relevance, A−1 is a completely dense matrix. In certain situations, this issue can be addressed effectively by considering approximate sparsity: A−1 is replaced by a sparse matrix M such that kA−1 − Mk ≤ tol for some prescribed tolerance tol and an appropriate matrix norm k · k. To illustrate the concepts mentioned above, let us consider the tridiagonal matrix
2
Jonas Ballani and Daniel Kressner
2 −1
. −1 2 . . + σ (n + 1)2 In , A = (n + 1)2 . . . . . . −1 −1 2
(1)
where the shift σ > 0 controls the condition number κ(A) = kAk2 kA−1 k2 . The intensity plots in Figure 1 reveal the magnitude of the entries of A−1 ; entries of magnitude below 10−15 are white. It can be concluded that A−1 can be very well approximated by a banded matrix if its
(a) σ = 4, κ(A) ≈ 2
(b) σ = 1, κ(A) ≈ 5
(c) σ = 0, κ(A) ≈ 103
Fig. 1. Approximate sparsity of A−1 for the matrix A from (1) with n = 50 and different values of σ . condition number is small. This is an accordance with a classical result [26, Proposition 2.1], which states for any symmetric positive definite tridiagonal matrix A that −1 [A ]i j ≤ C
√ 2|i− j| κ −1 √ , κ +1
p −1 C = max λmin , (2λmax )−1 (1 + κ(A))2 ,
(2)
where λmin , λmax denote the smallest/largest eigenvalues of A; see also [15] for recent extensions. This implies that the entries of A−1 decay exponentially away from the diagonal, but this decay may deteriorate as κ(A) increases. Although the bound (2) can be pessimistic, it correctly predicts that approximate sparsity is of limited use for ill-conditioned matrices as they arise from the discretization of (one-dimensional) PDEs. An n × n matrix is called data-sparse if it can be represented by much less than n2 parameters with respect to a certain format. Data-sparsity is much more general than sparsity. For example, any matrix of rank r n can be represented with 2nr (or even less) parameters by storing its low-rank factors. The inverse of the tridiagonal matrix (1) is obviously not a low-rank matrix, simply because it is invertible. However, any of its off-diagonal blocks has rank 1. One of the many ways to see is to use the Sherman-Morrison formula. Consider the following decomposition of a general symmetric positive definite tridiagonal matrix: T en1 A11 0 en1 A= − an1 ,n1 +1 , A11 ∈ Rn1 ×n1 , A22 ∈ Rn2 ×n2 , −e1 −e1 0 A22 where e j denotes a jth unit vector of appropriate length. Then A
−1
−1 T an1 ,n1 +1 A11 0 w1 w1 , = + T A−1 e + eT A−1 e w w2 0 A−1 2 1 + e 22 n1 11 n1 1 22 1
(3)
Matrices with hierarchical low-rank structures
3
−1 −1 have rank at where w1 = A−1 11 en1 and w2 = −A22 e1 . Hence, the off-diagonal blocks of A most 1. Assuming that n is an integer multiple of 4, let us partition (2) (2) (2) (2) B11 B12 A11 A12 A12 B12 (2) (2) (2) (2) B B A A −1 21 22 21 22 , A = (4) A= (2) (2) (2) (2) , A33 A34 B33 B34 A21 B34 (2) (2) (2) (2) A43 A44 B43 B44 (2)
(2)
such that Ai j , Bi j ∈ Rn/4×n/4 . By storing only the factors of the rank-1 off-diagonal blocks (2)
of A−1 and exploiting symmetry, it follows that – additionally to the 4 diagonal blocks B j j of size n/4 × n/4 – only 2n/2 + 4n/4 = 2n entries are needed to represent A−1 . Assuming that n = 2k and partitioning recursively, it follows that the whole matrix A−1 can be stored with 2n/2 + 4n/4 + · · · + 2k n/2k + n = n(log2 n + 1)
(5)
parameters. The logarithmic factor in (5) can actually be removed by exploiting the nestedness of the (2) low-rank factors. To see this, we again consider the partitioning (4) and let U j ∈ Rn/4×2 , j = 1, . . . , 4, be orthonormal bases such that n o (2) (2) (2) span (A j j )−1 e1 , (A j j )−1 en/4 ⊆ range(U j ). ! (2) (2) A11 A12 Applying the Sherman-Morrison formula (3) to A11 = (2) (2) shows A21 A22 ! ! (2) (2) U1 0 U1 0 −1 −1 A11 e1 ∈ range (2) , A11 en/2 ∈ range (2) . 0 U2 0 U2 Similarly, A−1 22 e1
! (2) U3 0 ∈ range (2) , 0 U4
A−1 22 en/2
! (2) U3 0 ∈ range (2) . 0 U4
If we let U j ∈ Rn/2×2 , j = 1, 2, be orthonormal basis such that n o −1 span A−1 j j e1 , A j j en/2 ⊆ range(U j ), then there exists matrices X j ∈ R4×2 such that (2)
Uj =
U2 j−1
0
0
U2 j
(2)
! Xj.
(6) (2)
Hence, there is no need to store the bases U1 ,U2 ∈ Rn/2×2 explicitly; the availability of U j and the small matrices X1 , X2 suffices. In summary, we can represent A−1 as (2) (2) (2) (2) U1 S12 (U2 )T B11 T U1 S12U2 (2) (2) T (2) T (2) U (S ) (U ) B22 12 1 (7) 2 (2) (2) (2) (2) T B U S (U ) T UT 33 3 34 4 U2 S12 1 (2) (2) (2) (2) U4 (S34 )T (U3 )T B44
4
Jonas Ballani and Daniel Kressner (2)
for some matrices S12 , Si j ∈ R2×2 . The number of entries needed for representing the offdiagonal blocks A−1 is thus 4 × 2n/4 + 2 × 8 + (2 + 1) × 4 . | {z } | {z } | {z } (2)
for U j
for X j
(2)
(2)
for S12 , S12 , S34
Assuming that n = 2k and recursively repeating the described procedure k − 1 times, we find that at most O(n) total storage is needed for representing A−1 . Thus we have removed the logarithmic factor from (5), at the expense of a larger constant due to the fact that we work with rank 2 instead of rank 1. Remark 1. The format (7) together with (6) is a special case of the HSS matrices that will be discussed in Section 4.1. The related but different concepts of semi-separable matrices [65] and sequentially semi-separable matrices [21] allow to represent the inverse of a tridiagonal matrix A with nested rank-1 factors and consequently reduce the required storage to two vectors of length n. For a number of reasons, the situation discussed above is simplistic and not fully representative for the type of applications that can be addressed with the hierarchical low-rank structures covered in this chapter. First, sparsity of the original matrix can be beneficial but it is not needed in the context of low-rank techniques. Second, exact low rank is a rare phenomenon and we will nearly always work with low-rank approximations instead. Third, the recursive partitioning (4) is closely tied to banded matrices and related structures, which arise, for example, from the discretization of one-dimensional PDEs. More general partitionings need to be used when dealing with structures that arise from the discretization of higher-dimensional PDEs.
1.2 Applications of hierarchical low-rank structures The hierarchical low-rank structures discussed in this chapter (HODLR, H , HSS, H 2 ) have been used in a wide variety of applications. The purpose of the following selection is to give a taste for the variety of applications; it is by no means complete: • • • • • • • • • • •
Discretization of (boundary) integral operators [5, 6, 8, 10, 17, 18, 42, 57]; HSS techniques for integral equations [24, 32, 38, 53]; Hierarchical low rank approximation combined with (sparse) direct factorizations [2, 31, 54, 56, 70, 72]; Preconditioning with low-rank Schur complements and HSS techniques [29]; H -matrix based preconditioning for finite element discretization of Maxwell equations [55]; Matrix sign function iteration in H -matrix arithmetic for solving matrix Lyapunov and Riccati equations [4, 37]; Combination of contour integrals and H -matrices for computing matrix functions [30]; Eigenvalue computation [12, 13, 14, 68]; Kernel approximation in machine learning [61, 69]; Sparse covariance matrix estimation [1, 3] and Kalman filtering [50]; Clustered low-rank approximation of graphs [58].
Matrices with hierarchical low-rank structures
5
1.3 Outline The rest of this chapter is structured as follows. In Section 2, we spend considerable attention to finding low-rank approximations of matrices. In particular, we discuss various algorithms used in practice and give a priori error estimates that offer insight into the ability of performing such approximations. Section 3 combines low-rank approximations with block partitionings, significantly extending the construction in (4). Finally, in Section 4, we additionally impose nestedness on the low-rank factors, analogous to (6). There are numerous excellent monographs [6, 17, 42, 66, 67], survey papers, and lecture notes on matrices with hierarchical low-rank structures. This chapter possibly complements the existing literature by serving as a broad but compact introduction to the field.
2 Low-rank approximation 2.1 The SVD and best low-rank approximation The basis of everything in this chapter is, in one way or the other, the singular value decomposition (SVD). Theorem 1 (SVD). Let A ∈ Rm×n with m ≥ n. Then there are orthogonal matrices U ∈ Rm×m and V ∈ Rn×n such that σ1 .. . A = UΣV T , with Σ = ∈ Rm×n σn 0 and σ1 ≥ σ2 ≥ · · · ≥ σn ≥ 0. The restriction m ≥ n in Theorem 1 is for notational convenience only. An analogous result holds for m < n, which can be seen by applying Theorem 1 to AT . Given an SVD A = UΣV T , the nonnegative scalars σ1 , . . . , σn are called the singular values, the columns v1 , . . . , vn of V right singular vectors, and the columns u1 , . . . , um of U left singular vectors of A. In M ATLAB, the SVD of a matrix A is computed by typing [U,D,V] = svd(A). If A is not square, say m > n, a reduced or “economic” SVD is computed when typing [U,D,V] = svd(A,’econ’). This yields a matrix U ∈ Rm×n with orthonormal columns and an orthogonal matrix V ∈ Rn×n such that σ1 T .. A=U V . . σn For r ≤ n, let Uk := u1 · · · uk ,
Σk := diag(σ1 , . . . , σk ),
Vk := u1 · · · uk .
Then the matrix Tk (A) := Uk ΣkVk clearly has rank at most k. For any class of unitarily invariant norms k · k, we have
(8)
6
Jonas Ballani and Daniel Kressner
kTk (A) − Ak = diag(σk+1 , . . . , σn )
In particular, the spectral norm and the Frobenius norm give q 2 + · · · + σ 2. kA − Tk (A)k2 = σk+1 , kA − Tk (A)kF = σk+1 n
(9)
This turns out to be optimal. Theorem 2 (Schmidt-Mirsky theorem). For A ∈ Rm×n , let Tk (A) be defined as in (8). Then kA − Tk (A)k = min kA − Bk : B ∈ Rm×n has rank at most k holds for any unitarily invariant norm k · k. Proof. Because of its simplicity and beauty, we include the proof for the spectral norm. The proof for the general case, which goes back to Mirsky, can be found in, e.g., [48, Pg. 468]. For any matrix B ∈ Rm×n of rank at most k, the null space kernel(A) has dimension at least n − r. Hence, the intersection between kernel(A) and the (k + 1)-dimensional space range(Vk+1 ) is nontrivial. Let w be a vector in this intersection with kwk2 = 1. Then T T kA − Bk22 ≥ k(A − B)wk22 = kAwk22 = kAVk+1Vk+1 wk22 = kUk+1 Σk+1Vk+1 wk22
=
r+1
r+1
j=1
j=1
∑ σ j |vTj w|2 ≥ σk+1 ∑ |vTj w|2 = σk+1 .
Together with (9), this shows the result. t u Remark 2. The singular values are uniquely determined by the matrix A. However, even when ignoring their sign indeterminacy, the singular vectors are only uniquely determined for simple singular values [48, Thm. 2.6.5]. In particular, the subspaces range(Ur ), range(Vr ), and the operator Tk (A) are uniquely determined if and only if σk > σk+1 . However, as we will see in Section 2.2 below, there is nothing pathological about the situation σk = σk+1 for the purpose of low-rank approximation. A practically important truncation strategy is to choose the rank such that the approximation error is below a certain threshold ε > 0. With slight abuse of notation, we write Tε (A) := Tkε (A) (A), where kε (A) is the ε-rank of A defined as the smallest integer k such that kTk (A)−Ak ≤ ε. By entering type rank into M ATLAB, one observes that the M ATLAB function rank actually computes the ε-rank with respect to the spectral norm and with ε = u max{m, n}kAk2 , where u ≈ 2 × 10−16 is the unit roundoff.
2.2 Stability of SVD and low-rank approximation The singular values of a matrix A are perfectly stable under a perturbation A 7→ A + E. Letting σ j (B) denote the jth singular value of a matrix B, Weyl’s inequality [48, Pg. 454] states that σi+ j−1 (A + E) ≤ σi (A) + σ j (E), Setting j = 1, this implies
1 ≤ i, j, ≤ min{m, n},
i + j ≤ q + 1.
Matrices with hierarchical low-rank structures σi (A + E) ≤ σi (A) + kEk2 ,
i = 1, . . . , min{m, n}.
7 (10)
In view of Remark 2, it is probably not surprising that the superb normwise stability of singular values does not carry over to singular vectors. For example [63], consider 1 0 0 ε A= , E= . 0 1+ε ε −ε 1 0 Then A has right singular vectors , while A + E has the completely different right 0 1 1 √1 1 singular vectors √1 , no matter how small ε > 0 is chosen. Considering the 2 1 2 −1 low-rank approximation (8), it is particularly interesting to study the perturbation behavior of the subspace spanned by the first k singular vectors. Theorem 3 (Wedin [71]). Let r < min{m, n} and assume that δ := σk (A + E) − σk+1 (A) > 0. fk / Vk and V fk denote the subspaces spanned by the first k left/right singular Let Uk and U vectors of A and A + E, respectively. Then q √
fk ) 2 + sinΘ (Vk , V fk ) 2 ≤ 2 kEkF ,
sinΘ (Uk , U F F δ where Θ is a diagonal matrix containing the canonical angles between two subspaces on its diagonal. Theorem 3 predicts that the accuracy of the singular subspaces is determined by the perturbation level in A multiplied with δ −1 ≈ (σk − σk+1 )−1 . This result appears to be rather discouraging because we will primarily work with matrices for which the singular values decay rather quickly and hence σk − σk+1 is very small. Fortunately, a simple argument shows that low-rank approximations are rather robust under perturbations. Lemma 1. Let A ∈ Rm×n have rank at most k. Then kTk (A + E) − Ak ≤ CkEk holds with C = 2 for any unitarily √ invariant norm k · k. For the Frobenius norm, the constant can be improved to C = (1 + 5)/2. Proof. By Theorem 2, kTk (A + E) − (A + E)k ≤ E. Hence, the triangle inequality implies kTk (A + E) − (A + E) + (A + E) − Ak ≤ 2kEk. The proof of the second part can be found in [43, Thm 4.5]. t u For the practically more relevant case that the original matrix A has rank larger than r, Lemma 1 implies
kTk (A + E) − Tk (A)k = Tk Tk (A) + (A − Tk (A)) + E − Tk (A) ≤ Ck(A − Tk (A)) + Ek ≤ C(kA − Tk (A)k + kEk).
(11)
Thus, as long as the perturbation level is not larger than the desired truncation error, a perturbation will at most have a mild effect on our ability to approximate A by a low-rank matrix.
8
Jonas Ballani and Daniel Kressner
Example 1. Consider a partitioned matrix A11 A12 A= , A21 A22
Ai j ∈ Rmi ×n j ,
and a desired rank k ≤ mi , n j . Let ε := kTk (A) − Ak and Ei j := Tk (Ai j ) − Ai j . Because of the min-max characterization of singular values, we have kEi j k ≤ ε. Setting k · k ≡ k · kF and using (11), we obtain
≤ Cε,
Tk Tk (A11 ) Tk (A12 ) − A = Tk A + E11 E12 − A
E E Tk (A21 ) Tk (A22 ) 21 22 F F √ with C = 23 (1+ 5). Consequently, first truncating the matrix blocks and then the entire matrix results in a quasi-optimal approximation with a very reasonable constant.
2.3 Algorithms for low-rank approximation The choice of algorithm for computing a good low-rank matrix approximation to a matrix A ∈ Rm×n critically depends on the size of the matrix A how its entries can be accessed by the algorithm: 1. If all entries of A are readily available or can be cheaply evaluated and min{m, n} is small, we can simply compute Tk (A) according to its definition via the SVD. As we will explain below, the SVD can also be useful when min{m, n} is large but A is given in factorized form. 2. For large m, n, the Lanczos-based method and randomized algorithms are typically the methods of choice, provided that matrix-vector multiplications with A and AT can be performed. 3. When A is not readily available and the computation of its entries is expensive, it becomes mandatory to use an approach that only requires the computation of a selected set of entries, such as adaptive cross approximation.
SVD-based algorithms We refer to [34, Sec. 8.6] and [28] for a description of classical algorithms for computing an (economic) SVD A = UΣV T of a dense matrix A ∈ Rm×n . Assuming m ≥ n, the complexity of these algorithms is O(mn2 ). Unlike the algorithms discussed below, none of these classical algorithms is capable of attaining lower complexity when it is known a priori that the target rank k is much smaller than m, n and only the truncated SVD Uk ΣkVkT needs to be computed to define Tk (A). In practice, one of course never stores Tk (A) explicitly but always in terms of its factors by storing, for example, the (m + n)k entries of Uk Σk and Vk . In principle, this can be reduced further to (m + n)k − k2 + O(k) by expressing Uk ,Vk in terms of k Householder reflectors [34, Sec. 5.1.6], but in practice this is rarely done because it becomes too cumbersome to work with such a representation and the benefit for k m, n is marginal at best. One major advantage of the SVD is that one obtains complete knowledge about the singular values, which conveniently allows for choosing the truncation rank k to attain a certain accuracy. However, care needs to be applied when interpreting small computed singular values. Performing a backward stable algorithms in double-precision arithmetic introduces a
Matrices with hierarchical low-rank structures
9
backward error E with kEk2 ≤ 10−16 × Cσ1 for some constant C mildly growing with m, n. According to (10), one can therefore expect that singular values near or below 10−16 σ1 are completely corrupted by roundoff error. Only in exceptional cases it is possible to compute singular values to high relative accuracy [27]. The stagnation of the singular values observed for the Hilbert matrix in the left plot of Figure 4 below is typical and entirely due to roundoff error. It is well-known that, in exact arithmetic, these singular values decay exponentially, see [11]; they certainly do not stagnate at 10−15 . The SVD is frequently used for recompression. Suppose that A = BCT ,
with B ∈ Rm×K ,C ∈ Rn×K ,
(12)
where K is larger than the target rank k but still (much) smaller than m, n. For example, such a situation arises when summing up J matrices of rank k: J
A=
∑
j=1
T T Bi Ci = B1 · · · BJ C1 · · · CJ . |{z} |{z} | {z }| {z }
∈Rm×k ∈Rn×k
Rm×Jk
(13)
Rm×Jk
In this and similar situations, the matrices B,C in (12) cannot be expected to have orthogonal columns. The following algorithm is then used to recompress A back to rank k: 1. Compute (economic) QR decompositions B = QB RB and C = QC RC . ek ΣkVek . 2. Compute truncated SVD Tk (RB RCT ) = U ek , Vk = QCVek and return Tk (A) := Uk ΣkV T . 3. Set Uk = QBU k This algorithm returns a best rank-k approximation of A and has complexity O((m + n)K 2 ).
Lanczos-based algorithms In the following, we discuss algorithms that makes use of the following Krylov subspaces for extracting a low-rank approximation to A: KK+1 (AAT , u1 ) = span u1 , AAT u1 , . . . , (AAT )K u1 , KK+1 (AT A, v1 ) = span v1 , AT Av1 , . . . , (AT A)K v1 , with a (random) starting vector u1 ∈ Rm and v1 = AT u1 /kAT u1 k2 . The bidiagonal Lanczos process, which is shown in lines 1–5 of Algorithm 1, produces matrices UK+1 ∈ Rm×(K+1) , VK+1 ∈ Rn×(K+1) such that their columns form orthonormal bases of KK+1 (AAT , u1 ) and KK+1 (AT A, v1 ), respectively. The scalars α j , β j generated by the Gram-Schmidt process in Algorithm 1 can be arranged in a bidiagonal matrix α1 β2 α2 BK = . . (14) , .. .. βK αK leading to the (two-sided) Lanczos decomposition AT UK = VK BTK ,
AVK = UK BK + βK+1 uK+1 eTK ,
(15)
where eK denotes the Kth unit vector of length K. In view of (8), it might be tempting to approximate Tk (A) by extracting approximations to the k dominant singular triplets to A
10
Jonas Ballani and Daniel Kressner
from (15); see, e.g., [16]. For this purpose, one could use existing software packages like ARPACK [49], which is behind the M ATLAB functions eigs and svds. Unfortunately, this may turn out to be a rather bad idea, especially if restarting is used. The small gaps between small singular values lead to poor convergence behavior of the Lanczos method for approximating singular vectors, basically because of their ill-conditioning; see Theorem 3. This could mislead the algorithm and trigger unnecessary restarting. Following [62], we propose instead to use the approximation Tk (A) ≈ AK := UK Tk (BK )VKT . This leads to Algorithm 1. Note that, in order to avoid loss orthogonality, one needs to reorthogonalize the vectors ue and ve in lines 3 and 4 versus the previously computed vectors u1 , . . . , u j and v1 , . . . , v j , respectively. The approximation error of Algorithm 1 can be cheaply
Algorithm 1 Lanczos method for low-rank approximation Input: Matrix A ∈ Rm×n , starting vector u1 ∈ Rm with ku1 k2 = 1, integer K ≤ min{m, n}, desired rank k ≤ K. ek ∈ Rm×k , Vek ∈ Rn×k with orthonormal columns and diagonal matrix Output: Matrices U k×k e ek ΣekVe T . Σk ∈ R such that A ≈ AK := U k 1: ve ← AT u1 , α1 ← ke vk2 , v1 ← ve/α1 . 2: for j = 1, . . . , K do 3: ue ← Av j − α j u j , β j+1 ← ke uk2 , u j+1 ← ue/β j+1 . 4: ve ← AT u j+1 − β j+1 v j , α j+1 ← ke vk2 , v j+1 ← ve/α j+1 . 5: end for 6: Set UK ← u1 , . . . , uK and VK ← v1 , . . . , vK . 7: Construct bidiagonal matrix BK according to (14). b ΣbVb T . 8: Compute singular value decomposition BK = U ek ← UK U(:, b 1 : k), Σek ← Σb(1 : k, 1 : k), Vek ← VK Vb (:, 1 : k). 9: U
monitored, assuming that the exact value of kAkF or a highly accurate approximation is known. Lemma 2. The approximation AK returned by Algorithm 1 satisfies q kAK − AkF ≤ σk+1 (BK )2 + · · · + σK (BK )2 + ωK ,
(16)
where ωK2 = kAk2F − α12 ∑Kj=2 (α 2j + β j2 ). Proof. By the triangular inequality
kAK − AkF ≤ UK (Tk (BK ) − BK )VKT +UK BK VKT − A F q
≤ σk+1 (BK )2 + · · · + σK (BK )2 + kUK BK VKT − A F .
2 To bound the second term, we note that kAk2F = kBK k2F + kUK BK VKT − A F holds because of orthogonality. t u Example 2. We apply Algorithm 1 to obtain the low-rank approximation of two 100 × 100 matrices with a qualitatively different singular value decay. Since the error bound (16) is dominated by the second term, we only report on the values of ωK as K increases. We consider two examples with a rather different singular value behavior:
Matrices with hierarchical low-rank structures
11
(a) The Hilbert matrix A defined by A(i, j) = 1/(i + j − 1). (b) The “exponential” matrix A defined by A(i, j) = exp(−γ|i − j|/n) with γ = 0.1.
5
2
10
10 Computed ω (1) Computed ω (2) Best approximation
0
10
Computed ω (1) Computed ω (2) Best approximation 0
10
−5
10
−10
10
−2
10 −15
10
−20
10
0
−4
20
40
60
80
100
10
0
20
40
60
80
100
Fig. 2. Behavior of ωK computed using (1) ωK2 = kAk2F − α12 ∑Kj=2 (α 2j + β j2 ) and (2) ωK = q 2 kA−UK BK VKT kF , compared to the best approximation error σK+1 + · · · + σn2 for the Hilbert matrix (left figure) and the exponential matrix (right figure).
From Figure 2, it can be seen that ωK follows the best approximation error quite closely, until it levels off due to round off error. Due to cancellation, this happens much earlier when we compute ωK using the expression from Lemma 2. Unfortunately, the more accurate formula ωK = kA −UK BK VKT kF is generally too expensive. In practice, one can avoid this by using the T − UK BK VKT kF , which also does not require much cheaper approximation kUK+1 BK+1VK+1 knowledge of kAkF . The excellent behavior of Algorithm 1 observed in Example 2 indicates that not much q 2 +···σ2 more than k iterations are needed to attain an approximation error close to σk+1 , min{m,n} √ say, within a factor of 2. For k min{m, n}, it thus pays off to use Algorithm 1 even for a dense matrix A, requiring a complexity of O(mnk + (m + n)k2 ). Example 3. One situation that calls for the use of iterative methods is the recompression of a matrix A that is represented by a longer sum (13). The algorithm based on QR decompositions described in the beginning of this section, see (13), has a complexity that grows quadratically with J, the number of terms in the sum. Assuming that it does not require more than O(k) iterations, Algorithm 1 scales linearly with J simply because the number of matrix-vector multiplications with A and AT scale linearly with J. In principle, one could alternatively use successive truncation to attain linear scaling in J. However, this comes at the risk of cancellation significantly distorting the results. For example, we have 2ε 0 2ε 0 10 00 2ε − 1 0 = T1 + + = T1 00 01 0 ε −1 0 ε 0 0 for any 0 ≤ ε < 1/4, but successive truncation of this sum leads to
12
Jonas Ballani and Daniel Kressner 10 00 ε −1 0 T1 T1 + + 00 01 0 ε −1 10 2ε − 1 0 0 0 = T1 + = , 00 0 ε −1 0 ε −1
which incurs an error of O(1) instead of O(ε).
Randomized Algorithms Algorithm 1 is based on subsequent matrix-vector multiplications, which perform poorly on a computer because of the need for fetching the whole matrix A from memory for each multiplication. One way to address this problem is to use a block Lanczos method [33]. The multiplication of A with a block of k vectors can be easily arranged such that A needs to be read only once from memory. A simpler and (more popular) alternative to the block Lanczos method is to make use of randomized algorithms. The following discussion of randomized algorithms is very brief; we refer to the excellent survey paper by Halko et al. [45] for more details and references.
Algorithm 2 Randomized algorithm for low-rank approximation Input: Matrix A ∈ Rm×n , desired rank k ≤ min{m, n}, oversampling parameter p ≥ 0. ek ∈ Rm×k , Vek ∈ Rn×k with orthonormal columns and diagonal matrix Output: Matrices U b=U ek ΣekVe T . Σek ∈ Rk×k such that A ≈ A k 1: Choose standard Gaussian random matrix Ω ∈ Rn×(k+p) . 2: Perform block matrix vector-multiplication Y = AΩ . 3: Compute (economic) QR decomposition Y = QR. 4: Form Z = QT A. b ΣbVb T . 5: Compute singular value decomposition Z = U ek ← QU(:, b 1 : k), Σek ← Σb(1 : k, 1 : k), Vek ← Vb (:, 1 : k). 6: U
Algorithm 2 is a simple but effective example for a randomized algorithm. To gain some intuition, let us assume for the moment that A has rank k, that is, it can be factorized as A = BCT with B ∈ Rm×k , C ∈ Rn×k . It then suffices to choose p = 0. With probability one, the matrix CT Ω ∈ Rk×k is invertible, which implies that the matrices Y = AΩ = B(CT Ω ) b = QQT A = QQT BCT = BCT = A and, hence, and have the same column space. In turn, A Algorithm 2 almost surely recovers the low-rank factorization of A exactly. When A does not have rank k (but can be well approximated by a rank-k matrix), it is advisable to choose the oversampling parameter p larger than zero in Algorithm 2, say, p = 5 or p = 10. The following result shows that the resulting error will not be far away from the best approximation error σk+1 in expectation, with respect to the random matrix Ω . Theorem 4 ([45, Thm 1.1]). Let k ≥ 2 and p ≥ 2 be chosen such that k + p ≤ min{m, n}. Then b produced by Algorithm 2 satisfies the rank-k approximation A ! p 4 (k + p) min{m, n} b 2 ≤ 2+ EkA − Ak σk+1 . (17) p−1
Matrices with hierarchical low-rank structures
13
It is shown in [45] that a slightly higher bound than (17) holds with very high probability. The bound (17) improves significantly when performing a few steps of subspace iteration after Step 2 in Algorithm 2, which requires a few additional block matrix-vector multiplications with AT and A. In practice, this may not be needed. As the following example shows, the observed approximation error is much better than predicted by Theorem 4. Example 4. We apply Algorithm 2 to the 100 × 100 matrices from Example 2. (a) For the Hilbert matrix A defined by A(i, j) = 1/(i + j − 1), we choose k = 5 and obtain the following approximation errors as p varies, compared to the exact rank-5 approximation error: Exact p=0 p=1 p=5 1.88 × 10−3 2.82 × 10−3 1.89 × 10−3 1.88 × 10−3 (b) For the exponential matrix A defined by A(i, j) = exp(−γ|i − j|/n) with γ = 0.1, we choose k = 40 and obtain: Exact p = 0 p = 10 p = 40 p = 80 1.45 × 10−3 5 × 10−3 4 × 10−3 1.6 × 10−3 1.45 × 10−3 These results indicate that small values of p suffice to obtain an approximation not far away from σk+1 . Interestingly, the results are worse for the matrix with slower singular value decay. This is not the case for the Lanczos method (see Figure 2), for which equally good results are attained for both matrices.
Adaptive Cross Approximation (ACA) The aim of Adaptive Cross Approximation (ACA) is to construct a low-rank approximation of a matrix A directly from the information provided by well-chosen rows and columns of A. An existence result for such an approach is given in the following theorem. Note that we use M ATLAB notation for denoting submatrices. Theorem 5 ([36]). Let A ∈ Rm×n . Then there exist row indices r ⊂ {1, . . . , m} and column indices c ⊂ {1, . . . , n} and a matrix S ∈ Rk×k such that |c| = |r| = k and √ √ √ kA − A(:, c)SA(r, :)k2 ≤ (1 + 2 k( m + n))σk+1 (A). Theorem 5 shows that it is, in principle, always possible to build a quasi-optimal low-rank approximation from the rows and columns of A. The construction [36] of the low-rank approximation in Theorem 5 proceeds in two steps: 1. Select c and r by selecting k “good” rows from the singular vector matrices Uk ∈ Rm×k and Vk ∈ Rn×k , respectively. 2. Choose a suitable matrix S. Step 1 requires to choose c, r with |c| = |r| = k such that kUk (c, :)−1 k2 and kVk (r, :)−1 k2 are small. This is not practical, primarily because it involves the singular vectors, which we want to avoid computing. Moreover, it is NP hard to choose c and r optimally even when the singular vectors are known [23]. Recently, it was shown [22] that random column and row selections can be expected to result in good approximations, provided that A admits an approximate low-rank factorization with factors satisfying an incoherence condition. However, it is hard to know a priori whether this incoherence condition holds and, in the described
14
Jonas Ballani and Daniel Kressner
scenario of sampling only a few entries of A, it cannot be enforced in a preprocessing step that randomly transforms the columns and rows of A. Step 2 in the construction of [36] involves the full matrix A, which is not available to us. A simpler alternative is given by S = (A(r, c))−1 , provided that A(r, c) is invertible. This has a direct consequence: Letting R := A − A(:, c)(A(r, c))−1 A(r, :)
(18)
denote the remainder of the low-rank approximation, we have R(r, :) = 0,
R(:, c) = 0.
In other words, the rows r and the columns c of A are interpolated exactly, giving rise to the name cross approximation, cf. Figure 3.
1
3
6
1 3 6 2 4 7
2 4
≈ 7
Fig. 3. Low-rank approximation of a matrix by cross approximation
We still need to discuss a cheaper alternative for Step 1. For the approximation (18), the maximum volume principle offers a recipe for choosing c and r. To motivate this principle, consider a (k + 1) × (k + 1) matrix A11 a12 A= , A11 ∈ Rk×k , a12 ∈ Rk×1 , a21 ∈ R1×k , a22 ∈ R, a21 a22 and suppose that A11 is invertible. Then, using the Schur complement, |(A−1 )k+1,k+1 | =
1 |a22 − a12 A−1 11 a21 |
=
| det A| | det A11 |
If | det A11 | is maximal (that is, A11 has maximal volume) among all possible selections of k × k submatrices of A, this shows that |(A−1 )k+1,k+1 | is maximal among all entries of A−1 : |(A−1 )k+1,k+1 | = kA−1 kC := max (A−1 )i j . i, j
On the other hand, using [47, Sec. 6.2], we have σk+1 (A)−1 = kA−1 k2 = max x
≥
kA−1 xk2 kxk2
1 kA−1 xk∞ 1 max = kA−1 kC . x k+1 kxk1 k+1
Matrices with hierarchical low-rank structures
15
When performing cross approximation with the first k rows and columns, the remainder defined in (18) thus satisfies kRkC = |a22 − a12 A−1 11 a21 | =
1 ≤ (k + 1)σk+1 (A). kA−1 kC
By bordering A11 with other rows and columns of A, this result can be extended to m × n matrices. Theorem 6 ([35]). Let A ∈ Rm×n and consider a partitioning A A A = 11 12 A21 A22 such that A11 ∈ Rk×k is non-singular and has maximal volume among all k × k submatrices of A. Then kA22 − A21 A−1 11 A12 kC ≤ (k + 1)σk+1 (A). The quasi-optimality shown in Theorem 6 is good enough for all practical purposes. It requires to choose S = (A(r, c))−1 such that the “volume” | det A(r, c)| is as large as possible. Unfortunately, it turns out, once again, that the determination of a submatrix A(r, c) of maximal volume is NP-hard [20] and thus computationally infeasible. Greedy approaches try to incrementally construct the row and column index sets r, c by maximizing the volume of the submatrix A(r, c) in each step [5, 10, 64]. A basic algorithm for a greedy strategy is given in Algorithm 3.
Algorithm 3 ACA with full pivoting 1: Set R0 := A, r := {}, c := {}, k := 0 2: repeat 3: k := k + 1 4: (i∗ , j∗ ) := arg maxi, j |Rk−1 (i, j)| 5: r := r ∪ {i∗ }, c := c ∪ { j∗ } 6: δk := Rk−1 (i∗ , j∗ ) 7: uk := Rk−1 (:, j∗ ), vk := Rk−1 (i∗ , :)T /δk 8: Rk := Rk−1 − uk vTk 9: until kRk kF ≤ εkAkF
If we had rank(A) = k, Algorithm 3 would stop after exactly k steps with kRk kF = 0 and hence A would be fully recovered. It turns out that the submatrix A(r, c) constructed in Algorithm 3 has a straightforward characterization. Lemma 3 ([5]). Let rk = {i1 , . . . , ik } and ck = { j1 , . . . , jk } be the row and column index sets constructed in step k of Algorithm 3. Then det(A(rk , ck )) = R0 (i1 , j1 ) · · · Rk−1 (ik , jk ). Proof. From lines 7 and 8 in Algorithm 3, we can see that the last column of A(rk , ck ) is a linear combination of the columns of the matrix
16
Jonas Ballani and Daniel Kressner ek := [A(rk , ck−1 ), Rk−1 (rk , jk )] ∈ Rk×k , A
ek ) = det(A(rk , ck )). However, A ek (i, jk ) = 0 for all i = i1 , . . . , ik−1 and hence which implies det(A ek ) = Rk−1 (ik , jk ) det(A(rk−1 , ck−1 )). det(A Noting that det A(r1 , c1 ) = A(i1 , j1 ) = R0 (i1 , j1 ) thus proves the claim by induction. t u In line 4 of Algorithm 3, a maximization step over all entries of the remainder is performed. Clearly, this is only possible if all matrix entries are cheaply available. In this case, it is reasonable to assume that the norm of A can be used in the stopping criterion in line 9. For larger matrices A with a possibly expensive access to the entries A(i, j), the choice of the pivots and the stopping criterion need to be adapted. The inspection of all matrix entries in the pivot search can be avoided by a partial pivoting strategy as described in Algorithm 4. In line 3 of Algorithm 4, the new pivot index is found by fixing a particular row index and optimizing the column index only. Note also that there is no need to form the remainder Rk explicitly. Any entry of Rk can always be computed from k
Rk (i, j) = A(i, j) − ∑ u` (i)v` ( j). `=1
This means that in each step of Algorithm 4 only a particular row and column of the matrix A needs to be accessed. The overall number of accessed entries is hence bounded by k(m + n).
Algorithm 4 ACA with partial pivoting 1: Set R0 := A, r := {}, c := {}, k := 1, i∗ := 1 2: repeat 3: j∗ := arg max j |Rk−1 (i∗ , j)| 4: δk := Rk−1 (i∗ , j∗ ) 5: if δk = 0 then 6: if #r = min{m, n} − 1 then 7: Stop 8: end if 9: else 10: uk := Rk−1 (:, j∗ ), vk := Rk−1 (i∗ , :)T /δk 11: Rk := Rk−1 − uk vTk 12: k := k + 1 13: end if 14: r := r ∪ {i∗ }, c := c ∪ { j∗ } 15: i∗ := arg maxi,i∈r / uk (i) 16: until stopping criterion is satisfied
% choose column index in given row
% the matrix has been recovered
% update low-rank approximation
% choose new row index
If the norm of A is not available, also the stopping criterion from line 9 in Algorithm 3 needs to be adjusted. A possible alternative is to replace kAkF by kAk kF from the already computed low-rank approximation Ak := ∑kj=1 u j vTj : kAk k2F = kAk−1 k2F +
k−1
∑ uTk u j vTj vk + kuk k22 kvk k22 ,
j=1
Matrices with hierarchical low-rank structures
17
see also [10]. Replacing the residual norm kRk kF by kuk k2 kvk k2 , Algorithm 4 is stopped if kuk k2 kvk k2 ≤ εkAk kF . The construction of a low-rank approximation of A from selected knowledge however comes with some cost. Since Algorithm 4 takes only partial information of the matrix A into account, it is known to fail to produce reliable low-rank approximations in certain cases, see [18] for a discussion. Example 5. We apply ACA with full and partial pivoting to the two matrices from Example 2, and compare it to the optimal approximation obtained from the SVD. (a) For the Hilbert matrix, it can be seen from Figure 4 (left) that both ACA strategies yield quasi-optimal low-rank approximations. Recall that the singular values for this example decay very fast. (b) For the other matrix, Figure 4 (right) shows that ACA with full pivoting still yields a quasi-optimal low-rank approximation. In contrast, the partial pivoting strategy suffers a lot from the slow decay of the singular values.
Hilbert matrix
0
Exponential matrix
0
10
10
−5
Full pivoting Partial pivoting SVD e 2 /kAk 2 kA − Ak
e 2 /kAk 2 kA − Ak
Full pivoting Partial pivoting SVD 10
−10
10
−2
10
−4
10
−15
10
−6
10 5
10
15 k
20
25
30
20
40
60
80
100
k
Fig. 4. Approximation of matrices from Example 5 by ACA
The observed relation between quasi-optimality and decay of the best approximation error has been studied in a related context in [52]. The cross approximation technique has a direct correspondence to the process of Gaussian elimination [5]. After a suitable reordering of the rows and columns of A, we can assume that r = {1, . . . , k}, c = {1, . . . , k}. If we denote the kth unit vector by ek , we can write line 8 in Algorithm 3 as Rk = Rk−1 − δk Rk−1 ek eTk Rk−1 = (I − δk Rk−1 ek eTk )Rk−1 = Lk Rk−1 , where Lk ∈ Rm×m is given by
18
Jonas Ballani and Daniel Kressner 1 . .. 1 eT Rk−1 ek 0 Lk = , `i,k = − iT , ek Rk−1 ek `k+1,k 1 .. .. . . `m,k 1
i = k + 1, . . . , m.
The matrix Lk differs only in position (k, k) from the usual lower triangular factor in Gaussian elimination. This means that the cross approximation technique can be considered as a column-pivoted LU decomposition for the approximation of a matrix. The relations between ACA and the Empirical Interpolation Method, which plays an important role in reduced order modeling, are discussed in [9]. If the matrix A happens to be symmetric positive semi-definite, the entry of largest magnitude resides on the diagonal of A. Moreover, one can show that all remainders Rk in Algorithm 3 also remain symmetric positive semi-definite. This means that we can restrict the full pivot search in line 4 to the diagonal without sacrificing the reliability of the low-rank approximation. The resulting algorithm can be interpreted as a pivoted Cholesky decomposition [46, 51] with a computational cost of O(nk2 ) instead of O(n2 k) for the general ACA algorithm with full pivoting. See also [47] for an analysis.
2.4 A priori approximation results None of the techniques discussed in Section 2.3 will produce satisfactory approximations if A cannot be well approximated by a low rank matrix in the first place or, in other words, if the singular values of A do not decay sufficiently fast. It is therefore essential to gain some a priori understanding on properties that promote low-rank approximability. As a rule of thumb, smoothness generally helps. This section illustrates this principle for the simplest setting of matrices associated with bivariate functions; we refer to [17, 42, 57] for detailed technical expositions.
Separability and low rank Let us evaluate a bivariate function f (x, y) : xmin , xmax × ymin , ymax → R, on a tensor grid [x1 , . . . , xn ] × [y1 , . . . , ym ] with xmin ≤ x1 < · · · ≤ xn ≤ xmax , ymin ≤ y1 < · · · ≤ yn ≤ ymax . The obtained function values can be arranged into the m × n matrix f (x1 , y1 ) f (x1 , y2 ) · · · f (x1 , yn ) f (x2 , y1 ) f (x2 , y2 ) · · · f (x2 , yn ) F = (19) . .. .. .. . . . f (xm , y1 ) f (xm , y2 ) · · · f (xm , yn ) A basic but crucial observation: When f is separable, that is, it can be written in the form f (x, y) = g(x)h(y) for some functions g : [xmin , xmax ] → R, f : [ymin , ymax ] → R then
Matrices with hierarchical low-rank structures
19
g(x1 ) g(x1 )h(y1 ) · · · g(x1 )h(yn ) . .. .. F = = .. h(y1 ) · · · h(yn ) . . . g(xm ) g(xm )h(y1 ) · · · g(xm )h(yn )
In other words, separability implies rank at most 1. More generally, let us suppose that f can be well approximated by a sum of separable functions: f (x, y) = fk (x, y) + εk (x, y), (20) where fk (x, y) = g1 (x)h1 (y) + · · · + gk (x)hk (y),
(21)
and the error |εk (x, y)| decays quickly as k increases. Remark 3. Any function in the seemingly more general form ∑ki=1 ∑kj=1 si j gi (x)e h j (y) can be written in the form (21), by simply setting hi (y) := ∑kj=1 si j e h j (y). By (21), the matrix fk (x1 , y1 ) · · · fk (x1 , yn ) .. .. Fk := . . . fk (xm , y1 ) · · · fk (xm , yn )
has rank at most k and, hence, σk+1 (F) ≤ kF − Fk k2 ≤ kF − Fk kF ≤
m,n
∑
εk (xi , ym )2
1/2
≤
√ mnkεk k∞ ,
(22)
i, j=1
where kεk k∞ := max x∈[xmin ,xmax ] | f (x, y) − fk (x, y)|. In other words, a good approximation by a y∈[ymin ,ymax ]
short sum of separable functions implies a good low-rank approximation.
Low rank approximation via semi-separable approximation There are various approaches to obtain a semi-separable approximation (20).
Exponential sums In many cases of practical interest, the function f takes the form f (x, y) = fe(z) with z = ge(x) + e h(y) for certain functions ge, e h. Notably, this is the case for f (x, y) = 1/(x −y). Any exponential sum approximation fe(z) ≈ ∑ki=1 ωi exp(αi z) yields a semi-separable approximation for f (x, y): k
f (x, y) ≈
∑ ωi exp
i=1
αi ge(x) exp αie h(y) ,
√ For specific functions, such as 1/z and 1/ z, (quasi-)best exponential sum approximations are well understood; see [41]. Sinc quadrature is a general approach to obtain (not necessarily optimal) exponential sum approximations [42, Sec. D.5].
20
Jonas Ballani and Daniel Kressner
Taylor expansion The truncated Taylor expansion of f (x, y) in x around x0 , k−1
fk (x, y) :=
1 ∂ i f (x0 , y) (x − x0 )i , ∂ xi
∑ i!
i=0
(23)
immediately gives a semi-separable function. One can of course also use the Taylor expansion in y for the same purpose. The approximation error is governed by the remainder of the Taylor series. To illustrate this construction, let us apply it to the following function, which features prominently in kernel functions of one-dimensional integral operators: log(x − y) if x > y, f (x, y) := log(y − x) if y > x, 0 otherwise. Assuming xmax > xmin > ymax > ymin and setting x0 = (xmin + xmax )/2, the Lagrange representation of the error for the truncated Taylor series (23) yields | f (x, y) − fk (x, y)| ≤
1 ∂ k f (x , y) 1x max − xmin k 0 k (ξ − x ) ≤ . 0 k 2(xmin − y) ∂ xk ξ ∈[xmin ,xmax ] k! max
(24)
For general sets Ωx , Ωy ∈ Rd , let us define diam(Ωx ) := max{kx − yk2 : x, y ∈ Ωx }, dist(Ωx , Ωy ) := min{kx − yk2 : x ∈ Ωx , y ∈ Ωy }. For our particular case, we set Ωx = [xmin , xmax ] and Ωy = [ymin , ymax ] and assume that min diam(Ωx ), diam(Ωy ) ≤ η · dist(Ωx , Ωy ) (25) is satisfied for some 0 < η < 2. Such a condition is called admissibility condition and the particular form of (25) is typical for functions with a diagonal singularity at x = y. We use Taylor expansion in x if diam(Ωx ) ≤ diam(Ωy ) and Taylor expansion in y otherwise. The combination of inequalities (24) and (25) yields k f (x, y) − fk (x, y)k∞ ≤
1 η k . k 2
As a consequence of (22), the singular values of the matrix F decay exponentially, provided that (25) holds. By a refined error analysis of the Taylor remainder, the decay factor can be η [17, 42], which also removes the restriction η < 2. improved to 2+η
Polynomial interpolation Instead of Taylor expansion, Lagrange interpolation in any of the two variables x, y can be used to obtain a semi-separable approximation. This sometimes yields tighter bounds than truncated Taylor expansion, but the error analysis is often harder. For later purposes, it is interesting to observe the result of performing Lagrange interpolation in both variables x, y. Given f (x, y) with x ∈ [xmin , xmax ] and y ∈ [ymin , ymax ], we first
Matrices with hierarchical low-rank structures
21
perform interpolation in y. For arbitrary interpolation nodes ymin ≤ θ1 < · · · < θk ≤ ymax , the Lagrange polynomials Lθ ,1 , . . . , Lθ ,k of degree k − 1 are defined as Lθ , j (y) :=
∏
1≤i≤k i6= j
y − θi , θ j − θi
1 ≤ j ≤ k.
The corresponding Lagrange interpolation of f in y is given by Iy [ f ](x, y) :=
k
∑ f (x, θ j )Lθ , j (y).
j=1
Analogously, we perform interpolation in x for each f (x, θ j ) on arbitrary interpolation nodes xmin ≤ ξ1 < · · · < ξk ≤ xmax . Using the linearity of the interpolation operator, this yields k
fk (x, y) := Ix [Iy [ f ]](x, y) =
∑
f (ξi , θ j )Lξ ,i (x)Lθ , j (y).
i, j=1
By Remark 3, this function is semi-separable. Moreover, in contrast to the result obtained with expanding or interpolating in one variable only, fk is a bivariate polynomial. The approximation error can be bounded by kεk k∞ ≤ k f − Ix [Iy [ f ]]k∞ = k f − Ix [ f ] + Ix [ f ] − Ix [Iy [ f ]]k∞ ≤ k f − Ix [ f ]k∞ + kIx k∞ k f − Iy [ f ]k∞ , where the so called Lebesgue constant kIx k∞ satisfies kIx k∞ ∼ log r when using Chebyshev interpolation nodes. In this case, the one-dimensional interpolation errors k f − Ix [ f ]k∞ and k f − Iy [ f ]k∞ decay exponentially for an analytic function; see [17, 42]. Because the interpolation is performed in both variables, the admissibility condition (25) needs to be replaced by max diam(Ωx ), diam(Ωy ) ≤ η · dist(Ωx , Ωy ).
Remarks on smoothness. The techniques presented above are adequate if f is arbitrarily smooth. It is, however, easy to see that the discontinuous function f : [0, 1] × [0, 1] → R defined by 0 if 1/4 < x < 3/4, 1/4 < y < 3/4, f (x, y) := 1 otherwise, yields a matrix F of rank exactly 2. More generally, low-rank approximability is not severely affected by singularities aligned with the coordinate axes. For the case that f only admits finitely many mixed derivatives, see [59] and the references therein.
3 Partitioned low-rank structures We now proceed to the situation that not the matrix A itself but only certain blocks of A admit good low-rank approximations. The selection of such blocks is usually driven by an admissibility criterion, see (25) for an example.
22
Jonas Ballani and Daniel Kressner
3.1 HODLR matrices Hierarchically Off-Diagonal Low Rank (HODLR) matrices are block matrices with a multilevel structure such that off-diagonal blocks are low-rank. Let A ∈ Rn×n and assume for simplicity that n = 2 p n p for some maximal level p ∈ N with n p ≥ 1. Moreover, we denote by I = {1, . . . , n} the row (and column) index set of A. As a starting point, we first subdivide the matrix A into blocks. One level ` = 0, we partition the index set I10 := I into I10 = I11 ∪ I21 with I11 = {1, . . . , 2n } and I21 = { n2 + 1, . . . , n}. This corresponds to a subdivision of A into four subblocks A(Ii1 , I 1j ) with i, j = 1, 2. On each `+1 level ` = 1, . . . , p − 1, we further partition all index sets Ii` into equally sized sets I2i−1 and `+1 ` ` ` I2i . In accordance to this partition, the diagonal blocks A(Ii , Ii ), i = 1, . . . , 2 , are then subdivided into quadratic blocks of size 2 p−` n p . This results in a block structure for A as shown for p = 2, 3, 4 in Figure 5.
Fig. 5. HODLR matrices with level p = 2, 3, 4 Each off-diagonal block A(Ii` , I `j ), i 6= j, is assumed to have rank at most k and is kept in the low-rank representation (`)
(`)
A(Ii` , I `j ) = Ui (V j )T ,
(`)
(`)
Ui ,V j
∈ Rn` ×k ,
(26)
where n` := #Ii` = #I `j = 2 p−` n p . All diagonal blocks of A are represented on the finest level ` = p as dense n p × n p matrices. From Figure 5, we can easily see that there are 2` off-diagonal blocks on level ` > 0. If we sum up the storage requirements for the off-diagonal blocks A on the different levels, we get p
p
2k ∑ 2` n` = 2kn p `=1
∑ 2` 2 p−` = 2kn p p2 p = 2knp = 2kn log2 (n/n p ).
`=1
The diagonal blocks of A on the finest level require 2 p n2p = nn p units of storage. Assuming that n p ≤ k, the storage complexity for the HODLR format is hence given by O(kn log n).
Matrix-vector multiplication A matrix vector multiplication y = Ax for a HODLR matrix A and a vector x ∈ Rn can be carried out exactly (up to numerical accuracy) by a block recursion. On level ` = 1, we compute y(I11 ) = A(I11 , I11 )x(I11 ) + A(I11 , I21 )x(I21 ), y(I21 ) = A(I21 , I11 )x(I11 ) + A(I21 , I21 )x(I21 ).
Matrices with hierarchical low-rank structures
23
In the off-diagonal blocks A(I11 , I21 ) and A(I21 , I11 ), we need to multiply a low-rank matrix of size n/2 with a vector of size n/2 with cost cLR·v (n/2) where cLR·v (n) = 4nk. In the diagonal blocks, the cost for the matrix-vector multiplication cH·v is the same as for the original matrix with the matrix size reduced to n/2. If we add the cost n for the vector addition, we get the recursive relation cH·v (n) = 2cH·v (n/2) + 4kn + n. from which we conclude, using the Master theorem, that cH·v (n) = (4k + 1) log2 (n)n.
(27)
Matrix addition Let us now consider the computation of C = A + B for HODLR matrices A, B,C. In principle, the addition increases the rank in all off-diagonal blocks of C up to 2k, cf. (13). We apply truncation to ensure that all off-diagonals have again rank at most k, and thus the result of the addition will in general not be exact. The truncated addition is performed blockwise by C(Ii` , I `j ) := Tk (A(Ii` , I `j ) + B(Ii` , I `j )) for an off-diagonal block Ii` × I `j with the truncation operator Tk from (8). The cost for two rank-k matrices of size n is cLR+LR (n) = cSV D (nk2 + k3 ), where the generic constant cSV D depends on the cost for the QR factorization and the SVD. As there are 2` off-diagonal blocks on each level `, the cost for truncated addition sums up to p
p
∑ 2` cLR+LR (n` ) = cSV D ∑ 2` (k3 + n` k2 )
`=1
`=1
p
≤ cSV D (2 p+1 k3 + ∑ 2` 2 p−` n p k2 ) `=1
3
≤ cSV D (2nk + n log2 (n)k2 ). The dense diagonal blocks can be added exactly by 2 p n2p = nn p operations. Hence, the overall complexity for the truncated addition of two HODLR matrices is given by cH+H (n) = cSV D (nk3 + n log(n)k2 ).
Matrix multiplication As for the addition, the matrix-matrix multiplication C = AB of two HODLR matrices A and B is combined with truncation to ensure that all off-diagonal blocks of C have rank at most k. Again, this implies that the result of multiplication will in general not be exact. (`) (`) Let Ai, j = A(Ii` , I `j ), Bi, j = B(Ii` , I `j ), and consider the blockwise multiplication
24
Jonas Ballani and Daniel Kressner " (1) (1) # " (1) (1) # " (1) (1) (1) (1) A1,1 B1,1 + A1,2 B2,1 A A1,2 B1,1 B1,2 AB = 1,1 (1) (1) (1) (1) (1) (1) = (1) (1) A2,1 B1,1 + A2,2 B2,1 A2,1 A2,2 B2,1 B2,2
The block structure of (28) can be illustrated as · + · · = · + · where is again a HODLR matrix of size n/2 and structurally four different types of multiplications in (28):
(1) (1)
(1) (1)
(1) (1)
(1) (1)
A1,1 B1,2 + A1,2 B2,2 A2,1 B1,2 + A2,2 B2,2
·
+
·
·
+
·
# .
(28)
is a low-rank block. There are
1.
·
: the multiplication of two HODLR matrices of size n/2,
2.
·
: the multiplication of two low-rank blocks,
3.
·
: the multiplication of a HODLR matrix with a low-rank block,
4.
·
: the multiplication of a low-rank block with a HODLR matrix.
First note that in the cases 2,3,4 the multiplication is exact and does not require a truncation step. For case 1, we need to recursively apply low-rank truncation. Let cH·H , cLR·LR , cH·LR , cLR·H denote the cost for each of the four cases, respectively. Moreover, let cH+LR (n) be the cost for the truncated addition of a HODLR matrix with a low-rank matrix of size n. From (28) we conclude that cH·H (n) = 2(cH·H (n/2) + cLR·LR (n/2) + cH·LR (n/2) + cLR·H (n/2) + cH+LR (n/2) + cLR+LR (n/2)). If the result is kept in low-rank form, the cost for the multiplication of two low-rank blocks is cLR·LR (n) = 4nk2 . The multiplication of a HODLR matrix of size n with a rank-one matrix requires cH·v (n) operations. According to (27), we thus get cH·LR (n) = cLR·H (n) = kcH·v (n) = k(4k + 1) log2 (n)n. The cost for the truncated addition of a HODLR matrix with a low-rank matrix is the same as the cost for adding two HODLR matrices, that is, cH+LR (n) = cH+H (n) = cSV D (nk3 + n log(n)k2 ). Summing up all terms, we finally obtain cH·H (n) ∈ O(k3 n log n + k2 n log2 n).
Matrix factorization and linear systems A linear system Ax = b can be addressed by first computing the LU factorization A = LU where L is lower triangular with unit diagonal and U is upper triangular. The solution x is then obtained by first solving Ly = b forward substitution and then Ux = y by backward substitution. For a dense n × n matrix A, the LU factorization requires O(n3 ) and constitutes the most expensive step.
Matrices with hierarchical low-rank structures
25
In the following, we discuss how an inexact LU factorization A ≈ LU can be cheaply computed for a HODLR matrix A, such that L,U are again HODLR matrices: ≈
·
Such an inexact factorization can be used as a preconditioner in an iterative method for solving Ax = b. Let us first discuss the solution of Ly = b where L is a lower triangular HODLR matrix. On level ` = 1, L, y and b are partitioned as L11 0 y1 b L= , y= , b= 1 , L21 L22 y2 b2 with a low-rank matrix L21 and HODLR matrices L11 , L22 . The corresponding block forward substitution consists of first solving L11 y1 = b1 , (29) then we computing the vector e b2 := b2 − L21 y1
(30)
and finally solving L22 y2 = e b2 . (31) e The computation of b2 in (30) requires a matrix-vector product with a low-rank matrix and a vector addition with total cost (2k + 1)n. The solutions in (29) and (31) are performed recursively, by forward substitution with the matrix size reduced to n/2. Hence, the cost for forward substitution satisfies cforw (n) = 2cforw (n/2) + (2k + 1)n. On level ` = p, each of the 2 p = n/n p linear systems of size n p is solved directly, requiring O(n3p ) operations. The overall cost for forward substitution is thus given by cforw (n) ∈ O(nn2p + kn log(n)). An analogous estimate holds for the backward substitution cbackw (n) = cforw (n). Due to the properties of the matrix-vector for HODLR matrices, also the forward and backward algorithms can be carried out exactly (up to numerical accuracy). Both algorithms can be extended to the matrix level. Assume we want to solve LY = B where L 0 Y Y B B L = 11 , Y = 11 12 , B = 11 12 . L21 L22 Y21 Y22 B21 B22 To this end, we first solve L11Y11 = B11 ,
L11Y12 = B12
to find Y11 ,Y12 . Afterwards, we get Y21 ,Y22 from L22Y21 = B21 − L21Y11 ,
L22Y22 = B22 − L21Y12 .
We observe that this step involves matrix additions and multiplications in the HODLR format which in general require a rank truncation. The actual computation of the LU factorization can now be performed recursively. On level ` = 1, the matrices A, L,U have the block structure
26
Jonas Ballani and Daniel Kressner A A L 0 A = 11 12 , L = 11 , A21 A22 L21 L22
U=
U11 U12 , 0 U22
which leads to the equations A11 = L11U11 ,
A12 = L11U12 ,
A21 = L21U11 ,
A22 = L21U12 + L22U22 .
We can therefore proceed in the following way: 1. 2. 3. 4.
compute the LU factors L11 ,U11 of A11 , −1 compute U12 = L11 A12 by forward substitution, −1 compute L21 = A21U11 by backward substitution, compute the LU factors L22 ,U22 of A22 − L21U12 .
Each of these four steps can be addressed using LU factorization or forward and backward substitution for matrices of size n/2. On level ` = p we use dense routines to solve the subproblems of size n p . By similar arguments as above one can show that the cost cLU for the approximate LU factorization is not larger than the cost for the matrix-matrix multiplication of two HODLR matrices, i.e. cLU (n) . cH·H (n) = O(k3 n log n + k2 n log2 n). If the matrix A happens to be symmetric positive definite, an approximate Cholesky factorization A ≈ LLT can be obtained by the same techniques with a reduction of the computational cost.
3.2 General clustering strategies In many applications, the particular block partitioning of HODLR matrices is not appropriate and leads either to a poor approximation or to large ranks. Obtaining a block partitioning that is adapted to the situation at hand is by no means trivial and referred to as clustering. The first step of clustering typically consists of ordering the index set I = {1, . . . , n} of A ∈ Rn×n . The second step consists of recursively subdividing the blocks of the reordered matrix A. Our main goals in the clustering process are the following: (a) The ranks of all blocks should be small relative to n. (b) The total number of blocks should be small. Clearly, we need to balance both goals in some way. The reordering of A and the subdivision of the blocks are often driven by some notion of locality between the indices. This locality could be induced by the geometry of the underlying problem or by the incidence graph of A (if A is sparse). In the following, we illustrate both strategies. First, we consider a geometrically motivated clustering strategy related to the concepts from Section 2.4.
Geometrical clustering Suppose we aim to find u : Ω → R satisfying a one-dimensional integral equation of the form Z 1
log |x − y|u(y)dy = f (x),
x ∈ Ω = [0, 1],
0
with a given right-hand side f : Ω → R. For n = 2 p , p ∈ N, let the interval [0, 1] be subdivided into subintervals
Matrices with hierarchical low-rank structures τi := [(i − 1)h, ih],
27
1 ≤ i ≤ n,
where h := 1/n. A Galerkin discretization with piecewise constant basis function ( 1, x ∈ τi , ϕi (x) := 0, else, i = 1, . . . , n, then leads to a matrix A ∈ Rn×n defined by Z 1Z 1
A(i, j) := 0
0
ϕi (x) log |x − y|ϕ j (y)dydx =
Z Z
log |x − y|dydx.
τi τ j
This is a variation of the simplified setting discussed in Section 2.4, cf. (19), and the results of Section 2.4 can be easily adjusted. In particular, a semi-separable approximation log |x − y| ≈ UV T by setting the entries of the matrices U,V ∑k`=1 g` (x)h`R(y) yields a rank-k approximation R to U(i, `) = τi g` (x)dx and V ( j, `) = τ j h` (y)dy. We recall from Section 2.4 that the diagonal singularity of log |x − y| requires us to impose the admissibility condition (25). To translate this into a condition on the blocks of A, let us define [ Ωt := τi ⊂ Ω i∈t
for any index set t ⊂ I = {1, . . . , n}. Definition 1. Let η > 0 and let s,t ⊂ I. We call a matrix block (s,t), associated with the row indices s and column indices t, admissible if min{diam(Ωs ), diam(Ωt )} ≤ ηdist(Ωs , Ωt ). Thus, a block is admissible if all its entries stay sufficiently far away from the diagonal, relative to the size of the block. To systematically generate a collection of admissible blocks, we first subdivide the index set I by a binary tree TI such that neighboring indices are hierarchically grouped together. An example for n = 8 is depicted in Figure 6. Each node t ∈ TI in the tree is an index set t ⊂ I that is the union of its sons whenever t is not a leaf node. The tree TI is then used to recursively define a block subdivision of the matrix A, by applying
[0, 1]
{1, 2, 3, 4, 5, 6, 7, 8} {1, 2, 3, 4} {1, 2}
{3, 4}
[0, 12 ]
{5, 6, 7, 8} {5, 6}
{7, 8}
{1} {2} {3} {4} {5} {6} {7} {8}
[0, 14 ]
[ 12 , 1] [ 14 , 12 ]
[ 12 , 34 ]
[ 34 , 1]
[0, 18 ] [ 18 , 14 ] [ 14 , 83 ] [ 38 , 12 ] [ 12 , 85 ] [ 58 , 43 ] [ 43 , 78 ] [ 78 , 1]
Fig. 6. Left: Cluster tree TI . Right: Partition of Ω = [0, 1] with respect to TI . Algorithm 5 starting with s,t = I. With η := 1 in the admissibility condition, Algorithm 5 applied to the one-dimensional example from above yields the block structure shown in Figure 7.
28
Jonas Ballani and Daniel Kressner
Algorithm 5 BuildBlocks(s,t) 1: if (s,t) is admissible or both s and t have no sons then 2: fix the block (s,t) 3: else 4: for all sons s0 of s and all sons t 0 of t do 5: BuildBlocks(s0 ,t 0 ) 6: end for 7: end if
p=2
p=3
p=4
Fig. 7. Block subdivision with admissibility constant η = 1
The structure is very similar to the HODLR format except that now also parts close to subdiagonal are stored as dense matrices. In fact, a slightly weaker admissibility condition than in Definition 1 would have resulted in the block structure from Figure 5, see [44]. Clearly, the concentration of dense blocks towards the diagonal is only typical for onedimensional problems. Fortunately, the construction above easily generalizes to problems in higher dimensions with Ω ⊂ Rd , d > 1. This requires a suitable adaptation of the construction of the cluster tree TI , from which the block structure can be derived using Algorithm 5, see [19, 42] for details.
Algebraic clustering When A ∈ Rn×n is sparse, a suitable partition of the blocks of A can sometimes successfully be obtained solely by the sparsity of A, i.e., without having to take the geometry of the original problem into account. We briefly illustrate the basic ideas of such an algebraic clustering strategy and refer to [7] for details. Let G = (V, E) be the incidence graph of A, where V = I = {1, . . . , n} denotes the set of vertices and E denotes the set of edges. For simplicity, let us assume that A is symmetric and we thus obtain an undirected graph with (i, j) ∈ E
⇔
A(i, j) 6= 0.
Any subdivision of the graph G into subgraphs corresponds to a partition of the index set I. By the very basic property that a matrix with k nonzero entries has rank at most k, we aim to determine subgraphs such that the number of connecting edges between the subgraphs remains small. For this purpose, we partition ˙ I2 I = I1 ∪
Matrices with hierarchical low-rank structures
29
with #I1 ≈ #I2 such that #{(i, j) ∈ E : i ∈ I1 , j ∈ I2 } is small. This process is continued recursively by subdividing I1 and I2 in the same way. In general, the minimization of the number of connecting edges between subgraphs is an NP-hard problem. However, efficient approximations to this problem are well known and can be found, e.g., in the software packages METIS or Scotch. The subdivision process can be represented by a cluster tree TI . We still need to define a suitable admissibility condition in order to turn TI into a subdivision of the block index set I × I. Given vertices i, j ∈ I, we denote by dist(i, j) the length of a shortest path from i to j in the graph G. Moreover, for s,t ⊂ I let diam(s) := max dist(i, j), i, j∈s
dist(s,t) := min dist(i, j). i∈s, j∈t
In analogy to Definition 1, we call the block b = s × t admissible if min{diam(s), diam(t)} ≤ η dist(s,t)
(32)
for some constant η > 0. Based on the cluster tree TI and the algebraic admissibility condition (32), we can then construct the partition P by Algorithm 5. A variant of this partition strategy is to combine the clustering process with nested dissection. To this end, we subdivide I into ˙ I2 ∪ ˙S I = I1 ∪ such that (i, j) ∈ / E for all i ∈ I1 , j ∈ I2 . The set S is called a separator of G. This leads to a permuted matrix A of the form A(I1 , I1 ) 0 A(I1 , S) A(I2 , I2 ) A(I2 , S) . A= 0 (33) A(S, I1 ) A(S, I2 ) A(S, S) Note that we typically have #S #I such that we obtain large off-diagonal blocks that completely vanish. As before, we can recursively continue the subdivision process with the index sets I1 and I2 which in turn moves a large number of zeros to the off-diagonal blocks. For a matrix A admitting an LU factorization A = LU, a particularly nice feature of this strategy is that the zero blocks (33) are inherited by L and U. This means that we can incorporate this information a priori in the construction of a block-wise data-sparse approximation of the factors L and U. For an existence result on low-rank approximations to admissible blocks in the factors L and U and in the inverse matrix A−1 see [7].
3.3 H -matrices Hierarchical (H -) matrices represent a generalization of HODLR matrices from Section 3.1, see [6, 19, 40, 42]. In the HODLR format, the block structure of a matrix A ∈ Rn×n was derived from a perfect binary tree subdividing the index set I = {1, . . . , n}. In the H -matrix format, much more general block structures appear. On the one hand, this allows for more freedom and increases the applicability of hierarchical matrices. On the other hand, the design of efficient algorithms can become more complicated. We start with a general partition P of the block index set I × I of A. In accordance with Algorithm 5, this partition is often constructed from a cluster tree TI by means of an admissibility condition as in Definition 1 such that each block b ∈ P is of the form b = s × t with ˙ − where s,t ∈ TI . The partition can be split into two sets P = P+ ∪P
30
Jonas Ballani and Daniel Kressner P− := P \ P+ .
P+ := {b ∈ P : b is admissible},
For fixed P and a given maximal rank k ∈ N, the set of hierarchical matrices is now defined as H (P, k) = {A ∈ RI×I : rank(A(b)) ≤ k ∀ b ∈ P+ }. All blocks b ∈ P+ are stored by a low-rank representation, whereas all blocks b ∈ P− are kept as dense matrices. As for the HODLR case, each block b = s × t ∈ P+ of A ∈ H (P, k) is represented independently as A(b) = UbVbT ,
Ub ∈ Rs×kb , Vb ∈ Rt×kb
with a possibly different rank kb ≤ k in each block. Since a low-rank block b = s × t has storage complexity kb (#s + #t), the storage cost for a hierarchical matrix is given by storH =
∑
kb (#s + #t) +
∑
#s#t.
b=s×t∈P−
b=s×t∈P+
If we assume that all non-admissible blocks s × t ∈ P− are subject to min{#s, #t} ≤ nmin for some minimal cluster size nmin ∈ N, we readily estimate #s#t ≤ nmin (#s + #t) which implies storH ≤ max{nmin , k}
∑
(#s + #t).
s×t∈P
To proceed from here, we need to count the number of blocks b ∈ P and weigh them according to their size. For each s ∈ TI , the number of blocks in the partition P with row index set s is given by #{t ∈ TI : s × t ∈ P}. Accordingly, the number of blocks in P with column index set t is given by #{s ∈ TI : s × t ∈ P}. The maximum of both quantities over all s,t ∈ TI is called the sparsity constant Csp (P) := max max #{t ∈ TI : s × t ∈ P}, max #{s ∈ TI : s × t ∈ P} s∈TI
t∈TI
of the partition P. Example 6. Consider the block structure from Figure 7 for p = 4. In Figure 8 we can see that for this particular case Csp (P) = 6. Interestingly, this number does not increase for higher values of p.
Fig. 8. Number of blocks in the partition P for different row index sets s ∈ TI
Matrices with hierarchical low-rank structures
31
For each fixed s ∈ TI , the term #s can hence only appear at most Csp (P) times in the sum ∑s×t∈P #s, i.e. ∑ #s ≤ Csp (P) ∑ #s. s×t∈P
s∈TI
Moreover, we immediately observe that on all levels of the tree TI an index set s ⊂ I appears at most once such that ∑ #s ≤ #I(1 + depth(TI )). s∈TI
Assuming nmin ≤ k, the final complexity estimate is hence given by storH ≤ 2Csp (P)kn(1 + depth(TI )). For bounded Csp (P) and depth(TI ) ∈ O(log n) this corresponds to the storage complexity storH ∈ O(kn log n) for HODLR matrices.
3.4 Approximation of elliptic boundary value problems by H -matrices Let Ω ⊂ Rd be a domain and consider the Poisson problem Lu = f u=0
in Ω , on ∂ Ω ,
for some given right-hand side f and Lu = −∆ u. Using Green’s function G(x, y), we can formally write the solution u as u(x) = (L−1 f )(x) =
Z
G(x, y) f (y)dy.
(34)
Ω
For sufficiently smooth boundary ∂ Ω , one can show that G(x, y) can be well approximated by a polynomial p(x, y) for x ∈ σ ⊂ Ω , y ∈ τ ⊂ Ω whenever σ and τ are sufficiently far apart. If we split the polynomial p into its components depending on x and y, we hence find that k
G(x, y) ≈ p(x, y) =
∑ u j (x)v j (y),
(x, y) ∈ σ × τ,
(35)
j=1
for univariate functions u j , v j . Assume now that (34) is discretized by a Galerkin approximation with finite element basis functions ϕi , i = 1, . . . , n. This leads to a matrix M ∈ Rn×n defined by Z Z
M(i, j) =
ϕi (x)G(x, y)ϕ j (y)dydx. Ω Ω
For s,t ⊂ {1, . . . , n}, let σs := i∈s supp ϕi , τt := i∈t supp ϕi . Due to (35), we can approximate the matrix block b = s × t by a low-rank representation S
S
M(b) ≈ UbVbT , where
Ub ∈ Rs×k , Vb ∈ Rt×k ,
Z
Ub (i, j) =
Z
ϕi (x)u j (x)dx, Ω
Vb (i, j) =
ϕi (y)v j (y)dy, Ω
whenever the distance of σs and τt is large enough. Combined with a clustering strategy as in Section 3.2, this ensures the existence of H -matrix approximations to discretizations of the
32
Jonas Ballani and Daniel Kressner
inverse operator L−1 . Note however that the Green’s function G(x, y) is usually not explicitly available and hence the polynomial approximation (35) remains a theoretical tool. If the boundary ∂ Ω is not smooth enough or the operator L depends on non-smooth coefficients, we cannot expect that a good polynomial approximation of G(x, y) exists. Remarkably, for a general uniformly elliptic operator of the form d
Lu = −
∑
∂ j (ci j ∂i u)
i, j=1
with ci j ∈ L∞ (Ω ), one can still prove the existence of H -matrix approximations to the inverse of finite-element matrices stiffness and mass matrices, cf. [8]. For the solution of boundary value problems, an explicit construction of an approximate inverse is often not required. As the next example illustrates, an approximate factorization of the discretized operator suffices and can easily be combined with standard iterative schemes. Example 7. Let Ω = (0, 1)3 be the unit cube and consider the Poisson problem −∆ u = f u=0
in Ω , on ∂ Ω ,
with f ≡ 1. A finite difference discretization with m + 2 points in each spatial direction leads to a linear system Ax = b (36) with A ∈ Rn×n , n := m3 , defined by A = B ⊗ E ⊗ E + E ⊗ B ⊗ E + E ⊗ E ⊗ B, where B = tridiag(−1, 2, −1) ∈ Rm×m and E denotes the identity matrix of size m. In order to solve (36) by a direct method, we could compute a Cholesky factorization A = LLT . As Figure 9 illustrates, the number of non-zeros in L increases rapidly with the matrix size. For the simple three-dimensional setting of this example one can show nnz(L) = O(m5 ) = O(n5/3 ). Alternatively, we can use an iterative solver such as CG which immediately calls for a good preconditioner. In the ideal case, this means that the number of preconditioned CG iterations should be bounded independently of the matrix size n. On the other hand, the complexity for the application (or storage) of the preconditioner should not deteriorate with the matrix T by hierarchical masize. It turns out that an approximate Cholesky factorization A ≈ LH LH trices actually fulfills both criteria. In Figure 10 we can see that the singular values in the off-diagonal blocks of the Cholesky factor LH decay rapidly. We now iteratively compute a solution of (36) by preconditioned CG using different accuracies for the hierarchical matrix factorization. In Table 1 we can observe that the number of CG iterations stays constant for different sizes of the spatial discretization. Moreover, both the storage requirements for LH and the time for the factorization and for the solution tend to grow as O(n log∗ n). However, we also recognize an important drawback: the time for the factorization actually exceeds the solution time by far. Still, the computed preconditioner will be useful if the solution for multiple right-hand sides needs to be found. Remark 4. Example 7 is for illustration only. There are much more efficient methods to solve problems of this class by, e.g., fast multipole or boundary element methods.
Matrices with hierarchical low-rank structures
33
Fig. 9. Cholesky factorization of A from Example 7 with n = 1000 (top) and n = 8000 (bottom). Left: A after reordering with symamd. Right: Cholesky factor L with nnz(L) > 33 · 103 (top) and nnz(L) > 855 · 103 (bottom). T )−1 Ak CG steps stor(L ) [MB] time(chol) [s] time(solve) [s] n εLU kI − (LH LH 2 H 27000 1e-01 7.82e-01 8 64 1.93 0.35 1e-02 5.54e-02 3 85 3.28 0.21 1e-03 3.88e-03 2 107 4.61 0.18 1e-04 2.98e-04 2 137 6.65 0.22 1e-05 2.32e-05 1 172 10.31 0.17 64000 1e-01 9.11e-01 8 174 5.82 0.88 1e-02 9.66e-02 4 255 10.22 0.62 1e-03 6.56e-03 2 330 15.15 0.48 1e-04 5.51e-04 2 428 23.78 0.54 1e-05 4.53e-05 1 533 34.81 0.46 125000 1e-01 1.15e+00 9 373 14.26 2.09 1e-02 1.57e-01 4 542 25.21 1.32 1e-03 1.19e-02 3 764 44.33 1.33 1e-04 9.12e-04 2 991 65.86 1.19 1e-05 7.37e-05 1 1210 97.62 1.01
Table 1. Iterative solution of (36) by preconditioned CG with approximate Cholesky factorT with accuracy ε . ization A ≈ LH LH LU
4 Partitioned low-rank structures with nested low-rank representations 4.1 HSS matrices Hierarchically Semi-Separable (HSS) or Hierarchically Block-Separable (HBS) matrices represent special cases of the HODLR matrices from Section 3.1, see, e.g., [60]. In the HODLR
34
Jonas Ballani and Daniel Kressner
Fig. 10. Cholesky factor LH of A from Example 7 with n = 8000 for εLU = 10−4 . format, each off-diagonal block is expressed independently by a low-rank representation of the form (26). In the HSS format, the low-rank representations on different levels are nested and depend on each other in a hierarchical way. Consider an off-diagonal block of a matrix A in HODLR format on level 0 < ` < p with (`) (`) (`) (`) the low-rank representation A(Ii` , I `j ) = Ui Si, j (V j )T , where now Si, j ∈ Rk×k . On level ` + `+1 ˙ `+1 1, the row index set is subdivided into Ii` = I2i−1 ∪I2i and the column index set into I `j = ˙ `+1 I2`+1 j−1 ∪I2 j . The crucial assumption in the HSS format then states that there exist matrices (`)
Xi
(`)
∈ R2k×k ,Y j
∈ R2k×k such that " # (`+1) U2i−1 0 (`) (`) Ui = (`+1) Xi , 0 U2i
(`) Vj
=
" (`+1) V2 j−1 0
0 (`+1)
V2 j
# (`)
Yj .
(37)
Given the nestedness condition (37), we can construct the row and column bases U (`) and for any ` = 1, . . . , p − 1 recursively from the bases U (p) and V (p) at the highest level p using the matrices X (`) and Y (`) . To estimate the storage requirements for HSS matrices, we need a more refined estimate than the one used for HODLR matrices. As before, we denote by n` = 2 p−` n p the size of a block on level `. It is clear that whenever n` ≤ k, all blocks on level ` can be represented as dense n` × n` matrices. The first level where this happens is
V (`)
`c := min{` : n` ≤ k} = bp − log2 (k/n p )c.
Matrices with hierarchical low-rank structures
35
To represent the bases on level `, we need 2`−1 matrices X (`) ,Y (`) of size 2k × k which sums up to `c −1
2
∑ 2`−1 2k2 ≤ 4k2 2`
c
≤ 8k2 2 p n p /k = 8kn
`=1
units of storage. As there are 2` low-rank blocks on level `, an analogous estimate shows that we need 2kn storage units to represent all matrices S(`) for ` = 1, . . . , `c − 1. The cost for storing all off-diagonal blocks on level ` = `c , . . . , p amounts to p
p−1
p−1
`=`c
`=`c
`=`c
∑ 2` n2` = ∑ 2` 22p−2` n p = 22p ∑ 2−` ≤ 2 · 22p 2−`
c
≤ 2 · 22p kn p /2 p = 2kn.
Note that the storage cost for the diagonal blocks of A is again given by last term of this sum. As a final result, the storage complexity for a matrix in the HSS format sums up to O(kn). The matrix-vector multiplication y = Ax for x ∈ Rn in the HSS format is carried out in three steps. In the first step (forward transformation), the restriction of the vector x to an index (`) set I `j ⊂ I is transformed via the basis V j . The resulting vectors x`j are then multiplied with (`)
the coupling matrices Si, j (multiplication phase). In a third step (backward transformation), (`)
the results y`i from this multiplication are multiplied with the bases Ui and added up. Due to the nestedness (37), Algorithm 6 has a bottom-to-top-to-bottom structure.
Algorithm 6 Compute y = Ax in HSS format 1: On level ` = p compute (p) p p xi = (Vi )T x(Ii ), i = 1, . . . , 2 p 2: for level ` = p − 1, . . . , 1 do 3: Compute `+1 x (`) xi` = (Yi )T 2i−1 i = 1, . . . , 2` `+1 , x2i 4: end for 5: for level ` = 1, . . . , p − 1 do 6: Compute " # " # (`) (`) ` 0 S2i−1,2i x2i−1 y2i−1 , = ` (`) x2i y`2i S 0
% forward substitution
i = 1, . . . , 2`−1 % multiplication
2i,2i−1
7: end for 8: for level ` = 1, . . . , p − 1 do 9: Compute `+1 y`+1 2i−1 = y2i−1 + X (`) y` , i i y`+1 y`+1 2i 2i 10: end for 11: On level ` = p compute (p) p p p p p y(Ii ) = Ui yi + A(Ii , Ii )x(Ii ),
i = 1, . . . , 2`
% backward substitution
i = 1, . . . , 2 p
Except for level ` = p, only the matrices X (`) , Y (`) , S(`) are used for the matrix-vector multiplication. On level ` = p, the last line in Algorithm 6 takes into account the contributions from the diagonal blocks of A. Since all matrices from the HSS representation are touched
36
Jonas Ballani and Daniel Kressner
only once, the complexity for the matrix-vector multiplication in the HSS format is the same as the storage complexity, i.e. O(kn). Interestingly, the HSS matrix-vector multiplication has a direct correspondence to the fast multipole method [39], see [73] and the references therein for a discussion. Analogous to HODLR matrices, the HSS format allows for approximate arithmetic operations such as matrix-matrix additions and multiplications, see [60] for an overview. As a consequence, approximate factorizations and an approximate inverse of a matrix are computable fully in the HSS format. This is exploited in the development of efficient solvers and preconditioners that scale linearly in the matrix size n, see [2, 24, 29, 31, 32, 38, 53, 54, 70, 72] for examples.
4.2 H 2 -matrices H 2 -matrices are a generalization of HSS matrices from Section 4.1 and a special case of H matrices from Section 3.3, see [17] and the references therein. We start again with a partition P of the block index set I × I such that all b ∈ P have the form b = s ×t with s,t ∈ TI for a cluster tree TI . For A ∈ H (P, k), we require that each admissible block b = s × t ∈ P+ possesses a low-rank representation of the form A(b) = Us SbVtT ,
Us ∈ Rs×ks , Vt ∈ Rt×Kt , Sb ∈ Rks ×Kt .
(38)
b ∈ P+
With this condition, the low-rank representations in each block are no longer independent since we have fixed a particular basis (or frame) Us for each row index set s ∈ TI and a particular Vt for each column index set t ∈ TI . H -matrices with the additional property (38) are called uniform H -matrices. For each admissible block b of a uniform H -matrix, we do not store its low-rank representation separately but only the coupling matrix Sb . In addition, the bases Us ,Vt need only to be stored once for each s,t ∈ TI . This leads to a storage cost of storHuniform =
∑ ks #s + ∑ Kt #t +
s∈TI
t∈TI
ks Kt +
∑
b=s×t∈P+
∑
#s#t.
b=s×t∈P−
A further reduction in the storage cost can be obtained if we impose an additional restriction on the matrices Us ,Vt . Let L (TI ) denote the set of leaves of the cluster tree TI . We call the family of matrices (Us )s∈TI nested if for each non-leaf index set s ∈ TI \ L (TI ) with sons s0 ∈ TI there exists a matrix Xs,s0 ∈ Rks0 ×ks such that Us (s0 , :) = Us0 Xs,s0 .
(39)
Accordingly, the family (Vt )t∈TI is nested if for t ∈ TI \ L (TI ) with sons t 0 ∈ TI there exists a matrix Yt,t 0 ∈ RKt 0 ×Kt such that Vt (t 0 , :) = Vt 0 Yt,t 0 . (40) A uniform H -matrix with nested families (Us )s∈TI , (Vt )t∈TI is called an H 2 -matrix. Due to the nestedness conditions (39), (40), we only have to store the matrices Us ,Vt in the leaves s,t ∈ L (TI ) which requires at most 2kn units of storage. For all s0 ∈ TI \ {I} with father s ∈ TI , we can store the transfer matrices Xs,s0 in a complexity of k2 #TI . The number of nodes in the tree TI can be estimated as #TI ≤ 2#L (TI ). Including the cost for the transfer matrices Yt,t 0 and for the coupling matrices Sb for all blocks, we arrive at a storage cost of storH 2 ≤ 2kn + 4k2 #L (TI ) +
∑
b=s×t∈P+
ks Kt +
∑
#s#t.
b=s×t∈P−
Note that if #L (TI ) ∼ n/nmin and nmin ∼ k, the second term in this sum is of order O(kn).
Matrices with hierarchical low-rank structures
37
Example 8. To illustrate an explicit construction of an H 2 -matrix, we recall the one-dimensional problem from Section 3.2 with A ∈ Rn×n defined by Z Z
A(i, j) =
κ(x, y)dydx τi τ j
for κ(x, y) := log |x − y|. Given an admissible block b = s × t, we let Ωs := i∈s τi , Ωt := S j∈t τ j and apply the interpolation technique from Section 2.4 to approximate κ on Ω s × Ωt by S
k
κk (x, y) :=
k
∑ ∑κ
(s)
(t)
ξµ , θν )Lξ (s) ,µ (x)Lθ (t) ,ν (y),
x ∈ Ω s , y ∈ Ωt ,
µ=1 ν=1 (s)
with Lagrange polynomials Lξ (s) ,µ : Ωs → R, Lθ (t) ,ν : Ωt → R, and interpolation points ξµ ∈ (t)
Ωs , θν ∈ Ωt . Every entry (i, j) ∈ s × t of the submatrix A(s,t) can then be approximated by A(i, j) ≈
Z Z
κk (x, y)dydx τi τ j k
=
k
∑ ∑κ
(s)
(t)
Z
Z
ξµ , θν )
Lξ (s) ,µ (x)dx
τi
µ=1 ν=1
τj
Lθ (t) ,ν (y)dy = (Us SbVtT )i, j ,
where Z
Us (i, µ) := τi
Z
Lξ (s) ,µ (x)dx,
Vt ( j, ν) := τj
Lθ (t) ,ν (y)dy,
Sb (µ, ν) := κ(xs,µ , yt,ν ).
Since the matrices Us and Vt do only depend on the clusters s,t, respectively, this approximation gives already rise to a uniform H -matrix. Provided that the interpolation nodes are mutually different, the first k Lagrange polynomials form a basis for the space of polynomials of order at most k − 1. This allows us to express the Lagrange polynomial Lξ (s) ,µ for the cluster s as a linear combination of the Lagrange polynomials for another cluster s0 : k
Lξ (s) ,µ =
∑ Xs,s (ν, µ)Lξ 0
(s0 ) ,ν
ν=1 (s0 )
with a matrix Xs,s0 ∈ Rk×k defined by Xs,s0 (ν, µ) = Lξ (s) ,µ (ξν ). In particular, if s0 is a son of s ∈ TI and i ∈ s0 , we get k
Z
Us (i, µ) = τi
Lξ (s) ,µ (x)dx =
∑ ν=1
Z
Xs,s0 (ν, µ) τi
Lξ (s0 ) ,ν (x)dx = (Us0 Xs,s0 )i,µ .
We immediately conclude that the family of matrices (Us )s∈TI is nested. Analogously, the same holds for (Vt )t∈TI . In summary, (tensorized) Lagrange interpolation allows us to obtain an approximation of A in the H 2 -matrix format. Analogous to HSS and H -matrices, the matrix-vector product for an H 2 -matrix is carried out within O(nk) operations. However, the implementation of approximate algebraic operations such as matrix-matrix additions and multiplications in the H 2 -matrix format is significantly more difficult than for HODLR or H -matrices as one needs to maintain the nestedness structure (39)–(40). Note that this also makes the computation of approximate factorizations and an approximate inverse in the H 2 -matrix format more involved, see [17] for details.
38
Jonas Ballani and Daniel Kressner
References 1. Ambikasaran, S., Foreman-Mackey, D., Greengard, L., Hogg, D.W., O’Neil, M.: Fast direct methods for Gaussian processes. Preprint arXiv:1403.6015 (2014) 2. Aminfar, A., Darve, E.: A fast sparse solver for finite-element matrices. Preprint arXiv:1410.2697 (2014) 3. Ballani, J., Kressner, D.: Sparse inverse covariance estimation with hierarchical matrices. Tech. rep., EPFL-MATHICSE-ANCHP (2014) 4. Baur, U., Benner, P.: Factorized solution of Lyapunov equations based on hierarchical matrix arithmetic. Computing 78(3), 211–234 (2006) 5. Bebendorf, M.: Approximation of boundary element matrices. Numer. Math. 86(4), 565– 589 (2000) 6. Bebendorf, M.: Hierarchical matrices, Lecture Notes in Computational Science and Engineering, vol. 63. Springer-Verlag, Berlin (2008) 7. Bebendorf, M., Fischer, T.: On the purely algebraic data-sparse approximation of the inverse and the triangular factors of sparse matrices. Numer. Linear Algebra Appl. 18(1), 105–122 (2011) 8. Bebendorf, M., Hackbusch, W.: Existence of H -matrix approximants to the inverse FEmatrix of elliptic operators with L∞ -coefficients. Numer. Math. 95(1), 1–28 (2003) 9. Bebendorf, M., Maday, Y., Stamm, B.: Comparison of some reduced representation approximations. In: Reduced order methods for modeling and computational reduction, MS&A. Model. Simul. Appl., vol. 9, pp. 67–100. Springer, Cham (2014) 10. Bebendorf, M., Rjasanow, S.: Adaptive low-rank approximation of collocation matrices. Computing 70(1), 1–24 (2003) 11. Beckermann, B.: The condition number of real Vandermonde, Krylov and positive definite Hankel matrices. Numer. Math. 85(4), 553–577 (2000) 12. Benner, P., B¨orm, S., Mach, T., Reimer, K.: Computing the eigenvalues of symmetric H 2 -matrices by slicing the spectrum. Comput. Visual. Sci. 16, 271–282 (2013) 13. Benner, P., Mach, T.: Computing all or some eigenvalues of symmetric H` -matrices. SIAM J. Sci. Comput. 34(1), A4850–A496 (2012) 14. Benner, P., Mach, T.: The preconditioned inverse iteration for hierarchical matrices. Numer. Linear Algebra Appl. 20, 150–166 (2013) 15. Benzi, M., Razouk, N.: Decay bounds and O(n) algorithms for approximating functions of sparse matrices. Electron. Trans. Numer. Anal. 28, 16–39 (2007/08) 16. Berry, M.W.: Large scale singular value decomposition. Internat. J. Supercomp. Appl. 6, 13–49 (1992) 17. B¨orm, S.: Efficient numerical methods for non-local operators, EMS Tracts in Mathematics, vol. 14. European Mathematical Society (EMS), Z¨urich (2010) 18. B¨orm, S., Grasedyck, L.: Hybrid cross approximation of integral operators. Numer. Math. 101(2), 221–249 (2005) 19. B¨orm, S., Grasedyck, L., Hackbusch, W.: Hierarchical matrices. Lecture Notes 21, MPI for Mathematics in the Sciences, Leipzig (2003) 20. C ¸ ivril, A., Magdon-Ismail, M.: Exponential inapproximability of selecting a maximum volume sub-matrix. Algorithmica 65(1), 159–176 (2013) 21. Chandrasekaran, S., Dewilde, P., Gu, M., Pals, T., Sun, X., van der Veen, A.J., White, D.: Some fast algorithms for sequentially semiseparable representations. SIAM J. Matrix Anal. Appl. 27(2), 341–364 (2005) 22. Chiu, J., Demanet, L.: Sublinear randomized algorithms for skeleton decompositions. SIAM J. Matrix Anal. Appl. 34(3), 1361–1383 (2013)
Matrices with hierarchical low-rank structures
39
23. C ¸ ivril, A., Magdon-Ismail, M.: On selecting a maximum volume sub-matrix of a matrix and related problems. Theoret. Comput. Sci. 410(47-49), 4801–4811 (2009) 24. Corona, E., Martinsson, P.G., Zorin, D.: An O(N) direct solver for integral equations on the plane. Appl. Comput. Harmon. Anal. 38(2), 284–317 (2015) 25. Davis, T.A.: Direct methods for sparse linear systems, Fundamentals of Algorithms, vol. 2. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA (2006) 26. Demko, S., Moss, W.F., Smith, P.W.: Decay rates for inverses of band matrices. Math. Comp. 43(168), 491–499 (1984) 27. Demmel, J.W., Gu, M., Eisenstat, S., Slapniˇcar, I., Veseli´c, K., Drmaˇc, Z.: Computing the singular value decomposition with high relative accuracy. Linear Algebra Appl. 299(1-3), 21–80 (1999) 28. Drmaˇc, Z., Veseli´c, K.: New fast and accurate Jacobi SVD algorithm. I. SIAM J. Matrix Anal. Appl. 29(4), 1322–1342 (2007) 29. Gatto, P., Hesthaven, J.S.: A preconditioner based on low-rank approximation of Schur complements. Tech. rep., EPFL-MATHICSE-MCSS (2015) 30. Gavrilyuk, I.P., Hackbusch, W., Khoromskij, B.N.: H -matrix approximation for the operator exponential with applications. Numer. Math. 92(1), 83–111 (2002) 31. Ghysels, P., Li, X.S., Rouet, F., Williams, S., Napov, A.: An efficient multi-core implementation of a novel HSS-structured multifrontal solver using randomized sampling. Preprint arXiv:1502.07405 (2015) 32. Gillman, A., Young, P.M., Martinsson, P.G.: A direct solver with O(N) complexity for integral equations on one-dimensional domains. Front. Math. China 7(2), 217–247 (2012) 33. Golub, G.H., Luk, F.T., Overton, M.L.: A block L´anczos method for computing the singular values of corresponding singular vectors of a matrix. ACM Trans. Math. Software 7(2), 149–169 (1981) 34. Golub, G.H., Van Loan, C.F.: Matrix computations, fourth edn. Johns Hopkins University Press, Baltimore, MD (2013) 35. Goreinov, S.A., Tyrtyshnikov, E.E.: The maximal-volume concept in approximation by low-rank matrices. In: Structured matrices in mathematics, computer science, and engineering, I (Boulder, CO, 1999), Contemp. Math., vol. 280, pp. 47–51. Amer. Math. Soc., Providence, RI (2001) 36. Goreinov, S.A., Tyrtyshnikov, E.E., Zamarashkin, N.L.: A theory of pseudoskeleton approximations. Linear Algebra Appl. 261, 1–21 (1997) 37. Grasedyck, L., Hackbusch, W., Khoromskij, B.N.: Solution of large scale algebraic matrix Riccati equations by use of hierarchical matrices. Computing 70(2), 121–165 (2003) 38. Greengard, L., Gueyffier, D., Martinsson, P.G., Rokhlin, V.: Fast direct solvers for integral equations in complex three-dimensional domains. Acta Numerica 18, 243–275 (2009) 39. Greengard, L., Rokhlin, V.: A fast algorithm for particle simulations. J. Comput. Phys. 73(2), 325–348 (1987) 40. Hackbusch, W.: A sparse matrix arithmetic based on H -matrices. Part I: Introduction to H -matrices. Computing 62(2), 89–108 (1999) 41. Hackbusch, W.: Entwicklungen nach Exponentialsummen. Technical Report 4/2005, MPI MIS Leipzig (2010). URL http://www.mis.mpg.de/preprints/tr/ report-0405.pdf. Revised version September 2010 42. Hackbusch, W.: Hierarchical Matrices: Algorithms and Analysis. Springer, Berlin (2015) 43. Hackbusch, W.: New estimates for the recursive low-rank truncation of block-structured matrices. Numer. Math. (2015). DOI 10.1007/s00211-015-0716-7 44. Hackbusch, W., Khoromskij, B.N., Kriemann, R.: Hierarchical matrices based on a weak admissibility criterion. Computing 73(3), 207–243 (2004)
40
Jonas Ballani and Daniel Kressner
45. Halko, N., Martinsson, P.G., Tropp, J.A.: Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 53(2), 217– 288 (2011) 46. Harbrecht, H., Peters, M., Schneider, R.: On the low-rank approximation by the pivoted Cholesky decomposition. Appl. Numer. Math. 62(4), 428–440 (2012) 47. Higham, N.J.: Accuracy and Stability of Numerical Algorithms. SIAM, Philadelphia, PA (1996) 48. Horn, R.A., Johnson, C.R.: Matrix analysis, second edn. Cambridge University Press, Cambridge (2013) 49. Lehoucq, R.B., Sorensen, D.C., Yang, C.: ARPACK users’ guide. SIAM, Philadelphia, PA (1998) 50. Li, J.Y., Ambikasaran, S., Darve, E.F., Kitanidis, P.K.: A Kalman filter powered by H 2 matrices for quasi-continuous data assimilation problems. Preprint arXiv:1404.3816 (2014) 51. Liu, D., Matthies, H.G.: Pivoted Cholesky decomposition by cross approximation for efficient solution of kernel systems. Preprint arXiv:1505.06195 (2015) 52. Maday, Y., Mula, O., Patera, A.T., Yano, M.: The generalized empirical interpolation method: stability theory on Hilbert spaces with an application to the Stokes equation. Comput. Methods Appl. Mech. Engrg. 287, 310–334 (2015) 53. Martinsson, P.G., Rokhlin, V.: A fast direct solver for boundary integral equations in two dimensions. J. Comput. Phys. 205(1), 1–23 (2005) 54. Napov, A., Li, X.S., Gu, M.: An algebraic multifrontal preconditioner that exploits a lowrank property. Preprint (2013) 55. Ostrowski, J., Bebendorf, M., Hiptmair, R., Kr¨amer, F.: H-matrix based operator preconditioning for full Maxwell at low frequencies. IEEE Transactions Magnetics 46(8), 3193–3196 (2010) 56. Pouransari, H., Coulier, P., Darve, E.: Fast hierarchical solvers for sparse matrices. Preprint arXiv:1510.07363 (2015) 57. Sauter, S.A., Schwab, C.: Boundary Element Methods. Springer Series in Computational Mathematics. Springer Berlin Heidelberg (2010) 58. Savas, B., Dhillon, I.S.: Clustered low rank approximation of graphs in information science applications. In: SDM, pp. 164–175 (2011) 59. Schneider, R., Uschmajew, A.: Approximation rates for the hierarchical tensor format in periodic Sobolev spaces. J. Complexity 30(2), 56–71 (2014) 60. Sheng, Z., Dewilde, P., Chandrasekaran, S.: Algorithms to solve hierarchically semiseparable systems. In: D. Alpay, V. Vinnikov (eds.) System Theory, the Schur Algorithm and Multidimensional Analysis, Operator Theory: Advances and Applications, vol. 176, pp. 255–294. Birkh¨auser Basel (2007) 61. Si, S., Hsieh, C.J., Dhillon, I.S.: Memory efficient kernel approximation. In: Proceedings of the 31st International Conference on Machine Learning (ICML), pp. 701–709 (2014) 62. Simon, H.D., Zha, H.: Low-rank matrix approximation using the Lanczos bidiagonalization process with applications. SIAM J. Sci. Comput. 21(6), 2257–2274 (2000) 63. Stewart, G.W.: Perturbation theory for the singular value decomposition. In: R.J. Vaccaro (ed.) SVD and Signal Processing II: Algorithms, Analysis and Applications (1991) 64. Tyrtyshnikov, E.E.: Incomplete cross approximation in the mosaic-skeleton method. Computing 64(4), 367–380 (2000) 65. Vandebril, R., Van Barel, M., Mastronardi, N.: A note on the representation and definition of semiseparable matrices. Numer. Linear Algebra Appl. 12(8), 839–858 (2005) 66. Vandebril, R., Van Barel, M., Mastronardi, N.: Matrix computations and semiseparable matrices. Vol. 1. Johns Hopkins University Press, Baltimore, MD (2008)
Matrices with hierarchical low-rank structures
41
67. Vandebril, R., Van Barel, M., Mastronardi, N.: Matrix computations and semiseparable matrices. Vol. II. Johns Hopkins University Press, Baltimore, MD (2008) 68. Vogel, J., Xia, J., Cauley, S., Venkataramanan, B.: Superfast divide-and-conquer method and perturbation analysis for structured eigenvalue solutions. Preprint (2015) 69. Wang, R., Li, R., Mahoney, M.W., Darve, E.: Structured block basis factorization for scalable kernel matrix evaluation. Preprint arXiv:1505.00398 (2015) 70. Wang, S., Li, X.S., Rouet, F., Xia, J., de Hoop, M.: A parallel geometric multifrontal solver using hierarchically semiseparable structure. Preprint (2013) ˚ Perturbation bounds in connection with singular value decomposition. 71. Wedin, P.A.: Nordisk Tidskr. Informationsbehandling (BIT) 12, 99–111 (1972) 72. Xia, J., Chandrasekaran, S., Gu, M., Li, X.S.: Fast algorithms for hierarchically semiseparable matrices. Numer. Linear Algebra Appl. 17(6), 953–976 (2010) 73. Yokota, R., Turkiyyah, G., Keyes, D.: Communication complexity of the fast multipole method and its algebraic variants. Preprint arXiv:1406.1974 (2014)