Approximation of Gram-Schmidt Orthogonalization by Data Matrix

Approximation of Gram-Schmidt Orthogonalization

arXiv:1701.00711v1 [math.RA] 31 Dec 2016

by Data Matrix Gen Li and Yuantao Gu

∗

Submitted December 31, 2016

Abstract For a matrix A with linearly independent columns, this work studies to use its ¯ and A itself to approximate its orthonormalization V. We theoretically normalization A ¯ approach V, respectively. analyze the order of the approximation errors as A and A Our conclusion is able to explain the fact that a high dimensional Gaussian matrix can well approximate the corresponding truncated Haar matrix. For applications, this work can serve as a foundation of a wide variety of problems in signal processing such as compressed subspace clustering. Keywords: subspaces, basis, Gram-Schmidt process, projection, Gaussian matrix

1

Introduction

Suppose that S is a d-dimensional subspace in Rn . The columns of A = [a1 , a2 , . . . , ad ] ∈ Rn×d constitute a basis of S. We can normalize the columns of A and obtain a normal basis of S as the following a2 ad a1 ¯ ¯2 , . . . , a ¯d ] = , ,..., . (1) A = [¯ a1 , a ka1 k ka2 k kad k ¯ or directly on A, to obtain Furthermore, we can apply GramSchmidt process [1, 2] on A, an orthonormal basis of S as the following vi = where ˜i = a ¯i − v

˜i v , k˜ vi k

i−1 X

m=1

i = 1, 2, . . . , d,

¯T a i vm vm ,

i = 1, 2, . . . , d.

(2)

(3)

Notice that the index i in (3) should start from 1 and increase to d, and if ai is used instead ¯ i in (3), then the result remains the same. We denote the matrix [v1 , v2 , . . . , vd ] as of a The authors are with the Department of Electronic Engineering, Tsinghua University, Beijing 100084, China. The corresponding author of this work is Yuantao Gu (e-mail: [email protected]). ∗

1

¯ and V as base V. A natural question is how to measure the similarity between A (or A) matrices of the same subspace. Consider the case where subspace S is described by certain data points on it. In other words, what we have is a set of linearly independent points {ai }i on a latent subspace, rather than an orthonormal basis of it. In order to calculate the energy of the projection of a new data point x on S, we need to first apply the Gram-Schmidt process on A to obtain V, then the energy is kVT xk2 . In cases where the amount of data is huge, or the data are acquired and stored in a distributed way, the cost of the Gram-Schmidt process is high. An intuitive way of approximating V is to normalize A as shown in (1), and then to use ¯ to calculate an approximated projection A ¯ T x and its energy kA ¯ T xk2 . In the obtained A such a way, how accurate can the approximation be? How to evaluate such approximation? Furthermore, if we directly use kAT xk2 as an approximation of the energy of the projection, ¯ then how large can the error be? The answers must depend on some properties of A or A, and this work will try to find out such answers. Such problems are fundamental in cases where random matrices are applied [3, 4]. According to the conclusions of this work, if A is a random matrix, we do not have to apply ¯ can be a rather accurate the Gram-Schmidt process to A, instead the normalized matrix A approximation. For a high dimensional random matrix, even normalization is not needed, and the matrix itself is able to be a good approximation.

2

Approximation of orthonormal basis by normal basis

¯ to approximate the orthonormalized matrix We first study to use the normalized matrix A ¯ and an orthonormal matrix is measured by R ¯ =A ¯ TA ¯ − I. V. The similarity between A ¯ =A ¯ U, ¯ we use U ¯ to evaluate the similarity between Based on a defined decomposition V− A ¯ and V. The following lemma describes the performance of U ¯ as R ¯ → 0. A Lemma 1. Let V = [v1 , v2 , . . . , vd ] denote the Gram-Schmidt orthogonalization of a column¯ = [¯ ¯ = (¯ ¯ TA ¯ − I. ¯2 , . . . , a ¯d ], where k¯ normalized matrix A a1 , a ai k = 1, ∀i. Denote R rji ) = A T ¯ to approximate V with error ¯j a ¯ i is small enough for j 6= i, we can use A Then when r¯ji = a d×d ¯ =A ¯ U, ¯ where U ¯ = [¯ V−A uji ] ∈ R is an upper triangular matrix satisfying ¯ Rk ¯ 2F , u ¯ii = g¯ii (R)k

∀i,

(4)

¯ ≤ 1/4, and ¯ > 0 and lim ¯ ¯ii (R) where g¯ii (R) R→0 g ¯ Rk ¯ F, u ¯ji = −¯ rji + g¯ji (R)k

∀j < i,

(5)

¯ = 0. g¯ji (R) where limR→0 ¯ ¯ G, ¯ Proof. We define V following the Gram-Schmidt process of (2) and (3). We have V = A ¯ is an upper triangular matrix. Accordingly, U ¯ =G ¯ − I is also upper triangular where G 2

and ¯i + vi = a

i X

¯j . u ¯ji a

(6)

j=1

Using (6) and (3) in (2), we have    i−1 m X X 1  ¯m + ¯T ¯j  . ¯i − a u ¯jm a a vi = i vm a k˜ vi k m=1

(7)

j=1

By switching the order of the summations, (7) can be reformulated as     i−1 i−1 X X 1  a ¯i − ¯T ¯j  ¯T vi = ¯jm  a a a i vm u i vj + k˜ vi k j=1 m=j P i−1 T X ¯T ¯i vj + i−1 a ¯jm ¯i a i vm u m=j a ¯j . a − = k˜ vi k k˜ vi k

(8)

j=1

Comparing (6) and (8), we readily get 1 − 1, ∀i, k˜ vi k   i−1 X 1  T ¯T ¯i vj + ¯jm  , a a u ¯ji = − i vm u k˜ vi k u ¯ii =

(9) ∀j < i.

m=j

(10)

¯i We will first study (9) and then turn to (10). Using (3) in (9) and noticing that both a and vm have been normalized, we have u ¯ii =

1 k¯ ai −

Pi−1

m=1

¯T a i vm vm k

−1= q

1−

1 Pi−1

m=1

¯T a i vm

2 − 1.

(11)

According to the Taylor’s series with Peano form of the remainder, i.e., f (x) = √

1 x = 1 + + h(x)x, 2 1−x

where limx→0 h(x) = 0, (11) is approximated by u ¯ii =

X i−1 2 1 ¯T , + h(·) a i vm 2

(12)

m=1

P 2 i−1 Tv ¯ where h a is denoted by h(·) for short. Following (6) and using the definim i m=1 ¯ tion of R, for m < i we have ¯T a i vm

=

¯T ¯m a i a

+

m X

¯T ¯k u ¯km a i a

k=1

= r¯mi +

m X k=1

3

u ¯km r¯ki .

(13)

Using (13) in (12), we have !2 X m i−1 X 1 r¯mi + u ¯km r¯ki + h(·) u ¯ii = 2 m=1 k=1    !2 X i−1 i−1 m m X X X 1 2  r¯mi + + h(·)  = u ¯km r¯ki + 2 u ¯km r¯mi r¯ki  . 2

m=1

m=1

k=1

(14)

k=1

¯ the first summation in the RHS of (14) is bounded by Because of the symmetry of R, 1 ¯ 2 2 kRkF . Furthermore, the second summation, which is composed of squares and products ¯ 2 , where ǫ1 is a small quantity. Consequently, we have of r¯pq , must be bounded by ǫ1 kRk F 1 1 2 ¯ ¯ ¯ 2, u ¯ii = g¯ii (R)kRkF ≤ (15) + h(·) + ǫ1 kRk F 2 2 where

¯ ≤ 1, lim g¯ii (R) ¯ 4 R→0

(16)

¯ approaches 0. We then complete the first part of the lemma. because h(·) tends to 0 as R Next we will study (10). Using (9) and (13) in (10), we have   ! j m i−1 X X X u ¯ji = −(1 + u ¯ii ) r¯ji + r¯mi + u ¯lm r¯li u ¯jm  u ¯kj r¯ki + 

= −(1 + u ¯ii ) r¯ji +

k=1

m=j

j X

i−1 X

k=1

u ¯kj r¯ki +

l=1

u ¯jm r¯mi +

m i−1 X X

m=j l=1

m=j



u ¯lm u ¯jm r¯li  ,

∀j < i.

(17)

Notice that the summations in (17), which are composed of r¯pq , must be bounded by ¯ F , where ǫ2 is a small quantity. Plugging (15) and (16) into (17), we have ǫ2 kRk

where

¯ Rk ¯ F, ¯ kRk ¯ 2F ) r¯ji + ǫ2 kRk ¯ F = −¯ rji + g¯ji (R)k u ¯ji = − 1 + g¯ii (R

(18)

¯ = 0. lim g¯ji (R)

(19)

¯ R→0

The second part of the lemma is proved. ¯ approaches an orthonormal basis, i.e., R ¯ approaches 0, Lemma 1 unveils that, when A ¯ go to zero, and they are of the same order as kRk ¯ 2 . At the the diagonal elements of U F ¯ and the differences are of a higher order same time, the off-diagonal elements go to −R, ¯ F. than kRk

¯ rather than V, i.e., we define Remark 1. Notice that the error that we define is based on A ¯ =A ¯U ¯ rather than V − A ¯ = VU. ¯ The reason is that A ¯ is at hand and can be easily V−A obtained, while V is expensive to calculate. It would contradict our purpose of reducing the computation complexity, if V were used here. 4

For the energy of the projection of a vector onto a subspace, based on Lemma 1, we can ¯ instead of V. The following corollary gives an obtain the approximation error of using A upper bound on such error. Corollary 1. Following the definition of Lemma 1, we use a column-normalized matrix ¯ = [¯ ¯2 , . . . , a ¯d ] to approximate its orthonormal matrix V = [v1 , v2 , . . . , vd ] gotten A a1 , a ¯ =A ¯ U. ¯ For an arbitrary through Gram-Schmidt process. The approximation error V − A vector x ∈ Rn , we conclude that

T 2 kV xk − kA ¯ T xk2 ≤ dkA ¯ T xk2 max R ¯ + ǫ(R)k ¯ Rk ¯ F,

(20)

¯ = 0. ǫ(R) where limR→0 ¯

¯ T x. We have Proof. Let b0 = VT x and b = A d d X 2 X 2 2 2 2 b0,i − b2i . LHS of (20) = kb0 k − kbk = (b0,i − bi ) ≤ i=1

i=1

¯ T b, where ci = By further defining c = b0 − b = U RHS of (21) =

d X i=1

≤ ≤

(21)

Pi

¯mi bm , m=1 u

we have

|ci (2bi + ci )|

i d X X

i=1 m=1 d X i−1 X

i=1 m=1

|¯ umi |(b2i + b2m ) + |¯ umi |(b2i

+

b2m ) +

d X

c2i

i=1 d X

2

i=1

|¯ uii |b2i

i d X X u ¯2mi b2m . i + i=1

(22)

m=1

Let’s first check the third item in the RHS of (22). By using (4) and (5), we have i−1 d i d d X X X X X 2 2 2 2 i¯ u2ii b2i i u ¯mi bm + i u ¯mi bm = i=1

m=1

=

i=1 m=1 i−1 d X X

i

i=1 m=1 ′ ¯ ¯

i=1

¯ Rk ¯ F −¯ rmi + g¯mi (R)k

= ǫ (R)kRkF ,

2

b2m +

d X

¯ Rk ¯ 4 b2 i¯ gii2 (R)k F i

(23)

i=1

(24)

¯ = 0. Equation (24) is derived because all items in (23) are of higher ǫ′ (R) where limR→0 ¯

5

¯ F . Following the similar way, we adopt (4), (5), and (24) in (22), order of kRk RHS of (22) ≤ ≤

i−1 d X X

¯ Rk ¯ F )(b2 + b2 ) + 2 (|¯ rmi | + |¯ gmi (R)|k i m

i=1 m=1 i−1 d X X

+

i=1 m=1 i−1 d X X

b2i + b2m

i=1 m=1

!

d X i=1

¯ Rk ¯ 2 b2 + ǫ′ (R)k ¯ Rk ¯ F |¯ gii (R)|k F i

¯ max(R)

¯ Rk ¯ F (b2i + b2m ) + 2 |¯ gmi (R)|k

d X i=1

¯ Rk ¯ 2F b2i + ǫ′ (R)k ¯ Rk ¯ F |¯ gii (R)|k

¯ + ǫ(R)k ¯ Rk ¯ F ≤ RHS of (20), ≤(d − 1)kbk2 max R

(25) (26)

¯ = 0. Equation (26) is derived because the last three items in (25) are where limR→0 ǫ(R) ¯ F. of higher order of kRk Remark 2. VT x denotes the projection of a vector x in the subspace of V. Then Corollary ¯ T x to estimate the projected energy is d max R. ¯ 1 shows that the relative error for using A Example 1. Given a random matrix Φ ∈ Rn×k , whose entries are independent standard ¯ through normal random variables, we can estimate the truncated Haar matrix [3, 4] by Φ normalizing the columns of Φ. According to Lemma 2, we can easily find that, with proba ¯ is bility at least 1 − (k(k − 1)/2) exp −nε2 /2 , the inner product of any two columns of Φ less than ε. Then, according to Lemma 1, the Frobenius norm of the estimating error is less p than k(k − 1)/2ε + o(ε). On the other hand, according to Corollary 1, we can consider ¯Φ ¯ T as the projection matrix of Φ. Φ Lemma 2. Since the normalized Gaussian random vector is uniformly distributed on the sphere, according to the concentration of measure on the sphere [5, 6], we have P {| cos θ| > ε} ≤ exp −nε2 /2 , where θ denotes the angle between two independent Gaussian random vectors, whose elements are independent standard normal random variables.

3

Approximation of orthonormal basis by arbitrary basis

Based on the previous section, we discuss the error of using the original matrix A as an approximation of its orthonormalization V. To begin with, we define (R, W) to measure the similarity between A and V. Then the matrix U is used to describe the similarity between A and V, where V − A = AU. The following corollary describes the performance of U as R → 0 and W → I. Corollary 2. Let V = [v1 , v2 , . . . , vd ] denote the orthonormal matrix of an arbitrary matrix A = [a1 , a2 , . . . , ad ] gotten through the Gram-Schmidt process. Let W be a diagonal matrix with wii = kai k2 , and R = (rij ) = AT A − W. When W approaches to I, and rji = aT j ai 6

is small enough for j 6= i, we can use A to approximate V with error V − A = AU, where U = (uji ) ∈ Rd×d is an upper triangular matrix satisfying uii =

1 − wii + h(R, W)(1 − wii ) + gii (R, W)kRk2F , 2

∀i,

(27)

where limR→0,W→I h(R, W) = 0 and limR→0,W→I gii (R, W) ≤ 1/4, and uji = −rji + gji (R, W)kRkF ,

∀j < i,

(28)

where limR→0,W→I gji (R, W) = 0. Proof. The proof follows a similar routine as that of Lemma 1, where the variables with bar in the proof of Lemma 1 are exactly the counterparts of the variables here. Therefore we will only highlight those different. Referring to the deduction of (9) and (10) in the proof of Lemma 1, we have 1 − 1, ∀i, k˜ vi k   i−1 X 1  T  uji = − ai vj + aT i vm ujm , k˜ vi k uii =

m=j

where

˜ i = ai − v

i−1 X

m=1

T aT i vm = ai am +

aT i vm vm ,

m X

(29) ∀j < i.

(30)

∀i,

(31)

ukm aT i ak = rmi +

k=1

m X

ukm rki ,

k=1

∀m < i.

(32)

We will first check (29) and then (30). Noticing that ai is not normalized, we have uii =

1 kai −

Pi−1

T m=1 ai vm

vm k

−1= q

wii −

1 Pi−1

m=1

aT i vm

2 − 1.

Using the Taylor’s series with Peano form of the remainder in (33), we have ! i−1 X 2 1 T uii = 1 − wii + ai vm + h(R, W) , 2 m=1

(33)

(34)

2 P Tv is denoted as a function of R and where, without confusing, h 1 − wii + i−1 a m i m=1 W for better understanding. Using (32) in (34) and referring to deduction of (14), we have uii =

1 − wii + h(R, W)(1 − wii ) + gii (R, W)kRk2F , 2

where limR→0,W→I h(R, W) = 0 and limR→0,W→I gii (R, W) ≤ 1/4.

7

(35)

Now we will study (30). Plugging (29), (32), and (35) in (30) and referring to the deduction of (17) and (18), we have   i−1 X  aT uji = −(1 + uii ) aT i vm ujm i vj + m=j

1 − wii 2 + h(R, W)(1 − wii ) + gii (R, W)kRkF =− 1+ 2   j i−1 X m i−1 X X X · rji + ujm rmi + ukj rki + ulm ujm rli  k=1

m=j l=1

m=j

= −rji + gji (R, W)kRkF ,

∀j < i,

(36)

where limR→0,W→I gji (R, W) = 0. We then complete the proof. Example 2. Given a random matrix Φ ∈ Rn×k , whose entries are independent standard √ normal random variables, we can also use (1/ n)Φ to approximate the truncated Haar matrix [3, 4]. According to Corollary 2 and Law of Large Number, with high probability, the error is small enough when n is large enough. Notice that the random matrix here is different from the measurement matrix in Compressed Sensing (CS) [7, 8, 9], since here we need n ≫ k. Remark 3. If a Gaussian random matrix is orthonormalized, then its columns (and even entries) are not independent anymore. Therefore, the orthonormal matrix no longer satisfies useful properties of Gaussian matrices. According to the proposed theoretical analysis, a Gaussian matrix can be an approximation of its orthonormalization. Certain error is inevitable, but it can be small enough, and the independency between columns (and even entries) is preserved. Example 3. As an application, the conclusions of this work can be used to prove the restricted isometric property of random projection of a finite number of subspaces [10], where the detailed proofs can be found in [11].

References [1] ˚ A. Bj¨ orck, “Numerics of gram-schmidt orthogonalization,” Linear Algebra and Its Applications, vol. 197, pp. 297–316, 1994. [2] W. Hoffmann, “Iterative algorithms for gram-schmidt orthogonalization,” Computing, vol. 41, no. 4, pp. 335–348, 1989. [3] A. M. Tulino and S. Verd´ u, Random Matrix Theory and Wireless Communications. Now Publishers Inc, 2004, vol. 1.

8

[4] D. Petz and J. Réffy, “On asymptotics of large haar distributed unitary matrices,” Periodica Mathematica Hungarica, vol. 49, no. 1, pp. 103–117, 2004. [5] P. Lévy and F. Pellegrino, Problèmes concrets d’analyse fonctionnelle: avec un complément sur les fonctionnelles analytiques. Gauthier-Villars, 1951. [6] E. Schmidt, “Die brunn-minkowskische ungleichung und ihr spiegelbild sowie die isoperimetrische eigenschaft der kugel in der euklidischen und nichteuklidischen geometrie. i,” Mathematische Nachrichten, vol. 1, no. 2-3, pp. 81–157, 1948. [7] E. J. Candès, “The restricted isometry property and its implications for compressed sensing,” Comptes Rendus Mathematique, vol. 346, no. 9, pp. 589–592, 2008. [8] D. L. Donoho, “Compressed sensing,” IEEE Transactions on Information Theory, vol. 52, no. 4, pp. 1289–1306, 2006. [9] W. Johnson and J. Lindenstrauss, “Extensions of lipschhitz maps into a hilbert space,” Contemporary Math, vol. 26, 1984. [10] G. Li and Y. Gu, “Distance-preserving property of random projection for subspaces,” accepted for presentation at IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2017. [11] ——, Restricted isometry property (RIP) via gaussian random projection, preparing http://gu.ee.tsinghua.edu.cn/publications, 2016.

9

for for

finite set submission,

of subspaces available at

Approximation of Gram-Schmidt Orthogonalization by Data Matrix

Approximation of Gram-Schmidt Orthogonalization by Data Matrix

Suggest Documents

Approximation of LÃ¶wdin Orthogonalization to a Spectrally ... - TUM

Interest Zone Matrix Approximation

Local Low-Rank Matrix Approximation - The Analysis of Data

On the L1-Norm Approximation of a Matrix by Another

Data Fusion by Matrix Factorization

Scattered data approximation by neural networks ...

Interconnect Macromodelling and Approximation of Matrix Exponent

Canonical Orthogonalization, Polar Decomposition

Functional Data Analysis by Matrix Completion - arXiv

Data Fusion by Matrix Factorization

Matrix Approximation Techniques for ... - Nicolas Dobigeon - ENSEEIHT

Hierarchical Matrix Approximation with Blockwise Constrains Mario ...

low rank matrix approximation using the lanczos

Self-Expressive Decompositions for Matrix Approximation and

Matrix Approximation under Local Low-Rank Assumption

Approximation of scattered data by trigonometric ... - TU Chemnitz

Orthogonalization by principal components applied to ... - IEEE Xplore

Approximation of Discrete Data by Discrete Weighted ... - CiteSeerX

Matrix Approximation under Local Low-Rank Assumption

VECTOR FITTING FOR MATRIX-VALUED RATIONAL APPROXIMATION

Orthogonalization of a Boolean Function - La Sapienza

Nonlinear approximation by sums of

Random matrix approximation of spectra of integral ... - Project Euclid

Approximation by neural networks with scattered data