Sep 25, 2007 - In order to see this we first need to consider the probability that one vector u is distorted by more the
The Random Projection Method ∗ Edo Liberty † September 25, 2007
1
Introduction
We start by giving a short proof of the Johnson-Lindenstrauss lemma due to P.Indyk and R.Motowani. We continue by showing an application of random projections for the use of fast matrix rank k approximation due to Papadimitriou, Raghavan, Tamaki, and Vempala.
2
Random Projection
In the next 20 minutes we will prove the Johnson-Lindenstrauss (JL) lemma. We use a construction and proof derivation by Indyk and Motowani which are simpler then the original proof. Let us state the JL lemma: Lemma 2.1 A set of n points u1 . . . un in Rd can be projected down to v1 . . . vn in Rk such that all pairwise distances are preserved: (1 − ²)kui − uj k2 ≤ kvi − vj k2 ≤ (1 + ²)kui − uj k2 if k>
9 ln n , and,0 ≤ ² ≤ 1/2 ²2 − ²3
In order to see this we first need to consider the probability that one vector u is distorted by more then ². Our projecting matrix R will be a random d × k matrix s.t R(i, j) are drawn i.i.d from N (0, 1)1 . We will see that if we set: 1 v = √ RT u k Then:
E(kvk2 ) = kuk2
And with probability Prsuccess ≥ 1 − 2e−(²
(1 − ²)kuk2 ≤ kvk2 ≤ (1 + ²)kuk2 2
−²3 ) k 4
.
∗ chosen
chapters from DIMACS vol.65 by Santosh S. Vempala University, New Haven CT (
[email protected]) Supported by AFOSR, and NGA. 1 Normal distribution with mean 0 and variance 1.
† Yale
1
2.1
Calculating E(kvk2 )
Since v =
√1 RT u k
we can calculate the expectancy E(kvk2 ). 2 k d X X 1 √ R(i, j)u(j) E(kvk2 ) = E k i=1 j=1 2 k d X1 X = E R(i, j)u(j) k i=1 j=1 =
k d X ¢¡ ¢ 1X ¡ E R(i, j)2 u(j)2 k j=1 i=1
=
d k X 1X (u(j))2 k i=1 j=1
= kuk2 .
2.2
Bounding the ”stretching” probability
In order to bound the distortion probability we define xj = x=
1 kuk hRj , ui
and
k k X X (RjT u)2 kkvk2 = = x2j 2 kuk2 kuk j=1 j=1
The convince in these definitions will become clear in the next few steps. We now turn to calculate the probability £ ¤ Pr kvk2 ≥ (1 + ²)kuk2 = Pr [x > (1 + ²)k] = Pr [x ≥ (1 + ²)k] h i = Pr eλx ≥ eλ(1+²)k E(eλx ) eλ(1+²)k 2 Πkj=1 E(eλxj ) = eλ(1+²)k !k à 2 E(eλx1 ) = eλ(1+²)
≤
2
Here we are faced with calculating E(eλx1 ). Using the fact that x1 itself is drawn from N (0, 1) We get that: Z ∞ 2 2 t2 1 E(eλx1 ) = eλt √ e− 2 dt 2π −∞ Z ∞ √ 1 1 − 2λ − t2 (1−2λ) √ e 2 = √ dt 1 − 2λ −∞ 2π 1 = √ For λ < 1/2. 1 − 2λ 2
Inserting E(eλX1 ) =
√ 1 1−2λ
into the probability equation we get: £ ¤ Pr kvk ≥ (1 + ²)kuk2 ≤ 2
µ
e−2(1+²)λ 1 − 2λ
¶k/2
Substituting for λ =
² 2(1+²)
we get: £ ¤ ¡ ¢k/2 Pr kvk ≥ (1 + ²)kuk2 ≤ (1 + ²)e−²
Finally using that 1 + ² < e²−(²
2
−²3 )/2
we obtain:
¤ £ 2 3 Pr kvk ≥ (1 + ²)kuk2 ≤ e−(² −² )k/4
2.3
Bounding the other direction £ ¤ Pr kvk ≤ (1 − ²)kuk2 = = = ≤
Achieving for one point: with probability Pr ≥ 1 − 2e−(²
2.4
Pr [x ≤ (1 − ²)k] h i Pr e−λx ≥ e−λ(1−²)k à !k 2 E(e−λx1 ) e−λ(1−²) µ 2(1−²)λ ¶k/2 e ² ,λ = 1 + 2λ 2(1 − ²)
≤
((1 − ²)e² )k/2
≤
e−(²
2
−²3 )k/4
(1 − ²)kuk2 ≤ kvk2 ≤ (1 + ²)kuk2 2
−²3 ) k 4
Putting it all together
In the case we have n points we have to preserve O(n2 ) distances which are all the pairwise distances between points. We can fail with probability less then a constant, say, Prf ail < 1/2. We use a simple union bound: 2
3
n2 2e−(² −² )k/4 (²2 − ²3 )k/4 k
< >
1/2 2 ln(2n) 9 ln(n) > For n > 16. ²2 − ²3
Finally, since the failure probability is smaller then 1/2 we can repeat until success a constant number of times in expectancy.
3
Fast low rank approximation
The results from the last section can be applied to accelerating low rank approximation of matrices. An optimal low rank approximations can be easily computed using the SVD of A in O(mn2 ). Using random projections we show how to achieve an ”almost optimal” low rank approximation in O(mn log(n)). We will go over a two step algorithm, suggested by Papadimitriou, Raghavan, Tamaki, and Vempala. First, we use k random projections to find a matrix B which is ”much smaller” then A, but still shares most of its (right) eigenspace. Then, we SVD B and project A on B’s k top eigenvectors.
3
3.1
Introduction
A low rank approximation of an m × n, (m ≥ n), matrix A is another matrix Ak such that: 1. The rank of Ak is at most k. 2. kA − Ak knorm is minimized. It is well known that for both l2 , and the Frobenius norms Ak =
k X
σi ui viT
i=1
where the singular value decomposition (SVD) of A is: A = U SV T =
n X
σi ui viT
i=1
The algorithm: Let R be an m×` matrix such that R(i, j) are drawn i.i.d from N (0, 1). Also we have that ` ≥ 1. Compute B =
c log(n) . ²2
1 √ RT A. `
P` 2. Compute the SVD of B, B = i=1 λi ai bTi . ´ ³ ek ← A Pk bi bT . 3. Return: A i i=1
4
Proof of the algorithm
Since Ak is the optimal solution we cannot hope to do better then kA − Ak k2F , Yet we show that we do not do much worse. ek k2F ≤ kA − Ak k2F + 2²kAk k2F kA − A Reminder: A=
n X
σi ui viT
, Ak =
i=1
B=
` X
k X
σi ui viT
i=1
Ã
λi ai bTi
ek = A , A
i=1
k X
! bi bTi
.
i=1
Since the bi s are orthogonal ek k2F kA − A
=
n X
ek )bi k2 k(A − A
i=1
= =
n X
k X kAbi − A( bj bTj )bi k2
i=1 n X
j=1
kAbi k2
i=k+1
=
kAk2F −
k X i=1
4
kAbi k2
on the other hand:
kA − Ak k2F = kAk2F − kAk k2F .
We gain that: ek k2 = kA − Ak k2 + (kAk k2 − kA − A F F F
k X
kAbi k2 ).
i=1
We now need to show that Pk 2 i=1 kAbi k .
kAk k2F k X
is not much larger then
λ2i
k X
=
i=1
Pk i=1
kAbi k2 . We start by giving a lower bound on
kBbi k2
i=1 k X
1 k √ RT (Abi )k2 ` i=1
=
≤ (1 + ²)
k X
kAbi k2 With probability
i=1
and so we gain k X
k
kAbi k2 ≥
i=1
We now need to bound the term
Pk
1 X 2 λ 1 + ² i=1 i
λ2i from below:
i=1
k X
λ2i
k X
≥
i=1
viT B T Bvi
i=1 k X 1
=
`
i=1
viT AT RRT Avi
k X
1 σi2 k √ RT ui k2 ` i=1
=
k X
≥
(1 − ²)σi2 , With probability.
i=1
Combining the inequalities with the fact that
Pk
k X
i=1
σi2 = kAk k2F we get:
kAbi k2
≥
1−² kAk k2F 1+²
≥
(1 − 2²)kAk k2F
i=1
Finally:
ek k2F ≤ kA − Ak k2F + 2²kAk k2F kA − A
Notice that we need to preserve the length of at most 2n vectors. This gives us a success probability of Prsuccess ≥ 2 3 1 − 4ne−(² −² )`/4 , which is constant for l = O( log(n) ²2 ).
5
4.1
Computational savings for full matrices
1. Computing the matrix B is O(mn`). 2. Computing the SVD of B is O(m`2 ). 3. Where ` ≥
c log(n) . ²2
Hence the total running time is:
µ O
¶ 1 mn log(n) . ²2
as compared to the straightforward SVD which is O(mn2 ).
5
Summary • We showed that a random matrix can be used to project points in Rd to Rk in a length preserving way (with high probability). Such that k = O( log(n) ²2 ). • We used this fact to accelerate Low rank approximations of m × n matrices from O(mn2 ) to O(mn log(n)).
6
Important message
Petros Drineas, Rafi Ostrovski and Yuval Rabani are organizing a research/reading group on algorithmic aspects of k-means clustering. The first meeting will take place this Friday, Sep 28, at 10:30am, at the IPAM lobby. The topics are related and somewhat in the same spirit as the this talk. A list of the papers to be discussed during the semester were sent to you by email. The first paper read will be Kanungo et al.
6