A Note on Spectral Clustering

arXiv:1509.09188v4 [cs.DM] 21 Apr 2016

A Note on Spectral Clustering Pavel Kolev∗ Kurt Mehlhorn Max-Planck-Institut f¨ ur Informatik, Saarbr¨ ucken, Germany {pkolev,mehlhorn}@mpi-inf.mpg.de

Abstract Spectral clustering is a popular and successful approach for partitioning the nodes of a graph into clusters for which the ratio of outside connections compared to the volume (sum of degrees) is small. In order to partition into k clusters, one first computes an approximation of the first k eigenvectors of the (normalized) Laplacian of G, uses it to embed the vertices of G into k-dimensional Euclidean space Rk , and then partitions the resulting points via a kmeans clustering algorithm. It is an important task for theory to explain the success of spectral clustering. Peng et al. (COLT, 2015) made an important step in this direction. They showed that spectral clustering provably works if the gap between the (k + 1)-th and the k-th eigenvalue of the normalized Laplacian is sufficiently large. They prove a structural and an algorithmic result. The algorithmic result needs a considerably stronger gap assumption and does not analyze the standard spectral clustering paradigm; it replaces spectral embedding by heat kernel embedding and k-means clustering by locality sensitive hashing. We extend their work in two directions. Structurally, we improve the quality guarantee for spectral clustering by a factor of k and simultaneously weaken the gap assumption. Algorithmically, we show that the standard paradigm for spectral clustering works. Moreover, it even works with the same gap assumption as required for the structural result.

∗ This work has been funded by the Cluster of Excellence “Multimodal Computing and Interaction” within the Excellence Initiative of the German Federal Government.

Contents 1 Introduction

2

2 Technical Contributions and Structure of the Paper 2.1 Exact Spectral Embedding - Notation . . . . . . . . . 2.2 Exact Spectral Embedding - Our Results . . . . . . . 2.3 Approximate Spectral Embedding - Notation . . . . . 2.4 Approximate Spectral Embedding - Our Results . . .

4 4 5 7 8

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

3 The Proof of Part (a) of The Main Theorem:

9

4 Vectors gbi and fi are Close 10 4.1 Analyzing the Columns of Matrix F . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 4.2 Analyzing Eigenvectors f in terms of fbj . . . . . . . . . . . . . . . . . . . . . . . . . 12 5 Spectral Properties of Matrix B 13 5.1 Analyzing the Column Space of Matrix B . . . . . . . . . . . . . . . . . . . . . . . . 13 5.2 Analyzing the Row Space of Matrix B . . . . . . . . . . . . . . . . . . . . . . . . . . 14 6 Proof of Lemma 2.4

17

7 The Normalized Spectral Embedding is ε-separated

19

8 An 8.1 8.2 8.3 8.4

21 21 22 24 26

Efficient Spectral Clustering Algorithm Proof of Lemma 2.7 . . . . . . . . . . . . . Proof of Theorem 2.8 . . . . . . . . . . . . . Proof of Theorem 2.9 . . . . . . . . . . . . . Proof of Part (b) of Theorem 1.2 . . . . . .

9 Parameterized Upper Bound on ρbavr (k)

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

27

A Singular Values Bounds of Random Matrices

1

31

1

Introduction

A cluster in an undirected graph G = (V, E) is a set S of nodes whose volume is large comconnections. Formally, we define the conductance of S by φ(S) = pared to the number of outside P E(S, S) /µ(S), where µ(S) = v∈S deg(v) is the volume of S. The k-way partitioning problem for graphs asks to partition the vertices of a graph such that the conductance of each block of the partition is small (formal definition below). This problem arises in many applications, e.g., image segmentation and exploratory data analysis. We refer to the survey [12] for additional information. A popular and very successful approach to clustering [6, 11, 12] is spectral clustering. One first computes an approximation of the first k eigenvectors of the (normalized) Laplacian of G, uses it to embed the vertices of G into k-dimensional Euclidean space Rk , and then partitions the resulting points via a k-means clustering algorithm. It is an important task for theory to explain the success of spectral clustering. Peng et al. [9] made an important step in this direction recently. They showed that spectral clustering provably works if the (k + 1)-th and the k-th eigenvalue of the normalized Laplacian differ sufficiently. In order to explain their result, we need some notation. The order k partition constant ρb(k) of G is defined by ρb(k) ,

min

partition (P1 ,...,Pk ) of V

Φ(P1 , . . . Pk ),

where Φ(Z1 , . . . , Zk ) = max φ(Zi ). i∈[1:k]

Let LG = I − D −1/2 AD −1/2 be the normalized Laplacian matrix of G, where D is the diagonal degree matrix and A is the adjacency matrix, and let fj ∈ RV be the eigenvector corresponding to the j-th smallest eigenvalue λj of LG . The spectral embedding map F : V → Rk is defined by 1 F (u) = √ (f1 (u) , . . . , fk (u))T , du

for all vertices u ∈ V .

(1)

Peng et al. [9] construct a k-means instance XV by inserting du many copies of the vector F (u) into XV , for every vertex u ∈ V . Let X be a set of vectors of the same dimension. Then △k (X ) ,

min

partition (X1 ,...,Xk ) of X

k X X

i=1 x∈Xi

kx − ci k2 ,

where ci =

1 X x, |X| x∈Xi

is the optimal cost of clustering X into k sets. An α-approximate clustering algorithm returns a k-way partition (A1 , . . . , Ak ) and centers c1 , . . . , ck such that Cost({Ai , ci }ki=1 ) ,

k X X

i=1 x∈Ai

kx − ci k2 6 α · △k (X ).

(2)

Theorem 1.1. [9, Theorem 1.2] Let k > 3 and (P1 , . . . , Pk ) be a k-way partition of V with Φ(P1 , . . . , Pk ) = ρb(k). Let G be a graph that satisfies the gap assumption1 Υ=

λk+1 = 2 · 105 · k3 /δ, ρb(k)

(3)

for some δ ∈ (0, 1/2]. Let (A1 , . . . , Ak ) be the k-way partition2 of V returned by an α-approximate k-means algorithm applied to XV . Then the following statements hold (after suitable renumbering 1

Note that λk /2 6 ρb(k), see (6). Thus the assumption implies λk /2 6 ρb(k) 6 δλk+1 /(2 · 105 · k3 ), i.e., there is a substantial gap between the k-th and the (k + 1)-th eigenvalue. 2 The k-means algorithm returns a partition of XV . One may assume w.l.o.g. that all copies of F (u) are put into the same cluster of XV . Thus the algorithm also partitions V .

2

of one of the partitions): 1) µ(Ai △Pi ) 6 αδ · µ(Pi )

and

2) φ(Ai ) 6 (1 + 2αδ) · φ(Pi ) + 2αδ.

Under the stronger gap assumption Υ = 2 · 105 · k5 /δ, they showed how to obtain a partition in time O (m · poly log(n)) with essentially the guarantees stated in Theorem 1.1, where m = |E| is the number of edges in G and n = |V | is the number of nodes. However, their algorithmic result does not analyze the standard spectral clustering paradigm, since it replaces spectral embedding by heat kernel embedding and k-means clustering by locality sensitive hashing. Therefore, their algorithmic result does not explain the success of the standard spectral clustering paradigm. Our Results: We strengthen the approximation guarantees in Theorem 1.1 by a factor of k and simultaneously weaken the gap assumption. As a consequence, the variant of Lloyd’s k-means fV achieves the improved approximation algorithm analyzed by Ostrovsky et al. [7] applied to3 X ln n 2 guarantees in time O(m(k + λk+1 )) with constant probability. Table 1 summarizes these results. Let O be the set of all k-way partitions (P1 , . . . , Pk ) with Φ(P1 , . . . , Pk ) = ρb(k), i.e., the set of all partitions that achieve the order k partition constant. Let ρbavr (k) ,

min

(P1 ,...,Pk )∈O

k 1X φ(Pi ) k i=1

be the minimal average conductance over all k-way partitions in O. Our gap assumption is defined in terms of λk+1 . Ψ, ρbavr (k)

For the remainder of this paper we denote by (P1 , . . . , Pk ) a k-way partition of V that achieves ρbavr (k). We can now state our main result. Theorem 1.2 (Main Theorem). a) (Existence of a Good Clustering) Let G be a graph satisfying Ψ = 204 · k3 /δ

(4)

for some δ ∈ (0, 1/2] and k > 3 and let (A1 , . . . , Ak ) be the k-way partition output by an αapproximate clustering algorithm applied to the spectral embedding XV . Then for every i ∈ [1 : k] the following two statements hold (after suitable renumbering of one of the partitions): 2αδ 2αδ αδ · φ(Pi ) + 3 . 1) µ(Ai △Pi ) 6 3 · µ(Pi ) and 2) φ(Ai ) 6 1 + 3 10 k 10 k 10 k b) (An Efficient Algorithm) If in addition k/δ > 109 and4 △k (XV ) > n−O(1) , then the variant of fV returns in time O(m(k2 + ln n )) Lloyd’s algorithm analyzed by Ostrovsky et al. [7] applied to X λk+1 with constant probability a partition (A1 , . . . , Ak ) such that for every i ∈ [1 : k] the following two statements hold (after suitable renumbering of one of the partitions): 4δ 4δ 2δ · φ(Pi ) + 3 . 3) µ(Ai △Pi ) 6 3 · µ(Pi ) and 4) φ(Ai ) 6 1 + 3 10 k 10 k 10 k f X V is defined as XV but in terms of approximate eigenvectors, see Subsection 2.3. The case △k (XV ) 6 n−O(1) constitutes a trivial clustering problem. For technical reasons, we have to exclude too easy inputs. 3

4

3

Gap Assumption Peng et al. [9] Υ = 2 · 105 · k3 /δ This paper

Ψ = 204 · k3 /δ

Partition Quality

Running Time

µ(Ai △Pi ) 6 αδ · µ(Pi ) φ(Ai ) 6 (1 + 2αδ) φ(Pi ) + 2αδ µ(Ai △Pi ) 6

αδ 103 k

· µ(Pi ) 2αδ φ(Ai ) 6 1 + 10 3 k φ(Pi ) +

2αδ 103 k

Existential result

Existential result

2

Peng et al. [9] Υ = 2 · 105 · k5 /δ Ψ = 204 · k3 /δ This paper k/δ >

109

△k (XV ) > n−O(1)

k µ(Ai △Pi ) 6 δ log · µ(Pi ) k2 2 k φ(Pi ) + φ(Ai ) 6 1 + 2δ log 2 k

µ(Ai △Pi ) 6

2δ log2 k k2

2δ 103 k

· µ(Pi ) φ(Ai ) 6 1 + 104δ3 k φ(Pi ) +

4δ 103 k

O (m · poly log(n)) O m k2 +

ln n λk+1

Table 1: A comparison of the results in Peng et al. [9] and our results. The parameter δ ∈ (0, 1/2] relates the approximation guarantees with the gap assumption.

Part (b) of Theorem 1.2 gives theoretical support for the practical success of spectral clustering based on spectral embedding followed by k-means clustering. Previous papers [5, 9] replaced kmeans clustering by other techniques for their algorithmic results. If k 6 poly(log n) and λk+1 > poly(log n), our algorithm works in nearly linear time. The k-means algorithm in [7] is efficient only for inputs X for which some partition into k clusters is much better than any partition into k − 1 clusters; formally, for inputs X satisfying △k (X ) 6 ε2 · △k−1 (X ) for some ε ∈ (0, 6 · 10−7 ]. For the proof of part (b) of Theorem 1.2, we show fV satisfies this assumption. in Section 8 that X The order k conductance constant ρ(k) is defined by ρ(k) =

min

disjoint nonempty Z1 ,...,Zk

Φ(Z1 , . . . , Zk ),

where Φ(Z1 , . . . , Zk ) = max φ(Zi ). i∈[1:k]

(5)

Lee et al. [5] connected ρ(k) and the k-th smallest eigenvalue of the normalized Laplacian matrix LG through the relation p (6) λk /2 6 ρ(k) 6 O(k 2 ) λk , and in a consecutive work Oveis Gharan and Trevisan [8] showed ρ(k) 6 ρb(k) 6 kρ(k).

(7)

In Section 9, we establish an analogous relation for ρbavr (k). We next introduce our technical contributions and we outline the structure of the paper.

2 2.1

Technical Contributions and Structure of the Paper Exact Spectral Embedding - Notation

We use the notation adopted by Peng et al. [9]. We refer to the j-th eigenvalue of matrix LG by λj , λj (LG ). The (unit) eigenvector corresponding to λj is denoted by fj ∈ RV . 4

kfbi − gi k2 6 φ(Pi )/λk+1

P (i) fbi = kj=1 αj fj

fi =

Pk

kfi − gbi k2 6 (1 + 3k/Ψ) · k/Ψ

(i) b j=1 βj fj

1/2 χ

D gi = √

gbi =

Pi

µ(Pi )

=

Pn

(i) j=1 αj fj

Pk

(i) j=1 βj gj

Figure 1: The relation between the vectors fi , fbi , gbi and gi . The vectors {fi }n i=1 are eigenvectors of the normalized Laplacian matrix LG of a graph G satisfying Ψ > 4 · k 3/2 . The vectors {g i }ki=1 are the normalized characteristic vectors of an optimal partition (P1 , . . . , Pk ). For each i ∈ [1 : k] the vector fbi is the projection of vector gi onto span(f1 , . . . , fk ). The vectors 3/2 , and thus we can write fbi and g i are close for i ∈ [1 : k]. It holds span(f1 , . . . , fk ) = span(fb1 , . . . , fc k ) when Ψ > 4 · k Pk Pk (i) b (i) fi = j=1 βj fj . Moreover, the vectors fi and gbi = j=1 βj gj are close for i ∈ [1 : k].

D 1/2 χ

Pi , where χPi is the characteristic vector of the subset Pi ⊆ V . We note that kD1/2 χPi k

2 P gi is the normalized characteristic vector of Pi and D 1/2 χPi = v∈Pi deg(v) = µ(Pi ). The Rayleigh quotient is defined by and satisfies

Let gi =

R (gi ) ,

1 |E(S, S)| gi T LG gi = χT LχPi = = φ(Pi ), µ(Pi ) Pi µ(Pi ) gi T gi

where L = D − A is the graph Laplacian matrix. The eigenvectors {fi }ni=1 form an orthonormal basis of Rn . Thus each characteristic vector gi P (i) can be expressed as gi = nj=1 αj fj for all i ∈ [1 : k]. We define its projection onto the first k P (i) eigenvectors by fbi = k α fj . j=1

j

Peng et al. [9] proved that if the gap parameter Υ is large enough then span({fbi }ki=1 ) = P (i) span({fi }ki=1 ) and the first k eigenvectors can be expressed by fi = kj=1 βj fbj , for all i ∈ [1 : k]. P (i) Moreover, they demonstrated that each vector gbi = kj=1 βj gj approximates the eigenvector fi , for all i ∈ [1 : k]. We show that similar statements hold with substituted gap parameter Ψ. We define the estimation centers induced by the spectral embedding by 1 (1) (k) T p(i) = p . (8) βi , . . . , βi µ(Pi ) Our analysis relies on the spectral properties of the following two matrices. Let F, B ∈ Rk×k be square matrices such that for all indices i, j ∈ [1 : k] we have (i)

Fj,i = αj

2.2

(i)

and Bj,i = βj .

(9)

Exact Spectral Embedding - Our Results

Our proof of Theorem 1.2 (a) follows the proof-structure of [9, Theorem 1.2] in Peng et al., but improves upon it in essential ways. Our key technical insight is that the matrices BT B and BBT are close to the identity matrix. We prove this in two steps. In Section 4, we show that the vectors gbi and fi are close, and then in Section 4 we analyze the column space and row space of matrix B. 5

Theorem 2.1 (Matrix BBT is Close to Identity Matrix). If Ψ > 104 · k3 /ε2 and ε ∈ (0, 1) then for all distinct i, j ∈ [1 : k] it holds √ 1 − ε 6 hBi,: , Bi,: i 6 1 + ε and |hBi,: , Bj,: i| 6 ε. Theorem 2.1 is key for the improved separation (by a factor of k over [9, Lemma 4.3]) of estimation centers and the bound on the norm of the separation centers. Lemma 2.2. If Ψ = 204 · k3 /δ for some δ ∈ (0, 1] then for every i ∈ [1 : k] it holds that

1 µ(Pi )

Proof. By definition p(i) = √

2 h √ i

(i)

p ∈ 1 ± δ/4

1 . µ(Pi )

· Bi,: and Theorem 2.1 yields kBi,: k2 ∈ [1 ±

√

δ/4].

Lemma 2.3 (Larger Distance Between Estimation Centers). If Ψ = 204 · k3 /δ for some δ ∈ (0, 12 ] then for any distinct i, j ∈ [1 : k] it holds that

2

(i)

−1

p − p(j) > [2 · min {µ(Pi ), µ(Pj )}] .

√ Proof. Since p(i) is a row of matrix B, Theorem 2.1 with ε = δ/4 yields * + √ hBi,: , Bj,: i 2δ1/4 p(i) ε p(j)

,

= 6 = .

p(i) p(j) kBi,: k kBj,: k 1−ε 3

2

2

W.l.o.g. assume that p(i) > p(j) , say p(j) = α · p(i) for some α ∈ (0, 1]. Then by Lemma √

2 2.2 we have p(i) > (1 − δ/4) · [min {µ(Pi ), µ(Pj )}]−1 , and hence +

p(j) p(i) (i) (j)

,

p

p

p(i) p(j) !

2

· α + 1 p(i) > [2 · min {µ(Pi ), µ(Pj )}]−1 .

2

2

2

(i)

p − p(j) = p(i) + p(j) − 2 >

α2 −

4δ1/4 3

*

The observation that Υ can be replaced by Ψ in all statements in [9] is technically easy. However, this is crucial for the part (b) of Theorem 1.2, since it yields an improved version of [9, Lemma 4.5] showing that a weaker by a factor of k assumption is sufficient. We prove Lemma 2.4 in Section 6. Lemma 2.4. Let (P1 , . . . , Pk ) and (A1 , . . . , Ak ) are partitions of the vector set. Suppose for every permutation π : [1 : k] → [1 : k] there is an index i ∈ [1 : k] such that µ(Ai △Pπ(i) ) >

2ε · µ(Pπ(i) ), k

(10)

where ε ∈ (0, 1) is a parameter. If Ψ = 204 · k3 /δ for some δ ∈ (0, 12 ], and ε > 64α · k3 /Ψ then Cost({Ai , ci }ki=1 ) > 6

2k2 α. Ψ

With the above Lemmas in place, the proof of part (a) of Theorem 1.2 is then completed as in [9]. We give more details in Section 3. Before we turn to part (b) of Theorem 1.2, we consider the variant of Lloyd’s algorithm analyzed by Ostrovsky et al. [7] applied to XV . This algorithm is efficient for inputs X satisfying: some partition into k clusters is much better than any partition into k − 1 clusters. Theorem 2.5. [7, Theorem 4.15] Assuming that △k (X ) 6 ε2 △k−1 (X ) for ε ∈ (0, 6 · 10−7 ], there is an algorithm that returns a solution of cost at most [(1 − ε2 )/(1 − 37ε2 )]△k (X ) with probability √ at least 1 − O( ε) in time O(nkd + k3 d). In Section 7, we establish the assumption of Ostrovsky et al. [7] for XV . Theorem 2.6 (Normalized Spectral Embedding is ε-separated). Let G be a graph that satisfies Ψ = 204 · k3 /δ, δ ∈ (0, 1/2] and k/δ > 109 . Then for ε = 6 · 10−7 it holds △k (XV ) 6 ε2 △k−1 (XV ).

(11)

However, Theorem 2.6 is insufficient for part (b) of Theorem 1.2, since we need a similar result fV formed by approximate eigenvectors. To overcome this issue we build upon the recent for the set X work by Boutsidis et al. [2] which shows that running an approximate k-means clustering algorithm on approximate eigenvectors obtained via the power method, yields an additive approximation to solving the k-means clustering problem on exact eigenvectors. In order to state the connection, we need to introduce some of their notation.

2.3

Approximate Spectral Embedding - Notation

Let Z ∈ Rn×k be a matrix whose rows represent n vectors that are to be partitioned into k clusters. p For every k-way partition we associate an indicator matrix X ∈ Rn×k that satisfies Xij = 1/ |Cj | if the i-th row Zi,: belongs to the j-th cluster Cj , and Xij = 0 otherwise. We denote the optimal indicator matrix Xopt by Xopt

k X X

T 2

kZu,: − cj k22 , = arg min Z − XX Z F = arg min X∈Rn×k

X∈Rn×k

(12)

j=1 u∈Xj

P where cj = (1/|Xj |) u∈Xj Zu,: is the center point of cluster Cj . The normalized Laplacian matrix LG ∈ Rn×n of a graph G is define by LG = I − A, where A = D −1/2 AD −1/2 is the normalized adjacency matrix. Let Uk ∈ Rn×k be a matrix composed of the first k orthonormal eigenvectors of LG corresponding to the smallest eigenvalues λ1 , . . . , λk . We define by Y , Uk the canonical spectral embedding. Our approximate spectral embedding is computed by the so called “Power method”. Let S ∈ Rn×k be a matrix whose entries are i.i.d. samples from the standard Gaussian distribution N (0, 1) and p be a positive integer. Then the approximate spectral embedding Ye is defined by the following process: 1) B , I + A;

eΣ e Ve T be the SVD of B p S; 2) Let U

and

e ∈ Rn×k . 3) Ye , U

(13)

We proceed by defining the normalized (approximate) spectral embedding. We construct a matrix Y ′ ∈ Rm×k such that for every vertex u ∈ V we add d(u) many copies of the normalized

7

p f′ ) is row Uk (u, :)/ d(u) to Y ′ . Formally, the normalized (approximate) spectral embedding Y ′ (Y defined by     e (1,:) U 1d(1) √ 1d(1) U√k (1,:)  d(1)  d(1)    f′ =   and Y · · · · · · , (14) Y′ =      e (n,:)  Uk (n,:) U 1d(n) √ 1d(n) √ d(n)

d(n)

m×k

m×k

where 1d(i) is all-one column vector with dimension d(i). f′ ) an indicator matrix X ′ (X f′ ) that satisfies X ′ = Similarly to (12) we associate to Y ′ (Y ij p ′ = 0 otherwise. We may 1/ µ(Cj ) if the i-th row Yi,:′ belongs to the j-th cluster Cj , and Xij ′ assume w.l.o.g. p that a k-means algorithm outputs an indicator matrix X such that all copies of row Uk (v, :)/ d(v) belong to the same cluster, for every vertex v ∈ V . f′ the sets of points XV and X fV respectively. We present now We associate to matrices Y ′ and Y a key connection between the spectral embedding map F (·), the optimal k-means cost △k (XV ) and ′ : matrices Y ′ , Xopt k X

X

2 T ′

′

2 ′ ′ d(v) F (v) − c⋆j F = △k (XV ),

Y − Xopt Xopt Y = F

where each center satisfies c⋆j = µ(Cj⋆ )−1 ·

2.4

(15)

j=1 v∈Cj⋆

P

v∈Cj⋆

p d(v)F (v) and F (v) = Yv,: / d(v).

Approximate Spectral Embedding - Our Results

Our analysis relies on the proof techniques developed in [1, 2]. By adjusting these techniques (c.f. [2, Lemma 5] and [1, Lemma 7]) to our setting, we prove in Subsection 8.1 the following result for the symmetric positive semi-definite matrix B whose largest k singular values (eigenvalues) correspond to the eigenvectors u1 , . . . , uk of LG . eΣ e Ve T be the SVD of B p S ∈ Rn×k , where p > 1 and S is an n × k matrix Lemma 2.7. Let U 2−λ < 1 and fix δ, ǫ ∈ (0, 1). Then for any p > of i.i.d. standard Gaussians. Let γk = 2−λk+1 k ln(8nk/ǫδ) ln(1/γk ) with probability at least 1 − 2e−2n − 3δ it holds

eU e T

Uk UkT − U

6 ǫ. F

We establish several technical Lemmas that combined with Lemma 2.7 allow us to apply the proof techniques in [2, Theorem 6]. More precisely, we prove in Subsection 8.2 that running an f′ computed by approximate k-means algorithm on a normalized approximate spectral embedding Y the power method, yields an approximate clustering of the normalized spectral embedding Y ′ . f′ via the power method with p > ln(8nk/ǫδ) ln(1/γk ), where Theorem 2.8. Compute matrix Y f′ an α-approximate k-means algorithm with γk = (2 − λk+1 )/(2 − λk ) < 1. Run on the rows of Y f′ ∈ Rn×k . Then with failure probability δα . Let the outcome be a clustering indicator matrix X α −2n probability at least 1 − 2e − 3δp − δα it holds

T

2

′ T ′

′

2 ′ ′ ′ f′ f′ X

Y − X 6 (1 + 4ε) · α · − X X Y + 4ε2 . Y

Y opt opt α α

F F

8

fV satisfies the assumption Our main technical contribution is to prove, in Subsection 8.3, that X of Ostrovsky et al. [7]. Our analysis builds upon Theorem 2.6 and Theorem 2.8.

Theorem 2.9 (Approximate Normalized Spectral Embedding is ε-separated). Suppose the gap assumption satisfies Ψ = 204 · k3 /δ, k/δ > 109 for some δ ∈ (0, 1/2] and the optimum cost5

′

′ (X ′ )T Y ′ > n−O(1) . Construct matrix Y f′ via the power method with p > Ω( ln n ).

Y − Xopt opt λk+1 F −7 2 fV < 5ε · △k−1 X fV . Then for ε = 6 · 10 w.h.p it holds △k X Based on the preceding results, we prove part (b) of Theorem 1.2 in Subsection 8.4.

3

The Proof of Part (a) of The Main Theorem:

The proof of part (a.1) builds upon the following Lemmas. Recall that XV contains du copies of F (u) for each u ∈ V . W.l.o.g. we may restrict attention to clusterings of XV that put all copies of F (u) into the same cluster and hence induce a clustering of V . Let (A1 , . . . , Ak ) with cluster centers c1 to ck be a clustering of V . Its k-means cost is Cost({Ai , ci }ki=1 ) =

k X X

i=1 u∈Ai

du kF (u) − ci k2 .

Lemma 3.1 ((P1 , . . . , Pk ) is a good k-means partition). If Ψ > 4 · k3/2 then there are vectors {p(i) }ki=1 such that 2 k 3k (i) k · . Cost({Pi , p }i=1 ) 6 1 + Ψ Ψ k Proof. By Theorem 4.1 we have kfi − gbi k2 6 1 + 3k Ψ · Ψ and thus k X X

i=1 u∈Pi

=

k k X X

du kF (u) − c⋆i k2 6

X

j=1 i=1 u∈Pi

k X X

i=1 u∈Pi

(fj (u) − gbj (u))2 =

k X j=1

k X k X

2 X

(i) 2 du F (u)j − pj du F (u) − p(i) = i=1 j=1 u∈Pi

kfj − gbj k2 6

1+

3k Ψ

·

k2 , Ψ

where the k-way partition (P1 , . . . , Pk ) achieving ρbavr (k) has corresponding centers c⋆1 , . . . , c⋆k .

Lemma 3.2 (Only partitions close to (P1 , . . . , Pk ) are good). Under the hypothesis of Theorem 1.2, the following holds. If for every permutation σ : [1 : k] → [1 : k] there exists an index i ∈ [1 : k] such that 8αδ µ(Ai △Pσ(i) ) > 4 · µ(Pσ(i) ). (16) 10 k Then it holds that 2αk2 . (17) Cost({Ai , ci }ki=1 ) > Ψ

5

′ ′ Y ′ − Xopt (Xopt )T Y ′ F > n−O(1) asserts a multiplicative approximation guarantee in Theorem 2.8.

9

We note that Lemma 3.2 follows directly by applying Lemma 2.4 with ε = 64 · α · k3 /Ψ. Substituting these bounds into (2) yields a contradiction, since 2αk 2 3k αk 2 < Cost({Ai , ci }ki=1 ) 6 α · △k (XV ) 6 α · Cost({Pi , p(i) }ki=1 ) 6 1 + . · Ψ Ψ Ψ Therefore, there exists a permutation π (the identity after suitable renumbering of one of the 8αδ partitions) such that µ(Ai △Pi ) < 10 4 k · µ(Pi ) for all i ∈ [1 : k]. Part (a.2) follows from Part (a.1). Indeed, for δ′ = 8δ/104 we have αδ′ · µ(Pi ) µ(Ai ) > µ(Pi ∩ Ai ) = µ(Pi ) − µ(Pi \ Ai ) > µ(Pi ) − µ(Ai △Pi ) > 1 − k and |E(Ai , Ai )| 6 |E(Pi , Pi )| + µ(Ai ∆Pi ) since every edge that is counted in |E(Ai , Ai )| but not in |E(Pi , Pi )| must have an endpoint in Ai ∆Pi . Thus ′

|E(Pi , Pi )| + αδk · µ(Pi ) |E(Ai , Ai )| 6 Φ(Ai ) = 6 ′ µ(Ai ) (1 − α·δ k ) · µ(Pi )

2αδ′ 1+ k

· φ(Pi ) +

2αδ′ . k

This completes the proof of part (a) of Theorem 1.2.

4

Vectors gbi and fi are Close

In this section we prove Theorem 4.1. We argue in a similar manner as in [9], but in contrast we use the gap parameter Ψ. For completeness, we show in Subsection 4.1 that the span of the first k eigenvectors of LG equals the span of the projections of Pi ’s characteristic vectors onto the first k eigenvectors. Then in Subsection 4.2 we express the eigenvectors fi in terms of fbi and we conclude the proof of Theorem 4.1. P (i) Theorem 4.1. If Ψ > 4 · k3/2 then the vectors gbi = kj=1 βj gj , i ∈ [1 : k], satisfy 2

4.1

kfi − gbi k 6

3k 1+ Ψ

k . Ψ

·

Analyzing the Columns of Matrix F

We prove in this subsection the following result that depends on gap parameter Ψ. Lemma 4.2. If Ψ > k3/2 then the span({fbi }ki=1 ) = span({fi }ki=1 ) and thus each eigenvector can be P (i) expressed as fi = kj=1 βj · fbj for every i ∈ [1 : k]. To prove Lemma 4.2 we build upon the following result shown by Peng et al. [9].

Lemma 4.3. [9, Theorem 1.1 Part 1] For Pi ⊂ V let gi =

that

D 1/2 χPi

kD1/2 χPi k

. Then any i ∈ [1 : k] it holds

n

2

X φ(Pi ) R (gi )

(i) 2 b = . 6 = − f α

gi i j λk+1 λk+1 j=k+1

Based on the following two results we prove Lemma 4.2.

10

Lemma 4.4. For every i ∈ [1 : k] and p 6= q ∈ [1 : k] it holds that 1 − φ(Pi )/λk+1

2

2

6 fbi = α(i) 6 1

and

p D E φ(Pp ) · φ(Pq ) b b . fp , fq = |hαp , αq i| 6 λk+1

Proof. The first part follows by Lemma 4.3 and the following chain of inequalities 1−

n k n

2 X X X φ(Pi )

(i) 2 (i) 2 (i) 2 = 1. αj 6 αj = fbi = αj 61− λk+1 j=1

j=1

j=k+1

We show now the second part. Since {fi }ni=1 are orthonormal eigenvectors we have for all p 6= q that n X (q) (p) hfp , fq i = (18) αl · αl = 0. l=1

We combine (18) and Cauchy-Schwarz to obtain k n D X X E b b (p) (q) (p) (q) αl · αl = αl · αl fp , fq = l=1 l=k+1 v v u n u X 2 pφ(P ) · φ(P ) u n (p) 2 u X p q (q) . ·t 6 αl αl 6 t λk+1 l=k+1

l=k+1

Lemma 4.5. If Ψ > k3/2 then the columns {F:,i }ki=1 are linearly independent. Proof. We show that the columns of matrix F are almost orthonormal. Consider the symmetric T T matrix F F. It is known that ker F F = ker(F) and that all eigenvalues of matrix FT F are real numbers. We proceeds by showing that the smallest eigenvalue λmin (FT F) > 0. This would imply that ker(F) = ∅ and hence yields the statement. By combining Gersgorin Circle Theorem, Lemma 4.4 and Cauchy-Schwarz it holds that     k k D   

2 X E  X T (j) (i)

FT F ii − λmin (FT F) > min F F ij = min α(i) − α ,α  i∈[1:k]   i∈[1:k]  j6=i j6=i v s s s u k k X √ uX φ(Pj ) φ(Pi⋆ ) k3/2 φ(Pj ) φ(Pi⋆ ) > 1− > 1 − kt >1− > 0, λk+1 λk+1 λk+1 λk+1 Ψ j=1

j=1

where i⋆ ∈ [1 : k] is the index that minimizes the expression above.

We present now the proof of Lemma 4.2. Proof of Lemma 4.2. Let λ be an arbitrary non-zero vector. Notice that ! k k k k k k X X X X X X (i) (i) b γj fj , where γj = hFj,: , λi . λi αj fj = λi αj fj = λi · fi = i=1

i=1

j=1

j=1

(19)

j=1

i=1

By Lemma 4.5 the columns {F:,i }ki=1 are linearly independent and since γ = Fλ, it follows at least n ok one component γj 6= 0. Therefore the vectors fbi are linearly independent and span Rk . i=1

11

4.2

Analyzing Eigenvectors f in terms of fbj

To prove Theorem 4.1 we establish next the following result. Lemma 4.6. If Ψ > k3/2 then for i ∈ [k] it holds

2k 1+ Ψ

−1

6

k X j=1

(i) 2 βj

6

2k 1− Ψ

Proof. We show now the upper bound. By Lemma 4.2 fi = 1

=

kfi k2 =

=

k X j=1

(⋆)

>

* k X a=1

βa(i) fba ,

(i) 2 b 2

fj

βj

+

k X

a=1 b6=a

X k 2k (i) 2 . βj · 1− Ψ

.

Pk

(i) βb fbb

b=1 k X k X

−1

(i) b j=1 βj fj

+

(i)

βa(i) βb

D

fba , fbb

for all i ∈ [1 : k] and thus

E

j=1

To prove the inequality (⋆) we consider the two terms separately.

2 P P P

By Lemma 4.4, fbj > 1 − φ(Pj )/λk+1 . We then apply i ai bi 6 ( i ai )( i bi ) for all non-negative vectors a, b and obtain X k k k k X X φ(Pj ) k X (i) 2 (i) 2 φ(Pj ) (i) 2 (i) 2 βj 1− . βj − βj βj > 1− = λk+1 λk+1 Ψ j=1

j=1

j=1

j=1

D E p Again by Lemma 4.4, we have fba , fbb 6 φ(Pa )φ(Pb )/λk+1 , and by Cauchy-Schwarz it holds k X k X a=1 b6=a

(i) βa(i) βb

D

fba , fbb

E

k X k D E X (i) (i) b b > − βa · βb · fa , fb a=1 b6=a

> −

> −

1

λk+1 1 λk+1

k X k X (i) p (i) p βa φ(Pa ) · βb φ(Pb ) a=1 b6=a

2 k k q X k X (i) 2 (i)   >− · . βj βj φ(Pj ) Ψ 

j=1

The lower bound follows by analogous arguments.

j=1

We are ready now to prove Theorem 4.1. P P (i) (i) Proof of Theorem 4.1. By Lemma 4.2, we have fi = kj=1 βj fbj and recall that gbi = kj=1 βj gj for all i ∈ [1 : k]. We combine triangle inequality, Cauchy-Schwarz, Lemma 4.3 and Lemma 4.6 to

12

obtain

2 

2

X k k

X

(i) βji ·  kfi − gbi k2 = βj fbj − gj

fbj − gj 

6

j=1 j=1       −1 k k k

2 2 X X X 2k 1

b (i)    6  · βj φ(Pj )

fj − gj  6 1 − Ψ λk+1 j=1 j=1 j=1 3k k 2k −1 k · 6 1+ · , = 1− Ψ Ψ Ψ Ψ

where the last inequality uses Ψ > 4 · k.

5

Spectral Properties of Matrix B

In this section, we present the proof of Theorem 2.1. In Subsection 5.1, we analyzes the column space of B and we prove in Lemma 5.3 that 1 − ε 6 hBi,: , Bi,: i 6 1 + ε for all i. Then in Subsection 5.2, √ we analyze the row space of B and we derive Lemma 5.4 that shows |hBi,: , Bj,: i| 6 ε for all i, j.

5.1

Analyzing the Column Space of Matrix B

We show below that the matrix BT B is close to the identity matrix. Lemma 5.1. (Columns) If Ψ > 4 · k3/2 then for all distinct i, j ∈ [1 : k] it holds r 3k 3k k 6 hB:,i , B:,i i 6 1 + and |hB:,i , B:,j i| 6 4 . 1− Ψ Ψ Ψ Proof. By Lemma 4.6 it holds that X (i) 2 3k 3k 61+ βj 6 hB:,i , B:,i i = . 1− Ψ Ψ k

j=1

Recall that gbi =

Pk

(i) j=1 βj

· gj . Moreover, since the eigenvectors {fi }ki=1 and the characteristic

vectors {gi }ki=1 are orthonormal by combing Cauchy-Schwarz and by Theorem 4.1 it holds |hB:,i , B:,j i| =

k X l=1

(i) (j) βl βl

=

* k X a=1

βa(i)

· ga ,

k X b=1

= h(gbi − fi ) + fi , (gbj − fj ) + fj i

(j) βb

· gb

+

= hgbi , gbj i

= hgbi − fi , gbj − fj i + hgbi − fi , fj i + hfi , gbj − fj i

6 kgbi − fi k · kgbj − fj k + kgbi − fi k + kgbj − fj k s r k 3k 3k k k 6 1+ 1+ . · +2 · 64 Ψ Ψ Ψ Ψ Ψ

Using a stronger gap assumption we show that the columns of matrix B are linearly independent. 13

Lemma 5.2. If Ψ > 25 · k3 then the columns {B:,i }ki=1 are linearly independent. Proof. Since ker (B) = ker BT B and BT B is SPSD6 matrix, it suffices to show that the smallest eigenvalue xT BT Bx λ(BT B) = min > 0. x6=0 xT x By Lemma 5.1, !2 r r k k X k D E X X k k (i) (j) 2 |xi | |xj | β , β |xi | , 6 kxk · 4k 64 Ψ Ψ i=1 j6=i

and

T

T

x B Bx =

i=1

* k X

(i)

xi β ,

>

xj β

j=1

i=1

k X

3k 1− Ψ

kxk2 −

(j)

+

k X

=

i=1

k k X X i=1 j6=i

2

x2i β (i)

+

k k X X i=1 j6=i

D E |xi | |xj | β (i) , β (j) >

E D xi xj β (i) , β (j) 1 − 5k

r

k Ψ

!

· kxk2 .

Therefore λ(BT B) > 0 and the statement follows.

5.2

Analyzing the Row Space of Matrix B

In this subsection we show that the matrix BBT is close to the identity matrix. We bound now the squared L2 norm of the rows in matrix B, i.e. the diagonal entries in matrix BBT . Lemma 5.3. (Rows) If Ψ > 400 · k3 /ε2 and ε ∈ (0, 1) then for all distinct i, j ∈ [1 : k] it holds 1 − ε 6 hBi,: , Bi,: i 6 1 + ε.

Proof. We show that the eigenvalues of matrix BBT are concentrated around 1. This would imply T that χT i BB χi = hBi,: , Bi,: i ≈ 1, where χi is a characteristic vector. By Lemma 5.1 we have k D

E 3k 2 16k2 23k2 3k 2 (i) T

(i) 4 X (j) (i) 2 T (i) β ,β 6 1+ 6 β · BB · β = β + + 61+ 1− Ψ Ψ Ψ Ψ j6=i

and r X r k D E E D 2 (i) T 3k k k k (l) (j) T (j) (i) (l) β · BB · β 6 + 16 6 11 . 68 1+ β ,β β ,β Ψ Ψ Ψ Ψ l=1

By Lemma 5.2 every vector x ∈ Rk can be expressed as x = xT BBT x =

k X i=1

=

k X i=1

> 6

k T X γj β (j) γi β (i) · BBT ·

Pk

i=1 γi β

(i) .

j=1

k X k T T X γi γj β (i) · BBT · β (j) γi2 β (i) · BBT · β (i) +

1−

23k2 Ψ

− 11k

r

k Ψ

!

i=1 j6=i

kγk2 >

1 − 14k

We denote by SPSD the class of symmetric positive semi-definite matrices.

14

r

k Ψ

!

kγk2 .

and T

x x=

k k X X i=1 j=1

D

(i)

γi γj β , β

(j)

E

=

k X i=1

2 2 (i) γi β

+

k k X X i=1 j6=i

D E γi γj β (i) , β (j)

q P P

2

k By Lemma 5.1 we have ki=1 kj6=i γi γj β (i) , β (j) 6 kγk2 · 4k Ψ and β (i) 6 1 + holds r ! r ! k k kγk2 6 xT x 6 1 + 5k kγk2 . 1 − 5k Ψ Ψ Therefore 1 − 20k

r

k 6 λ(BBT ) 6 1 + 20k Ψ

r

3k Ψ.

Thus it

k . Ψ

We have now established the first part of Theorem 2.1. We turn to the second part and restate it in the following Lemma. Lemma 5.4. (Rows) If Ψ > 104 · k3 /ε2 and ε ∈ (0, 1) then for all distinct i, j ∈ [1 : k] it holds √ |hBi,: , Bj,: i| 6 ε. To prove Lemma 5.4 we establish the following three Lemmas. Before stating them we need some notation that is inspired by Lemma 5.1. q k Definition 5.5. Let BT B = I + E, where |Eij | 6 4 Ψ and E is symmetric matrix. Then we have BBT

2

= B (I + E) BT = BBT + BEBT .

Lemma 5.6. If Ψ > 402 · k3 /ε2 and ε ∈ (0, 1) then all eigenvalues of matrix BEBT satisfy λ(BEBT ) 6 ε/5.

Proof. Let z = BT x. We upper bound the quadratic form

r X T k x BEBT x = z T Ez 6 · |Eij | |zi | |zj | 6 4 Ψ ij

k X i=1

!2

|zi |

By Lemma 5.3 we have 1 − ε 6 λ(BBT ) 6 1 + ε and since kzk2 =

2

6 kzk · 4k

xBBT x xT x

r

k . Ψ

· kxk2 it follows that

kzk2 kzk2 6 kxk2 6 , 1+ε 1−ε and hence

r T x BEBT x k λ(BEBT ) 6 max 6 4 (1 + ε) · k 6 ε/5. T x x x Ψ

Lemma 5.7. Suppose {ui }ki=1 is orthonormal basis and the square matrix U has ui as its i-th column. Then UT U = I = UUT . 15

Proof. Notice that by the definition of U it holds UT U = I. Moreover, the matrix U−1 exists and thus UT = U−1 . Therefore, we have UUT = I as claimed. Lemma 5.8. If Ψ > 402 ·k3 /ε2 and ε ∈ (0, 1) then it holds |(BEBT )ij | 6 ε/5 for every i, j ∈ [1 : k]. Proof. Notice that BEBT is symmetric matrix, since P E is symmetric. By SVD Theorem there is an orthonormal basis {ui }ki=1 such that BEBT = ki=1 λi (BEBT ) · ui uT i . Thus, it suffices to bound the expression k X T |λl (BEBT )| · |(ul uT |(BEB )ij | 6 l )ij |. l=1

By Lemma 5.7 we have

q q k X 2 |(ul )i | · |(ul )j | 6 kUi,: k kUj,: k2 = 1. l=1

We apply now Lemma 5.6 to obtain

k k X ε ε X T T |(ul )i | · |(ul )j | 6 . |λl (BEB )| · |(ul ul )ij | 6 · 5 5 l=1

l=1

√

We are ready now to prove Lemma 5.4, i.e. |hBi,: , Bj,: i| 6 ε for all i 6= j. 2 Proof of Lemma 5.4. By Definition 5.5 we have BBT = BBT + BEBT . Observe that the (i, j)th entry of matrix BBT is equal to the inner product between the i-th and j-th row of matrix B, i.e. BBT ij = hBi,: , Bj,: i. Moreover, we have h

BBT

2 i

ij

=

k X l=1

BBT

i,l

BBT

l,j

=

k X l=1

hBi,: , Bl,: i hBl,: , Bj,: i .

For the entries on the main diagonal, it holds hBi,: , Bi,: i2 +

k X l6=i

hBi,: , Bl,: i2 = [(BBT )2 ]ii = [BBT + BEBT ]ii = hBi,: , Bi,: i + BEBT

ii

,

and hence by applying Lemma 5.3 with ε′ = ε/5 and Lemma 5.8 with ε′ = ε we obtain hBi,: , Bj,: i2 6

X l6=i

ε 2 ε ε + − 1− 6 ε. hBi,: , Bl,: i2 6 1 + 5 5 5

16

6

Proof of Lemma 2.4

In this section, we prove Lemma 2.4. Our key technical contribution gives in Lemma 6.3 an improved lower bound on the k-means cost in the setting of Lemma 6.2. This yields an improved version of [9, Lemma 4.5] showing that a weaker by a factor of k assumption suffices (c.f. Lemma 2.4). Our analysis combines the proof techniques developed in [9] with our strengthen results that depend on the gap parameter Ψ. We start by establishing a Corollary of Lemma 2.3. 4 3 Corollary 6.1. 20

· k /δ for some δ ∈ (0, 1/2]. Suppose ci is the center of a cluster Ai .

Let Ψ =

(i ) (i ) 2 1

then it holds > ci − p If ci − p

2 1

2

(i1 ) −1 (i1 ) (i2 ) − p > − p

ci

p

> [8 · min {µ(Pi1 ), µ(Pi2 )}] . 4

We restate now [9, Lemma B.2] whose analysis crucially relies on a function σ defined by µ(Al ∩ Pj ) . µ(Pj ) j∈[1:k]

σ(l) = arg max

(20)

Lemma 6.2. [9, Lemma B.2] Let (P1 , . . . , Pk ) and (A1 , . . . , Ak ) be partitions of the vector set. Suppose for every permutation π : [1 : k] → [1 : k] there is an index i ∈ [1 : k] such that µ(Ai △Pπ(i) ) > 2ε · µ(Pπ(i) ),

(21)

where ε ∈ (0, 1/2) is a parameter. Then one of the following three statements holds: 1. If σ is a permutation and µ(Pσ(i) \Ai ) > ε · µ(Pσ(i) ), then for every index j 6= i there is a real εj > 0 such that µ(Aj ∩ Pσ(j) ) > µ(Aj ∩ Pσ(i) ) > εj · min{µ(Pσ(j) ), µ(Pσ(i) )},

P and j6=i εj > ε. 2. If σ is a permutation and µ(Ai \Pσ(i) ) > ε · µ(Pσ(i) ), then for every j 6= i there is a real εj > 0 such that µ(Ai ∩ Pσ(i) ) > εj · µ(Pσ(i) ), µ(Ai ∩ Pσ(j) ) > εj · µ(Pσ(i) ), P and j6=i εj > ε. 3. If σ is not a permutation, then there is an index ℓ 6∈ {σ(1), . . . , σ(k)} and for every index j there is a real εj > 0 such that

and

Pk

j=1 εj

µ(Aj ∩ Pσ(j) ) > µ(Aj ∩ Pℓ ) > εj · min{µ(Pσ(j) ), µ(Pℓ )}, = 1.

We prove below the improved lower bound on the k-means cost. Lemma 6.3. Suppose the hypothesis of Lemma 6.2 is satisfied and Ψ = 204 · k3 /δ for some δ ∈ (0, 1/2]. Then it holds ε 2k2 Cost({Ai , ci }ki=1 ) > − . 16 Ψ

17

Proof. By definition Cost({Ai , ci }ki=1 )

=

k k X X

X

i=1 j=1 u∈Ai ∩Pj

du kF (u) − ci k2 , Λ.

(22)

Since for every vectors x, y, z ∈ Rk it holds 2 kx − yk2 + kz − yk2 > (kx − yk + kz − yk)2 > kx − zk2 ,

we have for all indices i, j ∈ [1 : k] that 2

kF (u) − ci k >

(j)

p − ci 2 2

2

− F (u) − p(j) .

(23)

Our proof proceeds by considering three cases. Let i ∈ [1 : k] be the index from the hypothesis in Lemma 6.2. Case 1. Suppose the first conclusion of Lemma 6.2 holds. For every index j 6= i let (

pσ(j) , if pσ(j) − cj > pσ(i) − cj ; γ(j) p = pσ(i) , otherwise. Then by combining (23), Corollary 6.1 and Lemma 3.1, we have Λ >

1X 2

X

j6=i u∈Aj ∩Pγ(j)

>

2 X

γ(j) − cj − du p

X

j6=i u∈Aj ∩Pγ(j)

µ(Aj ∩ Pγ(j) ) 1 X 3k − 1+ 16 min{µ(Pσ(i) ), µ(Pσ(j) )} Ψ j6=i

·

2

γ(j)

F (u) − p

ε 2k2 k2 > − . Ψ 16 Ψ

Case 2. Suppose the second conclusion of Lemma 6.2 holds. Notice that if µ(Ai ∩ Pσ(i) ) 6 (1 − ε) · µ(Pσ(i) ) then µ(Pσ(i) \Ai ) > ε · µ(Pσ(i) ) and thus we can argue as in Case 1. Hence, we can assume that it holds µ(Ai ∩ Pσ(i) ) > (1 − ε) · µ(Pσ(i) ). (24) We proceed by analyzing subcases.

σ(j)

two

σ(i)

a) If p − ci > p − ci holds for all j 6= i then by combining (23), Corollary 6.1 and Lemma 3.1 it follows

2

2 X

X X 1X

Λ > du pσ(j) − ci −

F (u) − pσ(j) 2 j6=i u∈Ai ∩Pσ(j) j6=i u∈Ai ∩Pσ(j) 2 µ(Ai ∩ Pσ(j) ) 1X 3k ε 2k2 k > − 1+ > − . · 2 min{µ(Pσ(i) ), µ(Pσ(j) )} Ψ Ψ 16 Ψ j6=i

b) Suppose there is an index j 6= i such that pσ(j) − ci < pσ(i) − ci . Then by triangle inequality combined with Corollary 6.1 we have

2 1

−1

σ(i) . − ci > pσ(i) − pσ(j) > 8 · min{µ(Pσ(i) ), µ(Pσ(j) )}

p 4 18

Thus, by combining (23), (24) and Lemma 3.1 we obtain

2

2

X X 1

du F (u) − pσ(i) du pσ(i) − ci − Λ > 2 u∈Ai ∩Pσ(i) u∈Ai ∩Pσ(i) 2 µ(Ai ∩ Pσ(i) ) 1 3k 1 − ε 2k2 k > · − 1+ > − . · 16 min{µ(Pσ(i) ), µ(Pσ(j) )} Ψ Ψ 16 Ψ Case 3. Suppose the third conclusion of Lemma 6.2 holds, i.e., σ is not a permutation. Then there is an index ℓ ∈ [1 : k] \ {σ(1), . . . , σ(k)} and for every index j ∈ [1 : k] let (

pℓ , if pℓ − cj > pσ(j) − cj ; γ(j) p = pσ(j) , otherwise. By combining (23), Corollary 6.1 and Lemma 3.1 it follows that k

Λ >

1X 2

X

j=1 u∈Aj ∩Pγ(j)

>

k

2 X

γ(j) − cj − du p

j=1 u∈Aj ∩Pγ(j)

k µ(Aj ∩ Pγ(j) ) 1 X 3k − 1+ 16 min{µ(Pσ(j) ), µ(Pℓ )} Ψ j=1

X

·

2

γ(j) F (u) − p du

1 2k2 k2 > − . Ψ 16 Ψ

We are now ready to prove Lemma 2.4. Proof of Lemma 2.4. We apply Lemma 6.2 with ε′ = ε/k. Then by Lemma 6.3 we have Cost({Ai , ci }ki=1 ) >

ε 2k2 − , 16k Ψ

and the desired result follows by setting ε > 64α · k3 /Ψ.

7

The Normalized Spectral Embedding is ε-separated

In this section, we prove that the normalized spectral embedding XV is ε-separated. Proof of Theorem 2.6 We establish first a lower bound on △k−1 (XV ). Lemma 7.1. Let G be a graph that satisfies Ψ = 204 · k3 /δ for some δ ∈ (0, 1/2]. Then for δ′ = 2δ/204 it holds 1 δ′ △k−1 (XV ) > − . (25) 12 k Before we present the proof of Lemma 7.1 we show that it implies (11). By Lemma 3.1 we have △k (XV ) 6

2k2 δ′ = . Ψ k

Moreover, by applying Lemma 7.1 with k/δ > 109 and ε = 6 · 10−7 we obtain △k−1 (XV ) >

1 δ′ 1 2 δ 1010 δ 1 δ′ 1 − = − 4· > · = · > 2 · △k (XV ). 5 2 12 k 12 20 k 9·2 k ε k ε 19

Proof of Lemma 7.1 We argue in a similar manner as in Lemma 6.3 (c.f. Case 3). We start by giving some notation, then we establish Lemma 7.2 and apply it in the proof of Lemma 7.1. We redefine the function σ (c.f. (20)) such that for any two partitions (P1 , . . . , Pk ) and (Z1 , . . . , Zk−1 ) of V , we define a mapping σ : [1 : k − 1] 7→ [1 : k] by µ(Zi ∩ Pj ) , µ(Pj ) j∈[1:k]

σ(i) = arg max

for every i ∈ [1 : k − 1].

We lower bound now the clusters overlapping in terms of the volume between any k-way and (k − 1)-way partitions of V . Lemma 7.2. Suppose (P1 , . . . , Pk ) and (Z1 , . . . , Zk−1 ) are partitions of V . Then for any index ℓ ∈ [1 : k] \ {σ(1), . . . , σ(k − 1)} (there is at least one such ℓ) and for every i ∈ [1 : k − 1] it holds µ(Zi ∩ Pσ(i) ), µ(Zi ∩ Pℓ ) > τi · min µ(Pℓ ), µ(Pσ(i) ) ,

where

Pk−1 i=1

τi = 1 and τi > 0.

Proof. By pigeonhole principle there is an index ℓ ∈ [1 : k] such that ℓ ∈ / {σ(1), . . . , σ(k − 1)}. Thus, for every i ∈ [1 : k − 1] we have σ(i) 6= ℓ and µ(Zi ∩ Pσ(i) ) µ(Zi ∩ Pℓ ) > , τi , µ(Pσ(i) ) µ(Pℓ ) where

Pk−1 i=1

τi = 1 and τi > 0 for all i. Hence, the statement follows.

We present now the proof of Lemma 7.1. Proof of Lemma 7.1. Let (Z1 , . . . , Zk−1 ) be a (k − 1)-way partition of V with centers c′1 , . . . , c′k−1 that achieves △k−1 (XV ), and (P1 , . . . , Pk ) be a k-way partition of V achieving ρbavr (k). Our goal now is to lower bound the optimal (k − 1)-means cost △k−1 (XV ) =

k k−1 X X

X

i=1 j=1 u∈Zi ∩Pj

2 du F (u) − c′i .

(26)

By Lemma 7.2 there is an index ℓ ∈ [1 : k] \ {σ(1), . . . , σ(k − 1)}. For i ∈ [1 : k − 1] let (

ℓ

ℓ

p − c′ > pσ(i) − c′ ; p , if i i pγ(i) = pσ(i) , otherwise. Then by combining Corollary 6.1 and Lemma 7.2, we have

2 −1

γ(i)

− c′i > 8 · min µ(Pℓ ), µ(Pσ(i) ) and µ(Zi ∩ Pγ(i) ) > τi · min µ(Pℓ ), µ(Pσ(i) ) , (27)

p where

Pk−1 i=1

τi = 1. We now lower bound the expression in (26). Since

2

2

γ(i) ′ γ(i)

F (u) − c′ 2 > 1 p − c F (u) − p −

, i i 2 20

it follows for δ′ = 2δ/204 that △k−1 (XV ) =

k k−1 X X

X

i=1 j=1 u∈Zi ∩Pj k−1

>

1X 2

X

i=1 u∈Zi ∩Pγ(i)

> >

1 2

k−1 X i=1

k−1 X

′ 2

du F (u) − ci >

X


k−1

2 X

du pγ(i) − c′i −

X


2 du F (u) − c′i

2

du F (u) − pγ(i)

k X X

2 µ(Zi ∩ Pγ(i) ) − du F (u) − pi 8 · min µ(Pγ(i) ), µ(Pσ(i) ) i=1 u∈P i

δ′

1 − , 16 k

where the last inequality holds due to (27) and Lemma 3.1.

8

An Efficient Spectral Clustering Algorithm

In this section, we apply the proof techniques developed by Boutsidis et al. [1, 2] to our setting. More precisely, we prove that any α-approximate k-means algorithm that runs on an approximate f′ computed by the power method, yields an approximate clustering normalized spectral embedding Y ′ f Xα of the normalized spectral embedding Y ′ . f′ is ε-separated. This allows us to apply Furthermore, we prove under our gap assumption that Y the variant of Lloyd’s k-means algorithm analyzed by Ostrovsky et al. [7] to efficiently compute f′ . Then we use part (a) of Theorem 1.2 to establish the desired statement. X α This Section is organized as follows. In Subsection 8.1, we prove Lemma 2.7. In Subsection 8.2, we present the proof of Theorem 2.8. In Subsection 8.3, we establish Theorem 2.9. Based on the results from the preceding three subsections, we prove part (b) of Theorem 1.2 in Subsection 8.4.

8.1

Proof of Lemma 2.7

We argue in a similar manner as in [1, Lemma 7]. By the eigenvalue decomposition theorem B = U ΣU T , where the columns of U = [Uk Uρ−k ] ∈ Rn×ρ are the orthonormal eigenbasis of LG , Σ ∈ Rρ×ρ is a positive diagonal matrix such that Σii = 2 − λi > 0 for all i, and ρ = rank(B). Since B p = U Σp U T , it follows that ker(B p S) = ker(U T S). By Lemma A.3 with probability at least 1 − e−2n we have rank(U T S) = k and thus matrix B p S has k singular values. Furthermore, eΣ e Ve T of B p S satisfies: U e ∈ Rn×k is a matrix with orthonormal columns, the SVD decomposition U e ∈ Rk×k is a positive diagonal matrix and Ve T ∈ Rk×k is an orthonormal matrix. We denote by Σ and we use the following four facts eR U eR σi U eR σi U

e

X U 2

e Ve T ∈ Rk×k , R,Σ

T S = Uk Σpk UkT S + Uρ−k Σpρ−k Uρ−k > σk Uk Σpk UkT S > (2 − λk )p · σk UkT S

= σi (R)

e e , > X U

· σk U 2

21

for any X ∈ Rℓ×k

(28) (29) (30) (31)

(28) follows from the eigenvalue decomposition of B and the fact that B p = U Σp U T ; (29) follows by (28) due to Uk and Uρ−k span orthogonal spaces, and since the minimum singular value of a e TU e = Ik ; Recall product is at least the product of the minimum singular values; (30) holds due to U −2n that with probability at least 1 − e we have σk (R) > 0 and hence (31) follows by kXk2 = max x6=0

kXRxk2 kXRk2 kXRxk2 6 max = . x6=0 σk (R) kxk2 kRxk2 σk (R)

[4, Theorem 2.6.1] shows that for every two m × k orthonormal matrices W, Z with m > k it holds

T ⊥ T ⊥

W W T − ZZ T = W Z Z W =

,

2 2

2

where [Z, Z ⊥ ] ∈ Rm×m is full orthonormal basis. Therefore, we have

⊥ T

e T ⊥

T e T T e e e

U = U

Uk Uk − U U = U Uk = Uk

U

, ρ−k

2 2 2

(32)

2

T e e where the last equality

is due to Un−ρ U = 0, since U is in the range of U = Uρ .

T e To upper bound Uρ−k U we establish the next two inequalities: 2

T e

T e

T e p U U · σ(R) > U U > U U R

ρ−k · (2 − λk ) · σk UkT S ,

ρ−k

ρ−k 2 2

2

p

T e p T T S ,

Uρ−k U R = Σρ−k Uρ−k S 6 (2 − λk+1 ) · σ1 Uρ−k 2

2

(33) (34)

where (33) follows by (31), (30) and (29); and (34) is due to (28) and Σ11 > · · · > Σnn > 0. By combining Lemma A.1 and Lemma A.2 with probability at least 1−e−2n −3δ both inequalities hold √ δ T √ 6 σk UkT S and σ1 Uρ−k (35) S 6 4 n. k Using (32), (33), (34) and (35) we obtain

√

T e

eU e T U 6 4δ−1 · nk · γkp . (36)

Uk UkT − U

= Uρ−k 2

2

Since kM kF 6

p

eU e T ) 6 2k, it follows rank(M ) · kM k2 for every matrix M and rank(Uk UkT − U

p eU e T eU e T

Uk UkT − U

6 8δ−1 · n1/2 k3/2 · γk 6 ǫ,

6 2k · Uk UkT − U 2

F

where the last two inequalities are due to (36) and the choise of γk .

8.2

Proof of Theorem 2.8

We establish several technical Lemmas that combined with Lemma 2.7 allow us to apply the proof techniques in [2, Theorem 6]. Lemma 8.1. X ′ X ′T is a projection matrix.

p Proof. By construction there are d(v) many copies of row Uk (v, :)/ d(v) in Y ′ , for every vertex ′ v ∈ V . We may assume w.l.o.g. p that a k-means algorithm outputs an indicator matrix X such that all copies of row belong to the same cluster, for every v ∈ V . Moreover, by p Uk (v, :)/ d(v) ′ ′ = 0 otherwise, where ′ definition Xij = 1/ µ(Cj ) if row Yi,: belongs to the j-th cluster Cj and Xij matrix X ′ ∈ Rm×k . Therefore, it follows that X ′T X ′ = Ik×k and thus (X ′ X ′T )2 = X ′ X ′T . 22

f′ T Y f′ . Lemma 8.2. It holds that Y ′T Y ′ = Ik×k = Y

f′ T Y f′ = Ik×k follows similarly. Since Proof. We prove now Y ′T Y ′ = Ik×k , but the equality Y   U (1,j) √k 1d(1) d(1) U (1,i)   Uk (n,i) T √k √   1T · · · 1 Y ′T Y ′ ij = ···   d(1) d(1) d(n) d(n) Uk (n,j) √ 1d(n) d(n)

= the statement follows.

n X

Uk (ℓ, i) Uk (ℓ, j) p = hUk (:, i), Uk (:, j)i = δij , d(ℓ) p d(ℓ) d(ℓ) ℓ=1

′ ′T f′ f′ T

T T e e Lemma 8.3. It holds that Y Y − Y Y = Y Y − Y Y . F

F

Proof. By definition

′

Y Y

′T

=

k X

′T ′ Y:,ℓ Y:,ℓ

ℓ=1

and ′ ′T Y:,ℓ Y:,ℓ

d(i)d(j)



 ′ = where Y:,ℓ  =

Uk (1,ℓ) √ 1d(1) d(1)



  ···  Uk (n,ℓ) √ 1d(n) d(n)

m×1

Uk (i, ℓ)Uk (j, ℓ) p · 1d(1) 1T d(j) . d(i)d(j)

The statement follows by establishing the following chain of equalities

′ ′T f′ f′ T 2

Y Y − Y Y

F

2 n n X X

′ ′T

T ′ ′ f f

Y Y −Y Y

=

d(i)d(j)

F

i=1 j=1

=

=

=

= =

2 n X n X k

X T

′ ′T f′ :,ℓ Y f′ Y:,ℓ Y:,ℓ −Y

:,ℓ

d(i)d(j) i=1 j=1 ℓ=1 F

2

!) ( k n X n X

X e e U (i, ℓ) U (j, ℓ) U (i, ℓ)U (j, ℓ)

k k p − p · 1d(i) 1T

d(j)

d(i)d(j) d(i)d(j) i=1 j=1 ℓ=1 F " k !#2 n n X Uk (i, ℓ)Uk (j, ℓ) U XX e (i, ℓ)U e (j, ℓ) p d(i)d(j) − p d(i)d(j) d(i)d(j) i=1 j=1 ℓ=1 " # k n n X 2 X X e (i, ℓ)U e (j, ℓ) Uk (i, ℓ)Uk (j, ℓ) − U i=1 j=1 ℓ=1 n n X X i=1 j=1

eU eT Uk UkT − U

2

ij

2

2

eU e T = Uk UkT − U

= Y Y T − Ye Ye T . F

F

23

Lemma 8.4. For any matrix U with orthonormal columns and every matrix A it holds

U U T − AAT U U T = U − AAT U . F F

(37)

Proof. The statement follows by the Frobenius norm property kM k2F = Tr[M T M ], the cyclic property of trace Tr[U M T M U T ] = Tr[M T M · U T U ] and the orthogonality of matrix U . We are now ready to prove Theorem 2.8. Proof of Theorem 2.8. Using Lemma 2.7 and Lemma 8.3 with probability at least 1 − 2e−2n − 3δp we have

′ ′T f′ f′ T

Y Y − Y Y = Y Y T − Ye Ye T 6 ε. F

F

T

f′ Y f′ + E such that kEk 6 ε. By combining Lemma 8.2 and Lemma 8.4, (37) holds Let Y ′ Y ′T = Y F f′ . Thus, by Lemma 8.1 and the proof techniques in [2, Theorem 6] it for the matrices Y ′ and Y follows

T

′ T ′ √

′

′ ′ ′ ′ f′ f

Y − Xα X 6 Y α · − X X Y + 2ε . (38)

Y

α opt opt

F F

The desired statement follows by simple algebraic manipulations of (38).

8.3

Proof of Theorem 2.9

In this subsection, we show under our gap assumption that the approximate normalized spectral f′ is ε-separated, i.e. △k (X fV ) < 5ε2 ·△k−1 (X fV ). Our analysis builds upon Theorem 2.6, embedding Y Theorem 2.8 and the proof techniques in [2, Theorem 6]. Before we present the proof of Theorem 2.9 we establish two technical Lemmas. Lemma 8.5. If Ψ > 204 · k3 /δ for δ ∈ (0, 1/2] it holds 2 − λk 1 4δ ln > 1 − 4 2 λk+1 . 2 − λk+1 2 20 k Proof. Lee et al. [5] proved that higher order Cheeger’s inequality satisfies p λk /2 6 ρ(k) 6 O(k 2 ) · λk .

(39)

Using the LHS of (39) we have

k3 ρbavr (k) = k2

k X i=1

φ(Pi ) > k2 max φ(Pi ) > k2 · ρ(k) > i∈[1:k]

k 2 λk 2

and thus we can upper bound the k-th smallest eigenvalue of LG by λk 6 2k · ρbavr (k).

Moreover, by the gap assumption we have λk+1 > The statement follows by

204 k2 204 k2 · 2k · ρbavr (k) > · λk . 2δ 2δ

1 − 204δk2 · λk+1 2 − λk > exp > 2 − λk+1 1 − 21 λk+1

24

1 4δ 1 − 4 2 λk+1 . 2 20 k

′(k)

′ For our next results we need to introduce some notation. We use interchangeably Xopt with Xopt to denote the optimal indicator matrix for the k-means problem on XV that is induced by the rows ′(k−1) the optimal indicator matrix for the (k − 1)-means of matrix Y ′ . Similarly, we denote by Xopt problem on XV . ′(k) Based on Lemma 3.1 and the definition of Y ′ and Xopt we obtain the following statement.

Corollary 8.6. Let G be a graph that satisfies Ψ = 204 · k3 /δ, δ ∈ (0, 1/2] and k/δ > 109 . Then it holds

T

2

′ 1 ′(k) ′(k) ′

Y − X X Y opt opt

6 8 · 1013 .

F

We are now ready to prove Theorem 2.9.

Proof of Theorem 2.9. By Theorem 2.6 we have

T T

′

Y − X ′(k) X ′(k) Y ′ 6 ε Y ′ − X ′(k−1) X ′(k−1) Y ′ . opt opt opt opt

(40)

F

F

We set the approximation parameter in Theorem 2.8 to

1p 1 ′(k) T ′ ′(k) ′ −O(1) ′

△k (XV ) = Y − Xopt Xopt ε , , Y

>n 4 4 F

(41)

and we note that by Theorem 2.6 it holds

ε′ 6

εp △k−1 (XV ). 4

(42)

n We construct now matrix Ye via the power method with p > Ω( λlnk+1 ). By combining Lemma 2.7 and Lemma 8.3 we obtain with high probability that

′ ′T f′ f′ T

Y Y − Y Y = Y Y T − Ye Ye T 6 ε′ . F

f′ Y f′ T + E Let Y ′ Y ′T = Y thus (37) in Lemma 8.4 we have r fV △k X =

F

f′ T Y f′ and such that kEkF 6 ε′ . By Lemma 8.2 we have Y ′T Y ′ = Ik×k = Y f′ . Therefore, by Lemma 8.1 holds for the orthonormal matrices Y ′ and Y

T T

T T ] ] ]

f′ ] ′(k) ′(k) f′ Y f′ − X ′(k) X ′(k) f′ Y f′ f′ = Y Y Y

Y − Xopt Xopt opt opt

F

! F T T

] ] ] ]

′(k) ′(k) ′(k) ′(k) = Y ′ Y ′T − Xopt Xopt Y ′ Y ′T − I − Xopt Xopt E

F

T

] ]

′(k) ′(k) 6 kEkF + Y ′ − Xopt Xopt Y ′

F

By Lemma 8.5 we can apply Theorem 2.8 which yields

2

T

2

′ ]

′ ] ′(k) ′(k) ′(k) T ′ ′(k) ′ ′ ′2

Y 6 (1 + 4ε ) · Y − Xopt Xopt Y

Y − Xopt Xopt

+ 4ε .

F F

25

By Corollary 8.6 we derive an upper bound on the optimal k-means cost of XV

T

2

′ 1

Y − X ′(k) X ′(k) Y ′ 6 opt opt

8 · 1013 F

that combined with the definition of ε′ gives s r

T

2 ′(k) ′(k) ′ ′ ′ ′ + 4ε′2 fV 6 ε + (1 + 4ε ) △k X Y − X X Y opt opt

F

T p

′ ′(k) ′(k) 6 2 Y ′

= 2 △k (XV )

Y − Xopt Xopt F p 6 2ε · △k−1 (XV ).

(43)

(44)

Moreover, it holds that p

T

T

′ ^ ^

′(k−1) ′(k−1) ′(k−1) ′(k−1) ′ ′ ′

△k−1 (XV ) = Y − Xopt Xopt Y 6 Y − Xopt Xopt Y

F F

T

^ ^

′(k−1) ′(k−1) = Y ′ Y ′T − Xopt Xopt Y ′ Y ′T

F

T ! T

T T ^ ^ ^

f′ f′ ′(k−1) ′(k−1) ′(k−1) ′(k−1) f′ Y f′ + I − X^ E X Y Xopt Y − Xopt = Y

opt opt

F

T

^ ^

f′ ′(k−1) ′(k−1) f′ Xopt Y 6 Y − Xopt

+ kEkF

F r εp fV + △k−1 X △k−1 (XV ) 6 4

and thus

p

r ε fV . △k−1 (XV ) 6 1 + △k−1 X 2

The statement follows by combining (44) and (45). r r p f fV . △k XV 6 2ε · △k−1 (XV ) 6 (2 + ε) · ε · △k−1 X

8.4

(45)

Proof of Part (b) of Theorem 1.2

n Let p = Θ( λlnk+1 ). We compute the matrix B p S in time O(mkp) and its singular value decomposition f′ (c.f. (14)). eΣ e Ve T in time O(nk 2 ). Based on it, we construct in time O(mk) matrix Y U fV is ε-separated for ε = 6 · 10−7 , i.e. △k X fV < 5ε2 · △k−1 X fV . Hence, By Theorem 2.9, X

f′ that has by Theorem 2.5 there is an algorithm that outputs a clustering with indicator matrix X α a cost at most

T T

2

2 1 ′ ′ g g ′ ′ f f′ f′ f f′ − X f′ X

Y

· Y − X X Y Y 6 1 + α α opt opt

1010 F F 26

2 4 −10 with constant probability (close to 1) in√time

O(mk + k ), where

α = 1 + 10 . T ′

′ ′ Xopt Y , where δA ∈ (0, 1) is to be We apply Theorem 2.8 with ε′ = 4δA Y ′ − Xopt F determined soon, and since by Corollary 8.6 we have

T ′ 1

′

′ ′ Xopt Y < 6,

Y − Xopt 10 F

with constant probability it follows that

T

2

′ T ′

′

2 ′ ′ ′ ′ f′ f′ X

Y − X 6 (1 + 4ε )α − X Y X Y + 4ε′2

Y opt opt α α

F F

p T ′ δA ′ T ′

2

′ ′ ′ ′ 1 + δA Y ′ − Xopt = · Y − Xopt Xopt Y Xopt Y α+ 4 F F √

2 δA 1 δA ′ T ′ ′ ′ 6 1+ · 1 + 10 + · Y − Xopt Xopt Y . 106 10 4 F

f′ yields a multiplicative approximation of XV that satisfies for δA = 1/106 The indicator matrix X α

T

2

′ T ′ 1

′

2 ′ ′ ′ f′ f′ X

Y − X (46) − X X Y . 6 1 + Y

Y opt opt α α

6 10 F F

The statement follows by part (a) of Theorem 1.2 applied to the partition (A1 , . . . , Ak ) of V that f′ . is induced by the indicator matrix X α

9

Parameterized Upper Bound on ρbavr(k)

A k-disjoint tuple Z is a k-tuple (Z1 , . . . , Zk ) of disjoint subsets of V . A k-way partition (P1 , . . . , Pk ) of V is compatible with a k-disjoint tuple Z if Zi ⊆ Pi for all i. We then define Si = Pi \Zi and use PZ to denote all partitions compatible with Z. We use Zk to denote all k-tuples Z with ρ(k) = Φ(Z) = Φ(Z1 , . . . , Zk ). The elements of Zk are called optimal (k-disjoint) tuples. We denote all partitions compatible with some optimal k-tuple by Pk = ∪Z∈Zk PZ .

(47)

Oveis Gharan and Trevisan [8, Lemma 2.5] proved that for every k-disjoint tuple Z ∈ Zk there is a k-way partition (P1 , . . . , Pk ) ∈ PZ with Φ(P1 , . . . , Pk ) 6 kρ(k).

(48)

Remark 9.1. In this section, we assume that every partition (P1 , . . . , Pk ) ∈ Pk satisfies Φ(P1 , . . . , Pk ) > ρ(k),

(49)

since otherwise ρb(k) = ρ(k).

We refine the analysis in [8] and prove a parameterized upper bound on ρbavr (k) that depends on a natural combinatorial parameter and the average conductance of a k-disjoint tuple Z ∈ Zk . Before we state our results, we need some notation. We define the order k inter-connection constant of a graph G by ρP (k) ,

min

P1 ,...,Pk ∈Pk

27

ΦIC (P1 , . . . , Pk )

(50)

where ΦIC (P1 , . . . , Pk ) , max Si 6=∅

|E(Si , V \Pi )| − |E(Si , Zi )| . |E(Pi , V \Pi )|

(51)

We will prove in Lemma 9.5 that ρP (k) ∈ (0, 1 − 1/(k − 1)]. Furthermore, let OP be the set of all k-way partitions (P1 , . . . , Pk ) ∈ Pk with ΦIC (P1 , . . . , Pk ) = ρP (k), i.e., the set of all partitions that achieve the order k inter-connection constant. Let ρeavr (k) =

min

(P1 ,...,Pk )∈OP

k 1X φ(Pi ) k

(52)

i=1

be the minimal average conductance over all k-way partitions in OP . By construction it holds that ρbavr (k) 6 ρeavr (k).

(53)

We present now our main result of this Section which upper bounds ρeavr (k).

Theorem 9.2. For any graph G there exists a k-way partition (P1 , . . . , Pk ) ∈ OP compatible with a k-disjoint tuple Z with Φ(Z1 , . . . , Zk ) = ρ(k) such that for κP , [1 − ρP (k)]−1 ∈ (1, k − 1] it holds

and in addition, for every i ∈ [1 : k]

ρeavr (k) 6

k κP X φ(Zi ) k i=1

φ(Pi ) 6 κP · φ(Zi ). Our goal now is to prove Theorem 9.2. We establish first a few useful Lemmas that will be used to prove Lemma 9.5 and Theorem 9.2. Oveis Gharan and Trevisan [8, Algorithm 2 and Fact 2.4] showed that Fact 9.3 ([8]). For any k-disjoint tuple Z, there is a k-way partition (P1 , . . . , Pk ) ∈ PZ such that 1. For every i ∈ [1 : k], Zi ⊆ Pi . 2. For every i ∈ [1 : k], and every subset ∅ = 6 S ⊆ Pi \Zi it holds |E(S, Pi \S)| >

1 |E(S, V \S)| . k

Lemma 9.4. For any k-disjoint tuple Z, there exists a k-way partition (P1 , . . . , Pk ) ∈ PZ that satisfies 1 |E(Si , V \Pi )| − |E(Si , Zi )| 61− . max |E(Pi , V \Pi )| k−1 Si 6=∅ Proof. By Fact 9.3 there is a k-way partition (P1 , . . . , Pk ) ∈ PZ such that for all i it holds |E(Si , Zi )| = |E(Si , Pi \Si )| >

1 1 |E(Si , V \Si )| = (|E(Si , V \Pi )| + |E(Si , Zi )|) k k

and hence |E(Si , Zi )| >

1 |E(Si , V \Pi )| . k−1

28

Lemma 9.5. The order k inter-connection constant of a graph G is bounded by 0 < ρP (k) 6 1 −

1 . k−1

Proof. We prove first the upper bound. By Lemma 9.4 there is a k-way partition (P1 , . . . , Pk ) ∈ Pk compatible with a k-disjoint tuple Z such that max Si 6=∅

1 |E(Si , V \Pi )| − |E(Si , Zi )| 61− . |E(Pi , V \Pi )| k−1

Therefore, ρP (k) =

min′

P1′ ,...,Pk ∈Pk

= max Si 6=∅

ΦIC P1′ , . . . , Pk′ 6 ΦIC (P1 , . . . , Pk )

|E(Si , V \Pi )| − |E(Si , Zi )| 1 61− . |E(Pi , V \Pi )| k−1

We prove now the lower bound. Suppose for contradiction that ρP (k) 6 0. By definition we have |E(Pi , V \Pi )| |E(Zi , V \Zi )| + |E(Si , V \Pi )| − |E(Si , Zi )| = µ(Pi ) µ(Pi ) |E(Si , V \Pi )| − |E(Si , Zi )| 6 φ(Zi ) + µ(Pi )

φ(Pi ) =

By (50), it holds for any Si 6= ∅ that |E(Si , V \Pi )| − |E(Si , Zi )| 6 ρP (k) · |E(Pi , V \Pi )| and thus φ(Pi )

(

6 φ(Zi ) − |ρP (k)| · φ(Pi ) , if Si 6= ∅; = φ(Zi ) , otherwise.

However, this contradicts Φ(P1 , . . . , Pk ) > ρ(k) and thus the statement follows.

We are now ready to prove Theorem 9.2. Proof of Theorem 9.2. Let (P1 , . . . , Pk ) ∈ OP be a k-way partition compatible with a k-disjoint tuple Z ∈ Zk that satisfies Φ(Z1 , . . . , Zk ) = ρ(k). By Lemma 9.5 there is a real number such that κP , [1 − ρP (k)]−1 ∈ (1, k − 1].

We argue in a similar manner as in Lemma 9.5 to obtain ( 6 φ(Zi ) − ρP (k) · φ(Pi ) , if Si 6= ∅; φ(Pi ) = φ(Zi ) , otherwise.

(54)

(55)

By combining (54) and the first conclusion of (55) we have φ(Pi ) 6 [1 − ρP (k)]−1 · φ(Zi ) = κP · φ(Zi ).

(56)

The statement follows by combining (52) and (56), since ρeavr (k) 6

k k κP X 1X φ(Pi ) 6 φ(Zi ). k k i=1

i=1

29

References [1] C. Boutsidis and M. Magdon-Ismail. Faster svd-truncated regularized least-squares. In 2014 IEEE International Symposium on Information Theory, Honolulu, HI, USA, June 29 - July 4, 2014, pages 1321–1325, 2014. doi: 10.1109/ISIT.2014.6875047. URL http://dx.doi.org/10.1109/ISIT.2014.6875047. [2] C. Boutsidis, P. Kambadur, and A. Gittens. Spectral clustering via the power method - provably. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 40–48, 2015. URL http://jmlr.org/proceedings/papers/v37/boutsidis15.html. [3] K. R. Davidson and S. J. Szarek. Chapter 8 local operator theory, random matrices and banach spaces. volume 1 of Handbook of the Geometry of Banach Spaces, pages 317 – 366. Elsevier Science B.V., 2001. doi: http://dx.doi.org/10.1016/S1874-5849(01)80010-3. URL http://www.sciencedirect.com/science/article/pii/S1874584901800103. [4] G. H. Golub and C. F. Van Loan. Matrix computations. Johns Hopkins studies in the mathematical sciences. The Johns Hopkins University Press, Baltimore, London, 2012. ISBN 08018-5413-X. URL http://opac.inria.fr/record=b1103116. [5] J. R. Lee, S. Oveis Gharan, and L. Trevisan. Multi-way spectral partitioning and higher-order cheeger inequalities. In Proceedings of the Forty-fourth Annual ACM Symposium on Theory of Computing, STOC ’12, pages 1117–1130, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1245-5. doi: 10.1145/2213977.2214078. URL http://doi.acm.org/10.1145/2213977.2214078. [6] A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems 14, pages 849–856. MIT Press, 2002. [7] R. Ostrovsky, Y. Rabani, L. J. Schulman, and C. Swamy. The effectiveness of lloyd-type methods for the k-means problem. J. ACM, 59(6):28:1–28:22, Jan. 2013. ISSN 0004-5411. doi: 10.1145/2395116.2395117. URL http://doi.acm.org/10.1145/2395116.2395117. [8] S. Oveis Gharan and L. Trevisan. Partitioning into expanders. In Proceedings of the TwentyFifth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2014, Portland, Oregon, USA, January 5-7, 2014, pages 1256–1266, 2014. doi: 10.1137/1.9781611973402.93. URL http://dx.doi.org/10.1137/1.9781611973402.93. [9] R. Peng, H. Sun, and L. Zanetti. Partitioning well-clustered graphs: Spectral clustering works! In Proceedings of The 28th Conference on Learning Theory, COLT 2015, Paris, France, July 3-6, 2015, pages 1423–1455, see also 2015. URL http://jmlr.org/proceedings/papers/v40/Peng15.html. http://arxiv.org/abs/1411.2021. [10] A. Sankar, D. A. Spielman, and S. Teng. Smoothed analysis of the condition numbers and growth factors of matrices. SIAM J. Matrix Analysis Applications, 28(2):446–476, 2006. doi: 10.1137/S0895479803436202. URL http://dx.doi.org/10.1137/S0895479803436202. [11] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000. ISSN 0162-8828. doi: http://doi. ieeecomputersociety.org/10.1109/34.868688. 30

[12] U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, pages 395–416, 2007.

A

Singular Values Bounds of Random Matrices

Lemma A.1 (Norm of a Gaussian Matrix [3]). Let M ∈ Rn×k be a matrix of i.i.d. standard √ Gaussian random variables, where n > k. Then, for t > 4, Pr{σ1 (M ) > t n} > exp{−nt2 /8}. Lemma A.2 (Invertibility of a Gaussian Matrix [10]). Let M ∈ Rn×n be a matrix of i.i.d. standard √ Gaussian random variables. Then, for any δ ∈ (0, 1), Pr{σn (M ) 6 δ/(2.35 n)} 6 δ. Lemma A.3 (Rectangular Gaussian Matrix). Let S ∈ Rn×k be a matrix of i.i.d. standard Gaussian random variables, V ∈ Rn×ρ be a matrix with orthonormal columns and n > ρ > k. Then, with probability at least 1 − e−2n it holds rank(V T S) = k. Proof. Let S ′ ∈ Rn×ρ be an extension of S such that S ′ = [S S ′′ ], where S ′′ ∈ Rn×ρ−k is a matrix of i.i.d. standard Gaussian random variables. Notice that V T S ′ ∈ Rρ×ρ is a matrix of i.i.d. standard Gaussian random variables. We apply now Lemma A.2 with δ = e−2n which yields with probability √ at least 1 − e−2n that σρ (V T S ′ ) > 1/(2.35 · e2n ρ) > 0 and thus rank(V T S ′ ) = ρ. In particular, rank(V T S) = k with probability at least 1 − e−2n .

31