1
Deterministic Feature Selection for k -means Clustering Christos Boutsidis and Malik Magdon-Ismail
arXiv:1109.5664v1 [cs.LG] 26 Sep 2011
Abstract—We study feature selection for k-means clustering. Although the literature contains many methods with good empirical performance, algorithms with provable theoretical behavior have only recently been developed. Unfortunately, these algorithms are randomized and fail with, say, a constant probability. We address this issue by presenting a deterministic feature selection algorithm for k-means with theoretical guarantees. At the heart of our algorithm lies a deterministic method for decompositions of the identity. Index Terms—clustering, k-means, feature selection, dimension reduction, deterministic algorithms.
✦
1 I NTRODUCTION This paper is about feature selection for k-means clustering, a topic that received considerable attention from scientists and engineers. Arguably, k-means is the most widely used clustering algorithm in practice [35]. It’s simplicity and effectiveness are remarkable among all the available methods [29]. On the negative side, using k-means to cluster high dimensional data with, for example, billions of features is not simple and straightforward [15]; the curse of dimensionality makes the algorithm very slow. On top of that, noisy features often lead to overfitting, another undesirable effect. Therefore, reducing the dimensionality of the data by selecting a subset of the features, i.e. feature selection, and optimizing the k-means objective on the low dimensional representation of the high dimensional data is an attractive approach that not only will make k-means faster, but also more robust [14], [15]. The natural concern with throwing away potentially useful dimensions is that it could lead to high clustering error. So, one has to select the features carefully to ensure that one can recover comparably good clusters just using the dimension-reduced data. Practitioners have developed numerous feature selection methods that work well empirically [14], [15]. The main focus of this work is on algorithms for feature selection with provable guarantees. Recently, Boutsidis et al. described a feature selection algorithm that gives a theoretical guarantee on the quality of the clusters that are produced after reducing the dimension [5]. Their algorithm, which employs a technique of Rudelson and Virshynin [31], selects the features randomly with probabilities that are computed via the right singular vectors of the matrix containing • C. Boutsidis is with the Mathematical Sciences Department, IBM T.J. Watson Research Center, Yorktown Heights, NY, 10598. E-mail:
[email protected]. • M. Magdon-Ismail is with the Computer Science Department, Rensselaer Polytechnic Institute, Troy, NY, 12180. E-mail:
[email protected]
the data ( [7] describes a randomized algorithm with the same bound but faster running time). Although Boutsidis et al. give a strong theoretical bound for the quality of the resulting clusters (we will discuss this bound in detail later), the bound fails with some non negligible probability, due to the randomness in how they sample the features. This means that every time the feature selection is performed, the algorithm could (a) fail, and (b) return a different answer each time. Since practitioners are not comfortable with such randomization and failures, it is rarely the case that a randomized algorithm will make impact in practice. To better address the applicability of such feature selection algorithms for k-means, there is a need for deterministic, provably accurate feature selection algorithms. We present the first deterministic algorithms of this type. Our contributions can roughly be summarized as follows. •
•
Deterministic Supervised Feature Selection (Theorem 2). Given any dataset and any k-partition in this dataset, there is a small set of O(k) feature dimensions that can produce a k-partition that is no more than a constant factor worse in quality than the given k-partition. Moreover, this small set of feature dimensions can be computed in deterministic loworder polynomial time. This is the first deterministic algorithm of this type. Prior work [5], [7] offers a randomized algorithm that requires Ω(k log k) features for a comparable clustering quality. Existence of a small set of near optimal features (Corollary 3). We prove existence of a small set of near optimal features. That is, given any dataset and the number of clusters k, there is a set of O(k) feature dimensions that can produce a k-partition that is no more than a constant factor worse in quality than the optimal k-partition of the dataset. The existence of such a small set of features was not known before. Prior work [5], [7] only implies the existence of Ω(k log k) features with comparable performance.
2
•
•
Deterministic Unsupervised Feature Selection (Theorem 4). Given any dataset and the number of clusters k, it is possible, in deterministic low-order polynomial time, to select r feature dimensions, for any r > k, such that the optimal k-partition of the dimension-reduced data is no more than O(n/r) worse in quality than the optimal k-partition; here n is the number of features of the dataset. This is the first deterministic algorithm of this type. Prior work [5], [7] offers a randomized algorithm with error (3 + O(k log k/r)), for r = Ω(k log k). Unsupervised Feature Selection with a small subset of features (Theorem 5). Finally, given any dataset and the number of clusters k, in randomized low-order polynomial time it is possible to select a small number, r, of feature dimensions, with k < r = o(k log k), such that the optimal k-partition for the dimension-reduced data is no more than O(k log(k)/r) worse in quality than the optimal k-partition. In particular, this is the first (albeit randomized) algorithm of this type that can select a small subset of O(k) feature dimensions and provide an O(log k)-factor guarantee. Prior work [5], [7] is limited to selecting r = Ω(k log k) features. The new algorithm combines ideas from this paper with the technique of [5].
In order to get deterministic algorithms and be able to select O(k) feature dimensions, we use techniques that are completely different from those used in [5], [7]. Our approach is inspired by a recent deterministic result for decompositions of the identity which was introduced in [3] and subsequently extended in [4]. This general approach might also be of interest to the Machine Learning and Pattern Recognition communities, with potential applications to other problems involving subsampling, for example [9], [32]. 1.1 Background We provide the basic background on k-means clustering such that to describe our results in Section 2. Section 3 provides all the necessary background to describe our algorithms (Section 4) and to prove our main results (Section 5). Let us start with the definition of the kmeans clustering problem.1 Consider m points in an ndimensional Eucledian space, P = {p1 , p2 , ..., pm } ∈ Rn , and integer k denoting the number of clusters. The objective of k-means is to find a k-partition of P such that points that are “close” to each other belong to the same cluster and points that are “far” from each other belong to different clusters. A k-partition of P is a collection S = {S1 , S2 , ..., Sk } 1. In Section 3, we provide an alternative definition using matrix notation, which will be useful in proving the main results of this work.
of k non-empty pairwise disjoint sets which covers P. Let sj = |Sj |, be the size of Sj . For each set Sj , let µj ∈ Rn be its centroid (the mean point), 1 X pi . µj = sj pi ∈Sj
The k-means objective function is F (P, S) =
m X i=1
2
kpi − µ(pi )k2 ,
where µ(pi ) is the centroid of the cluster to which pi belongs. The goal of k-means is to find a partition S which minimizes F for a given P and k. We will refer to any such optimal clustering as, Sopt = argmin F (P, S). S
The corresponding objective value is Fopt = F (P, Sopt ). The goal of feature selection is to construct points ˆ 2, . . . , p ˆ m } ∈ Rr , Pˆ = {ˆ p1 , p (for some r ≪ n specified in advance) by projecting each pi onto r of the coordinate dimensions. Consider the ˆ optimum k-means partition of the points in P, ˆ S). Sˆopt = argmin F (P, S
The goal of feature selection is to construct a new set of points Pˆ such that, F (P, Sˆopt ) ≤ αF (P, Sopt ).
Here, α is the approximation factor and might depend on m, n, k and r. In words, computing a partition Sˆopt by using the low-dimensional data and plugging it back to cluster the high dimensional data, gives an approximation to the optimal value of the k-means objective function. Notice that we measure the quality of Sˆopt by evaluating the k-means objective function in the original space, an approach which is standard ( [2], [22], [28]). Comparing Sˆopt directly to Sopt , i.e. the identity of the clusters, not just the clustering error, would be much more interesting but at the same time a much harder (combinatorial) problem. A feature selection algorithm is called unsupervised if it computes Pˆ by only looking at P and k. Supervised algorithms construct Pˆ with respect to a given partition Sin of the data. Finally, an algorithm will be a γ-approximation for k-means (γ ≥ 1) if it finds a clustering Sγ with corresponding value Fγ ≤ γFopt . Such algorithms would be useful to state our results in a more general way.
3
Definition 1. [ K - MEANS APPROXIMATION ALGORITHM ] An algorithm is a “γ-approximation” for k-means clustering (γ ≥ 1) if it takes inputs the dataset P = {p1 , p2 , ..., pm } ∈ Rn and the number of clusters k, and returns a clustering Sγ such that, Fγ = F (P, Sγ ) ≤ γFopt . The simplest algorithm with γ = 1, but exponential running time, would try all possible k-partitions and return the best. Another example of such an algorithm is in ( [22]) with γ = 1 + ǫ (0 < ǫ < 1). The corresponding O(1) running time is O(mn · 2(k/ǫ) ). Also, the work in [2] describes a method with γ = O(log k) and running time O(mnk). The later two algorithms are randomized. For other γ-approximation algorithms, see [28] as well as the discussion and the references in [2], [22], [28].
2
S TATEMENT
OF OUR MAIN RESULTS
2.1 Supervised Feature Selection Our first result is within the context of supervised feature selection. Suppose that we are given points P and some k-partition Sin . The goal is to find the important features of the points in P with respect to the given partition Sin . In other words, for any given partition Sin , we would like to find the features of the points such that by only using these features and some γ-approximation algorithm, we will be able to obtain a partition Sout that has similar performance to Sin .
Theorem 2. Fix P = {p1 , p2 , ..., pm } ∈ Rn , Sin , k, and the number of features to be selected k < r < n. There is an O(mn min{m, n}+rk 2 n) time deterministic feature selection ˆ 2 , ..., p ˆ m } ∈ Rr such algorithm that constructs Pˆ = {ˆ p1 , p that Sout , i.e. the partition that is constructed by running ˆ satisfies some γ-approximation method on P, ! 4γ p F (P, Sout ) ≤ 1++ F (P, Sin ) (1 − k/r)2 p = γ · (5 + O( k/r)) · F (P, Sin ).
Essentially the clustering Sout is at most a constant factor worse than the original clustering Sin ; this means we can accomplish compression of the feature dimensions and preserve a specific clustering in the data. This is useful, for example, in privacy preserving applications where one seeks to release minimal information of the data without destroying much of the encoded information [33]. Notice that the feature selection part of the ˆ is deterministic. The theorem (i.e. the construction of P) γ-approximation algorithm, which can be randomized, is only used to describe the clustering that can be obtained with the features returned by our deterministic feature selection algorithm (same comment applies to Theorem 4). A complete survey of available approximation algorithms for k-means clustering is beyond the scope of the present work; our focus is on the feature selection.The corresponding algorithm is presented as Algorithm 1 in Section 4 and the proof of the theorem
is given in Section 5. Prior to this result, the best, and in fact the only method with theoretical guarantees for this supervised setting [5], [7] is randomized2 and gives p F (P, Sout ) ≤ γ · (3 + O( k log(k)/r)) · F (P, Sin ).
Further, [5], [7] requires r = Ω(k log k), otherwise the analysis breaks. We improve this bound by O(log k); also, we allow the user to select any r > k features. A surprising existential result is a direct corollary of the above theorem by setting Sin = Sopt and assuming a γ approximation algorithm with γ = 1 (an algorithm that tries all possible partitions can achieve γ = 1). Corollary 3. For any set of points P = {p1 , p2 , ..., pm } ∈ Rn , integer k, and ǫ > 0, there is a set of r = O(k/ǫ2 ) features ˆ 2 , ..., p ˆ m } ∈ Rr such that, if Pˆ = {ˆ p1 , p Sopt = argmin F (P, S) S
then,
and
ˆ S), Sˆopt = argmin F (P, S
F (P, Sˆopt ) ≤ (5 + ǫ) F (P, Sopt ).
In words, for any dataset there exist a small subset of O(k/ǫ2 ) features such that the optimal clustering on these features is at most a (5 + ǫ)-factor worse that the optimal clustering on the original features. Unfortunately, finding these features in an unsupervised manner (without the knowledge of the optimal partition Sopt ) is not obvious; but even the existence of such a small subset of good features was not known before; for more discussion see [1], [11], [13], [16], [17]. 2.2 Unsupervised Feature Selection Our second result is within the context of unsupervised feature selection. In unsupervised feature selection, the goal is to obtain a k-partition that is as close as possible to the optimal k-partition of the high dimensional data. The theorem below shows that it is possible to reduce the dimension of any high-dimensional dataset by using a deterministic method and obtain some theoretical guarantees on the quality of the clusters. Theorem 4. Fix P = {p1 , p2 , ..., pm } ∈ Rn , k, and the number of features to be selected k < r < n. There is an O(mn min{m, n} + rk 2 n) time deterministic algorithm ˆ 2 , ..., p ˆ m } ∈ Rr such that Sout , to construct Pˆ = {ˆ p1 , p i.e. the partition that is constructed by running some γˆ satisfies approximation k-means method on P, ! p (1 + n/r)2 p F (P, Sout ) ≤ 1 + 4γ Fopt (1 − k/r)2 = γ · O (n/r) · Fopt . The corresponding algorithm is presented as Algorithm 2 in Section 4 and the proof of the theorem is given in Section 5. Prior to this result, the best method 2. We should note that [5] describes the result for the unsupervised setting but it’s easy to verify that the same algorithm and bound apply to the supervised setting as well.
4
for thistask is in [5], [7], which replaces O (n/r) with p 3+O k log(k)/r and it is randomized, i.e. this bound is achieved only with a constant probability. Further, [5], [7] requires r = Ω(k log k), so one cannot select o (k log k) features. Clearly, our bound is worse but we achieve it deterministically, and it applies to any r > k. Finally, it is possible to combine the algorithm of Theorem 4 with the randomized algorithm of [5], [7] and obtain the following result. Theorem 5. Fix P = {p1 , p2 , ..., pm } ∈ Rn , k, and the number of features to be selected k < r < 4k log k. There is an O mnk + rk 3 log(k) + r log r time randomized ˆ 2 , ..., p ˆ m } ∈ Rr such algorithm to construct Pˆ = {ˆ p1 , p that Sout , i.e. the partition that is constructed by running some γ-approximation k-means method on Pˆ satisfies, with probability 0.4, !2 p 16k ln k/r 1 + Fopt p F (P, Sout ) ≤ 1 + 640γ 1 − k/r k log k · Fopt . = γ·O r The corresponding algorithm is presented as Algorithm 3 in Section 4 and the proof of the theorem is given in Section 5. Comparing Theorem 5 with Theorem 4, we obtain a much better approximation bound at the cost of introducing randomization. Comparing Theorem 5 with the main result of [5], [7], we have been able to break the barrier of having to select r = Ω (k log k) features. So, Theorem 5, to the best of our knowledge, is the best available algorithm in the literature for unsupervised feature selection in k-means clustering using O (k) features. If r = Ω(k log k), our algorithm cannot improve on [5], [7]. We should note here that the main algorithm of [5] can be obtained as Algorithm 3 in Section 4 after ignoring the 4th and 5th steps and replacing the approximate SVD with the exact SVD in the first step (the algorithm in [7] uses approximate SVD). 2.3 Discussion All our algorithms require the computation of the top k (approximate) singular vectors of the matrix containing the data. Then, the selection of the features is done by looking at the structure of these singular vectors and using the deterministic techniques of [3], [4]. We should note that [5], [7] takes a similar approach as far as computing the (approximate) singular vectors of the dataset; then [5], [7] employ the randomized technique of [31] to extract the features. In the introduction, we stated our results for unsupervised feature selection assuming the output partition was the optimal one in the reduced-dimension space, Sout = Sˆopt . This is the case if we assume a γapproximation algorithm with γ = 1. Our theorems are actually more general and apply to any γ ≥ 1.
3 P RELIMINARIES We now provide the necessary background such that to be able presenting our algorithms and proving our main theorems. 3.1 Singular Value Decomposition We start with the description of the Singular Value Decomposition of a matrix, which plays an important role in the description of our algorithms. The Singular Value Decomposition (SVD) of A ∈ Rm×n with rank ρ = rank(A) ≤ min{m, n} is Σk V kT 0 , A = Uk Uρ−k T 0 Σρ−k Vρ−k {z } | | {z } {z } | U A ∈Rm×ρ ΣA ∈Rρ×ρ
V TA ∈Rρ×n
with singular values σ1 ≥ σ2 ≥ · · · ≥ σρ > 0 contained in Σk ∈ Rk×k and Σρ−k ∈ R(ρ−k)×(ρ−k) . Uk ∈ Rm×k and Uρ−k ∈ Rm×(ρ−k) contain the left singular vectors of A. Similarly, Vk ∈ Rn×k and Vρ−k ∈ Rn×(ρ−k) contain T the right singular vectors. We use A+ = V A Σ−1 A UA ∈ Rn×m to denote the Moore-Penrose pseudo-inverse of A with Σ−1 A denoting the inverse of ΣA . We use the Frobenius and norms: kAk2F = P Pρ the2 spectral matrix 2 2 2 σ ; and kAk = σ . A = 2 1 Given A and B ij i=1 i i,j of appropriate dimensions: kABkF ≤ kAkF kBk2 . This is a stronger version of the standard submultiplicavity property kABkF ≤ kAkF kBkF , which we will refer to as spectral submultiplicavity. Let Ak = Uk Σk VkT = AVk VkT T and Aρ−k = A − Ak = Uρ−k Σρ−k V ρ−k . The SVD gives the best rank-k approximation to A in both the spectral ˜ ≤ k then (for ξ = 2, F ) and Frobenius norms: if rank(A) ˜ ξ ; also, kA − Ak kF = kΣρ−k kF . kA − Ak kξ ≤ kA − Ak 3.2 Approximate Singular Value Decomposition The exact SVD of A, though a deterministic algorithm, takes cubic time. More specifically, for any k ≥ 1, the running time to compute the top k left and/or right singular vectors of A ∈ Rm×n is O(mn min{m, n}). We will use the exact SVD in our deterministic feature selection algorithms in Theorems 2 and 4. To speed up our randomized algorithm in Theorem 5, we will use a factorization, which can be computed fast and approximates the SVD in some well defined sense. We quote a recent result from ( [4]) for a relative-error Frobenius norm SVD approximation algorithm. The exact description of the algorithm of the following lemma is out of the scope of the present work. Lemma 6 (Lemma 13 in ( [4])). Given A ∈ Rm×n of rank ρ, a target rank 2 ≤ k < ρ, and 0 < ǫ < 1, there exists an O (mnk/ǫ) time randomized algorithm that computes a matrix Z ∈ Rn×k such that Z T Z = Ik , EZ = 0m×k (for E = A − AZZ T ∈ Rm×n ), and E kEk2F ≤ (1 + ǫ) kA − Ak k2F . We use Z = FastApproximateSVD(A, k, ǫ) to denote this randomized procedure.
5
3.3 Linear Algebraic Definition of k-means
3.4 Spectral Sparsification and Rudelson’s concentration Lemma
We now give the linear algebraic definition of the kmeans problem. Recall that P = {p1 , p2 , ..., pm } ∈ Rn contains the data points, k is the number of clusters, and S denotes a k-partition of P. Define the data matrix A ∈ Rm×n as
We now present the main tools we use to select the features in the context of k-means clustering. Let Ω ∈ Rn×r be a matrix such that AΩ ∈ Rm×r contains r columns of A. Ω is a projection operator onto the r-dimensional subsect of features. Let S ∈ Rr×r be a diagonal matrix, so AΩS ∈ Rm×r rescales the columns of A that are in AΩ. Intuitively, AΩS projects down to the chosen r dimensions and then rescales the data along these dimensions. The following two lemmas describe two deterministic algorithms for constructing such Ω and S.
A T = [p1 , p2 , . . . , pm ].
We represent a clustering S by its cluster indicator matrix X ∈ Rm×k . Each column j = 1, . . . , k of X represents a cluster. Each row i = 1, . . . , m indicates the cluster membership of the point pi . So, √ X ij = 1/ sj , if and only data point pi is in cluster Sj (sj = |Sj |). Every row of X has one non-zero element, corresponding to the cluster the data point belongs to. There are sj nonzero elements in column j, which indicates the points belonging to Sj . The two formulations are related: F (A, X) = kA − XX T Ak2F m X 2 kpiT − piT X T Ak2 = i=1
=
m X i=1
2
kpiT − µ(pi )T k2
= F (P, S),
where we have used the identity piT X T A = µ(pi )T , for i = 1, ..., m, which can be verified after some elementary algebra. Using this formulation, the goal of kmeans is to find an indicator matrix X which minimizes kA − XX T Ak2F . We will denote the best such matrix with, X opt = arg min kA − XX T Ak2F ; X∈Rm×k
so,
T Fopt = kA − X opt X opt Ak2F .
Since Ak is the best rank k approximation to A, kA − Ak k2F ≤ Fopt ,
T because X opt Xopt A has rank at most k.
Using the matrix formulation for k-means, we can restate the goal of a feature selection algorithm in matrix notation. So, the goal of feature selection is to construct points C ∈ Rm×r , where C is a subset of r columns from A, which represent the m points in the r-dimensional selected feature space. Note also that we will allow rescaling of the corresponding columns. Now consider the optimum k-means partition of the points in C, ˆ opt = arg min kC − XX T Ck2 . X F X∈Rm×k
The goal of feature selection is to construct the new set of points C such that, T
ˆ Ak2 ≤ αkA − X opt X T Ak2 . ˆ opt X kA − X opt F opt F
Lemma 7 (Lemma 11 in ( [4])). Let V T ∈ Rk×n and B ∈ Rℓ1 ×n with V T V = Ik . Let r > k. There is a deterministic O(rk 2 n + ℓ1 n) time algorithm to construct Ω ∈ Rn×r and S ∈ Rr×r such that, p σk (V T ΩS) ≥ 1 − k/r; kBΩSkF ≤ kBkF .
Lemma 8 (Lemma 10 in ( [4])). Let V T ∈ Rk×n , Q ∈ Rℓ2 ×n , V T V = Ik , and Q T Q = Iℓ2 . Let r > k. There is a deterministic O(rk 2 n + rℓ22 n) time algorithm to construct Ω ∈ Rn×r and S ∈ Rr×r such that, p p kQΩSk2 ≤ 1 + ℓ2 /r. σk (V T ΩS) ≥ 1 − k/r; Moreover, if Q = In , the running time is O(rk 2 n).
Lemmas 7 and 8 are generalizations of the original work of Batson et al [3], which presented a deterministic algorithm which operates only on V. Lemmas 7 and 8 are proved in [4] (see Lemmas 10 and 11 in [4]). We will use Lemma 7 in a novel way: we will apply it to a matrix B of the form B1 , B= B2 so we will be able to control the sum of the Frobenius norms of two different matrices B1 , B2 , which is all we need in our application. The above two lemmas will be used to prove our deterministic results for feature selection, i.e. Theorems 2 and 4. We will also need the following result, which corresponds to the celebrated work of Rudelson and Virshynin [30], [31] and describes a randomized algorithm for constructing matrices Ω and S. The lower bound with the optimal constants 4 and 20 was recently proved as Lemma 15 in [25]. The Frobenius norm bounds are straightforward; a short proof can be found as Eqn. 36 in [10]. This lemma will be used to prove our hybrid randomized result for feature selection, i.e. Theorem 5. Lemma 9. Let V T ∈ Rk×n , B ∈ Rℓ1 ×n and Q ∈ Rℓ2 ×n , with V T V = Ik . Let r > 4k ln(k). Algorithm 6 in O(nk+r log(r)) time constructs Ω ∈ Rn×r and S ∈ Rr×r such that with probability 0.9, h i p σk2 (V T ΩS) ≥ 1 − 4k ln(20k)/r; E kBΩSk2F = kBk2F ; h i 2 2 E kQΩSkF = kQkF .
6
Input: A ∈ Rm×n , Xin ∈ Rm×k , number of clusters k, and number of features r > k. Output: C ∈ Rm×r containing r rescaled columns of A. 1: 2: 3: 4:
Input: A ∈ Rm×k , number of clusters k, and number of features k < r < 4k ln k. Output: C ∈ Rm×r containing r rescaled columns of A.
Compute the matrix V k ∈ Rn×k from the SVD of A. A − AVk VkT Let B = ∈ R2m×n . T A − X in X in A Let [Ω, S] = DeterministicSamplingI(V kT , B, r). return C = AΩS ∈ Rm×r .
1:
2: 3: 4:
Algorithm 1: Supervised Feature Selection (Theorem 2)
5: 6:
Algorithm 3: Randomized Unsupervised Feature Selection (Theorem 5)
m×n
Input: A ∈ R , number of clusters k, and number of features r > k. Output: C ∈ Rm×r containing r rescaled columns of A. 1: 2: 3:
Compute the matrix Z ∈ Rn×k from the approximate SVD of A in Lemma 6: Z = F astApproximateSV D(A, k, 21 ) . Let c = 16k ln(20k). Let [Ω1 , S1 ] = RandomizedSampling(Z T , c). ˜ ∈ Rc×k with the top k right Compute V singular vectors of Z T Ω1 S1 ∈ Rk×c . ˜ T , Ic , r). Let [Ω, S] = DeterministicSamplingII(V return C = AΩ1 S1 ΩS ∈ Rm×r .
only control three error terms
Compute the matrix V k ∈ Rn×k from the SVD of A. Let [Ω, S] = DeterministicSamplingII(VkT , In , r). return C = AΩS ∈ Rm×r .
T k(A − Xin X in A)ΩSk2F ;
k(A − AVk VkT )ΩSk2F ;
σk2 (VkT ΩS).
Another requirement of the matrices Ω and S is that rank(VkT ΩS) = k,
Algorithm 2: Unsupervised Feature Selection (Theorem 4)
which can be achieved if σk (VkT ΩS) > 0.
4
A LGORITHMS
This section gives the details of the algorithms relating to Theorems 2, 4, and 5 (recall that there is no algorithm for Corollary 3 since our result is only existential). The resulting algorithms are presented as Algorithms 1, 2, and 3, respectively. The intuition for Algorithms 1, 2, and 3 stems from the following crucial Lemma, which lies at the heart of the analysis of our algorithms and the proofs of Theorems 2, 4, and 5. In the following lemma, the sampling and rescaling matrices Ω ∈ Rn×r and S ∈ Rr×r are arbitrary, modulo the rank restriction in the lemma. Lemma 10. Fix A ∈ Rm×n , k > 0, and X in ∈ Rm×k . Let Ω ∈ Rn×r and S ∈ Rr×r be any matrices, so C = AΩS ∈ Rm×r . Let Xout be the output of some γ-approximation algorithm on C, k. Then, if rank(VkT ΩS) = k, kA −
T X out X out Ak2F
≤
kA −
T X in Xin Ak2F
+∆
∆ = 2γ
k(A −
−
AVk VkT )ΩSk2F
so that when we control kBk2F , we can control the sum of the squared Frobenius norms of the two submatrices, which is all we need (see the proof of Theorem 2). T The term k(A − Xin Xin A)ΩSk2F in the equation of the above lemma indicates why Algorithm 1 requires the knowledge of some partition X in . In order to remove the dependence on Xin and obtain our unsupervised feature selection result, we use spectral submultiplicativity to manipulate this term as follows, T T k(A − Xin Xin A)ΩSk2F ≤ k(A − Xin X in A)k2F kIn ΩSk22 .
Similarly, we obtain,
where T X in X in A)ΩSk2F + k(A σk2 (VkT ΩS)
Notice that DeterministicSamplingI in Algorithm 1 in Theorem 2 controls the smallest singular value of an orthonormal matrix and the Frobenius norm of some other matrix B. We carefully chose A − AVk VkT B= T A − X in X in A
.
The message of this lemma is very usefull. It tells us that a provably accurate feature selection algorithm need
k(A − AVk VkT )ΩSk2F ≤ k(A − AVk VkT )k2F kIn ΩSk22 . These two results along with Lemma 8 give the intuition for the method DeterministicSamplingII in Algorithm 2 that accomplishes the claim of Theorem 4 (Algorithm 3 uses this same trick as well).
7
Input: V T = [v1 , v2 , . . . , vn ] ∈ Rk×n , B = [b1 , b2 , . . . , bn ] ∈ Rℓ1 ×n , and r > k. Output: Sampling matrix Ω ∈ Rn×r and rescaling matrix S ∈ Rr×r . 1: 2: 3: 4: 5:
Initialize A0 = 0k×k , Ω = 0n×r , and S = 0r×r . Set constants p δB = kBk2F (1 − k/r)−1 ; δL = 1. for τ = 0 to r −√1 do Let Lτ = τ − rk; U B = rδB . Pick index iτ ∈ {1, 2, ..., n} and number tτ > 0 (see text for the definition of U, L): U (biτ , δP ) ≤
6: 7: 8:
9:
Input: V T = [v1 , v2 , . . . , vn ] ∈ Rk×n , Q = [q1 , q2 , . . . , qd ] ∈ Rℓ2 ×n , and r > k. Output: Sampling matrix Ω ∈ Rn×r and rescaling matrix S ∈ Rr×r . 1: 2: 3: 4: 5:
1 ≤ L(viτ , δL , Aτ −1 , Lτ ). tτ
Update Aτ = Aτ −1 + tτ viτ viTτ ;√set Ωiτ ,τ +1 = 1 and Sτ +1,τ +1 = 1/ tτ . end for Multiply all the weights in S by q p r−1 (1 − k/r).
Return: Ω and S.
Algorithm 4: DeterministicSamplingI (Lemma 7)
4.1 Running Times We now comment on the running times of Algorithms 1, 2, and 3. Algorithm 1 computes the matrix Vk in O(mn min{m, n}) time; then, A − AV k VkT T and A − Xin X in A can be computed in O(mnk). DeterministicSamplingI takes time O(rk 2 n + mn), from Lemma 7. Overall, the running time of Algorithm 1 in Theorem 2 is O(mn min{m, n}+rk 2 n). Similar arguments suffice to show that this is also the case for Algorithm 2 in Theorem 4. Finally, the running time of Algorithm 3 is O mnk + rk 3 log(k) + r log r , since it employs the approximate SVD of Lemma 6, the randomized technique of Lemma 9, and the deterministic technique of Lemma 7.
Initialize A0 = 0k×k , B0 = 0ℓ2 ×ℓ2 , Ω = 0n×r , and S = 0r×r . Set constants −1 p δQ = (1 + ℓ2 /r) 1 − k/r ; δL = 1. for τ = 0 to r −√1 do √ Let Lτ = τ − rk; Uτ = δQ τ + ℓ2 r Pick index iτ ∈ {1, 2, ..., n} and number tτ > 0 (see text for the definition of U, L): ˆ (qiτ , δQ , B, Lτ ) ≤ 1 ≤ L(viτ , δL , Aτ −1 , Lτ ). U tτ
6:
7: 8:
9:
Update Aτ = Aτ −1 + tτ viτ viTτ ; Bτ = Bτ −1 + tτ qiτ qiTτ , and √ set Ωiτ ,τ +1 = 1, Sτ +1,τ +1 = 1/ tτ . end for Multiply all the weights in S by r p r−1 1 − k/r .
Return: Ω and S.
Algorithm 5: DeterministicSamplingII (Lemma 8)
V T = [v1 , v2 , . . . , vn ] and B = [b1 , b2 , . . . , bn ]. Given k and r > k, introduce the iterator τ √ = 0, 1, 2, ..., r − 1, and define the parameter Lτ = τ − rk. For a square symmetric matrix A ∈ Rk×k with eigenvalues λ1 , . . . , λk , v ∈ Rk , L ∈ R, define φ(L, A) =
i=1
DeterministicSamplingI. At a high level, Algorithm 4 selects columns in a greedy way such that the desired bounds hold at every iteration of the algorithm (r iterations in total). The key equation is in step 5 of the algorithm. If an index iτ satisfies this equation, then the corresponding columns satisfy the desired bounds as well. Thus, a key requirement in the algorithm is that such an index iτ exists in every step. These two observations give the high level idea in DeterministicSamplingI. To describe the algorithm in more detail, it is convenient to view the input matrices as two sets of n vectors,
1 , λi − L
and let L(v, δL , A, L) be defined as L(v, δL , A, L) = where
vT (A − L′ Ik )−2 v − vT (A − L′ Ik )−1 v, φ(L ′ , A) − φ(L, A) L
4.2 Description of the Algorithms Accomplishing Lemmas 7, 8, and 9
k X
′
=
L
+ δL =
L
+ 1.
For a vector z and scalar δ > 0, define the function U (z, δ) = δ −1 zT z. At each iteration τ , the algorithm selects iτ , tτ > 0 for which U (biτ , δP ) ≤ t−1 τ ≤ L(viτ , δL , A, Lτ ). Such iτ , tτ exist, as was shown in [4]. The running time of the algorithm is dominated by the search for an index iτ satisfying −1 , A, Lτ ) U (biτ , δP ) ≤ t−1 τ ≤ L(viτ , δ
(one can achieve that by exhaustive search). One needs
8
5.1
Input: V T = [v1 , v2 , . . . , vn ] ∈ Rk×n , and the number of sampled columns r > 4k ln k. Output: Sampling matrix Ω ∈ Rn×r and rescaling matrix S ∈ Rr×r .
2
T We start by manipulating the term kA − X out Xout AkF . First, we decompose the matrix A as
A = Ak + (A − Ak ).
pi = k1 kvi k22 . S = 0r×r .
For i = 1, ..., n compute Initialize Ω = 0n×r and for τ = 1 to r do Select index i ∈ {1, 2, ..., n} independently with the probability of selecting index i equal to pi . √ Set Ωi,τ = 1 and Sτ,τ = 1/ pi r. end for Return: Ω and S.
1: 2: 3: 4:
5: 6: 7:
Proof of Lemma 10
Now, from the Pythagorean Theorem for matrices, have that:
3
we
T kA − X out X out Ak2F T T k(Im − X out Xout )Ak + (Im − Xout Xout )(A − Ak )k2F
=
T k(Im − X out Xout )Ak k2F + T k(Im − X out Xout )(A − Ak )k2F .
=
T Using that Im − X out X out is a projection matrix and that 2
T )Ak2F , kA − Ak kF ≤ k(Im − X opt X opt
Algorithm 6: RandomizedSampling (Lemma 9) we obtain 3
φ(L, A), and hence the eigenvalues of A. This takes O(k ) time, once per iteration, for a total of O(rk 3 ). Then, for i = 1, . . . , n, we need to compute L for every vi . This takes O(nk 2 ) per iteration, for a total of O(rnk 2 ). To compute U , we need biT bi for i = 1, . . . , n, which need to be computed only once for the whole algorithm and takes O(ℓ1 n). So, the total running time is O(nrk 2 + ℓ1 n). DeterministicSamplingII. Algorithm 5 is similar to ˆ . For Algorithm 4; we only need to define the function U ℓ2 ×ℓ2 with eigenvalues a square symmetric matrix B ∈ R λ1 , . . . , λℓ2 , q ∈ Rℓ2 , U ∈ R, define: ˆ U, B) = φ(
ℓ2 X i=1
−2 T ′ ˆ (q, δQ , B, U) = q (B − U Iℓ2 ) q − qT (B − U′ Iℓ2 )−1 q, U ˆ U, B) − φ( ˆ U′ , B) φ(
where
′
=
U
+ δQ =
U
−1 p + (1 + ℓ2 /r) 1 − k/r .
The running time of the algorithm is O(nrk 2 + nrℓ22 ). RandomizedSampling. Finally, Algorithm 6 takes input only V; it selects the columns randomly based on a probability distribution that is computed via V. It needs O(nk) to compute the probabilities and O(n + r log r) to select the indices, a total of O(nk + r log r) time.
5
We now bound the first term. Given Ω and S, for some residual matrix Y ∈ Rm×n , let Ak = AΩS(VkT ΩS)+ VkT + Y.
Then, (a)
P ROOFS
We first prove Lemma 10, which is the main technical contribution of this work. Then, combining Lemma 10 along with Lemmas 7, 8, and 9, we prove Theorems 2, 4, and 5, respectively.
T k(Im − Xout X out )Ak k2F
≤
T 2k(Im − X out Xout )AΩS(V kT ΩS)+ VkT k2F + 2kYk2F
≤
T 2k(Im − X out Xout )AΩSk2F k(VkT ΩS)+ k2F + 2kYk2F
≤
T 2γk(Im − Xin Xin )AΩSk2F k(VkT ΩS)+ k22 + 2kYk2F
(b)
1 , U − λi
ˆ (q, δQ , B, U) be defined as and let U
U
≤
T kA − Xout X out Ak2F T T )Ak2F . k(Im − X out X out )Ak k2F + k(Im − X opt X opt
(c)
In (a), we used kY1 + Y 2 k2F ≤ 2kY1 k2F + 2kY2 k2F (for any two matrices Y1 , Y2 ), which follows from the triangle inequality of matrix norms; further we have removed the T projection matrix Im − Xout Xout from the second term, which can be done without increasing the Frobenius norm. In (b), we used spectral submultiplicativity and the fact that VkT is orthonormal, and so it can be dropped without increasing the spectral norm. Finally, in (c), we replaced Xout by Xin and the factor γ appeared in the first term. To understand why this can be done, notice that, by assumption, X out was constructed by running the γ-approximation on C = AΩS. So, for any indicator matrix X: T k(Im − X out X out )AΩSk2F ≤ γk(Im − XX T )AΩSk2F .
Setting X = X in shows the claim. Finally, we bound the term kYk2F . Recall that Y
= =
Ak − AΩS(V kT ΩS)+ VkT
Ak − Ak ΩS(V kT ΩS)+ V kT − (A − Ak )ΩS(VkT ΩS)+ VkT .
3. Let Y1 , Y2 ∈ Rm×n satisfy Y1 Y2T = 0m×m . Then, kY1 + Y2 k2F = kY1 k2F + kY2 k2F .
9
Since Ak =
We can now use the bound for σk2 (VkT ΩS) from Lemma 7: 2 p σk2 (VkT ΩS) > 1 − k/r .
Uk Σk V kT ,
we have that Ak ΩS(VkT ΩS)+ VkT
=
Uk Σk VkT ΩS(V kT ΩS)+ VkT
=
Uk Σk VkT ,
Also, we can use the Frobenius norm bound for B of Algorithm 4: kBΩSk2F ≤ kBk2F .
where the last equality follows because
From our choice of B,
rank(V kT ΩS) = k, and so
T kBΩSk2F = k(A − AVk VkT )ΩSk2F + k(A − X in Xin A)ΩSk2F ,
and
VkT ΩS(VkT ΩS)+ = Ik .
Thus, the first two terms in the expression for E cancel and we have kYk2F
T kBk2F = kA − AVk VkT k2F + kA − X in X in Ak2F ;
so, T k(A − AVk VkT )ΩSk2F + k(A − X in X in A)ΩSk2F
= k(A − Ak )ΩS(VkT ΩS)+ VkT k2F ≤ k(A − Ak )ΩSk2F k(VkT ΩS)+ Vk k22 ≤
=
T ≤ kA − AVk VkT k2F + kA − X in X in Ak2F .
k(A − Ak )ΩSk2F k(VkT ΩS)+ k22 k(A − Ak )ΩSk2F σk2 (VkT ΩS)
Combine all these bounds together and use kA − AVk VkT k2F ≤ Fopt ≤ F (P, Sin ), to wrap up.
In the first 3 steps, we have used spectral submultiplicativity, and in the last step we have used the definition of the spectral norm of the pseudo-inverse. Combining all these bounds together (and using γ ≥ 1): T T kA − X out Xout Ak2F ≤ kA − X opt X opt Ak2F + ∆,
where 2
∆ = 2γ
T k(In − Xin Xin )AΩSkF + k(A − Ak )ΩSk2F
σk2 (VkT ΩS)
5.3 Proof of Theorem 4 Theorem 4 follows by combining Lemmas 10 and 8. The rank requirement in Lemma 10 is satisfied by the bound for the smallest singular value of VkT ΩS in Lemma 8. In Lemma 10, let X in = Xopt . So,
.
T T kA − Xout Xout Ak2F ≤ kA − X opt Xopt Ak2F + ∆,
The lemma now follows because
where
T T kA − X opt Xopt Ak2F ≤ kA − X in X in Ak2F ,
which holds by the optimality of the indicator matrix X opt on the high dimensional points containing in the rows of A.
∆ = 2γ
T k(A − X opt X opt A)ΩSk2F + k(A − AVk VkT )ΩSk2F
σk2 (V kT ΩS)
.
Now, from spectral submultiplicativity and using T kA − AVk VkT k2F ≤ kA − X opt X opt Ak2F ,
5.2 Proof of Theorem 2
we obtain,
Theorem 2 will follow by using Lemmas 10 and 7. We would like to apply Lemma 10 for the matrices Ω and S constructed with DeterministicSamplingI; to do that, we need rank(V kT ΩS) = k. This rank requirement follows from Lemma 7, because p σk (V kT ΩS) > 1 − k/r > 0.
Hence, Lemma 10 gives
T T kA − Xout X out Ak2F ≤ kA − Xin X in Ak2F + ∆,
where ∆ = 2γ
T T k(A − X opt X opt A)ΩSk2F ≤ k(A − Xopt Xopt A)k2F kIn ΩSk22 ,
and k(A − AVk VkT )ΩSk2F
k(A −
−
AVk VkT )ΩSk2F
.
≤
k(A − AVk VkT )k2F kIn ΩSk22
T k(A − Xopt Xopt A)k2F kIn ΩSk22 .
The result now follows by using the following bounds from Lemma 8, p σk−1 (VkT ΩS) ≤ (1 − k/r)−1 ; p kIn ΩSk2 ≤ 1 + n/r. 5.4
T X in X in A)ΩSk2F + k(A σk2 (VkT ΩS)
≤
Proof of Theorem 5
To prove the theorem, we are going to need a more general version of our crucial structural result, Lemma 10.
10
Lemma 11. Fix A ∈ Rm×n , k > 0, and X in ∈ Rm×k . Let Ω ∈ Rn×r and S ∈ Rr×r be any matrices, so C = AΩS ∈ Rm×r . Let Z ∈ Rn×k be any orthonormal matrix, such that A = AZZ T + E (note that E = A − AZZ T ∈ Rm×n ). Let X out be the output of some γ-approximation algorithm on C, k. Then, if rank(Z T ΩS) = k, T T kA − Xout X out Ak2F ≤ kA − Xin X in Ak2F + ∆,
where ∆ = 2γ
T k(A − X in X in A)ΩSk2F + kEΩSk2F . σk2 (Z T ΩS)
Proof: The proof follows exactly the same argument as the proof of Lemma 10, after replacing
Let
Y = VkT Ω1 S1 ∈ Rk×c ,
and consider its SVD, ˜Σ ˜ T, ˜V Y=U ˜ ∈ Rk×k , Σ ˜ ∈ Rc×k . Lemma 9 now ˜ ∈ Rk×k , and V with U implies that with probability 0.9, p 1 σk (Y) ≥ 1 − 4k ln(20k)/c = , 2 because c = 16k ln(20k), which means that T
˜ ) = k. rank(V
Vk = Z, Ak = AZZ T , and
Aρ−k = E = A − AZZ T .
T
˜ is an input to Deterministic Sampling II, it Since V follows from Lemma 8 that p ˜ T ΩS) > 1 − k/r > 0. σk (V
Since, Notice that Lemma 10 is a special case of the above lemma by assuming Z = Vk , the matrix of the top k right singular vectors of A, in which case E = A − Ak . To prove Theorem 5, we start with the general bound of Lemma 11; to apply the lemma, we need to satisfy the rank assumption, which will become clear shortly, during the course of the proof. The algorithm of Theorem 5 constructs the matrix Z by using the algorithm of Lemma 6 with ǫ set to a constant, 1 . 2 Using the same notation as in Lemma 6, ǫ=
T
E = A − AZZ , and
3 E kEk2F ≤ kA − Ak k2F . 2 We can now apply the Markov’s inequality, to obtain that with probability at least 0.9, kEk2F ≤ 15kA − Ak k2F . The randomized construction in the third step of Algorithm 3 gives sampling and rescaling matrices Ω1 and S1 ; the deterministic construction in the fifth step of Algorithm 3 gives sampling and rescaling matrices Ω and S. To apply Lemma 11, we will choose X in = X opt , since the Lemma gives us the luxury to pick any indicator matrix X in . Note that we do not need to actually compute X opt in the algorithm; we are just using it in the proof to get the desired result. Algorithm 3 first selects c columns using Ω1 S1 ∈ Rn×c .
˜ T ΩS) = k, rank(YΩS) = rank(V
we can apply Lemma 11: T T kA − X out X out Ak2F ≤ kA − Xin Xin Ak2F + ∆,
where 2
∆ = 2γ
2
T k(A − X opt Xopt A)Ω1 S1 ΩSkF + kEΩ1 S1 ΩSkF
σk2 (Z T Ω1 S1 ΩS)
.
To bound the denominator, observe that ˜Σ ˜ T ΩS) = σ 2 (Σ ˜ T ΩS). ˜V ˜V σk2 (Z T Ω1 S1 ΩS) = σk2 (U k ˜ is a full rotation, Now, since U ˜ T ΩS) ≥ σ 2 (Σ)σ ˜ T ΩS) = σ 2 (Z T Ω1 S1 )σ 2 (V ˜ T ΩS). ˜V ˜ k2 (V σk2 (Σ k k k Thus, σk2 (Z T Ω1 S1 ΩS) ≥
p 1 (1 − k/r)2 . 4 2
T We now bound k(A − X opt X opt A)Ω1 S1 ΩSkF : 2
T k(A − X opt X opt A)Ω1 S1 ΩSk2 2
T = k(A − X opt X opt A)Ω1 S1 Ic ΩSk2 2
T ≤ k(A − X opt X opt A)Ω1 S1 kF kIc ΩSk22 p 2 T ≤ (1 + c/r)2 k(A − X opt X opt A)Ω1 S1 kF ,
where the last inequality is because Ic is the input to DetrministicSampling II whose spectral norm is controlled, with ℓ2 = c. Similarly, we bound kEΩ1 S1 ΩSk2F as p kEΩ1 S1 ΩSk2F ≤ (1 + c/r)2 kEΩ1 S1 k2F . From Lemma 9,
E kEΩ1 S1 k2F = kEk2F ,
11
and i h 2 2 T T AkF . E k(A − X opt X opt A)Ω1 S1 kF = kA − X opt X opt
By a simple application of Markov’s inequality and the union bound, both equations below hold with probability at least 0.6, kEΩ1 S1 k2F 2
T k(A − X opt X opt A)Ω1 S1 kF
≤ 5kEk2F ;
2
T ≤ 5kA − Xopt X opt AkF .
We further manipulate the bound for the matrix E as kEΩ1 S1 k2F ≤ 5kEk2F ≤ 75kAρ−k k2F . This now introduces another failure probability 0.1 (from the discussion in the beginning of this section). Putting all these bounds together, and using that 2
T Fopt = kA − X opt X opt AkF ,
and
kAρ−k k2F ≤ Fopt ,
we conclude the proof as follows, 2 T 2γ k(A − X opt Xopt A)Ω1 S1 ΩSkF + kEΩ1 S1 ΩSk2F ≤ = = =
1 4 (1
2γFopt p − k/r)2
σk2 (Z T Ω1 S1 ΩS) p 80(1 + c/r)2
p c/r)2 p 640γFopt (1 − k/r)2 p γFopt 640 + O( c/r + c/r) (1 +
γFopt · O(c/r),
where the last expression follows because for k < r < c, the dominant term is O(c/r). Since c = O(k log k), the result follows. Note, we have made no attempt to optimize constants. Finally, the failure probability of the theorem follows by a union bound on the events of the bounds of the equations kEΩ1 S1 k2F 2
T k(A − X opt X opt A)Ω1 S1 kF
≤ 5kEk2F ;
2
T ≤ 5kA − Xopt X opt AkF ,
and the lower bound of Lemma 9.
6
R ELATED
WORK
Feature selection has received considerable attention in the machine learning and pattern recognition communities. A large number of different techniques appeared in prior work, addressing the feature selection within the context of both clustering and classification. Surveys include [14], as well as [15], which reports the results of the NIPS 2003 challenge in feature selection. Popular feature selection techniques include the Laplacian scores
[18], the Fisher scores [12], or the constraint scores [36]. None of these feature selection algorithms have theoretical guarantees on the performance of the clusters obtained using the dimension-reduced features. We focus on the family of feature selection methods that resemble our feature selection techniques, in that they select features by looking at the right singular vectors of the matrix containing the data (the matrix A). Given the input m × n object-feature matrix A, and a positive integer k, a line of research tries to construct features for (unsupervised) data reconstruction, specifically for Principal Components Analysis (PCA). PCA corresponds to the task of identifying a subset of k linear combinations of columns from A that best reconstruct A. Subset selection for PCA asks to find the columns of A that reconstruct A with comparable error as do its top Principal Components. Jolliffe [19] surveys various methods for the above task. Four of them (called B1, B2, B3, and B4 in [19]) employ the Singular Value Decomposition of A in order to identify columns that are somehow correlated with its top k left singular vectors. In particular, B3 employs a deterministic algorithm which is very similar to Algorithm 3 that we used in this work; no theoretical results are reported. An experimental evaluation of the methods of [19] on real datasets appeared in [20]. Another approach employing the matrix of the top k right singular vectors of A and a Procrustes-type criterion appeared in [21]. From an applications perspective, [34] employed the methods of [19] and [21] for gene selection in microarray data analysis. Feature selection for clustering seeks to identify those features that have the most discriminative power among all the features. [24] describes a method where one first computes the matrix Vk ∈ Rn×k , and then clusters the rows of Vk by running, for example, the k-means algorithm. One finally selects those k rows of Vk that are closest to the centroids of the clusters computed by the previous step. The method returns those columns from A that correspond to the selected rows from Vk . A different approach is described in [8]. This method selects features one at a time; it first selects the column of A which is most correlated with the top left singular vector of A, then projects A to this singular vector, removes the projection from A, computes the top left singular vector of the resulting matrix, and selects the column of A which is most correlated with the latter singular vector, etc. Greedy approaches similar to the method of [8] are described in [26] and [27]. There are no known theoretical guarantees for any of these methods. While these methods are superficially similar to our method, in that they use the right singular matrix Vk and are based on some sort of greedy algorithm, the techniques we developed to obtain theoretical guarantees are entirely different and based on linear-algebraic sparsification results [3], [4]. The result most closely related to ours is the work in [5], [7]. This work provides a randomized algorithm which offers a theoretical guarantee. Specifically, for
12
r = Ω(kǫ−2 log k), it is possible to select r features such that the optimal clustering in the reduced-dimension space is a (3+ǫ)-approximation to the optimal clustering. Our result improves upon this in two ways. First, our algorithms are deterministic; second, by using our deterministic algorithms in combination with this randomized algorithm, we can select r = O(k) features and obtain a competitive theoretical guarantee. Finally, we should mention that if one allows linear combinations of the features (feature extraction), there are algorithms that offer theoretical guarantees. First there is the SVD itself, which constructs k (mixed) features for which the optimal clustering in this feature space is a 2-approximation to the optimal clustering [Drineas et al.(1999)Drineas, Frieze, Kannan, Vempala, and Vinay]. It is possible to improve the efficiency of this SVD algorithm considerably by using the approximate SVD (as in Lemma 6) instead of the exact SVD to get nearly the same approximation guarantee with k features. The exact statement of this improvement can be found in [7]. Boutsidis et al. [6] shows how to select O(kǫ−2 ) (mixed) features with random projections and also obtaining a (2 + ǫ)-guarantee. While these algorithms are interesting, they do not produce features that preserve the integrity of the original features. The focus of this work is on what one can achieve while preserving the original features.
ACKNOWLEDGEMENTS We would like to thank Haim Avron and Sanmay Das for their comments on a preliminary version of this work which helped us improve the presentation of the paper.
R EFERENCES [1]
[2] [3] [4] [5] [6] [7] [8] [9]
A. Aggarwal, A. Deshpande, and R. Kannan. Adaptive sampling for k-means clustering. In Proceedings of the Workshop on Approximation Algorithms for Combinatorial Optimization Problems (APPROX), 2009. D. Arthur and S. Vassilvitskii. k-means++: the advantages of careful seeding. In Proceedings of the Annual ACM-SIAM Symposium on Discrete algorithms (SODA), 2007. J. Batson, D. Spielman, and N. Srivastava. Twice-ramanujan sparsifiers. In Proceedings of the ACM symposium on Theory of Computing (STOC), 2009. C. Boutsidis, P. Drineas, and M. Magdon-Ismail. Near optimal column based matrix reconstruction. In Proceedings of the Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2011. C. Boutsidis, M. W. Mahoney, and P. Drineas. Unsupervised feature selection for the k-means clustering problem. In Neural Information Processing Systems (NIPS), 2009. C. Boutsidis, A. Zouzias, and P. Drineas. Random projections for k-means clustering. In Neural Information Processing Systems (NIPS), 2010. C. Boutsidis, A. Zouzias, M.W. Mahoney, and P. Drineas. Stochastic Dimensionality Reduction for k-means clustering. Manuscript, Under Review, 2011. Y. Cui and J. G. Dy. Orthogonal principal feature selection. manuscript. A. d Aspremont, L. El Ghaoui, M. Jordan, and G. Lanckriet. A direct formulation for sparse PCA using semidefinite programming. SIAM review, 49(3):434, 2007.
[Drineas et al.(1999)Drineas, Frieze, Kannan, Vempala, and Vinay] P. Drineas, A. Frieze, R. Kannan, S. Vempala, and V. Vinay. Clustering in large graphs and matrices. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 1999. [10] P. Drineas, M. Mahoney, and S. Muthukrishnan Relative-Error CUR Matrix Decompositions. SIAM Journal on Matrix Analysis and Applications, 30:844-881, 2008. [11] D. Feldman, M. Monemizadeh, and C. Sohler. A PTAS for kmeans clustering based on weak coresets. In Proceedings of the Annual Symposium on Computational Geometry (SoCG), 2007. [12] D. Foley and J. Sammon, J.W. An optimal set of discriminant vectors. IEEE Transactions on Computers, C-24(3):281–289, March 1975. [13] G. Frahling and C. Sohler. A fast k-means implementation using coresets. In Proceedings of the Annual Symposium on Computational Geometry (SoCG), 2006. [14] I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal of Machine Learning Research (JMLR), 3:1157–1182, 2003. [15] I. Guyon, S. Gunn, A. Ben-Hur, and G. Dror. Result analysis of the nips 2003 feature selection challenge. In Neural Information Processing Systems (NIPS), 2005. [16] S. Har-Peled and A. Kushal. Smaller coresets for k-median and k-means clustering. In Proceedings of the Annual Symposium on Computational Geometry (SoCG), 2005. [17] S. Har-Peled and S. Mazumdar. On coresets for k-means and kmedian clustering. In Proceedings of the Annual ACM Symposium on Theory of Computing (STOC), 2004. [18] X. He, D. Cai, and P. Niyogi. Laplacian score for feature selection. In Advances in Neural Information Processing Systems, 2006. [19] I. Jolliffe. Discarding variables in a principal component analysis. I: Artificial data. Applied Statistics, 21(2):160–173, 1972. [20] I. Jolliffe. Discarding variables in a principal component analysis. II: Real data. Applied Statistics, 22(1):21–31, 1973. [21] W. Krzanowski. Selection of variables to preserve multivariate data structure, using principal components. Applied Statistics, 36(1):22–33, 1987. [22] A. Kumar, Y. Sabharwal, and S. Sen. A simple linear time (1 + ǫ)-approximation algorithm for k-means clustering in any dimensions. In Proceedings of the Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2004. [23] S. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129–137, 1982. [24] Y. Lu, I. Cohen, X. S. Zhou, and Q. Tian. Feature selection using principal feature analysis. In Proceedings of the 15th international conference on Multimedia, pages 301–304, 2007. [25] M. Magdon-Ismail. Row Sampling for Matrix Algorithms via a Non-Commutative Bernstein Bound. ArXiv Report, http://arxiv.org/abs/1008.0587, 2010. [26] A. Malhi and R. Gao. PCA-based feature selection scheme for machine defect classification. IEEE Transactions on Instrumentation and Measurement, 53(6):1517–1525, Dec. 2004. [27] K. Mao. Identifying critical variables of principal components for unsupervised feature selection. IEEE Transactions on Systems, Man, and Cybernetics, 35(2):339–344, April 2005. [28] R. Ostrovsky and Y. Rabani. Polynomial time approximation schemes for geometric k-clustering. In Proceedings of the IEEE Symposium on Foundations of Computer Science (FOCS), 2000. [29] R. Ostrovsky, Y. Rabani, L. J. Schulman, and C. Swamy. The effectiveness of Lloyd-type methods for the k-means problem. In Proceedings of the IEEE Symposium on Foundations of Computer Science (FOCS), 2006. [30] M. Rudelson. Random Vectors in the Isotropic Position. Journal of Functional Analysis, 164, 1:60–72, 1999. [31] M. Rudelson and R. Vershynin. Sampling from large matrices: An approach through geometric functional analysis. Journal of the ACM (JACM), 54, 2007. [32] A. Smola and B. Scholkopf. ¨ Sparse greedy matrix approximation for machine learning. In Proceedings of the International Conference on Machine Learning (ICML), 2000. [33] J. Vaidya nd C. Clifton. Privacy-preserving k-means clustering over vertically partitioned data. In ACM SIGKDD International Conference on Knowledge discovery and Data Mining (KDD), 2003. [34] A. Wang and E. A. Gehan. Gene selection for microarray data analysis using principal component analysis. Stat Med, 24(13):2069–2087, July 2005.
13
[35] X. Wu et al. Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1):1–37, 2008. [36] D. Zhang, S. Chen, and Z.-H. Zhou. Constraint score: A new filter method for feature selection with pairwise constraints. Pattern Recognition, 41(5):1440–1451, 2008.