Neural Comput & Applic (2012) 21:1791–1799 DOI 10.1007/s00521-012-0827-3
ORIGINAL ARTICLE
Unsupervised maximum margin feature selection via L2,1-norm minimization Shizhun Yang • Chenping Hou • Feiping Nie Yi Wu
•
Received: 8 November 2011 / Accepted: 5 January 2012 / Published online: 18 January 2012 Springer-Verlag London Limited 2012
Abstract In this article, we present an unsupervised maximum margin feature selection algorithm via sparse constraints. The algorithm combines feature selection and K-means clustering into a coherent framework. L2,1-norm regularization is performed to the transformation matrix to enable feature selection across all data samples. Our method is equivalent to solving a convex optimization problem and is an iterative algorithm that converges to an optimal solution. The convergence analysis of our algorithm is also provided. Experimental results demonstrate the efficiency of our algorithm. Keywords Feature selection K-means clustering Maximum margin criterion Regularization
1 Introduction In recent years, the dimensionality of the data used in data mining becomes larger and larger, and the high-dimensional data contain millions of features, while many of these features may be irrelevant. Analysis of such data is challenging. Feature selection and feature extraction are two main techniques for dimensionality reduction [1]. The major difference between feature selection and feature
S. Yang (&) C. Hou Y. Wu Department of Mathematics and System Science, National University of Defense Technology, Changsha 410073, China e-mail:
[email protected] F. Nie Department of Computer Science and Engineering, University of Texas, Arlington, TX 76019, USA
extraction is that feature selection keeps original features, while feature extraction generates new features. In certain applications, such as genetic analysis, where the original features need to be kept for interpretation, feature extraction techniques may fail and we must employ feature selection techniques. Another advantage of feature selection is that we only pay our attention to our concerning features, but not all of the features, which may be easier than feature extraction techniques. Motivated by the superiority over feature extraction, we focus on feature selection in this article. Feature selection has been playing an important role in many applications, since it can speed up the learning process, improve the model generalization capability, and alleviate the effect of the curse of dimensionality [2]. Recently, there are three types of feature selection algorithms, supervised feature selection [3–5], semi-supervised feature selection [6, 7], and unsupervised feature selection [1, 8–10]. Supervised feature selection determines feature relevance by evaluating feature’s correlation with the class labels, semi-supervised feature selection uses both (small) labeled and (large) unlabeled data, while unsupervised feature selection exploits data variance and separability to evaluate feature relevance without any class labels. In many real-world applications, it is expensive or impossible to recollect the needed training data and rebuild the models. So unsupervised and semi-supervised feature selection may be more practical. Traditional feature selection algorithms simply use statistical character to rank the features, and they never take ‘‘learning the transformation matrix efficiently’’ into consideration. Sparsity regularization has played a central role in many areas including machine learning, statistics, and applied mathematics. Recently, sparsity regularization has been applied into feature selection studies [11, 14, 15]. In
123
1792
this article, we combine feature selection and K-means clustering into a coherent framework to adaptively select the most discriminative subspace, and we also employ a sparse regularization term to enable feature selection. Indeed it is the L2,1-norm as the sparse constraints that force many rows of projection matrix W to zeros and then we can select the most relevant features more efficiently. By alternately learning the optimal subspace and the clusters, we gain a good feature selection result. Firstly, we use K-means clustering to generate class labels and then get the cluster indicator matrix F. Secondly, we use joint Maximization Margin Criterion and Minimization L2,1-Norm terms to do the feature selection. The clustering process is, thus, integrated with the feature selection process and simultaneously clustered, while feature selection is implemented. It is worthwhile to highlight our work. •
•
Unlike traditional feature selection algorithms, we inherit ‘‘learning mechanism’’ to do feature selection by employing sparse constrains on the objective function. We propose an efficient algorithm to solve the problem and also provide the convergence analysis of our algorithm. The experiment shows that the objective can be solved by less than 20 iterations.
The rest of the article is organized as follows. Firstly, in Sect 2, we review some related work. Then in Sect. 3, we formulate the joint supervised feature selection problem and present our algorithm. Section 4 shows our experiments. Finally we conclude the article and discuss some future work in Sect. 5.
2 Related work The most related works with unsupervised maximum margin feature selection via sparse constrains problem are feature selection techniques, maximum margin criterion and L2,1-norm regularization. 2.1 Feature selection techniques Feature selection techniques are designed to find the relevant feature subset of the original features that can facilitate clustering or classification. The feature selection problem is essentially a combinatorial optimization problem that is computationally expensive. There are three types of feature selection algorithms, that is, supervised feature selection, semi-supervised feature selection, and unsupervised feature selection. The typical supervised feature selection methods include Pearson correlation coefficients [5] and Fisher score [4],
123
Neural Comput & Applic (2012) 21:1791–1799
which use the statistical character to rank the features and Information gain [3], which computes the sensitivity of a feature with respect to the class label distribution of the data. Work by Zhao et al. [6] is the first work to unify supervised and unsupervised feature selection and enable their joint study under a general framework (SPEC). The proposed framework is able to generate families of algorithms for both supervised and unsupervised feature selection. Zhao et al. [7] presented a semi-supervised feature selection algorithm based on spectral analysis, which uses both (small) labeled and (large) unlabeled data in feature selection. The algorithm exploits both labeled and unlabeled data through a regularization framework, which provides an effective way to address the ‘‘small labeledsample’’ problem. Traditional unsupervised feature selection methods, including Pca Score (PcaScor) [9] and Laplacian Score (LapScor) [10], address this issue by selecting the top-ranked features based on certain scores computed independent for each feature. Cai et al. [8] proposed a new approach, called Multi-Cluster Feature Selection (MCFS), for unsupervised feature selection. MCFS considered the possible correlation between different features and used spectral analysis of the data (manifold learning) and L1-regularized models for subset selection. Liu et al. [2] proposed a novel spectral feature selection algorithm of an embedded model (MRSF), which evaluates the utility of a set of features jointly and can efficiently remove redundant features. Boutsidis et al. [11] presented a novel unsupervised feature selection algorithm for the K-means clustering problem. The author presented the first accurate feature selection algorithm for K-means clustering. In this article, we focus on unsupervised feature selection. The above traditional unsupervised feature selection methods never take ‘‘learning the transformation matrix efficiently’’ into consideration. So we should pay more attention to the transformation matrix and find some mechanisms to inherit the ‘‘learning ability’’.. 2.2 Maximum margin criterion (MMC) MMC [12] aims at maximizing the average margin between classes in the projected space. Therefore, the feature extraction criterion is defined as: J¼
C X C 1X pi pj dðCi ; Cj Þ 2 i¼1 j¼1
C is the number of distinct classes, pi ; pj are the prior probability of class i and class j, respectively, and the interclass margin is defined as
Neural Comput & Applic (2012) 21:1791–1799
dðCi ; Cj Þ ¼ dðmi ; mj Þ sðCi Þ sðCj Þ sðCi Þ ¼ trðSi Þ; sðCj Þ ¼ trðSj Þ where mi ; mj are the mean vectors of the class Ci, and the class Cj , Si ; Sj are the covariance matrix of the class Ci and the class Cj . After simple mathematical operation, we can obtain the following formula: J ¼ trðSb Sw Þ The between-class scatter matrix Sb and the within-class scatter matrix Sw are defined as Sb ¼
C X
ni ðmi mÞðmi mÞT Sw ¼
i¼1
C X
ðXi mi ÞðXi mi ÞT
i¼1
where ni is the number of class Ci , m is the mean vector of all data. Then the MMC can be formulated as JðWÞ ¼ arg max trðW T ðSb Sw ÞWÞ arg max T T W W¼I
W W¼I
Obviously we can get the optimal W by solving the generalized eigenvalue problem:
1793
the components of the vector bðWÞ indicate how important each feature is. The L2;1 -norm favors a small numbers of nonzero rows in the matrix W, thereby ensuring that feature selection will be achieved. The L2;1 -norm of a matrix was introduced [13] as rotational invariant L1 -norm and used for multi-task learning [14, 15]. Argyriou et al. [14] developed a novel non-convex multi-task generalization of the L2;1 -norm regularization that can be used to learn a few features common across multiple tasks. Obozinski et al. [15] proposed a novel type of joint regularization of the model parameters in order to couple feature selection across tasks. Liu et al. [16] considered the L2;1 -norm regularized regression model for joint feature selection from multiple tasks, which can be derived in the probabilistic framework by assuming a suitable prior from the exponential family. One appealing feature of the L2;1 -norm regularization is that it encourages multiple predictors to share similar sparsity patterns. In this article, we employ L2;1 -norm regularization on the objective function to enable feature selection efficiently.
ðSb Sw ÞW ¼ kW Therefore, W is composed of the first d largest eigenvectors of Sb Sw . We need not calculate the inverse of Sw , which allows us to avoid the small sample size problem easily. That is also why we do not use LDA as the base method. We also require W to be orthonormal, which may help preserve the shape of the distribution. 2.3 L2,1-norm For a matrix W 2 Rmn , the Lr;p -norm is defined as follows, X X p=r 1=p n m wij r kW kr;p ¼ i¼1 j¼1 Xn 1=p wi p ¼ r i¼1
3 Unsupervised maximum margin feature selection via L2,1 norm minimization In this section, we first formulate this problem strictly in mathematical way, and then we describe our unsupervised maximum margin feature selection via L2;1 -norm minimization problem clearly. Then we give an efficient alternate algorithm to solve the problem. At last, we give the convergence analysis of our algorithm. 3.1 Problem settings and notations
where wi is the ith row of W. It is easy to verify that Lr;p norm is a valid norm. Then L2;1 -norm is defined in the following equation. n rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi n Xm X X 2 ¼ wi kW k2;1 ¼ w ij 2 j¼1
Before introducing our algorithm in detail, we should specify the problem settings and notations. X ¼ ½x1 ; x2 ; . . .; xn 2 Rmn , which represents m features and n samples. We would like to select d features to represent the original data, where d \ m. W 2 Rmd is the transformation matrix. D 2 Rmm as the diagonal matrix with the ith diagonal element, . dii ¼ 1 2wi 2 ; i ¼ 1; 2; . . .; m
We can easily verify that L2;1 -norm is rotational invariant for rows for any rotational matrix R, i.e., kWRk2;1 ¼ kW k2;1 . We also give an intuitional explanation of L2;1 -norm. First, we can compute the L2 -norm of the rows wi (corresponding to feature i) and then compute L1 -norm of the vector bðWÞ ¼ w1 ; w2 ; . . .; kwm k . The magnitudes of
wi is the ith row of W. The number of clusters is predefined pffiffiffi as c, F 2 Rnc is the indicator matrix, Fij ¼ 1 lj if xi belong to the jth cluster, and Fij ¼ 0 otherwise, where lj is the number of samples in the jth cluster. Then we can formulate unsupervised maximum margin feature selection via sparse constraints (UMMFSSC) as follows,
i¼1
i¼1
2
2
2
123
1794
Neural Comput & Applic (2012) 21:1791–1799
min trðW T Sw WÞ trðW T Sb WÞ þ akW k2;1
W T W¼I
ð1Þ
We can easily verify the following two equations, nSw ¼ XðI FF T ÞX T ;
nSb ¼ XFF T X T
ð2Þ
Then we can reformulate (1) as follows min
W T W¼I;F T F¼I
trðW T XðI FF T ÞX T WÞ trðW T XFF T X T WÞ
þ atrðW T DWÞ
ð3Þ
where a is the balance parameter that balances the weight of the third item. 3.2 An efficient alternate algorithm From (3), we can see that there are three different variables that should be optimized, that is, W, D, and F. It is difficult to compute them simultaneously. We alternately optimize them, and then we can get W, D, and F. Firstly, we fix W and compute F, the optimization problem becomes arg max trðF T X T WW T XFÞ T F F¼I
Clearly, we can use spectral decomposition technique to solve this problem. Then the optimal F is formed by the eigenvectors corresponding to the c largest eigenvalues of the matrix X T WW T X. Secondly, we fix F and compute W and D. We notice that there are still two variables to be optimized, so we use nesting optimization technique. When fixing W, we can easily update D as follows, . D ¼ ðdii Þ; where dii ¼ 1 2wi 2 When fixing D, the optimization problem becomes arg min trðW T ðXðI 2FF T ÞX T þ DÞWÞ T W W¼I
We can also use spectral decomposition technique to solve this problem. Moreover, the optimal W is formed by the eigenvectors corresponding to the c smallest eigenvalues of the matrix XðI 2FF T ÞX T þ D. The iteration procedure is repeated until the algorithm converges. We also give the algorithm convergence analysis in the next section. At last, we can do feature selection using the computed W. We can see that the sparse constraints force many rows of W to zeros. We rank each feature in descending order and select the top-ranked features. The main algorithm is in Table 1. 3.3 Algorithm convergence analysis In this section, we will prove that our algorithm monotonically decreases the objective of the problem in (1). Firstly, we give the lemma from [17]. Lemma 1 For any nonzero vector w; v 2 Rd , the following inequality holds. kwk2
kwk22 kvk22 kv k2 2kv k2 2kv k2
Theorem 1 The algorithm will monotonically decrease the objective of the problem in (1) in each iteration and converge to the global optimum of the problem. Proof It can be easily verified that optimizing (1) is equivalent to solving (3). As seen in algorithm, when fixing D as Dt , we can compute W and F. In the t iteration, we should solve the following problem, Wtþ1 ; Ftþ1 ¼ arg þ
min
W T W¼I;F T F¼I atrðW T Dt WÞ
trðW T XðI 2FF T ÞX T WÞ
Then we can get
T T
T T tr Wtþ1 XðI 2Ftþ1 Ftþ1 Dt Wtþ1 X Wtþ1 Þ þ atr Wtþ1
tr WtT X I 2Ft FtT X T W t þatr WtT Dt Wt
Table 1 The UMMFSSC Algorithm Input: Data X ¼ ½x1 ; x2 ; . . .; xn 2 Rmn ; number of selected feature d, balance parameter a Output: Feature selection subset Set t = 0. Initialize Dt 2 Rmm as an identity matrix Step 1: Alternatively update F, W, and D until convergence Step 1.1: Fixing Wt , compute Ftþ1 . The optimal Ftþ1 is formed by the eigenvectors corresponding to the c largest eigenvalues of the matrix X T Wt WtT X. Step 1.2: Fixing Ft , compute Wtþ1 and Dtþ1 . We use nesting optimization technique. . Step 1.2.1: Fixing Wt , compute Dtþ1 .Dtþ1 ¼ ðdii Þ; where dii ¼ 1 2witþ1 2 . Step 1.2.2: Fixing Dt , compute Wtþ1 . The optimal Wtþ1 is formed by the eigenvectors corresponding to the c smallest eigenvalues of the matrix XðI 2Ft FtT ÞX T þ Dt . Step 2: Do feature selection using computed W. We rank the features in descending order according to W and select the top ranked features.
123
Neural Comput & Applic (2012) 21:1791–1799
1795
Table 2 The simple description of six datasets
wi is the ith row of W. The above inequality indicates that m i 2 X
T
T wtþ1 2 T tr Wtþ1 X I 2Ftþ1 Ftþ1 X Wtþ1 þ a i i¼1 2 wt 2 m i 2 X
T T w T t 2 tr Wt XðI 2Ft Ft X Wt Þ þ a ð4Þ 2 wi
Data set
Dimensionality (m)
Size (n)
Cluster
Cancer
30
98
2
Umist
644
575
20
1,024
400
20
60
208
2
1,024
1,440
40
671
1,559
26
ORL Sonar Coil Isolet
According to Lemma 1, for each i, we have 2 i 2 w i witþ1 2 i w w t 2 tþ1 2 t 2 2wit 2 2wit 2
Since kW k2;1 ¼
m X wi ; 2
dii ¼ 1
Then the following inequality holds 2 i 2 m w X wi wi tþ1 2 wi t 2 tþ1 2 t 2 2wit 2 2wit 2 i¼1 i¼1
. 2wi 2 ;
m X
i¼1
i ¼ 1; 2; . . .; m; Fig. 1 The ACC results under different numbers of selected features on six different datasets (K-means clustering). The xaxis represents the number of selected features and the y-axis is the ACC results. a Breast cancer. b Coil. c solet. d Orl. e Sonar. f Umist
t 2
i¼1
ð5Þ
0.7 PcaScor LapScor SPEC MCFS MRSF UMMFSSC
0.8
PcaScor LapScor SPEC MCFS MRSF UMMFSSC
0.65 0.6 0.55 0.5
0.75
0.45 0.4 0.7
0.35 0.3 0.25
0.65
2
3
4
5
6
7
8
5
10
15
20
(a) Breast Cancer 0.65 0.6
0.5
0.5
0.45
0.45
0.4
0.4
0.35
0.35
0.3 100
120
140
40
45
50
160
180
200
PcaScor LapScor SPEC MCFS MRSF UMMFSSC
0.6 0.55
80
35
0.65
0.55
60
30
(b) Coil PcaScor LapScor SPEC MCFS MRSF UMMFSSC
40
25
0.3 20
30
40
50
(c) Isolet
60
70
80
90
100 110 120
(d) Orl
0.75 PcaScor LapScor SPEC MCFS MRSF UMMFSSC
0.7
0.65
PcaScor LapScor SPEC MCFS MRSF UMMFSSC
0.65 0.6 0.55
0.6
0.5 0.45
0.55
0.4 0.5 0.35 0.45 2
4
6
8
10
12
14
(e) Sonar
16
18
20
22
0.3 20
30
40
50
60
70
80
90
100 110 120
(f) Umist
123
1796
Neural Comput & Applic (2012) 21:1791–1799
Combining Eqs. (4) and (5), we can get the following result d X
T
T T wi X Wtþ1 þ a tr Wtþ1 X I 2Ftþ1 Ftþ1 tþ1 2 i¼1 d X
wi tr W T X I 2Ft F T X T Wt þ a t
t
t 2
that our algorithm is not dominated by the final clustering algorithm, the second group is to evaluate the algorithms using NCut clustering instead of K-means clustering. The last group is to discuss the parameter used in our algorithm. We compare our algorithm with the following classical feature selection algorithms, PcaScore [9], LapScore [10], SPEC [6], MCFS [8], and MRSF [2].
i¼1
4.1 Data sets descriptions
That is to say
T
T T X Wtþ1 þ akWtþ1 k2;1 tr Wtþ1 X I 2Ftþ1 Ftþ1
tr WtT X I 2Ft FtT X T Wt þ akWt k2;1 This inequality indicates the algorithm will monotonically decrease the objective of the problem in (1) in each iteration. Besides, since the three items in (1) are convex functions and the objective function has lower bounds, such as zero, so the above iteration will converge to the global solution. In the following experiment section, we can see that our algorithm converges fast; the number of iteration is often less than 20. h
In this article, we perform experiments on six different data sets, including Umist [18], ORL [19], Coil [20], Isolet [21], Sonar [22], Breast Cancer [23] data sets. We give a simple description of these six data sets in Table 2. We employ two different metrics, including clustering accuracy (ACC) and normalized mutual information (NMI), to measure the clustering performance. 4.2 Overall comparison results 4.2.1 Clustering results with K-means clustering
4 Experimental results In the experiments, we use six real data sets to demonstrate the efficiency of our algorithm. In order to evaluate our algorithm, we perform three groups’ experiments. The first group is to evaluate the algorithms using K-means clustering as the final clustering algorithm. In order to verify
In the first group experiment, we employ K-means clustering with fixed selected features to evaluate our algorithm. Since K-means clustering algorithm is sensitive to the initialization, we repeat for 100 times and compute their mean results. The ACC results are in Fig. 1. Tables 3, 4, 5, and 6 show the NMI results under different numbers of selected features on ORL, Umist, Isolet, and Coil data sets, separately.
Table 3 The NMI results under different numbers of selected features on ORL dataset (mean ± SD) Numbers of selected features
15
30
45
60
75
90
105
120
PcaScor
0.5507 ± 0.0099
0.5754 ± 0.0111
0.5960 ± 0.0107
0.6072 ± 0.0129
0.6166 ± 0.0136
0.6174 ± 0.0124
0.6267 ± 0.0137
0.6368 ± 0.0142
LapScor
0.6305 ± 0.0107
0.6344 ± 0.0116
0.6811 ± 0.0131
0.6811 ± 0.0131
0.5847 ± 0.0200
0.6924 ± 0.0140
0.6862 ± 0.0156
0.6863 ± 0.0136
SPEC
0.6206 ± 0.0104
0.6400 ± 0.0111
0.6619 ± 0.0123
0.6752 ± 0.0136
0.6611 ± 0.0151
0.6911 ± 0.0132
0.7162 ± 0.0190
0.7163 ± 0.0201
MCFS
0.6793 ± 0.0126
0.6678 ± 0.0137
0.6918 ± 0.0158
0.6960 ± 0.0155
0.7008 ± 0.0184
0.7108 ± 0.0164
0.7098 ± 0.0164
0.7093 ± 0.0170
MRSF
0.6378 ± 0.0115
0.6779 ± 0.0156
0.6919 ± 0.0139
0.7026 ± 0.0165
0.7111 ± 0.0162
0.7176 ± 0.0158
0.7160 ± 0.0149
0.7213 ± 0.0183
UMMFSSC
0.6932 ± 0.0116
0.7041 ± 0.0133
0.7131 ± 0.0127
0.7394 ± 0.0136
0.7324 ± 0.0142
0.7500 ± 0.0128
0.7200 ± 0.0147
0.7521 ± 0.0151
105
120
The numbers shown in bold are the best results among the six methods
Table 4 The NMI results under different numbers of selected features on Umist data set (mean ± SD) Numbers of selected features
15
30
45
60
75
90
PcaScor
0.4848 ± 0.0129
0.5347 ± 0.0126
0.6011 ± 0.0141
0.6073 ± 0.0177
0.6212 ± 0.0187
0.6385 ± 0.0193
0.6393 ± 0.0172
0.6420 ± 0.0174
LapScor
0.5013 ± 0.0095
0.5174 ± 0.0155
0.5240 ± 0.0150
0.5512 ± 0.0188
0.5847 ± 0.0200
0.6032 ± 0.0232
0.6136 ± 0.0215
0.6116 ± 0.0211
SPEC
0.5048 ± 0.0092
0.5230 ± 0.0188
0.5257 ± 0.0166
0.5696 ± 0.0181
0.5929 ± 0.0196
0.6045 ± 0.0209
0.6162 ± 0.0191
0.6163 ± 0.0201
MCFS
0.5972 ± 0.0172
0.6650 ± 0.0191
0.6701 ± 0.0220
0.6968 ± 0.0234
0.7016 ± 0.0188
0.7082 ± 0.0227
0.6913 ± 0.0223
0.6915 ± 0.0200
MRSF
0.5667 ± 0.0151
0.6437 ± 0.0168
0.6581 ± 0.0183
0.6567 ± 0.0186
0.6649 ± 0.0191
0.6648 ± 0.0180
0.6639 ± 0.0210
0.6686 ± 0.0194
UMMFSSC
0.6050 ± 0.0176
0.7183 ± 0.0181
0.6977 ± 0.0168
0.7253 ± 0.0207
0.7325 ± 0.0201
0.7163 ± 0.0160
0.7274 ± 0.0202
0.7291 ± 0.0190
The numbers shown in bold are the best results among the six methods
123
Neural Comput & Applic (2012) 21:1791–1799
1797
Table 5 The NMI results under different numbers of selected features on isolet data set (mean ± SD) Numbers of selected features
25
50
75
100
125
150
175
200
PcaScor
0.5032 ± 0.0141
0.5750 ± 0.0123
0.6080 ± 0.0127
0.6421 ± 0.0129
0.6724 ± 0.0177
0.6838 ± 0.0138
0.6885 ± 0.0146
0.6851 ± 0.0158
LapScor
0.4892 ± 0.0100
0.6093 ± 0.0128
0.6393 ± 0.0120
0.6278 ± 0.0138
0.6345 ± 0.0142
0.6464 ± 0.0152
0.6691 ± 0.0139
0.6682 ± 0.0154
SPEC
0.5221 ± 0.0100
0.6126 ± 0.0109
0.6383 ± 0.0133
0.6293 ± 0.0119
0.6342 ± 0.0136
0.6472 ± 0.0158
0.6677 ± 0.0138
0.6678 ± 0.0137
MCFS
0.6661 ± 0.0199
0.6916 ± 0.0162
0.7142 ± 0.0149
0.7274 ± 0.0193
0.7229 ± 0.0152
0.7208 ± 0.0182
0.6613 ± 0.0143
0.7143 ± 0.0141
MRSF
0.6667 ± 0.0151
0.6241 ± 0.0112
0.6351 ± 0.0134
0.6363 ± 0.0119
0.6706 ± 0.0147
0.6860 ± 0.0178
0.6870 ± 0.0161
0.6892 ± 0.0155
UMMFSSC
0.7275 ± 0.0073
0.7284 ± 0.0098
0.7588 ± 0.0141
0.7479 ± 0.0160
0.7553 ± 0.0144
0.6980 ± 0.0210
0.6945 ± 0.0170
0.6944 ± 0.0159
40
45
The numbers shown in bold are the best results among the six methods
Table 6 The NMI results under different numbers of selected features on coil data set (mean ± SD) Numbers of selected features
10
15
20
25
30
35
PcaScor
0.4986 ± 0.0219
0.5752 ± 0.0103
0.5199 ± 0.0169
0.5443 ± 0.0205
0.5707 ± 0.0173
0.5770 ± 0.0211
0.5931 ± 0.0194
0.5997 ± 0.0153
LapScor
0.4774 ± 0.0095
0.5515 ± 0.0138
0.5437 ± 0.0122
0.5608 ± 0.0108
0.5604 ± 0.0119
0.5921 ± 0.0155
0.6091 ± 0.0139
0.6125 ± 0.0148
SPEC
0.4776 ± 0.0089
0.6126 ± 0.0129
0.6303 ± 0.0103
0.5422 ± 0.0115
0.5618 ± 0.0108
0.5920 ± 0.0131
0.6085 ± 0.0150
0.6133 ± 0.0197
MCFS
0.5351 ± 0.0152
0.6026 ± 0.0148
0.6281 ± 0.0170
0.6590 ± 0.0173
0.6587 ± 0.0183
0.6754 ± 0.0193
0.6917 ± 0.0221
0.6971 ± 0.0210
MRSF
0.5076 ± 0.0118
0.5962 ± 0.0148
0.6223 ± 0.0168
0.6718 ± 0.0177
0.6622 ± 0.0184
0.6708 ± 0.0198
0.7135 ± 0.0221
0.7184 ± 0.0185
UMMFSSC
0.5716 ± 0.0119
0.6484 ± 0.0189
0.6521 ± 0.0168
0.6531 ± 0.0155
0.6897 ± 0.0164
0.7086 ± 0.0183
0.7396 ± 0.0142
0.7240 ± 0.0181
The numbers shown in bold are the best results among the six methods
Fig. 2 The ACC and NMI results under different numbers of selected features on two data sets (NCut clustering). The x-axis represents the fixed number of selected features, and the y-axis is the ACC or the NMI results. a ACC on Coil data set. b NMI on Coil data set. c ACC on Unist data set. d NMI on Unist data set
0.7 0.65 0.6 0.55
PcaScor LapScor SPEC MCFS MRSF UMMFSSC
0.8 0.75 0.7
PcaScor LapScor SPEC MCFS MRSF UMMFSSC
0.5
0.65 0.45
0.6
0.4
0.55
0.35 0.3
0.5
0.25
0.45
0.2
s=15
s=30
s=45
0.4
s=15
(a) ACC on Coil dataset
0.6
s=45
0.9
0.8
0.7
s=30
(b) NMI on Coil dataset
PcaScor LapScor SPEC MCFS MRSF UMMFSSC
PcaScor LapScor SPEC MCFS MRSF UMMFSSC
0.85 0.8 0.75 0.7 0.65
0.5
0.6 0.4
0.55 0.5
0.3
0.45 0.2
s=30
s=60
s=90
(c) ACC on Unist dataset There are several observations from these results. •
From Fig. 1 and Tables 3, 4, 5, and 6, we can see that our algorithm outperforms other methods significantly.
•
0.4 s=30
s=60
s=90
(d) NM Ion Unist dataset From the ACC results on Sonar data set, we can see that it is not always better when more features are selected. This may be caused by the adding of features redundancy when we select more features.
123
1798 Fig. 3 The ACC and NMI results under fixed selected features with a range of alpha (K-means clustering). a Umist. b Isolet
Neural Comput & Applic (2012) 21:1791–1799 0.8
0.8 NMI Acc
0.75
NMI Acc
0.75
0.7
0.7
0.65
0.65 0.6
0.6 0.55
0.55
0.5
0.5
0.45
0.45
0.4 1
1.2
1.4
1.6
1.8
(a) Umist •
We can also see that the performance is sensitive to the numbers of selected features. However, how to decide the number of selected features is still an open problem.
4.2.2 Clustering results with NCut clustering In the second group experiment, in order to verify that our algorithm is not dominated by the final clustering algorithm, we employ NCut clustering with fixed selected features. The results are in Fig. 2. From Fig. 2, we can see that when using NCut clustering algorithm, our algorithm still outperforms other methods significantly. That is to say our algorithm is not dominated by the final clustering algorithm. 4.2.3 Clustering results with different parameters In the third group experiment, we will show the effect of different parameters of our algorithm. In our algorithm, there is only one parameter a to discuss. So we fix the numbers of selected features to 50 and set a from 1 to 1.9. From Fig. 3, we can see our algorithm is stable with a range of a. 5 Conclusion and discussion In this article, we propose a novel unsupervised maximum margin feature selection algorithm. The algorithm inherits ‘‘learning mechanism’’ and combines feature selection and K-means clustering into a coherent framework to adaptively select the most discriminative subspace. An efficient algorithm is proposed and convergence analysis is also given in this article. Experiments on six real data sets demonstrate the efficiency of our algorithm. In future, we are planning to take into account semi-supervised maximum margin feature selection. Acknowledgments We gratefully acknowledge the supports from National Natural Science Foundation of China, under Grant No. 60975038 and Grant No. 60105003.
123
2
1
1.2
1.4
1.6
1.8
2
(b) Isolet References 1. Zhao Z, Wang L, Liu H (2010) Efficient spectral feature selection with minimum redundancy. In: Proceedings of the 24th AAAI conference on artificial intelligence (AAAI-10), pp 673– 678 2. Liu H, Motoda H (1998) Feature selection for knowledge discovery and data mining. Springer, Berlin 3. Cover TM, Thomas JA (2006) Elements of information theory, 2nd edn. Wiley-Interscience, Hoboken 4. Duda RO, Hart PE, Stork DG (2000) Pattern classification, 2nd edn. Wiley-Interscience, Hoboken 5. Rodgers JL, Nicewander WA (1988) Thirteen ways to look at the correlation coefficient. Am Stat 42(1):59–66 6. Zhao Z, Liu H (2007) Spectral feature selection for supervised and unsupervised learning. In: Proceedings of the 24th international conference on machine learning (ICML-07), pp 1151– 1157 7. Zhao Z, Liu H (2007) Semi-supervised feature selection via spectral analysis. In: Proceedings of the 7th SIAM international conference on data mining (SDM-07), pp 641–646 8. Cai D, Zhang C, He X (2010) Unsupervised feature selection for multi-cluster data. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining (KDD-10), pp 333–342 9. Krzanowski WJ (1987) Selection of variables to preserve multivariate data structure, Using Principal Component Analysis. Appl Stat J R Stat Soc Ser C 36:22–33 10. He X, Cai D, Niyogi P (2006) Laplacian score for feature selection. In: Proceedings of the annual conference on advances in neural information processing systems (NIPS-06) 11. Boutsidis C, Mahoney MW, Drineas P (2009) Unsupervised feature selection for the k-means clustering problem. In: Proceedings of the annual conference on advances in neural information processing systems (NIPS-09) 12. Li H, Jiang T, Zhang K (2006) Efficient and robust feature extraction by maximum margin criterion. IEEE Trans Neural Netw 17(1):157–165 13. Ding C, Zhou D, He X, Zha H (2006) R1-PCA: rotational invariant L1-norm principal component analysis for robust subspace factorization. In: Proceedings of the 23rd international conference on machine learning (ICML-06), pp 281–288 14. Argyriou A, Evgeniou T, Pontil M (2007) Multi-task feature learning. In: Proceedings of the annual conference on advances in neural information processing systems (NIPS-07) 15. Obozinski G, Taskar B, Jordan M (2006) Multi-task feature selection. Technical report, Department of Statistics, University of California, Berkeley
Neural Comput & Applic (2012) 21:1791–1799 16. Liu J, Ji S, Ye J (2009) Multi-task feature learning via efficient L2,1-norm minimization. In: Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence (UAI-09) 17. Nie F, Huang H, Cai X, Ding C (2010) Efficient and robust feature selection via joint l2,1-norms minimization. In: Proceedings of the annual conference on advances in neural information processing systems (NIPS-10) 18. http://images.ee.umist.ac.uk/danny/database.html
1799 19. 20. 21. 22.
http://www.zjucadcg.cn/dengcai/Data/FaceData.html http://www1.cs.columbia.edu/CAVE/research/softlib/coil-20.html http://archive.ics.uci.edu/ml/machine-learning-databases/isolet http://archive.ics.uci.edu/ml/machine-learning-databases/und ocumented/connectionist-bench/sonar/ 23. http://archive.ics.uci.edu/ml/machine-learning-databases/breastcancer-wisconsin/
123