Unsupervised maximum margin feature selection via ... - Springer Link

31 downloads 1416 Views 442KB Size Report
Jan 18, 2012 - Unsupervised maximum margin feature selection via L2,1-norm ...... 20. http://www1.cs.columbia.edu/CAVE/research/softlib/coil-20.html.
Neural Comput & Applic (2012) 21:1791–1799 DOI 10.1007/s00521-012-0827-3

ORIGINAL ARTICLE

Unsupervised maximum margin feature selection via L2,1-norm minimization Shizhun Yang • Chenping Hou • Feiping Nie Yi Wu



Received: 8 November 2011 / Accepted: 5 January 2012 / Published online: 18 January 2012  Springer-Verlag London Limited 2012

Abstract In this article, we present an unsupervised maximum margin feature selection algorithm via sparse constraints. The algorithm combines feature selection and K-means clustering into a coherent framework. L2,1-norm regularization is performed to the transformation matrix to enable feature selection across all data samples. Our method is equivalent to solving a convex optimization problem and is an iterative algorithm that converges to an optimal solution. The convergence analysis of our algorithm is also provided. Experimental results demonstrate the efficiency of our algorithm. Keywords Feature selection  K-means clustering  Maximum margin criterion  Regularization

1 Introduction In recent years, the dimensionality of the data used in data mining becomes larger and larger, and the high-dimensional data contain millions of features, while many of these features may be irrelevant. Analysis of such data is challenging. Feature selection and feature extraction are two main techniques for dimensionality reduction [1]. The major difference between feature selection and feature

S. Yang (&)  C. Hou  Y. Wu Department of Mathematics and System Science, National University of Defense Technology, Changsha 410073, China e-mail: [email protected] F. Nie Department of Computer Science and Engineering, University of Texas, Arlington, TX 76019, USA

extraction is that feature selection keeps original features, while feature extraction generates new features. In certain applications, such as genetic analysis, where the original features need to be kept for interpretation, feature extraction techniques may fail and we must employ feature selection techniques. Another advantage of feature selection is that we only pay our attention to our concerning features, but not all of the features, which may be easier than feature extraction techniques. Motivated by the superiority over feature extraction, we focus on feature selection in this article. Feature selection has been playing an important role in many applications, since it can speed up the learning process, improve the model generalization capability, and alleviate the effect of the curse of dimensionality [2]. Recently, there are three types of feature selection algorithms, supervised feature selection [3–5], semi-supervised feature selection [6, 7], and unsupervised feature selection [1, 8–10]. Supervised feature selection determines feature relevance by evaluating feature’s correlation with the class labels, semi-supervised feature selection uses both (small) labeled and (large) unlabeled data, while unsupervised feature selection exploits data variance and separability to evaluate feature relevance without any class labels. In many real-world applications, it is expensive or impossible to recollect the needed training data and rebuild the models. So unsupervised and semi-supervised feature selection may be more practical. Traditional feature selection algorithms simply use statistical character to rank the features, and they never take ‘‘learning the transformation matrix efficiently’’ into consideration. Sparsity regularization has played a central role in many areas including machine learning, statistics, and applied mathematics. Recently, sparsity regularization has been applied into feature selection studies [11, 14, 15]. In

123

1792

this article, we combine feature selection and K-means clustering into a coherent framework to adaptively select the most discriminative subspace, and we also employ a sparse regularization term to enable feature selection. Indeed it is the L2,1-norm as the sparse constraints that force many rows of projection matrix W to zeros and then we can select the most relevant features more efficiently. By alternately learning the optimal subspace and the clusters, we gain a good feature selection result. Firstly, we use K-means clustering to generate class labels and then get the cluster indicator matrix F. Secondly, we use joint Maximization Margin Criterion and Minimization L2,1-Norm terms to do the feature selection. The clustering process is, thus, integrated with the feature selection process and simultaneously clustered, while feature selection is implemented. It is worthwhile to highlight our work. •



Unlike traditional feature selection algorithms, we inherit ‘‘learning mechanism’’ to do feature selection by employing sparse constrains on the objective function. We propose an efficient algorithm to solve the problem and also provide the convergence analysis of our algorithm. The experiment shows that the objective can be solved by less than 20 iterations.

The rest of the article is organized as follows. Firstly, in Sect 2, we review some related work. Then in Sect. 3, we formulate the joint supervised feature selection problem and present our algorithm. Section 4 shows our experiments. Finally we conclude the article and discuss some future work in Sect. 5.

2 Related work The most related works with unsupervised maximum margin feature selection via sparse constrains problem are feature selection techniques, maximum margin criterion and L2,1-norm regularization. 2.1 Feature selection techniques Feature selection techniques are designed to find the relevant feature subset of the original features that can facilitate clustering or classification. The feature selection problem is essentially a combinatorial optimization problem that is computationally expensive. There are three types of feature selection algorithms, that is, supervised feature selection, semi-supervised feature selection, and unsupervised feature selection. The typical supervised feature selection methods include Pearson correlation coefficients [5] and Fisher score [4],

123

Neural Comput & Applic (2012) 21:1791–1799

which use the statistical character to rank the features and Information gain [3], which computes the sensitivity of a feature with respect to the class label distribution of the data. Work by Zhao et al. [6] is the first work to unify supervised and unsupervised feature selection and enable their joint study under a general framework (SPEC). The proposed framework is able to generate families of algorithms for both supervised and unsupervised feature selection. Zhao et al. [7] presented a semi-supervised feature selection algorithm based on spectral analysis, which uses both (small) labeled and (large) unlabeled data in feature selection. The algorithm exploits both labeled and unlabeled data through a regularization framework, which provides an effective way to address the ‘‘small labeledsample’’ problem. Traditional unsupervised feature selection methods, including Pca Score (PcaScor) [9] and Laplacian Score (LapScor) [10], address this issue by selecting the top-ranked features based on certain scores computed independent for each feature. Cai et al. [8] proposed a new approach, called Multi-Cluster Feature Selection (MCFS), for unsupervised feature selection. MCFS considered the possible correlation between different features and used spectral analysis of the data (manifold learning) and L1-regularized models for subset selection. Liu et al. [2] proposed a novel spectral feature selection algorithm of an embedded model (MRSF), which evaluates the utility of a set of features jointly and can efficiently remove redundant features. Boutsidis et al. [11] presented a novel unsupervised feature selection algorithm for the K-means clustering problem. The author presented the first accurate feature selection algorithm for K-means clustering. In this article, we focus on unsupervised feature selection. The above traditional unsupervised feature selection methods never take ‘‘learning the transformation matrix efficiently’’ into consideration. So we should pay more attention to the transformation matrix and find some mechanisms to inherit the ‘‘learning ability’’.. 2.2 Maximum margin criterion (MMC) MMC [12] aims at maximizing the average margin between classes in the projected space. Therefore, the feature extraction criterion is defined as: J¼

C X C 1X pi pj dðCi ; Cj Þ 2 i¼1 j¼1

C is the number of distinct classes, pi ; pj are the prior probability of class i and class j, respectively, and the interclass margin is defined as

Neural Comput & Applic (2012) 21:1791–1799

dðCi ; Cj Þ ¼ dðmi ; mj Þ  sðCi Þ  sðCj Þ sðCi Þ ¼ trðSi Þ; sðCj Þ ¼ trðSj Þ where mi ; mj are the mean vectors of the class Ci, and the class Cj , Si ; Sj are the covariance matrix of the class Ci and the class Cj . After simple mathematical operation, we can obtain the following formula: J ¼ trðSb  Sw Þ The between-class scatter matrix Sb and the within-class scatter matrix Sw are defined as Sb ¼

C X

ni ðmi  mÞðmi  mÞT Sw ¼

i¼1

C X

ðXi  mi ÞðXi  mi ÞT

i¼1

where ni is the number of class Ci , m is the mean vector of all data. Then the MMC can be formulated as JðWÞ ¼ arg max trðW T ðSb  Sw ÞWÞ arg max T T W W¼I

W W¼I

Obviously we can get the optimal W by solving the generalized eigenvalue problem:

1793

the components of the vector bðWÞ indicate how important each feature is. The L2;1 -norm favors a small numbers of nonzero rows in the matrix W, thereby ensuring that feature selection will be achieved. The L2;1 -norm of a matrix was introduced [13] as rotational invariant L1 -norm and used for multi-task learning [14, 15]. Argyriou et al. [14] developed a novel non-convex multi-task generalization of the L2;1 -norm regularization that can be used to learn a few features common across multiple tasks. Obozinski et al. [15] proposed a novel type of joint regularization of the model parameters in order to couple feature selection across tasks. Liu et al. [16] considered the L2;1 -norm regularized regression model for joint feature selection from multiple tasks, which can be derived in the probabilistic framework by assuming a suitable prior from the exponential family. One appealing feature of the L2;1 -norm regularization is that it encourages multiple predictors to share similar sparsity patterns. In this article, we employ L2;1 -norm regularization on the objective function to enable feature selection efficiently.

ðSb  Sw ÞW ¼ kW Therefore, W is composed of the first d largest eigenvectors of Sb  Sw . We need not calculate the inverse of Sw , which allows us to avoid the small sample size problem easily. That is also why we do not use LDA as the base method. We also require W to be orthonormal, which may help preserve the shape of the distribution. 2.3 L2,1-norm For a matrix W 2 Rmn , the Lr;p -norm is defined as follows, X X   p=r 1=p n m  wij r kW kr;p ¼ i¼1 j¼1 Xn   1=p wi p ¼ r i¼1

3 Unsupervised maximum margin feature selection via L2,1 norm minimization In this section, we first formulate this problem strictly in mathematical way, and then we describe our unsupervised maximum margin feature selection via L2;1 -norm minimization problem clearly. Then we give an efficient alternate algorithm to solve the problem. At last, we give the convergence analysis of our algorithm. 3.1 Problem settings and notations

where wi is the ith row of W. It is easy to verify that Lr;p norm is a valid norm. Then L2;1 -norm is defined in the following equation. n rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi n   Xm X X 2 ¼ wi  kW k2;1 ¼ w ij 2 j¼1

Before introducing our algorithm in detail, we should specify the problem settings and notations. X ¼ ½x1 ; x2 ; . . .; xn  2 Rmn , which represents m features and n samples. We would like to select d features to represent the original data, where d \ m. W 2 Rmd is the transformation matrix. D 2 Rmm as the diagonal matrix with the ith diagonal element, .    dii ¼ 1 2wi 2 ; i ¼ 1; 2; . . .; m

We can easily verify that L2;1 -norm is rotational invariant for rows for any rotational matrix R, i.e., kWRk2;1 ¼ kW k2;1 . We also give an intuitional explanation of L2;1 -norm. First, we can compute the L2 -norm of the rows wi (corresponding to feature i) and then compute L1 -norm of the vector      bðWÞ ¼ w1  ; w2  ; . . .; kwm k . The magnitudes of

wi is the ith row of W. The number of clusters is predefined pffiffiffi as c, F 2 Rnc is the indicator matrix, Fij ¼ 1 lj if xi belong to the jth cluster, and Fij ¼ 0 otherwise, where lj is the number of samples in the jth cluster. Then we can formulate unsupervised maximum margin feature selection via sparse constraints (UMMFSSC) as follows,

i¼1

i¼1

2

2

2

123

1794

Neural Comput & Applic (2012) 21:1791–1799

min trðW T Sw WÞ  trðW T Sb WÞ þ akW k2;1

W T W¼I

ð1Þ

We can easily verify the following two equations, nSw ¼ XðI  FF T ÞX T ;

nSb ¼ XFF T X T

ð2Þ

Then we can reformulate (1) as follows min

W T W¼I;F T F¼I

trðW T XðI  FF T ÞX T WÞ  trðW T XFF T X T WÞ

þ atrðW T DWÞ

ð3Þ

where a is the balance parameter that balances the weight of the third item. 3.2 An efficient alternate algorithm From (3), we can see that there are three different variables that should be optimized, that is, W, D, and F. It is difficult to compute them simultaneously. We alternately optimize them, and then we can get W, D, and F. Firstly, we fix W and compute F, the optimization problem becomes arg max trðF T X T WW T XFÞ T F F¼I

Clearly, we can use spectral decomposition technique to solve this problem. Then the optimal F is formed by the eigenvectors corresponding to the c largest eigenvalues of the matrix X T WW T X. Secondly, we fix F and compute W and D. We notice that there are still two variables to be optimized, so we use nesting optimization technique. When fixing W, we can easily update D as follows, .    D ¼ ðdii Þ; where dii ¼ 1 2wi 2 When fixing D, the optimization problem becomes arg min trðW T ðXðI  2FF T ÞX T þ DÞWÞ T W W¼I

We can also use spectral decomposition technique to solve this problem. Moreover, the optimal W is formed by the eigenvectors corresponding to the c smallest eigenvalues of the matrix XðI  2FF T ÞX T þ D. The iteration procedure is repeated until the algorithm converges. We also give the algorithm convergence analysis in the next section. At last, we can do feature selection using the computed W. We can see that the sparse constraints force many rows of W to zeros. We rank each feature in descending order and select the top-ranked features. The main algorithm is in Table 1. 3.3 Algorithm convergence analysis In this section, we will prove that our algorithm monotonically decreases the objective of the problem in (1). Firstly, we give the lemma from [17]. Lemma 1 For any nonzero vector w; v 2 Rd , the following inequality holds. kwk2 

kwk22 kvk22  kv k2  2kv k2 2kv k2

Theorem 1 The algorithm will monotonically decrease the objective of the problem in (1) in each iteration and converge to the global optimum of the problem. Proof It can be easily verified that optimizing (1) is equivalent to solving (3). As seen in algorithm, when fixing D as Dt , we can compute W and F. In the t iteration, we should solve the following problem, Wtþ1 ; Ftþ1 ¼ arg þ

min

W T W¼I;F T F¼I atrðW T Dt WÞ

trðW T XðI  2FF T ÞX T WÞ

Then we can get

T T

T T tr Wtþ1 XðI  2Ftþ1 Ftþ1 Dt Wtþ1 X Wtþ1 Þ þ atr Wtþ1





 tr WtT X I  2Ft FtT X T W t þatr WtT Dt Wt

Table 1 The UMMFSSC Algorithm Input: Data X ¼ ½x1 ; x2 ; . . .; xn  2 Rmn ; number of selected feature d, balance parameter a Output: Feature selection subset Set t = 0. Initialize Dt 2 Rmm as an identity matrix Step 1: Alternatively update F, W, and D until convergence Step 1.1: Fixing Wt , compute Ftþ1 . The optimal Ftþ1 is formed by the eigenvectors corresponding to the c largest eigenvalues of the matrix X T Wt WtT X. Step 1.2: Fixing Ft , compute Wtþ1 and Dtþ1 . We use nesting optimization technique. .    Step 1.2.1: Fixing Wt , compute Dtþ1 .Dtþ1 ¼ ðdii Þ; where dii ¼ 1 2witþ1 2 . Step 1.2.2: Fixing Dt , compute Wtþ1 . The optimal Wtþ1 is formed by the eigenvectors corresponding to the c smallest eigenvalues of the matrix XðI  2Ft FtT ÞX T þ Dt . Step 2: Do feature selection using computed W. We rank the features in descending order according to W and select the top ranked features.

123

Neural Comput & Applic (2012) 21:1791–1799

1795

Table 2 The simple description of six datasets

wi is the ith row of W. The above inequality indicates that   m  i 2 X

T

T wtþ1 2 T   tr Wtþ1 X I  2Ftþ1 Ftþ1 X Wtþ1 þ a  i i¼1 2 wt 2   m  i 2 X

T T w T  t 2  tr Wt XðI  2Ft Ft X Wt Þ þ a ð4Þ  2 wi 

Data set

Dimensionality (m)

Size (n)

Cluster

Cancer

30

98

2

Umist

644

575

20

1,024

400

20

60

208

2

1,024

1,440

40

671

1,559

26

ORL Sonar Coil Isolet

According to Lemma 1, for each i, we have  2  i 2 w   i  witþ1 2  i  w      w    t 2 tþ1 2 t 2 2wit 2 2wit 2

Since kW k2;1 ¼

m   X wi  ; 2

dii ¼ 1

Then the following inequality holds  2  i 2 m   w  X  wi  wi   tþ1 2  wi    t 2 tþ1 2 t 2 2wit 2 2wit 2 i¼1 i¼1

.    2wi 2 ;

m  X

i¼1

i ¼ 1; 2; . . .; m; Fig. 1 The ACC results under different numbers of selected features on six different datasets (K-means clustering). The xaxis represents the number of selected features and the y-axis is the ACC results. a Breast cancer. b Coil. c solet. d Orl. e Sonar. f Umist

t 2

i¼1

ð5Þ

0.7 PcaScor LapScor SPEC MCFS MRSF UMMFSSC

0.8

PcaScor LapScor SPEC MCFS MRSF UMMFSSC

0.65 0.6 0.55 0.5

0.75

0.45 0.4 0.7

0.35 0.3 0.25

0.65

2

3

4

5

6

7

8

5

10

15

20

(a) Breast Cancer 0.65 0.6

0.5

0.5

0.45

0.45

0.4

0.4

0.35

0.35

0.3 100

120

140

40

45

50

160

180

200

PcaScor LapScor SPEC MCFS MRSF UMMFSSC

0.6 0.55

80

35

0.65

0.55

60

30

(b) Coil PcaScor LapScor SPEC MCFS MRSF UMMFSSC

40

25

0.3 20

30

40

50

(c) Isolet

60

70

80

90

100 110 120

(d) Orl

0.75 PcaScor LapScor SPEC MCFS MRSF UMMFSSC

0.7

0.65

PcaScor LapScor SPEC MCFS MRSF UMMFSSC

0.65 0.6 0.55

0.6

0.5 0.45

0.55

0.4 0.5 0.35 0.45 2

4

6

8

10

12

14

(e) Sonar

16

18

20

22

0.3 20

30

40

50

60

70

80

90

100 110 120

(f) Umist

123

1796

Neural Comput & Applic (2012) 21:1791–1799

Combining Eqs. (4) and (5), we can get the following result d  X 

T

T T wi  X Wtþ1 þ a tr Wtþ1 X I  2Ftþ1 Ftþ1 tþ1 2 i¼1 d   X



wi   tr W T X I  2Ft F T X T Wt þ a t

t

t 2

that our algorithm is not dominated by the final clustering algorithm, the second group is to evaluate the algorithms using NCut clustering instead of K-means clustering. The last group is to discuss the parameter used in our algorithm. We compare our algorithm with the following classical feature selection algorithms, PcaScore [9], LapScore [10], SPEC [6], MCFS [8], and MRSF [2].

i¼1

4.1 Data sets descriptions

That is to say

T

T T X Wtþ1 þ akWtþ1 k2;1 tr Wtþ1 X I  2Ftþ1 Ftþ1



 tr WtT X I  2Ft FtT X T Wt þ akWt k2;1 This inequality indicates the algorithm will monotonically decrease the objective of the problem in (1) in each iteration. Besides, since the three items in (1) are convex functions and the objective function has lower bounds, such as zero, so the above iteration will converge to the global solution. In the following experiment section, we can see that our algorithm converges fast; the number of iteration is often less than 20. h

In this article, we perform experiments on six different data sets, including Umist [18], ORL [19], Coil [20], Isolet [21], Sonar [22], Breast Cancer [23] data sets. We give a simple description of these six data sets in Table 2. We employ two different metrics, including clustering accuracy (ACC) and normalized mutual information (NMI), to measure the clustering performance. 4.2 Overall comparison results 4.2.1 Clustering results with K-means clustering

4 Experimental results In the experiments, we use six real data sets to demonstrate the efficiency of our algorithm. In order to evaluate our algorithm, we perform three groups’ experiments. The first group is to evaluate the algorithms using K-means clustering as the final clustering algorithm. In order to verify

In the first group experiment, we employ K-means clustering with fixed selected features to evaluate our algorithm. Since K-means clustering algorithm is sensitive to the initialization, we repeat for 100 times and compute their mean results. The ACC results are in Fig. 1. Tables 3, 4, 5, and 6 show the NMI results under different numbers of selected features on ORL, Umist, Isolet, and Coil data sets, separately.

Table 3 The NMI results under different numbers of selected features on ORL dataset (mean ± SD) Numbers of selected features

15

30

45

60

75

90

105

120

PcaScor

0.5507 ± 0.0099

0.5754 ± 0.0111

0.5960 ± 0.0107

0.6072 ± 0.0129

0.6166 ± 0.0136

0.6174 ± 0.0124

0.6267 ± 0.0137

0.6368 ± 0.0142

LapScor

0.6305 ± 0.0107

0.6344 ± 0.0116

0.6811 ± 0.0131

0.6811 ± 0.0131

0.5847 ± 0.0200

0.6924 ± 0.0140

0.6862 ± 0.0156

0.6863 ± 0.0136

SPEC

0.6206 ± 0.0104

0.6400 ± 0.0111

0.6619 ± 0.0123

0.6752 ± 0.0136

0.6611 ± 0.0151

0.6911 ± 0.0132

0.7162 ± 0.0190

0.7163 ± 0.0201

MCFS

0.6793 ± 0.0126

0.6678 ± 0.0137

0.6918 ± 0.0158

0.6960 ± 0.0155

0.7008 ± 0.0184

0.7108 ± 0.0164

0.7098 ± 0.0164

0.7093 ± 0.0170

MRSF

0.6378 ± 0.0115

0.6779 ± 0.0156

0.6919 ± 0.0139

0.7026 ± 0.0165

0.7111 ± 0.0162

0.7176 ± 0.0158

0.7160 ± 0.0149

0.7213 ± 0.0183

UMMFSSC

0.6932 ± 0.0116

0.7041 ± 0.0133

0.7131 ± 0.0127

0.7394 ± 0.0136

0.7324 ± 0.0142

0.7500 ± 0.0128

0.7200 ± 0.0147

0.7521 ± 0.0151

105

120

The numbers shown in bold are the best results among the six methods

Table 4 The NMI results under different numbers of selected features on Umist data set (mean ± SD) Numbers of selected features

15

30

45

60

75

90

PcaScor

0.4848 ± 0.0129

0.5347 ± 0.0126

0.6011 ± 0.0141

0.6073 ± 0.0177

0.6212 ± 0.0187

0.6385 ± 0.0193

0.6393 ± 0.0172

0.6420 ± 0.0174

LapScor

0.5013 ± 0.0095

0.5174 ± 0.0155

0.5240 ± 0.0150

0.5512 ± 0.0188

0.5847 ± 0.0200

0.6032 ± 0.0232

0.6136 ± 0.0215

0.6116 ± 0.0211

SPEC

0.5048 ± 0.0092

0.5230 ± 0.0188

0.5257 ± 0.0166

0.5696 ± 0.0181

0.5929 ± 0.0196

0.6045 ± 0.0209

0.6162 ± 0.0191

0.6163 ± 0.0201

MCFS

0.5972 ± 0.0172

0.6650 ± 0.0191

0.6701 ± 0.0220

0.6968 ± 0.0234

0.7016 ± 0.0188

0.7082 ± 0.0227

0.6913 ± 0.0223

0.6915 ± 0.0200

MRSF

0.5667 ± 0.0151

0.6437 ± 0.0168

0.6581 ± 0.0183

0.6567 ± 0.0186

0.6649 ± 0.0191

0.6648 ± 0.0180

0.6639 ± 0.0210

0.6686 ± 0.0194

UMMFSSC

0.6050 ± 0.0176

0.7183 ± 0.0181

0.6977 ± 0.0168

0.7253 ± 0.0207

0.7325 ± 0.0201

0.7163 ± 0.0160

0.7274 ± 0.0202

0.7291 ± 0.0190

The numbers shown in bold are the best results among the six methods

123

Neural Comput & Applic (2012) 21:1791–1799

1797

Table 5 The NMI results under different numbers of selected features on isolet data set (mean ± SD) Numbers of selected features

25

50

75

100

125

150

175

200

PcaScor

0.5032 ± 0.0141

0.5750 ± 0.0123

0.6080 ± 0.0127

0.6421 ± 0.0129

0.6724 ± 0.0177

0.6838 ± 0.0138

0.6885 ± 0.0146

0.6851 ± 0.0158

LapScor

0.4892 ± 0.0100

0.6093 ± 0.0128

0.6393 ± 0.0120

0.6278 ± 0.0138

0.6345 ± 0.0142

0.6464 ± 0.0152

0.6691 ± 0.0139

0.6682 ± 0.0154

SPEC

0.5221 ± 0.0100

0.6126 ± 0.0109

0.6383 ± 0.0133

0.6293 ± 0.0119

0.6342 ± 0.0136

0.6472 ± 0.0158

0.6677 ± 0.0138

0.6678 ± 0.0137

MCFS

0.6661 ± 0.0199

0.6916 ± 0.0162

0.7142 ± 0.0149

0.7274 ± 0.0193

0.7229 ± 0.0152

0.7208 ± 0.0182

0.6613 ± 0.0143

0.7143 ± 0.0141

MRSF

0.6667 ± 0.0151

0.6241 ± 0.0112

0.6351 ± 0.0134

0.6363 ± 0.0119

0.6706 ± 0.0147

0.6860 ± 0.0178

0.6870 ± 0.0161

0.6892 ± 0.0155

UMMFSSC

0.7275 ± 0.0073

0.7284 ± 0.0098

0.7588 ± 0.0141

0.7479 ± 0.0160

0.7553 ± 0.0144

0.6980 ± 0.0210

0.6945 ± 0.0170

0.6944 ± 0.0159

40

45

The numbers shown in bold are the best results among the six methods

Table 6 The NMI results under different numbers of selected features on coil data set (mean ± SD) Numbers of selected features

10

15

20

25

30

35

PcaScor

0.4986 ± 0.0219

0.5752 ± 0.0103

0.5199 ± 0.0169

0.5443 ± 0.0205

0.5707 ± 0.0173

0.5770 ± 0.0211

0.5931 ± 0.0194

0.5997 ± 0.0153

LapScor

0.4774 ± 0.0095

0.5515 ± 0.0138

0.5437 ± 0.0122

0.5608 ± 0.0108

0.5604 ± 0.0119

0.5921 ± 0.0155

0.6091 ± 0.0139

0.6125 ± 0.0148

SPEC

0.4776 ± 0.0089

0.6126 ± 0.0129

0.6303 ± 0.0103

0.5422 ± 0.0115

0.5618 ± 0.0108

0.5920 ± 0.0131

0.6085 ± 0.0150

0.6133 ± 0.0197

MCFS

0.5351 ± 0.0152

0.6026 ± 0.0148

0.6281 ± 0.0170

0.6590 ± 0.0173

0.6587 ± 0.0183

0.6754 ± 0.0193

0.6917 ± 0.0221

0.6971 ± 0.0210

MRSF

0.5076 ± 0.0118

0.5962 ± 0.0148

0.6223 ± 0.0168

0.6718 ± 0.0177

0.6622 ± 0.0184

0.6708 ± 0.0198

0.7135 ± 0.0221

0.7184 ± 0.0185

UMMFSSC

0.5716 ± 0.0119

0.6484 ± 0.0189

0.6521 ± 0.0168

0.6531 ± 0.0155

0.6897 ± 0.0164

0.7086 ± 0.0183

0.7396 ± 0.0142

0.7240 ± 0.0181

The numbers shown in bold are the best results among the six methods

Fig. 2 The ACC and NMI results under different numbers of selected features on two data sets (NCut clustering). The x-axis represents the fixed number of selected features, and the y-axis is the ACC or the NMI results. a ACC on Coil data set. b NMI on Coil data set. c ACC on Unist data set. d NMI on Unist data set

0.7 0.65 0.6 0.55

PcaScor LapScor SPEC MCFS MRSF UMMFSSC

0.8 0.75 0.7

PcaScor LapScor SPEC MCFS MRSF UMMFSSC

0.5

0.65 0.45

0.6

0.4

0.55

0.35 0.3

0.5

0.25

0.45

0.2

s=15

s=30

s=45

0.4

s=15

(a) ACC on Coil dataset

0.6

s=45

0.9

0.8

0.7

s=30

(b) NMI on Coil dataset

PcaScor LapScor SPEC MCFS MRSF UMMFSSC

PcaScor LapScor SPEC MCFS MRSF UMMFSSC

0.85 0.8 0.75 0.7 0.65

0.5

0.6 0.4

0.55 0.5

0.3

0.45 0.2

s=30

s=60

s=90

(c) ACC on Unist dataset There are several observations from these results. •

From Fig. 1 and Tables 3, 4, 5, and 6, we can see that our algorithm outperforms other methods significantly.



0.4 s=30

s=60

s=90

(d) NM Ion Unist dataset From the ACC results on Sonar data set, we can see that it is not always better when more features are selected. This may be caused by the adding of features redundancy when we select more features.

123

1798 Fig. 3 The ACC and NMI results under fixed selected features with a range of alpha (K-means clustering). a Umist. b Isolet

Neural Comput & Applic (2012) 21:1791–1799 0.8

0.8 NMI Acc

0.75

NMI Acc

0.75

0.7

0.7

0.65

0.65 0.6

0.6 0.55

0.55

0.5

0.5

0.45

0.45

0.4 1

1.2

1.4

1.6

1.8

(a) Umist •

We can also see that the performance is sensitive to the numbers of selected features. However, how to decide the number of selected features is still an open problem.

4.2.2 Clustering results with NCut clustering In the second group experiment, in order to verify that our algorithm is not dominated by the final clustering algorithm, we employ NCut clustering with fixed selected features. The results are in Fig. 2. From Fig. 2, we can see that when using NCut clustering algorithm, our algorithm still outperforms other methods significantly. That is to say our algorithm is not dominated by the final clustering algorithm. 4.2.3 Clustering results with different parameters In the third group experiment, we will show the effect of different parameters of our algorithm. In our algorithm, there is only one parameter a to discuss. So we fix the numbers of selected features to 50 and set a from 1 to 1.9. From Fig. 3, we can see our algorithm is stable with a range of a. 5 Conclusion and discussion In this article, we propose a novel unsupervised maximum margin feature selection algorithm. The algorithm inherits ‘‘learning mechanism’’ and combines feature selection and K-means clustering into a coherent framework to adaptively select the most discriminative subspace. An efficient algorithm is proposed and convergence analysis is also given in this article. Experiments on six real data sets demonstrate the efficiency of our algorithm. In future, we are planning to take into account semi-supervised maximum margin feature selection. Acknowledgments We gratefully acknowledge the supports from National Natural Science Foundation of China, under Grant No. 60975038 and Grant No. 60105003.

123

2

1

1.2

1.4

1.6

1.8

2

(b) Isolet References 1. Zhao Z, Wang L, Liu H (2010) Efficient spectral feature selection with minimum redundancy. In: Proceedings of the 24th AAAI conference on artificial intelligence (AAAI-10), pp 673– 678 2. Liu H, Motoda H (1998) Feature selection for knowledge discovery and data mining. Springer, Berlin 3. Cover TM, Thomas JA (2006) Elements of information theory, 2nd edn. Wiley-Interscience, Hoboken 4. Duda RO, Hart PE, Stork DG (2000) Pattern classification, 2nd edn. Wiley-Interscience, Hoboken 5. Rodgers JL, Nicewander WA (1988) Thirteen ways to look at the correlation coefficient. Am Stat 42(1):59–66 6. Zhao Z, Liu H (2007) Spectral feature selection for supervised and unsupervised learning. In: Proceedings of the 24th international conference on machine learning (ICML-07), pp 1151– 1157 7. Zhao Z, Liu H (2007) Semi-supervised feature selection via spectral analysis. In: Proceedings of the 7th SIAM international conference on data mining (SDM-07), pp 641–646 8. Cai D, Zhang C, He X (2010) Unsupervised feature selection for multi-cluster data. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining (KDD-10), pp 333–342 9. Krzanowski WJ (1987) Selection of variables to preserve multivariate data structure, Using Principal Component Analysis. Appl Stat J R Stat Soc Ser C 36:22–33 10. He X, Cai D, Niyogi P (2006) Laplacian score for feature selection. In: Proceedings of the annual conference on advances in neural information processing systems (NIPS-06) 11. Boutsidis C, Mahoney MW, Drineas P (2009) Unsupervised feature selection for the k-means clustering problem. In: Proceedings of the annual conference on advances in neural information processing systems (NIPS-09) 12. Li H, Jiang T, Zhang K (2006) Efficient and robust feature extraction by maximum margin criterion. IEEE Trans Neural Netw 17(1):157–165 13. Ding C, Zhou D, He X, Zha H (2006) R1-PCA: rotational invariant L1-norm principal component analysis for robust subspace factorization. In: Proceedings of the 23rd international conference on machine learning (ICML-06), pp 281–288 14. Argyriou A, Evgeniou T, Pontil M (2007) Multi-task feature learning. In: Proceedings of the annual conference on advances in neural information processing systems (NIPS-07) 15. Obozinski G, Taskar B, Jordan M (2006) Multi-task feature selection. Technical report, Department of Statistics, University of California, Berkeley

Neural Comput & Applic (2012) 21:1791–1799 16. Liu J, Ji S, Ye J (2009) Multi-task feature learning via efficient L2,1-norm minimization. In: Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence (UAI-09) 17. Nie F, Huang H, Cai X, Ding C (2010) Efficient and robust feature selection via joint l2,1-norms minimization. In: Proceedings of the annual conference on advances in neural information processing systems (NIPS-10) 18. http://images.ee.umist.ac.uk/danny/database.html

1799 19. 20. 21. 22.

http://www.zjucadcg.cn/dengcai/Data/FaceData.html http://www1.cs.columbia.edu/CAVE/research/softlib/coil-20.html http://archive.ics.uci.edu/ml/machine-learning-databases/isolet http://archive.ics.uci.edu/ml/machine-learning-databases/und ocumented/connectionist-bench/sonar/ 23. http://archive.ics.uci.edu/ml/machine-learning-databases/breastcancer-wisconsin/

123